Hi I have a file with duplicate records they look something like this:

<record>
<dateadd>012012</dateadd>
<nid>R04607295</nid>
<reflink></reflink>
<FPI>YES</FPI><TPG>NO</TPG><FT>YES</FT>
<num>631</num>
<author>Anon</author>
<title>ON THE WED</title>
</record>

<record>
<dateadd>012012</dateadd>
<idref>R04607297</idref>
<reflink></reflink>
<type>Article</type>
<FPI>YES</FPI><TPG>NO</TPG><FT>YES</FT>
<num>651</num>
<author>Bent, E</author>
<title>ENTRANCES AND EXITS</title>
</record>

<record>
<dateadd>012012</dateadd>
<nid>R04607295</nid>
<reflink></reflink>
<FPI>YES</FPI><TPG>NO</TPG><FT>YES</FT>
<num>631</num>
<author>Anon</author>
<title>ON THE WED</title>
</record>

<record>
<dateadd>012012</dateadd>
<idref>R04607297</idref>
<reflink></reflink>
<type>Article</type>
<FPI>YES</FPI><TPG>NO</TPG><FT>YES</FT>
<num>651</num>
<author>Bent, E</author>
<title>ENTRANCES &amp; EXITS</title>
</record>

Not all the records are 100% duplicates(see &amp; vs. AND) but the num fields contain a duplicate id that can be used.

Somewhere in the past I must have fiddled with this because I have this commented out at the bottom of a Python script I usually use to remove these pesky buggers:

sed -n "/<record>/,/<\/record>/p" 2010rec.got | sort | uniq | tee -a new.txt

I have fiddled with it some but I can't get it working. So my question is just is this at all possible? Thanks

Recommended Answers

All 5 Replies

That's a good question! I don't think your sed line is going to work, unless each record is all on one line. The way the records appear to be formatted, your sort|uniq would give you a big pile of nothing.

Here's a (kind of ugly) script that I wrote to to see if this would work... I *think* it does what you're looking for.

input="test.txt"
nums="$(grep '^<num>' $input |sort -u)"
for num in $nums; do
    grep -B6 -A3 $num $input|head -n 9
    echo
done

In this case, 'test.txt' is my input file, containing the 4 sample records you provided. Here's my output:

# sh test.sh
<record>
<dateadd>012012</dateadd>
<nid>R04607295</nid>
<reflink></reflink>
<FPI>YES</FPI><TPG>NO</TPG><FT>YES</FT>
<num>631</num>
<author>Anon</author>
<title>ON THE WED</title>
</record>

<record>
<dateadd>012012</dateadd>
<idref>R04607297</idref>
<reflink></reflink>
<type>Article</type>
<FPI>YES</FPI><TPG>NO</TPG><FT>YES</FT>
<num>651</num>
<author>Bent, E</author>
<title>ENTRANCES AND EXITS</title>

I hope this helps, or at least gives you a place to start! There's probably a much cleaner way to do it, but this was quick and simple.

Thanks for the help. From time to time I help other departments out with question like these. Mostly I just google it and usually I manage to hack something together that works. In this case the problem was solved by someone else using an old Perl script they dug out somewhere. I prefer to fix these with Unix. This way other people learn how to help themselves with the tools at hand and if I can get them from moving away from Perl that's a plus. We don't really support Perl anymore and when these old Perl scripts go wrong I generally want to tear my hair out. Besides, in the past I've come up with some Unix one liners that has replaced 300+ lines worth of Perl scripts.

Anyway, I think I know what you're doing but I get this:

grep: illegal option -- B
grep: illegal option -- 6
grep: illegal option -- A
grep: illegal option -- 3
Usage: grep -hblcnsviw pattern file . . .

This means I'm using a different Unix or something than you right? What is the -B6 and -A3 supposed to do?

Thanks again

Oh, right! Not all versions of grep have the -A and -B flags. Here's a definition:

-A NUM, --after-context=NUM
      Print  NUM  lines of trailing context after matching lines.  Places a line containing -- between con-
      tiguous groups of matches.

-B NUM, --before-context=NUM
      Print NUM lines of leading context before matching lines.  Places a line containing --  between  con-
      tiguous groups of matches.

There's a way to do this with sed, but I haven't sat down with it to break it down and understand it myself. Here's a link to the O'reilly Unix Power Tools page that talks about it: http://docstore.mik.ua/orelly/unix3/upt/ch13_09.htm

And here's a link to the 'cgrep' script that they use in the example: http://examples.oreilly.com/9780596003302/example_files.tar.gz

I hope this helps!

commented: Much appreciated! +1

I'm actually just looking at this at the moment. I'll post back if I make any progress. Thanks

- No cgrep, agrp or hgrep for me! Can't use any of these at work. :(

Well cgrep is just the sed script in the example link. You might even be able to use it as-is (I haven't actually looked at it yet)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.