I am trying to create a script that will compare one word at a time from one file to every word in a second file. I know how to design nested for loops to read the first entry in file1 and compare it to every entry in file2, so that's no problem. file1 contains entries that have non-alpha characters at the beginning of each line with either a single word or two words. file2 contains lines that are multiple words in length.
12/3:word 45/6:word word 78/9:word 01/23:word word and so on...
I can design a regex that captures each section correctly (checking it on regexpal.com) in file1:
I can check the regex in a terminal emulator (Gnome Term):
echo "45/6:word word" | sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*.[a-zA-Z]*\)/\2/'
That correctly removes the first part and returns word word as expected. However, when read from file1, I get two entries for the lines that have more than one word, so that if file1 had 10 lines and two of them have two words I get 12 results rather than 10:
for word in `less file1 | sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*.[a-zA-Z]*\)/\2/'`
Why does it work with echo and not with my loop? I have tried adding .* and .$ to the end of the second part of the sed read section in an effort to reach to the end of the line, but neither option helped. I suspect it might have something to do with using less, but I'm not finding a solution. I also tried using cat in place of less, but no difference. Any help is greatly appreciated!
Continued efforts include putting a third memory marker in to catch the third segment, but the results are still the same. If I put a space in, the regex won't recognize the single word lines, and if I leave the space out it still won't put two words in as one entry.
sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*\)\( [a-z]*\)/\2\3/'