1,105,409 Community Members

Using sed with regex

Member Avatar
chris.kelley.5015
Newbie Poster
6 posts since Sep 2012
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Hello,

I am trying to create a script that will compare one word at a time from one file to every word in a second file. I know how to design nested for loops to read the first entry in file1 and compare it to every entry in file2, so that's no problem. file1 contains entries that have non-alpha characters at the beginning of each line with either a single word or two words. file2 contains lines that are multiple words in length.

file1 contains:

12/3:word
45/6:word word
78/9:word
01/23:word word
and so on...

I can design a regex that captures each section correctly (checking it on regexpal.com) in file1:

[0-9]*\/[0-9]*:[a-zA-Z]*

I can check the regex in a terminal emulator (Gnome Term):

echo "45/6:word word" | sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*.[a-zA-Z]*\)/\2/'

That correctly removes the first part and returns word word as expected. However, when read from file1, I get two entries for the lines that have more than one word, so that if file1 had 10 lines and two of them have two words I get 12 results rather than 10:

for word in `less file1 | sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*.[a-zA-Z]*\)/\2/'`

do

echo $word 

done

Why does it work with echo and not with my loop? I have tried adding .* and .$ to the end of the second part of the sed read section in an effort to reach to the end of the line, but neither option helped. I suspect it might have something to do with using less, but I'm not finding a solution. I also tried using cat in place of less, but no difference. Any help is greatly appreciated!

Member Avatar
chris.kelley.5015
Newbie Poster
6 posts since Sep 2012
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

I tried to eliminate less file1 | in the for loop:

sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*.[a-zA-Z]*\)/\2/' < file1

This did not change the behavior, so I have to think it is my regex, but I still haven't been able to determine what. Again, any help is greatly appreciated.

Member Avatar
chris.kelley.5015
Newbie Poster
6 posts since Sep 2012
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Putting a space in the regex makes it so the regex only sees the lines with two words speparated by a space; the other lines do not get recognized:

sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*. [a-zA-Z]*\)/\2/' < file1

Still plugging away at this...

Member Avatar
chris.kelley.5015
Newbie Poster
6 posts since Sep 2012
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Continued efforts include putting a third memory marker in to catch the third segment, but the results are still the same. If I put a space in, the regex won't recognize the single word lines, and if I leave the space out it still won't put two words in as one entry.

sed -e 's/\([0-9]*\/[0-9]*:\)\([a-z]*\)\( [a-z]*\)/\2\3/'

Member Avatar
Watael
Junior Poster
134 posts since Apr 2012
Reputation Points: 4 [?]
Q&As Helped to Solve: 27 [?]
Skill Endorsements: 2 [?]
 
0
 

hi,

less gives no output to sed !

This is how to read a file :

while read line
do
    echo "$line"
done < file

now, if you want to split line into fields, tell the IFS (Internal Field Separator) what separator to use:

while IFS=':' read firstField secondField otherFields
do
    echo "$secondField"
done < file

what does file2 look like?
And what the desired output?
Maybe you don't need to use a loop. grep -f might be enough.

You
This article has been dead for over three months: Start a new discussion instead
Post:
Start New Discussion
View similar articles that have also been tagged: