954,124 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Help with SED/AWK email parser

I have a bunch of text files with different formats that somewhere in the file have email addresses. I would like to be able to parse through any number of these files for email addresses. Here are the types of input:

CFO: [email="some_cfo@domain.com"]some_cfo@domain.com[/email]

[email="misterman@domain.com"]misterman@domain.com[/email]

The Main Man [email="mainman@domain.com"]mainman@domain.com[/email]

To take care of the situations I have the following seds:

#Removes line with an opening title
sed -e 's/^.*://'

#Removes opening and closing whitepsace
sed -e 's/^[ ^t]*//;s/[ ^t]*$//'

Those are both really simple, but for the life of me I can't figure out how to remove normal text from before the email address. I either end up clobbering the whole thing, or including it.

I just need to end up with something like keep what is directly attached to the '@' and delete anything after or before other whitespace

The Main Man [email="mainman@domain.com"]mainman@domain.com[/email]
^not part of email. ^ and ^ are both parts of email.

Any clues anyone?

i686-linux
Posting Whiz in Training
210 posts since Mar 2004
Reputation Points: 87
Solved Threads: 12
 
I just need to end up with something like keep what is directly attached to the '@' and delete anything after or before other whitespace The Main Man [email="mainman@domain.com"]mainman@domain.com[/email] ^not part of email. ^ and ^ are both parts of email.


And grep saves the day. Next time I'll RTFM better. :)

grep -o "[[:alnum:][:graph:]]*@[[:alnum:][:graph:]]*"

I haven't tested for many bugs/quirks in the results yet, but a few quick checks seemed to work fine.

If anyone has any further ideas though they would still be greatly appreciated!

i686-linux
Posting Whiz in Training
210 posts since Mar 2004
Reputation Points: 87
Solved Threads: 12
 

Here are a few I've used in the past (dunno if they'll work for you since SED isn't my strong point):

# get return address header
sed '/^Reply-To:/q; /^From:/h; /./d;g;q'

# get Subject header, but remove initial "Subject: " portion
sed '/^Subject: */!d; s///;q'

# get return address header
sed '/^Reply-To:/q; /^From:/h; /./d;g;q'

# parse out the address proper. Pulls out the e-mail address by itself
# from the 1-line return address header (see preceding script)
sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'

TKS
Posting Pro in Training
470 posts since Jan 2004
Reputation Points: 108
Solved Threads: 18
 

Those look awfully familiar. Did you grab those off of a "100 useful SED scripts" site? :)

i686-linux
Posting Whiz in Training
210 posts since Mar 2004
Reputation Points: 87
Solved Threads: 12
 

I got them from my friend Josh who probably did pull them off that very site :D

TKS
Posting Pro in Training
470 posts since Jan 2004
Reputation Points: 108
Solved Threads: 18
 

hello,

I am working with:

cat filename.txt | grep @

and getting the email names reduced to something in the one line. I am thinking that this will help. What I wonder is if we can get grep to simply output the found expression instead of the whole dang line.

I am also wondering if AWK will do what you need.

Christian

kc0arf
Posting Virtuoso
Team Colleague
1,937 posts since Mar 2004
Reputation Points: 121
Solved Threads: 57
 

Wow. Looks like a bunch of us working on it at the same time. Cool.

Christian

kc0arf
Posting Virtuoso
Team Colleague
1,937 posts since Mar 2004
Reputation Points: 121
Solved Threads: 57
 
cat filename.txt | grep @ and getting the email names reduced to something in the one line. I am thinking that this will help. What I wonder is if we can get grep to simply output the found expression instead of the whole dang line. Christian


That is what I posted about:

grep -o "[[:alnum:][:graph:]]*@[[:alnum:][:graph:]]*"

grep -o returns the matched expression instead of the whole line matched

I realized that this can be cut down to:

grep -o "[[:graph:]]*@[[:graph:]]*"

i686-linux
Posting Whiz in Training
210 posts since Mar 2004
Reputation Points: 87
Solved Threads: 12
 

Using SED...you could also find a pattern similar to the grep -o

sed -n 's/.*\(pattern\).*/\1/p' file


Is the * in your grep -o example = to any character? I've never used that in a grep command before...

TKS
Posting Pro in Training
470 posts since Jan 2004
Reputation Points: 108
Solved Threads: 18
 

* = any ammount of matches of the previous expression

For example:

[[:graph:]]* is really "Any printable and visible (non-space) character repeated any number of times"

i686-linux
Posting Whiz in Training
210 posts since Mar 2004
Reputation Points: 87
Solved Threads: 12
 

I'm a grep fan and rarely use sed or awk. To pull the first and last name that often precede the address you can use this.

grep -o "[[:alnum:]]*[[:blank:]][[:alnum:]]*[[:blank:]][[:graph:]]*@[[:graph:]]*"


This adds the first/last name preceding the address, by looking for all letters and numbers in any amount preceding at least one horizontal space of any type, twice. Once for first name, then again for last name. Then match the actual address including any brackets hyphens, underscores, etc. This makes building a contact list from the typical email headers and forwarded headers very easy. This will return each name/name/address match on a new line. you may want to switch the alnum for the name for graph if you want any non blank delineator to be included.

Skifter
Newbie Poster
5 posts since Mar 2010
Reputation Points: 10
Solved Threads: 0
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You