Hi all.Am trying to extract a group of words from a text file.but i dont get the expected result using regular expression.Here are my codes:

mytext.txt contains:

Group=1
Name=mattew
Sex=male
Age=25

Group=2
Name=John
Sex=Male
Age=19

When i try to get the group=1 i end up get everything below the group 1

import re

data=open('mytext.txt').read()

p=re.compile(r'Group: [0-9]\n{1}#.*$',re.S)
m=p.search(data)
print m.group()

I only want to extract a certain group and everything below it only.please how do i do this?

Just an idea but in the regular expression you look for a string that goes "group:" with a colon while in the text file you have it like "group=" so that could be screwing it up.

Comments
I suspect you're right, regexs are picky that way
Group: [0-9]\n{1}#.*$

Setting aside the fact that that regex would not match any of the text you gave as an example, the key to understand your problem is .*$. That's a greedy "match regex", it will match any character until it encounters the end of line; not the first one (that $ is matched by the .) but the last end of line which would yield the most characters match.

Edited 3 Years Ago by Reverend Jim: Fixed formatting

Thanks for your replies.I still dont get it.ok here is an example of what i want to do.I will use a robots.txt content for this.

User-Agent: Googlebot
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

User-Agent: Cogger
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

# Alexa Archver, allow them
User-Agent: ia_archive
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

using this example,i would like to extract the exclusion rule for" User-Agent: Cogger".that is i want every thing below User-Agent Cogger.thats exactly my problem.

Thanks

I came up with a couple of regexes, depending on whether you want cogger to the end of file, or just cogger to the next agent (or end of file)

import re

data = """User-Agent: Googlebot
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

User-Agent: Cogger
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

# Alexa Archver, allow them
User-Agent: ia_archive
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search
"""

print "-- Greedy match --"
mm = re.search("User-Agent: Cogger[ \t]*\n(.*)", data, re.DOTALL)
if mm:
    lines = mm.group(1).split('\n')
    for line in lines:
        print line

# output
# -- Greedy match --
# #Disallow: /
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
#
# # Alexa Archver, allow them
# User-Agent: ia_archive
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
#


print "-- Not so greedy --"
mm = re.search("User-Agent: Cogger[ \t]*\n(.*?)(User-Agent|$)", data, re.DOTALL)
if mm:
    lines = mm.group(1).split('\n')
    for line in lines:
        print line

# output
# -- Not so greedy --
# #Disallow: /
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
# 
# # Alexa Archver, allow them
#

Note that both of them seem to have extra at the end.
And you'll probably have to deal with the comments (I'm presuming that lines that start with # are comments.)

Using a simple if

test_file = [
'User-Agent: Googlebot',
'#Disallow: /',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search',
'',
'User-Agent: Cogger',
'#Disallow: /',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search',
'',
'User-Agent: ia_archive',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search' ]

found_list = []
found = 0
to_find = "Cogger"
for rec in test_file:
   rec = rec.strip()
   if rec.startswith("User-Agent"):
      found = 0
      if to_find in rec:
         found = 1
   if (found) and (len(rec)):
      found_list.append(rec)

for rec in found_list:
 print rec

I like that type of solution better myself, I was just trying to make the regex work as that is what the OP asked for.

Not that we always give them what they ask for :)

Thanks to everybody,i will work on every example and i'll be sure to get back here

This article has been dead for over six months. Start a new discussion instead.