Hi all.Am trying to extract a group of words from a text file.but i dont get the expected result using regular expression.Here are my codes:

mytext.txt contains:

Group=1
Name=mattew
Sex=male
Age=25

Group=2
Name=John
Sex=Male
Age=19

When i try to get the group=1 i end up get everything below the group 1

import re

data=open('mytext.txt').read()

p=re.compile(r'Group: [0-9]\n{1}#.*$',re.S)
m=p.search(data)
print m.group()

I only want to extract a certain group and everything below it only.please how do i do this?

Recommended Answers

All 8 Replies

Just an idea but in the regular expression you look for a string that goes "group:" with a colon while in the text file you have it like "group=" so that could be screwing it up.

commented: I suspect you're right, regexs are picky that way +2
Group: [0-9]\n{1}#.*$

Setting aside the fact that that regex would not match any of the text you gave as an example, the key to understand your problem is .*$. That's a greedy "match regex", it will match any character until it encounters the end of line; not the first one (that $ is matched by the .) but the last end of line which would yield the most characters match.

Thanks for your replies.I still dont get it.ok here is an example of what i want to do.I will use a robots.txt content for this.

User-Agent: Googlebot
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

User-Agent: Cogger
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

# Alexa Archver, allow them
User-Agent: ia_archive
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

using this example,i would like to extract the exclusion rule for" User-Agent: Cogger".that is i want every thing below User-Agent Cogger.thats exactly my problem.

Thanks

I came up with a couple of regexes, depending on whether you want cogger to the end of file, or just cogger to the next agent (or end of file)

import re

data = """User-Agent: Googlebot
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

User-Agent: Cogger
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

# Alexa Archver, allow them
User-Agent: ia_archive
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search
"""

print "-- Greedy match --"
mm = re.search("User-Agent: Cogger[ \t]*\n(.*)", data, re.DOTALL)
if mm:
    lines = mm.group(1).split('\n')
    for line in lines:
        print line

# output
# -- Greedy match --
# #Disallow: /
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
#
# # Alexa Archver, allow them
# User-Agent: ia_archive
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
#


print "-- Not so greedy --"
mm = re.search("User-Agent: Cogger[ \t]*\n(.*?)(User-Agent|$)", data, re.DOTALL)
if mm:
    lines = mm.group(1).split('\n')
    for line in lines:
        print line

# output
# -- Not so greedy --
# #Disallow: /
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
# 
# # Alexa Archver, allow them
#

Note that both of them seem to have extra at the end.
And you'll probably have to deal with the comments (I'm presuming that lines that start with # are comments.)

Using a simple if

test_file = [
'User-Agent: Googlebot',
'#Disallow: /',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search',
'',
'User-Agent: Cogger',
'#Disallow: /',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search',
'',
'User-Agent: ia_archive',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search' ]

found_list = []
found = 0
to_find = "Cogger"
for rec in test_file:
   rec = rec.strip()
   if rec.startswith("User-Agent"):
      found = 0
      if to_find in rec:
         found = 1
   if (found) and (len(rec)):
      found_list.append(rec)

for rec in found_list:
 print rec

I like that type of solution better myself, I was just trying to make the regex work as that is what the OP asked for.

Not that we always give them what they ask for :)

Thanks to everybody,i will work on every example and i'll be sure to get back here

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.