Extracting text from file

Question

codedhands 0 Light Poster

15 Years Ago

Hi all.Am trying to extract a group of words from a text file.but i dont get the expected result using regular expression.Here are my codes:

mytext.txt contains:

Group=1
Name=mattew
Sex=male
Age=25

Group=2
Name=John
Sex=Male
Age=19

When i try to get the group=1 i end up get everything below the group 1

import re

data=open('mytext.txt').read()

p=re.compile(r'Group: [0-9]\n{1}#.*$',re.S)
m=p.search(data)
print m.group()

I only want to extract a certain group and everything below it only.please how do i do this?

python

5 Contributors
8 Replies
104 Views
2 Days Discussion Span
Latest Post 15 Years Ago Latest Post by codedhands

All 8 Replies

lllllIllIlllI 178 Veteran Poster

15 Years Ago

Just an idea but in the regular expression you look for a string that goes "group:" with a colon while in the text file you have it like "group=" so that could be screwing it up.

Murtan commented: I suspect you're right, regexs are picky that way +2

Aia 1,977 Nearly a Posting Maven

15 Years Ago

Group: [0-9]\n{1}#.*$

Setting aside the fact that that regex would not match any of the text you gave as an example, the key to understand your problem is .*$. That's a greedy "match regex", it will match any character until it encounters the end of line; not the first one (that $ is matched by the .) but the last end of line which would yield the most characters match.

Edited 11 Years Ago by Reverend Jim because: Fixed formatting

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

codedhands 0 Light Poster · Answer 1 · 2008-12-29T13:09:08+00:00

Thanks for your replies.I still dont get it.ok here is an example of what i want to do.I will use a robots.txt content for this.

User-Agent: Googlebot
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

User-Agent: Cogger
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

# Alexa Archver, allow them
User-Agent: ia_archive
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

using this example,i would like to extract the exclusion rule for" User-Agent: Cogger".that is i want every thing below User-Agent Cogger.thats exactly my problem.

Thanks

Murtan 317 Practically a Master Poster · Answer 2 · 2008-12-29T14:45:04+00:00

I came up with a couple of regexes, depending on whether you want cogger to the end of file, or just cogger to the next agent (or end of file)

import re

data = """User-Agent: Googlebot
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

User-Agent: Cogger
#Disallow: /
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search

# Alexa Archver, allow them
User-Agent: ia_archive
Disallow: /comments
Disallow: /user
Disallow: /poll
Disallow: /print
Disallow: /search
"""

print "-- Greedy match --"
mm = re.search("User-Agent: Cogger[ \t]*\n(.*)", data, re.DOTALL)
if mm:
    lines = mm.group(1).split('\n')
    for line in lines:
        print line

# output
# -- Greedy match --
# #Disallow: /
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
#
# # Alexa Archver, allow them
# User-Agent: ia_archive
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
#


print "-- Not so greedy --"
mm = re.search("User-Agent: Cogger[ \t]*\n(.*?)(User-Agent|$)", data, re.DOTALL)
if mm:
    lines = mm.group(1).split('\n')
    for line in lines:
        print line

# output
# -- Not so greedy --
# #Disallow: /
# Disallow: /comments
# Disallow: /user
# Disallow: /poll
# Disallow: /print
# Disallow: /search
# 
# # Alexa Archver, allow them
#

Note that both of them seem to have extra at the end.
And you'll probably have to deal with the comments (I'm presuming that lines that start with # are comments.)

woooee 814 Nearly a Posting Maven · Answer 3 · 2008-12-30T08:37:35+00:00

Using a simple if

test_file = [
'User-Agent: Googlebot',
'#Disallow: /',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search',
'',
'User-Agent: Cogger',
'#Disallow: /',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search',
'',
'User-Agent: ia_archive',
'Disallow: /comments',
'Disallow: /user',
'Disallow: /poll',
'Disallow: /print',
'Disallow: /search' ]

found_list = []
found = 0
to_find = "Cogger"
for rec in test_file:
   rec = rec.strip()
   if rec.startswith("User-Agent"):
      found = 0
      if to_find in rec:
         found = 1
   if (found) and (len(rec)):
      found_list.append(rec)

for rec in found_list:
 print rec

Murtan 317 Practically a Master Poster · Answer 4 · 2008-12-30T09:07:40+00:00

I like that type of solution better myself, I was just trying to make the regex work as that is what the OP asked for.

Not that we always give them what they ask for :)

Aia 1,977 Nearly a Posting Maven · Answer 5 · 2008-12-30T12:29:45+00:00

To RE or not to RE, that is the question.
Complimentary slides

codedhands 0 Light Poster · Answer 6 · 2008-12-31T03:10:25+00:00

Thanks to everybody,i will work on every example and i'll be sure to get back here

Extracting text from file

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers