What is the easiest way to search a text file for a particular string?

I have text files resembling the following :
FILE A HAS 2266 LINES OF WHICH 951 WERE IGNORED
FILE B HAS 2193 LINES OF WHICH 878 WERE IGNORED
THERE WERE 2 DIFFERENCES

Can anyone help me with python scripts (or even java) to search the text file for the string "differences".
Is there a way to read and store the number of differences in a variable ?

Thanks in advance.

You can use a regular expression to grab the number of differences... please ask if there's anything you don't understand.

>>> import re
>>> # my_file_handler = open( 'myfile.txt' )
>>> # my_file_lines = my_file_handler.readlines()
>>> # Here's the contents of my_file_lines after a readlines()
>>> # my_file_lines = [ 'FILE A HAS 2266 LINES OF WHICH 951 WERE IGNORED', 'FILE B HAS 2193 LINES OF WHICH 878 WERE IGNORED', 'THERE WERE 2 DIFFERENCES']
>>> rc = re.compile( '^THERE WERE ([0-9]*) DIFFERENCES$' )
>>> for line in my_file_lines:
...     if rc.match(line):
...         num_diffs = rc.match(line).group(1)
...         print num_diffs
...     
2
>>>

HTH

Thanks Jim, Will give this a try.

rc = re.compile( '^THERE WERE ([0-9]*) DIFFERENCES$' )
1. What do the symbols ^,* and $ stand for?
2. The text file is a diff file from a comparer program ... So the number expected could be even double/triple digits... Should I change accordingly ? say [0-999] ???

PS. I am very new to Python. Sorry if this is pretty standard stuff ...

I assume you mean something simple like this:

# extract the numeric value before the word 'DIFFERENCES'

text = """\
FILE A HAS 2266 LINES OF WHICH 951 WERE IGNORED
FILE B HAS 2193 LINES OF WHICH 878 WERE IGNORED
THERE WERE 2 DIFFERENCES"""

word_list = text.split()
print(word_list)

for ix, word in enumerate(word_list):
    if word.upper() == 'DIFFERENCES':
        # the diff value is just before 'DIFFERENCES'
        diff = int(word_list[ix-1])

print(diff)  # 2
Comments
The simpler, the better!

rc = re.compile( '^THERE WERE ([0-9]*) DIFFERENCES$' )
1. What do the symbols ^,* and $ stand for?

This standard regular expression stuff (refer here for more)...
^ - Indicates match the beginning of the line of text (ie, won't match something in the middle of a line.. the line must start with the text immediately after this symbol)
* - Match as many repetitions of the preceding character as possible (ie, match as many numerals as possible, in this example)
$ - Indicates the end of the line of text. In our case, the line MUST end with 'DIFFERENCES'.

2. The text file is a diff file from a comparer program ... So the number expected could be even double/triple digits... Should I change accordingly ? say [0-999] ???

No, [0-9] is shorthand for writing [0123456789], meaning match anything within this group (brackets indicate a group). Since the following character is *, that means that the regular expression will match any numeral as many times as possible... here's an example to illustrate:

>>> import re
>>> rc = re.compile( '^THERE WERE ([0-9]*) DIFFERENCES$' )
>>> rc.match('THERE WERE 19384856 DIFFERENCES')
<_sre.SRE_Match object at 0x01D11F60>
>>> rc.match('THERE WERE 19384856 DIFFERENCES').group(1)
'19384856'
>>> rc.match('THERE WERE NO DIFFERENCES')
>>> # Since the match() function returned nothing, it means no match
>>> rc.match('THERE WERE 0 DIFFERENCES')
<_sre.SRE_Match object at 0x01D11F60>
>>> rc.match('THERE WERE 0 DIFFERENCES').group(1)
'0'
>>>

Hope that helps

This article has been dead for over six months. Start a new discussion instead.