Hi,
I'm using SAS regular expressions to read a large text file with numbers and characters, and may be up to 500 lines of text in the file. I would like to extract certain key words, specifically, weight quantaties and measures, i.e. kilos or lbs.
Each line of text is very long and may contain multiple occurences of the information I want. I want to capture one or more occurance depending on the substance of interest, and print out a table of the substances and weights to an output file.
"11/01 120 lbs of sugar was found in the room across the hall. The next day we also found a sizeable amount of 5000lbs of glue, and yesterday we may find even more lb 100X".

I need a detailed sas code, also if anyone knows VBS or Python, that would be very helpful.

Thanks in advance,
-rachel

Recommended Answers

All 9 Replies

Hi,
I'm using SAS regular expressions to read a large text file with numbers and characters, and may be up to 500 lines of text in the file. I would like to extract certain key words, specifically, weight quantaties and measures, i.e. kilos or lbs.
Each line of text is very long and may contain multiple occurences of the information I want. I want to capture one or more occurance depending on the substance of interest, and print out a table of the substances and weights to an output file.
"11/01 120 lbs of sugar was found in the room across the hall. The next day we also found a sizeable amount of 5000lbs of glue, and yesterday we may find even more lb 100X".

I need a detailed sas code, also if anyone knows VBS or Python, that would be very helpful.

Thanks in advance,
-rachel

I do not know about SAS, but this is how Python regexp would catch the numbers.

>>> import re
>>> sentence = "11/01  120 lbs of sugar was found in the room across the hall. The next day we also found a sizeable amount of 5000lbs of glue, and yesterday we may find even more lb 100X"
>>> re.findall(r"(\d+) *lb", sentence)
['120', '5000']
>>>

Thanks, that works, but how do I read this type of data from very big text file(equivalent to 100+ pages) with the re.findall function to write the output to another text, cvs or xls file.

So it looks like this where the first row are headers.
Thank you again for your great help.
-r

Substance Amount Measure

Sugar 120 lbs
Glue 5000 lbs

This is not actually Python forum but you read the file in memory if it is less than few hundred million characters and the file and write it to other file (as I consider likelihood of number followed by lb too unlikely to give special consideration):

import re
import webbrowser

with open('small_file.txt') as infile, open('lbs.csv', 'w') as outfile:
    #excel understand ';' separated csv correctly
    outfile.write('; '.join(re.findall(r"(\d+) *lb", infile.read().lower())))

webbrowser.open('lbs.csv')

Maybe you wanted to take same input as before and produce something like you wrote as output, almost that, except substance last you could get from lb amounts like this:

import re
import webbrowser
""" small_file.txt
11/01  120 lbs of sugar was found in the room across the hall. The next day we also found a sizeable amount of 5000lbs of glue, and yesterday we may find even more lb 100X. We wait to get also 300 kg of flour.
"""
with open('small_file.txt') as infile, open('lbs_kg.csv', 'w') as outfile:
    #excel understand ';' separated csv correctly
    outfile.write('\n'.join(';'.join(match)
                            for match in re.findall(r"(\d+) *(lbs|kg) of (\w+)\W", infile.read().lower())))

webbrowser.open('lbs_kg.csv')
""" result in file lbs_kg.csv
120;lbs;sugar
5000;lbs;glue
300;kg;flour
"""

Tony,
Thank you very much. But where is the output file? Does the code need a close statement? I can't find the location of the output file after I run the code in IDLE GUI.
Thank you again.
-r

It should end up in same dictionary as the script is started from. You could enter the full path if you want. With statement takes care of closing of the file even if one exception would happen inside the with block.
You can check the directory with

import os
print(os.path.realpath(os.curdir))

Hi Tony,
I tried typing in the path name directly like this,

webbrowser.open("C:\Python27\lbs.csv")

Then I recieve a response "True"

But then it opens the Google homepage, but still no file.
Anything else I can try?
Thanks,
-r

You have not assosiated application for csv file, looks like. Just remove webbrowser part or define association for csv. I would use also raw string (r') to disable \ escapes when writing windows paths or use Unix style path with / which Python translates automatically to \ in Windows environment.

Im trying your suggestion, but I can't get it to work.
Would you please look at the code and correct it?
I appreciate your help a lot.
Thanks,
-r

import re

""" small_file.txt
11/01 120 lbs of sugar was found in the room across the hall. The next day we also found a sizeable amount of 5000lbs of glue, and yesterday we may find even more lb 100X. We wait to get also 300 kg of flour.
"""
with open('C:\Python27\small_file.txt',r') as infile, open('lbs_kg.txt', 'w') as outfile:
#excel understand ';' separated csv correctly
outfile.write('\n'.join(';'.join(match)
for match in re.findall(r"(\d+) *(lbs|kg) of (\w+)\W", infile.read().lower())))

outfile.open('lbs_kg.txt')
""" result in file lbs_kg.csv
120;lbs;sugar
5000;lbs;glue
300;kg;flour
"""

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.