For my new job I need to learn a bit of Python to parse and extract data from .txt files.

Essentially, I have a table that looks like this:
Pair NO. Sense Antisense Coding/Noncoding Cis/Trans Overlap
ATH00001 At1g02170 At1g02180 coding-coding cis 3

As you can see, the top are simply categories for each respective column, and the bottom is the actual data. Now, I have 30000 of the data lines, and I need to extract the following:
If first two letters of first column data are MM or HS I need to extract the data of the first and third column of that line.
Technically I could do this in Excel, but soon I'll be moving onto databases where Python will be the only solution.

I was just wondering what the best approach to this was. I was thinking something along the lines of splitting the string with aString.split(), then only taking the first and third element and writing them into a new file.

However, I'm utterly clueless as to how to start.
I guess I would open the file:
inp = file("SAdatabsse.txt","r")
Create a file to write to:
outp = open("SAdatabse2.txt","w")
Read lines with readlines command, then search for first two letters with find('MM',0,2) or find('HS',0,2). Then I guess I'd use some sort of boolean expression with a loop so that if it comes up true the first and third column for that line are stored in the new file.
Technically, I think I could do it, but I just have no idea how to structure the code, seeing as this is my first time working with Python.

So please, if you can, I would appreciate all and any help.
Thank for your time, - Siberian.

Recommended Answers

All 5 Replies

You're on the right track! However I typically like to use slicing when I know precisely where on a line the information that I'm looking for is going to be. Here's what you suggested in a friendly python format:

inp = open('SAdatabsse.txt', 'r')
outp = open('SAdatabse2.txt', 'w')

for eachline in inp:
    line_data = eachline.split()
    if eachline[:2] == 'MM' or eachline[:2] == 'HS':
        outp.write( '%s %s' % ( line_data[0], line_data[3] ) )

inp.close()
outp.close()

BTW, the notation of [:2] is called slicing. Open up a Python interpreter (shell) and play around with it... basically you're "slicing" a piece out of an object.
example of playing around in shell:

>>> slice1 = 'Slicing Example 1'
>>> slice2 = [ 1, 2, 3, 4, 5, 6 ]
>>> slice1[:4]
'Slic'
>>> slice1[4:]
'ing Example 1'
>>> slice1[5:9]
'ng E'
>>> slice2[:2]
[1, 2]
>>> slice2[-2:]
[5, 6]
>>>

As you can see by omitting a number you are basically saying either the beginning or the end of the object
And you can use negative indexing too!

It is a good idea to insure that there is no white space in any of the records, so you should include
rec = rec.strip()
then you can use

substrs=rec.split()
substrs[0] = substrs[0].upper()
if (substrs[0].startswith("MM")) or (substrs[0].startswith("HS"):

jlm699, thanks so much!
I used to do a bit of C++, so when I look at the code it makes perfect sense, just for me structuring is, for some reason, difficult.
I've got a question: When I slice, does a space count as a character? What about a tab?

Woooee (or jlm),
the rec = rec.strip() gets rid of any white space that would be before the first two letters, correct? So as not to throw off the slicing?

Thank you guys so much for your help, and I'm slowly learning the ways of Python.
P.S. if I was to implement the rec = rec.strip(), I would put that right before the "if" statement, right?

Alright, it works like a charm. I ended up using this:

inp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA.txt', 'r')
outp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA2.txt', 'w')

for eachline in inp:
    line_data = eachline.split()
    if eachline[:2] == 'MM' or eachline[:2] == 'HS':
        outp.write( '%s %s %s' % ( line_data[1], line_data[2], "\n" ) )

inp.close()
outp.close()

Strip takes away all trailing and leading whitespace. It will always remove newline characters and carriage returns (\n on linux and \r\n on windows), and it will always remove excess spaces and tabs at the beginning and end of the text.

For your other question yes, spaces, tabs, even newlines are considered a single character and will be taken into account when using slicing.

Here's your code with the strip statement. And also you do not have to use another %s for entering a newline into your text. Read up on string formatting a little bit here, as I'll make a slight modification to that statement...

inp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA.txt', 'r')
outp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA2.txt', 'w')

for eachline in inp:
    eachline = eachline.strip()
    # You could alternately do line_data = eachline.strip().split(), which does the same thing
    line_data = eachline.split()
    if eachline[:2] == 'MM' or eachline[:2] == 'HS':
        outp.write( '%s %s\n' % ( line_data[1], line_data[2] ) )

inp.close()
outp.close()
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.