•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the Python section within the Software Development category of DaniWeb, a massive community of 391,589 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 2,660 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Python advertiser:
Views: 163 | Replies: 5
![]() |
•
•
Join Date: Jul 2008
Posts: 3
Reputation:
Rep Power: 0
Solved Threads: 0
For my new job I need to learn a bit of Python to parse and extract data from .txt files.
Essentially, I have a table that looks like this:
Pair NO. Sense Antisense Coding/Noncoding Cis/Trans Overlap
ATH00001 At1g02170 At1g02180 coding-coding cis 3
As you can see, the top are simply categories for each respective column, and the bottom is the actual data. Now, I have 30000 of the data lines, and I need to extract the following:
If first two letters of first column data are MM or HS I need to extract the data of the first and third column of that line.
Technically I could do this in Excel, but soon I'll be moving onto databases where Python will be the only solution.
I was just wondering what the best approach to this was. I was thinking something along the lines of splitting the string with aString.split(), then only taking the first and third element and writing them into a new file.
However, I'm utterly clueless as to how to start.
I guess I would open the file:
inp = file("SAdatabsse.txt","r")
Create a file to write to:
outp = open("SAdatabse2.txt","w")
Read lines with readlines command, then search for first two letters with find('MM',0,2) or find('HS',0,2). Then I guess I'd use some sort of boolean expression with a loop so that if it comes up true the first and third column for that line are stored in the new file.
Technically, I think I could do it, but I just have no idea how to structure the code, seeing as this is my first time working with Python.
So please, if you can, I would appreciate all and any help.
Thank for your time, - Siberian.
Essentially, I have a table that looks like this:
Pair NO. Sense Antisense Coding/Noncoding Cis/Trans Overlap
ATH00001 At1g02170 At1g02180 coding-coding cis 3
As you can see, the top are simply categories for each respective column, and the bottom is the actual data. Now, I have 30000 of the data lines, and I need to extract the following:
If first two letters of first column data are MM or HS I need to extract the data of the first and third column of that line.
Technically I could do this in Excel, but soon I'll be moving onto databases where Python will be the only solution.
I was just wondering what the best approach to this was. I was thinking something along the lines of splitting the string with aString.split(), then only taking the first and third element and writing them into a new file.
However, I'm utterly clueless as to how to start.
I guess I would open the file:
inp = file("SAdatabsse.txt","r")
Create a file to write to:
outp = open("SAdatabse2.txt","w")
Read lines with readlines command, then search for first two letters with find('MM',0,2) or find('HS',0,2). Then I guess I'd use some sort of boolean expression with a loop so that if it comes up true the first and third column for that line are stored in the new file.
Technically, I think I could do it, but I just have no idea how to structure the code, seeing as this is my first time working with Python.
So please, if you can, I would appreciate all and any help.
Thank for your time, - Siberian.
You're on the right track! However I typically like to use slicing when I know precisely where on a line the information that I'm looking for is going to be. Here's what you suggested in a friendly python format:
BTW, the notation of [:2] is called slicing. Open up a Python interpreter (shell) and play around with it... basically you're "slicing" a piece out of an object.
example of playing around in shell:
As you can see by omitting a number you are basically saying either the beginning or the end of the object
And you can use negative indexing too!
python Syntax (Toggle Plain Text)
inp = open('SAdatabsse.txt', 'r') outp = open('SAdatabse2.txt', 'w') for eachline in inp: line_data = eachline.split() if eachline[:2] == 'MM' or eachline[:2] == 'HS': outp.write( '%s %s' % ( line_data[0], line_data[3] ) ) inp.close() outp.close()
BTW, the notation of [:2] is called slicing. Open up a Python interpreter (shell) and play around with it... basically you're "slicing" a piece out of an object.
example of playing around in shell:
python Syntax (Toggle Plain Text)
>>> slice1 = 'Slicing Example 1' >>> slice2 = [ 1, 2, 3, 4, 5, 6 ] >>> slice1[:4] 'Slic' >>> slice1[4:] 'ing Example 1' >>> slice1[5:9] 'ng E' >>> slice2[:2] [1, 2] >>> slice2[-2:] [5, 6] >>>
And you can use negative indexing too!
Let's Go Pens!
•
•
Join Date: Dec 2006
Posts: 384
Reputation:
Rep Power: 2
Solved Threads: 52
It is a good idea to insure that there is no white space in any of the records, so you should include
rec = rec.strip()
then you can use
rec = rec.strip()
then you can use
substrs=rec.split()
substrs[0] = substrs[0].upper()
if (substrs[0].startswith("MM")) or (substrs[0].startswith("HS"): Last edited by Tekmaven : 29 Days Ago at 6:40 pm. Reason: Code tags
•
•
Join Date: Jul 2008
Posts: 3
Reputation:
Rep Power: 0
Solved Threads: 0
jlm699, thanks so much!
I used to do a bit of C++, so when I look at the code it makes perfect sense, just for me structuring is, for some reason, difficult.
I've got a question: When I slice, does a space count as a character? What about a tab?
Woooee (or jlm),
the rec = rec.strip() gets rid of any white space that would be before the first two letters, correct? So as not to throw off the slicing?
Thank you guys so much for your help, and I'm slowly learning the ways of Python.
P.S. if I was to implement the rec = rec.strip(), I would put that right before the "if" statement, right?
I used to do a bit of C++, so when I look at the code it makes perfect sense, just for me structuring is, for some reason, difficult.
I've got a question: When I slice, does a space count as a character? What about a tab?
Woooee (or jlm),
the rec = rec.strip() gets rid of any white space that would be before the first two letters, correct? So as not to throw off the slicing?
Thank you guys so much for your help, and I'm slowly learning the ways of Python.
P.S. if I was to implement the rec = rec.strip(), I would put that right before the "if" statement, right?
Last edited by siberian1991 : 29 Days Ago at 7:13 pm.
•
•
Join Date: Jul 2008
Posts: 3
Reputation:
Rep Power: 0
Solved Threads: 0
Alright, it works like a charm. I ended up using this:
inp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA.txt', 'r')
outp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA2.txt', 'w')
for eachline in inp:
line_data = eachline.split()
if eachline[:2] == 'MM' or eachline[:2] == 'HS':
outp.write( '%s %s %s' % ( line_data[1], line_data[2], "\n" ) )
inp.close()
outp.close() Strip takes away all trailing and leading whitespace. It will always remove newline characters and carriage returns (\n on linux and \r\n on windows), and it will always remove excess spaces and tabs at the beginning and end of the text.
For your other question yes, spaces, tabs, even newlines are considered a single character and will be taken into account when using slicing.
Here's your code with the strip statement. And also you do not have to use another %s for entering a newline into your text. Read up on string formatting a little bit here, as I'll make a slight modification to that statement...
For your other question yes, spaces, tabs, even newlines are considered a single character and will be taken into account when using slicing.
Here's your code with the strip statement. And also you do not have to use another %s for entering a newline into your text. Read up on string formatting a little bit here, as I'll make a slight modification to that statement...
inp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA.txt', 'r')
outp = open('C:\\Documents and Settings\\Lev.DEEPBLUE\\Desktop\\halfSA2.txt', 'w')
for eachline in inp:
eachline = eachline.strip()
# You could alternately do line_data = eachline.strip().split(), which does the same thing
line_data = eachline.split()
if eachline[:2] == 'MM' or eachline[:2] == 'HS':
outp.write( '%s %s\n' % ( line_data[1], line_data[2] ) )
inp.close()
outp.close() Let's Go Pens!
![]() |
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
•
•
•
•
•
•
•
•
DaniWeb Python Marketplace
Similar Threads
Other Threads in the Python Forum
- Previous Thread: word game help
- Next Thread: Move Scollbars in controls after Freezing them


Linear Mode