biopython

Reply

Join Date: Jul 2006
Posts: 34
Reputation: msaenz is an unknown quantity at this point 
Solved Threads: 0
msaenz msaenz is offline Offline
Light Poster

biopython

 
0
  #1
Jul 8th, 2006
:cheesy: hi all,
i do not know how many people have worked in biopython before but, i am soo close to this answer i can feel it! just need a lil help again... basically this takes a FASTA file from NCBI and makes it into a dictionary which is wonderful and easy. However my fasta file has id's that are the same name and need a unique id. I wanted to add like a number to the id that is the same..(error comes from bio.mindy) i wrote this:
  1. if key in index:
  2. index[key] = index[key] + 1
but didnt work any ideas? thanks guys/gals!:mrgreen:


  1. import string
  2. from Bio import Fasta
  3. from Bio.Alphabet import IUPAC
  4. def get_accession_num(fasta_record):
  5. title_atoms = string.split(fasta_record.title)
  6. # all of the accession number information is stuck in the first element
  7. # and separated by '|'s
  8. accession_atoms = string.split(title_atoms[0], '|')
  9.  
  10. # the accession number is the 4th element
  11. gb_name = accession_atoms[3]
  12. # strip the version info before returning
  13. return gb_name[:-2]
  14. index_file(file_to_index,index_file_to_create,function_to_get_index_key)
  15.  
  16. if key in index:
  17. index[key] = index[key] + 1
  18. Fasta.index_file("ls_orchid.fasta", "orchid_2.idx",get_accession_num)
  19. #dna_parser = Fasta.SequenceParser(IUPAC.protein)
  20. orchid_dict = Fasta.Dictionary("orchid_2.idx")#,dna_parser)
Last edited by msaenz; Jul 8th, 2006 at 12:55 am.
Reply With Quote Quick reply to this message  
Join Date: Oct 2004
Posts: 3,959
Reputation: vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice 
Solved Threads: 918
Moderator
vegaseat's Avatar
vegaseat vegaseat is offline Offline
DaniWeb's Hypocrite

Re: biopython

 
0
  #2
Jul 8th, 2006
How familiar are you with Python?

I noticed in your code that the statement after "if key in index:", or the statements in the function "def get_accession_num(fasta_record):" are not properly indented to form a block. Also where does index come from?

Can you show us a short sample of your fasta file and what the dictionary should look like?
Last edited by vegaseat; Jul 8th, 2006 at 2:33 pm.
May 'the Google' be with you!
Reply With Quote Quick reply to this message  
Join Date: Jul 2006
Posts: 34
Reputation: msaenz is an unknown quantity at this point 
Solved Threads: 0
msaenz msaenz is offline Offline
Light Poster

Re: biopython

 
0
  #3
Jul 8th, 2006
Pretty decent, i copied and pasted and probabaly messed up how my code looks, the index comes from the
  1. index_file(file_to_index,index_file_to_create,function_to_get_index_key)
if you look at the module Bio.Mindy http://biopython.org/DIST/docs/api/ the index come from the index number in the fasta file from ncbi.
Reply With Quote Quick reply to this message  
Join Date: Oct 2004
Posts: 3,959
Reputation: vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice 
Solved Threads: 918
Moderator
vegaseat's Avatar
vegaseat vegaseat is offline Offline
DaniWeb's Hypocrite

Re: biopython

 
0
  #4
Jul 8th, 2006
I assume you are talking about a DNA or RNA virus with nucleic acid or amino acid sequences. It still would be nice to see a short section of the data file, and what the index dictionary would look like.

If your value is a string then use:
  1. if key in index:
  2. index[key] = index[key] + '1'
Last edited by vegaseat; Jul 8th, 2006 at 3:16 pm.
May 'the Google' be with you!
Reply With Quote Quick reply to this message  
Join Date: Jun 2005
Posts: 146
Reputation: G-Do is an unknown quantity at this point 
Solved Threads: 28
G-Do's Avatar
G-Do G-Do is offline Offline
Junior Poster

Re: biopython

 
0
  #5
Jul 10th, 2006
Hi msaenz,

I have used BioPython before and I am familiar with the format of FASTA files, but I'm not sure that I understand the problem. Is it that you want to create an indexed database of FASTA files for searching, but you keep getting key collisions because you're using GenBank accession numbers as your primary keys? So you want to add some number to the GenBank accessions - 1, 2, 3, and so on, yes?

If you are downloading all your files from NCBI, why not just use the NCBI identifiers? You know, the numbers that appear immediately after the ">gi|" on the header lines - I believe that these are unique primary keys for the NCBI Nucleotide database, so if you are only downloading FASTAs from NCBI these should be unique primary keys for your database as well. You could always pull up the GenBank accessions later when displaying information (since everything is getting indexed anyway).

Now, if you aren't downloading files solely from NCBI, or there is some other problem with this solution (maybe there's a front-end you don't have access to which displays the index identifier as the file header, or something similarly ugly) let me know and we'll take it from there.

Hope that helps.
Vi veri veniversum vivus vici
Reply With Quote Quick reply to this message  
Join Date: Jun 2005
Posts: 146
Reputation: G-Do is an unknown quantity at this point 
Solved Threads: 28
G-Do's Avatar
G-Do G-Do is offline Offline
Junior Poster

Re: biopython

 
0
  #6
Jul 10th, 2006
To clarify (a primer on dealing with biological sequence data):

A FASTA file is a text file containing an oligonucleotide or protein sequence and some header information. These files are very popular with computational biologists and are (I believe) the most popular way of formatting biological sequence data for BLAST and phylogenetic (e.g., Phylip) analysis. The following is a sample FASTA file containing the mRNA sequence of a developmental gene, PitX2, from mouse.
  1. >gi|109948276|ref|NM_001042504.1| Mus musculus paired-like homeodomain transcription factor 2 (Pitx2), transcript variant 1, mRNA
  2. GGAGAGAGAGTGCGAGACCGAGAGAGAAAGCCGGAGAGCAGCAGACAGAAACTGCCGGCGCCCGCTAGCT
  3. TTAGCAGCCCCCCGCGTGGACCCTCTCGGAACTTGGCACCCTCAAGATCCCCGCAGTTCCACCCAGACCC
  4. GCTCCACGGCGCTGGCTGTGCAGCCCGAGCCTCGGCCGCCTGGCAGTCACCCTGGGAAGCGGTGGGACGG
  5. GGAGACAGCCGTTCTCTCTCCGGTAGCCGATAACCGGGAATGGAGACCAATTGTCGCAAACTAGTGTCGG
  6. CCTGCGTGCAATTAGAGAAAGATAAGGGCCAGCAAGGAAAGAATGAGGATGTGGGCGCCGAGGACCCGTC
  7. CAAGAAGAAGCGGCAACGCCGGCAGAGGACTCATTTCACTAGCCAGCAGCTGCAGGAGCTGGAAGCCACT
  8. TTCCAGAGAAACCGCTACCCAGACATGTCCACTCGCGAAGAAATCGCCGTGTGGACCAACCTTACGGAAG
  9. CCCGAGTCCGGGTTTGGTTCAAGAATCGCCGGGCCAAATGGAGAAAGCGGGAACGCAACCAGCAGGCCGA
  10. GCTGTGCAAGAATGGCTTTGGGCCGCAGTTCAACGGGCTCATGCAGCCCTACGATGACATGTACCCCGGC
  11. TATTCGTACAACAATTGGGCTGCCAAGGGCCTCACGTCAGCGTCTCTGTCCACCAAGAGCTTCCCCTTCT
  12. TCAACTCCATGAACGTCAATCCCCTGTCCTCTCAGAGTATGTTTTCCCCGCCCAACTCCATCTCATCTAT
  13. GAGTATGTCGTCCAGCATGGTGCCCTCCGCGGTGACCGGCGTCCCGGGCTCCAGCCTCAATAGCCTGAAT
  14. AACTTGAACAACCTGAGCAGCCCGTCGCTGAATTCCGCGGTGCCCACGCCCGCCTGTCCTTACGCGCCGC
  15. CGACTCCTCCGTACGTTTATAGGGACACATGTAACTCGAGCCTGGCCAGCCTGAGACTGAAAGCAAAGCA
  16. GCACTCCAGCTTCGGCTACGCCAGCGTGCAGAACCCGGCCTCCAACCTGAGTGCTTGCCAGTATGCAGTC
  17. GACCGGCCGGTGTGAACCGCGCCCAGGGCGCGGGGATCCGAGGACTGTCGGAGTGGGCAACTCTGCCCCA
  18. GAAAGACTGAGAATTGTGCTAGAAGGTCGTGCGCACTATGGGAAGGAAGAGGGGGGAAAAAAGATCAGAG
  19. GAAAAGAAACCACTGAATTCAAAGAGAGAGCGCCTTTGATTTCAAAGGAATGTCCCCAAGTGTCTACGTC
  20. TTTCGCTAAGAGTATTCCCAACAGTTGGAGGACGCGTACGCCCACAAATGTTTGACTGGATATGACATTT
  21. TAACATTACTATAAGCTTGTTATTTTTTAAGTTTAGCATTGTTAACATTAAAATGACTGAAAGGATGTAT
  22. ATATATCGAAATGTCAAATTAATTTTATAAAAGCAGTTGTTAGTACTATCACGACAGTGTTTTTAAAGGC
  23. TAGGCTTTAAAATAAAGCATGTTATACAGAATCAGTTAGGATTTTTCGCTTGCGAGCAAAGGAATGTATA
  24. TACTAAATGCCACACTGTATGTTTCTAACATATTATTATTATAAAAATGTGTGAATATAAGTTTTAGAGT
  25. AGTTTCTCTGGTGGATGCCTTGTTTCTGAAACTGCTATGTACGACCCATCCTGTGTATAACATTTCGTAC
  26. GATATTATTGTTTTACTTTTCAGCAAATATGAAAAAAAATGTGTTTTATTTCTTGGGAGTAAAATATACT
  27. GCATACAAA
If I understand msaenz correctly, he or she wants to create his or her own local database of these records (or at least, the ones which are important to his or her own research) so that entries can be pulled out in the form of dictionaries, with the sequence as one key-value pair, and the elements of the header as other key-value pairs. As any relational database needs a primary key, msaenz has (I think) chosen the GenBank accession (also a RefSeq ID, which is why you see "ref" immediately before it), which for this file is "NM_001042504.1." My recommendation is to use the NCBI ID instead, which for this file is "109948276." NCBI archives their own data using this field as the primary key, and as far as I know they are unique, so if msaenz is drawing his or her FASTA entries solely from NCBI and uses the NCBI ID as the primary key, he or she should not run into key collision problems when indexing.

The NCBI database can be found here; oligonucleotide sequences can be searched for by clicking "Nucleotide" on the drop-down menu at the left and entering some search parameters.
Vi veri veniversum vivus vici
Reply With Quote Quick reply to this message  
Join Date: Jul 2006
Posts: 34
Reputation: msaenz is an unknown quantity at this point 
Solved Threads: 0
msaenz msaenz is offline Offline
Light Poster

Re: biopython

 
0
  #7
Jul 10th, 2006
yes that is exactly what i wanted to do! i wanted to take the numbers after the gi to use as identifiers instead of the the numbers after the ref since some protiens have the same ref numbers since it is doing chaing a and chain b and it will not make a dictionary due to having the same ref numbers that is why i wanted to use the gi identifiers however i didnt know how. sorry ive been out of the loop about this but thanks for you posts...
Reply With Quote Quick reply to this message  
Join Date: Jun 2005
Posts: 146
Reputation: G-Do is an unknown quantity at this point 
Solved Threads: 28
G-Do's Avatar
G-Do G-Do is offline Offline
Junior Poster

Re: biopython

 
0
  #8
Jul 10th, 2006
Hi msaenz,

In that case, here is what you need to do. Rather than getting accession_atoms[3], which pulls out the fourth token in the header (the RefSeq ID), get accession_atoms[1], which pulls out the second token in the header (the NCBI ID). Don't bother stripping it with the [:-2]; get rid of that part and just return the string as it is. That should give your database unique primary keys.

Just out of curiosity, what kind of work are you doing with protein sequences? Are you doing any structural analysis?

For everyone else, msaenz's comment about the protein chains means this. Proteins are biological molecules composed of several discrete units called chains; each chain is a string of small amino acid molecules daisy-chained end-to-end. There are twenty standard amino acids in the human body, and each is identifiable by a one-letter code, so that a protein chain can be thought of as a string of some length which draws from an alphabet of twenty characters. Some proteins are made up of just a single chain. Others are made up of many chains. The chains interact with each other (and also within themselves) by means of various physical and chemical forces (electrostatic attraction, van der Waal's forces, hydrogen bonding, the hydrophobic effect, etc) in such a way that they change shape - they tangle or "fold" around each other, and in so doing they become functional.

Now, I'm not sure what the RefSeq naming convention is when it comes to protein chains (I knew it at some point, then promptly forgot it when I stopped working with protein sequences). However, if what msaenz says is true, it appears that RefSeq doesn't allow us to make distinctions about protein chains using the RefSeq ID alone.

I hope that helps.
Vi veri veniversum vivus vici
Reply With Quote Quick reply to this message  
Join Date: Jul 2006
Posts: 34
Reputation: msaenz is an unknown quantity at this point 
Solved Threads: 0
msaenz msaenz is offline Offline
Light Poster

Re: biopython

 
0
  #9
Jul 10th, 2006
Thank for your help actually I am part of a REU from the National Science Foundation grant at my university and my research is consisting of using FASTA files and Plone an open source content management system and trying to upload fasta files and parse each sequence into a text by making an archetype.

I am actually helping another student in biology who is doing, I do not recall what study but she needs to get proteins of truncated hemoglobin and then take a random line of that sequence and calculate the volume and i did all that thanks to the help of the forum and just needed that last piece to help her out...

the title of her project isevelopment of Microarray-Based PCR for Analysis of Nitrogen Cycling Functional Guilds in Soil– supervision Dr. John Kelly (just found it on our blackboard online)

Can you believe we are actually doing this for fun (well also for the resume )
Last edited by msaenz; Jul 10th, 2006 at 5:39 pm.
Reply With Quote Quick reply to this message  
Join Date: Jun 2005
Posts: 146
Reputation: G-Do is an unknown quantity at this point 
Solved Threads: 28
G-Do's Avatar
G-Do G-Do is offline Offline
Junior Poster

Re: biopython

 
0
  #10
Jul 10th, 2006
Hi msaenz,

Hmm! Well, I'm not sure how the volume of truncated hemoglobin chains fits into MME-PCR, but I wish you and your friend all the best. Gene expression microarray analysis is my own avenue of study - I am investigating expression differences which occur in human fibroblasts as they age and senesce, and "doing things" with the results.

Glad to have been of assistance.
Vi veri veniversum vivus vici
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the Python Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC