:cheesy: hi all,
i do not know how many people have worked in biopython before but, i am soo close to this answer i can feel it! just need a lil help again... basically this takes a FASTA file from NCBI and makes it into a dictionary which is wonderful and easy. However my fasta file has id's that are the same name and need a unique id. I wanted to add like a number to the id that is the same..(error comes from bio.mindy) i wrote this:

if key in index:
        index[key] = index[key] + 1

but didnt work any ideas? thanks guys/gals!:mrgreen:

import string
from Bio import Fasta
from Bio.Alphabet import IUPAC
def get_accession_num(fasta_record):
title_atoms = string.split(fasta_record.title)
# all of the accession number information is stuck in the first element
# and separated by '|'s
accession_atoms = string.split(title_atoms[0], '|')
 
# the accession number is the 4th element
gb_name = accession_atoms[3]
# strip the version info before returning
return gb_name[:-2]
index_file(file_to_index,index_file_to_create,function_to_get_index_key)
 
if key in index:
index[key] = index[key] + 1
Fasta.index_file("ls_orchid.fasta", "orchid_2.idx",get_accession_num)
#dna_parser = Fasta.SequenceParser(IUPAC.protein)
orchid_dict = Fasta.Dictionary("orchid_2.idx")#,dna_parser)

How familiar are you with Python?

I noticed in your code that the statement after "if key in index:", or the statements in the function "def get_accession_num(fasta_record):" are not properly indented to form a block. Also where does index come from?

Can you show us a short sample of your fasta file and what the dictionary should look like?

Pretty decent, i copied and pasted and probabaly messed up how my code looks, the index comes from the

index_file(file_to_index,index_file_to_create,function_to_get_index_key)

if you look at the module Bio.Mindy http://biopython.org/DIST/docs/api/ the index come from the index number in the fasta file from ncbi.

I assume you are talking about a DNA or RNA virus with nucleic acid or amino acid sequences. It still would be nice to see a short section of the data file, and what the index dictionary would look like.

If your value is a string then use:

if key in index:
        index[key] = index[key] + '1'

Hi msaenz,

I have used BioPython before and I am familiar with the format of FASTA files, but I'm not sure that I understand the problem. Is it that you want to create an indexed database of FASTA files for searching, but you keep getting key collisions because you're using GenBank accession numbers as your primary keys? So you want to add some number to the GenBank accessions - 1, 2, 3, and so on, yes?

If you are downloading all your files from NCBI, why not just use the NCBI identifiers? You know, the numbers that appear immediately after the ">gi|" on the header lines - I believe that these are unique primary keys for the NCBI Nucleotide database, so if you are only downloading FASTAs from NCBI these should be unique primary keys for your database as well. You could always pull up the GenBank accessions later when displaying information (since everything is getting indexed anyway).

Now, if you aren't downloading files solely from NCBI, or there is some other problem with this solution (maybe there's a front-end you don't have access to which displays the index identifier as the file header, or something similarly ugly) let me know and we'll take it from there.

Hope that helps.

To clarify (a primer on dealing with biological sequence data):

A FASTA file is a text file containing an oligonucleotide or protein sequence and some header information. These files are very popular with computational biologists and are (I believe) the most popular way of formatting biological sequence data for BLAST and phylogenetic (e.g., Phylip) analysis. The following is a sample FASTA file containing the mRNA sequence of a developmental gene, PitX2, from mouse.

>gi|109948276|ref|NM_001042504.1| Mus musculus paired-like homeodomain transcription factor 2 (Pitx2), transcript variant 1, mRNA
GGAGAGAGAGTGCGAGACCGAGAGAGAAAGCCGGAGAGCAGCAGACAGAAACTGCCGGCGCCCGCTAGCT
TTAGCAGCCCCCCGCGTGGACCCTCTCGGAACTTGGCACCCTCAAGATCCCCGCAGTTCCACCCAGACCC
GCTCCACGGCGCTGGCTGTGCAGCCCGAGCCTCGGCCGCCTGGCAGTCACCCTGGGAAGCGGTGGGACGG
GGAGACAGCCGTTCTCTCTCCGGTAGCCGATAACCGGGAATGGAGACCAATTGTCGCAAACTAGTGTCGG
CCTGCGTGCAATTAGAGAAAGATAAGGGCCAGCAAGGAAAGAATGAGGATGTGGGCGCCGAGGACCCGTC
CAAGAAGAAGCGGCAACGCCGGCAGAGGACTCATTTCACTAGCCAGCAGCTGCAGGAGCTGGAAGCCACT
TTCCAGAGAAACCGCTACCCAGACATGTCCACTCGCGAAGAAATCGCCGTGTGGACCAACCTTACGGAAG
CCCGAGTCCGGGTTTGGTTCAAGAATCGCCGGGCCAAATGGAGAAAGCGGGAACGCAACCAGCAGGCCGA
GCTGTGCAAGAATGGCTTTGGGCCGCAGTTCAACGGGCTCATGCAGCCCTACGATGACATGTACCCCGGC
TATTCGTACAACAATTGGGCTGCCAAGGGCCTCACGTCAGCGTCTCTGTCCACCAAGAGCTTCCCCTTCT
TCAACTCCATGAACGTCAATCCCCTGTCCTCTCAGAGTATGTTTTCCCCGCCCAACTCCATCTCATCTAT
GAGTATGTCGTCCAGCATGGTGCCCTCCGCGGTGACCGGCGTCCCGGGCTCCAGCCTCAATAGCCTGAAT
AACTTGAACAACCTGAGCAGCCCGTCGCTGAATTCCGCGGTGCCCACGCCCGCCTGTCCTTACGCGCCGC
CGACTCCTCCGTACGTTTATAGGGACACATGTAACTCGAGCCTGGCCAGCCTGAGACTGAAAGCAAAGCA
GCACTCCAGCTTCGGCTACGCCAGCGTGCAGAACCCGGCCTCCAACCTGAGTGCTTGCCAGTATGCAGTC
GACCGGCCGGTGTGAACCGCGCCCAGGGCGCGGGGATCCGAGGACTGTCGGAGTGGGCAACTCTGCCCCA
GAAAGACTGAGAATTGTGCTAGAAGGTCGTGCGCACTATGGGAAGGAAGAGGGGGGAAAAAAGATCAGAG
GAAAAGAAACCACTGAATTCAAAGAGAGAGCGCCTTTGATTTCAAAGGAATGTCCCCAAGTGTCTACGTC
TTTCGCTAAGAGTATTCCCAACAGTTGGAGGACGCGTACGCCCACAAATGTTTGACTGGATATGACATTT
TAACATTACTATAAGCTTGTTATTTTTTAAGTTTAGCATTGTTAACATTAAAATGACTGAAAGGATGTAT
ATATATCGAAATGTCAAATTAATTTTATAAAAGCAGTTGTTAGTACTATCACGACAGTGTTTTTAAAGGC
TAGGCTTTAAAATAAAGCATGTTATACAGAATCAGTTAGGATTTTTCGCTTGCGAGCAAAGGAATGTATA
TACTAAATGCCACACTGTATGTTTCTAACATATTATTATTATAAAAATGTGTGAATATAAGTTTTAGAGT
AGTTTCTCTGGTGGATGCCTTGTTTCTGAAACTGCTATGTACGACCCATCCTGTGTATAACATTTCGTAC
GATATTATTGTTTTACTTTTCAGCAAATATGAAAAAAAATGTGTTTTATTTCTTGGGAGTAAAATATACT
GCATACAAA

If I understand msaenz correctly, he or she wants to create his or her own local database of these records (or at least, the ones which are important to his or her own research) so that entries can be pulled out in the form of dictionaries, with the sequence as one key-value pair, and the elements of the header as other key-value pairs. As any relational database needs a primary key, msaenz has (I think) chosen the GenBank accession (also a RefSeq ID, which is why you see "ref" immediately before it), which for this file is "NM_001042504.1." My recommendation is to use the NCBI ID instead, which for this file is "109948276." NCBI archives their own data using this field as the primary key, and as far as I know they are unique, so if msaenz is drawing his or her FASTA entries solely from NCBI and uses the NCBI ID as the primary key, he or she should not run into key collision problems when indexing.

The NCBI database can be found here; oligonucleotide sequences can be searched for by clicking "Nucleotide" on the drop-down menu at the left and entering some search parameters.

yes that is exactly what i wanted to do! i wanted to take the numbers after the gi to use as identifiers instead of the the numbers after the ref since some protiens have the same ref numbers since it is doing chaing a and chain b and it will not make a dictionary due to having the same ref numbers that is why i wanted to use the gi identifiers however i didnt know how. sorry ive been out of the loop about this but thanks for you posts...

Hi msaenz,

In that case, here is what you need to do. Rather than getting accession_atoms[3], which pulls out the fourth token in the header (the RefSeq ID), get accession_atoms[1], which pulls out the second token in the header (the NCBI ID). Don't bother stripping it with the [:-2]; get rid of that part and just return the string as it is. That should give your database unique primary keys.

Just out of curiosity, what kind of work are you doing with protein sequences? Are you doing any structural analysis?

For everyone else, msaenz's comment about the protein chains means this. Proteins are biological molecules composed of several discrete units called chains; each chain is a string of small amino acid molecules daisy-chained end-to-end. There are twenty standard amino acids in the human body, and each is identifiable by a one-letter code, so that a protein chain can be thought of as a string of some length which draws from an alphabet of twenty characters. Some proteins are made up of just a single chain. Others are made up of many chains. The chains interact with each other (and also within themselves) by means of various physical and chemical forces (electrostatic attraction, van der Waal's forces, hydrogen bonding, the hydrophobic effect, etc) in such a way that they change shape - they tangle or "fold" around each other, and in so doing they become functional.

Now, I'm not sure what the RefSeq naming convention is when it comes to protein chains (I knew it at some point, then promptly forgot it when I stopped working with protein sequences). However, if what msaenz says is true, it appears that RefSeq doesn't allow us to make distinctions about protein chains using the RefSeq ID alone.

I hope that helps.

Thank for your help actually I am part of a REU from the National Science Foundation grant at my university and my research is consisting of using FASTA files and Plone an open source content management system and trying to upload fasta files and parse each sequence into a text by making an archetype.

I am actually helping another student in biology who is doing, I do not recall what study but she needs to get proteins of truncated hemoglobin and then take a random line of that sequence and calculate the volume and i did all that thanks to the help of the forum and just needed that last piece to help her out...

the title of her project is:Development of Microarray-Based PCR for Analysis of Nitrogen Cycling Functional Guilds in Soil– supervision Dr. John Kelly (just found it on our blackboard online)

Can you believe we are actually doing this for fun (well also for the resume ;) )

Hi msaenz,

Hmm! Well, I'm not sure how the volume of truncated hemoglobin chains fits into MME-PCR, but I wish you and your friend all the best. Gene expression microarray analysis is my own avenue of study - I am investigating expression differences which occur in human fibroblasts as they age and senesce, and "doing things" with the results.

Glad to have been of assistance.

I think Python is really a great language for solving science problems. Thanks for enlightening the rest of us with a little biology and biochemistry.

This question has already been answered. Start a new discussion instead.