Parsing tab separated .txt files with common and distinct attributes

Question

haojam 0 Light Poster

14 Years Ago

Dear Sir,
I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script and i would be glad for your support and cooperation. I have attached a zip file for all sdrf.txt files and the output for the script i have run. The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

Regards,
Haobijam

#!/usr/bin/python
import glob
#import linecache
outfile = open('output_att.txt' , 'w')
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    #count = 0
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        #count = count + 1
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
    infile.close()
outfile.close()

open-source python

This attachment is potentially unsafe to open. It may be an executable that is capable of making changes to your file system, or it may require specific software to open. Use caution and only open this attachment if you are comfortable working with zip files.

output_att.zip (3.01 KB)

This attachment is potentially unsafe to open. It may be an executable that is capable of making changes to your file system, or it may require specific software to open. Use caution and only open this attachment if you are comfortable working with zip files.

sdrf.txt_.zip (95.67 KB)

4 Contributors
22 Replies
690 Views
1 Month Discussion Span
Latest Post 14 Years Ago Latest Post by haojam

All 22 Replies

TrustyTony 888 ex-Moderator

14 Years Ago

Are the data you want in result file? You want unique lines or what? Any example of desired output would enable me to help you.

TrustyTony 888 ex-Moderator

14 Years Ago

Word Factor is in both lines, so it is not unique. Define what you want. Computer can give you things you know you want.

Here is code for finding unique words in file:

import string

inputstring=open('output_att.txt').read()

uniqwords=set(word.strip(string.punctuation+string.digits)
              for word in inputstring.lower().split())

print('The %i unique words are: %s' % (len(uniqwords),sorted(uniqwords)))

Edited 14 Years Ago by TrustyTony because: n/a

TrustyTony 888 ex-Moderator

14 Years Ago

Maybe like this?

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
##    previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
    

outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Edited 14 Years Ago by TrustyTony because: n/a

Gribouillis 1,391 Programming Explorer

14 Years Ago

You could normalize the unique terms using a regular expression:

import re
item_pattern = re.compile(r"[a-zA-Z][a-z]*|[^\s]")

def tuple_key(composite_term):
    return tuple(w.lower() for w in item_pattern.findall(composite_term))

def normalize(term):
    return ' '.join(tuple_key(term.strip()))
        
print normalize("Characteristics [StrainOrLine]")
"""my output -->
characteristics [ strain or line ]
"""

Edited 14 Years Ago by Gribouillis because: n/a

woooee 814 Nearly a Posting Maven

14 Years Ago

Find where the error is. Print
line2[j], j, len(line2)
and with a separate print statement
a[j], j, len(a).
You can then fix the specific problem. Also, include some comments in the code explaining what is happening, so we know what it is supposed to be doing. No one can be expected to write all of the code for you.

Edited 14 Years Ago by woooee because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

haojam 0 Light Poster · Answer 1 · 2010-10-14T13:28:18+00:00

Dear ,
Yes from the result file i would like to extract / parse only unique words like Factor Value [bkv] AND Factor Value [incubate] and may be more unique words which is available only one time.

Regards,
Haobijam

haojam 0 Light Poster · Answer 2 · 2010-10-14T17:19:10+00:00

Dear,

I would like to read and parse only unique words (attributes) which are unrepeated from first line (i.e. Headers not table values) of all the sdrf.txt files. The first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] or sometimes with Comment [ArrayExpress FTP file].

Regards,
Haobijam

haojam 0 Light Poster · Answer 3 · 2010-10-14T17:37:36+00:00

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
inputstring = open('output.txt').read()
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip(string.punctuation+string.digits)
        for word in inputstring.lower().split())
        print('The %i unique words are: %s' % (len(uniqwords),sorted(uniqwords)))
        inputstring.read(output)
    inputstring.close()
    infile.close()
outfile.close()

When i run the output comes with an error like -
Traceback (most recent call last):
File "C:\Users\haojam\Desktop\GEO\arrayexpress\Experiment\ab.py", line 21, in <module>
inputstring.read(output)
AttributeError: 'str' object has no attribute 'read'

haojam 0 Light Poster · Answer 4 · 2010-10-14T21:43:36+00:00

When i run this code there is an error on the following lines. Could you please assist me.

print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
previous |=  uniqwords 
    infile.close()
    

outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 5 · 2010-10-14T21:45:27+00:00

TrustyTony 888 ex-Moderator

14 Years Ago

Error message, Python version?

haojam 0 Light Poster · Answer 6 · 2010-10-19T11:12:44+00:00

Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
    #previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

With regards,
Haobijam

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 7 · 2010-10-19T12:07:30+00:00

What have you tried? Which documents have you looked to solve your problem? Error messages?

haojam 0 Light Poster · Answer 8 · 2010-10-19T13:07:56+00:00

Dear Sir,

I would like to print unique terms as a whole for all sdrf.txt files not for every individual sdrf.txt files. I am eagerly waiting for your positive response.

Reagards,
Haobijam

haojam 0 Light Poster · Answer 9 · 2010-10-21T20:23:43+00:00

Dear Sir,

I do have a query regarding this parsing code . I would like to parse all sdrf.txt files but here is a problem in sdrf.txt attributes structure. There are two forms one starts with Source Name and other starts with Labeled Extract Name. SO i need to run the python code separately. Could you please rectify this code in one python code using if else condition for both case. I hereby attached the sdrf.txt files starting with Labeled Extract Name. I would be glad for your support and cooperation.

#!/usr/bin/python
import glob
import string
outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
uniqwords_new = set()
for file in files:
    #print('\n'+file)
    infile = open(file)
    previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous)
        #print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords
    uniqwords_new = uniqwords_new ^ uniqwords
    # making strings to replace the undesired words
    #---------------------------------------------
    str_old1 = str('sex')
    str_new1 = str('Sex')
    str_old2 = str('organism')
    str_new2 = str('Organism')
    str_old3 = str('rganism part')
    str_new3 = str('OrganismPart')
    str_old4 = str('time')
    str_new4 = str('Time')
    str_old5 = str('Quantity of labled extract used')
    str_new5 = str('Quantity of label target used')
    str_old6 = str('quantitiy')
    str_new6 = str('Quantity')
    #--------------add below the other words you wish to replace
    # replacing the words
    output = output.replace(str_old1,str_new1)
    output = output.replace(str_old2,str_new2)
    output = output.replace(str_old3,str_new3)
    output = output.replace(str_old4,str_new4)
    output = output.replace(str_old5,str_new5)
    output = output.replace(str_old6,str_new6)
    #------------ replace other words you wish to
    print (output)
    outfile.write(output)
    infile.close()
print('The %i unique terms are:\n\t%s' % (len(uniqwords_new),'\n\t'.join(sorted(uniqwords_new))))                  
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Regards,
Haobijam

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 10 · 2010-10-21T22:28:27+00:00

Your coding was really clumsy for lines 22..44. Use tuples and for loop.

#!/usr/bin/python
import glob
import string

with open('output.txt' , 'w') as outfile:
    files = glob.glob('*.sdrf.txt')

    uniqwords_new = set()

    for file in files:
        with open(file) as infile:
            previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Source Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                
            uniqwords_new = uniqwords_new ^ uniqwords
            # making tuples to replace the undesired words
            #---------------------------------------------
            replacement_tuples = (('sex','Sex'),
                                  ('organism','Organism'),
                                  ('rganism part','OrganismPart'),
                                  ('time','Time'),
                                  ('Quantity of labled extract used','Quantity of label target used'),
                                  ('quantitiy', 'Quantity') )
            #--------------add below the other words you wish to replace
            # replacing the words
            for old, new in replacement_tuples:
                output = output.replace(old, new)
            #------------ replace other words you wish to
            print (output)
            outfile.write(output)

print('The %i unique terms are:\n\t%s' % (len(uniqwords_new),'\n\t'.join(sorted(uniqwords_new))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

How you are processing variable output defined inside for outside for loop? It will get only the last lines [0] elements value?

Output is having double new lines and line 15 does not do anything, so finally lineArray[0] is same as line which is also called output with probably unnecessary \t\n

previous is initialized twice to empty set, take out line 6.

startswith is at line 14 as you see, change it to deal also with other starting frase.

haojam 0 Light Poster · Answer 11 · 2010-10-24T14:30:04+00:00

Dear Sir,

When i run the python script with Labeled Extract Name attribute in SMDB sdrf.txt files attached earlier an error occurs. Here is the message --

Traceback (most recent call last):
  File "C:/Users/haojam/Desktop/GEO/arrayexpress/Experiment/sdrf_10.py", line 20, in <module>
    uniqwords_new = uniqwords_new ^ uniqwords
NameError: name 'uniqwords' is not defined

Regards,
Haobijam

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 12 · 2010-10-24T18:36:46+00:00

TrustyTony 888 ex-Moderator

14 Years Ago

uniqwords must be initialized.

haojam 0 Light Poster · Answer 13 · 2010-10-27T12:08:19+00:00

Sir,

When i run this python script the unique terms output and the terms output remains the same . But when i uncomment previous = set() the output comes different. My query here is "Is %i terms gonna be common terms in all files"?

Regards,
Haobijam

#!/usr/bin/python
import glob
import string

with open('output.txt' , 'w') as outfile:
    files = glob.glob('*.sdrf.txt')

    uniqwords_new = set()
    previous = set()
    for file in files:
        with open(file) as infile:
            #previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Source Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                
            uniqwords_new = uniqwords_new ^ uniqwords
            # making tuples to replace the undesired words
            #---------------------------------------------
            replacement_tuples = (('sex','Sex'),
                                  ('organism','Organism'),
                                  ('organism part','OrganismPart'),
                                  ('time','Time'),
                                  ('Quantity of labled extract used','Quantity of label target used'),
                                  ('quantitiy', 'Quantity') )
            #--------------add below the other words you wish to replace
            # replacing the words
            for old, new in replacement_tuples:
                output = output.replace(old, new)
            #------------ replace other words you wish to
            #print (output)
            outfile.write(output)

print('The %i unique terms are:\n\t%s' % (len(uniqwords_new),'\n\t'.join(sorted(uniqwords_new))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

haojam 0 Light Poster · Answer 14 · 2010-10-28T10:48:13+00:00

Dear Sir,

I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

With warm regards,
Haobijam

#!/usr/bin/python
import glob
import string
with open('output_Reporter Name.txt' , 'w') as outfile:
    files = glob.glob('*.adf.txt')
    uniqwords = set()
    previous = set()
    for file in files:
        with open(file) as infile:
            #previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Reporter Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                print (output)
                outfile.write(output)
print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

haojam 0 Light Poster · Answer 15 · 2010-11-01T10:09:30+00:00

Sir,
I had written a python script to parse the first line for all sdrf.txt files [ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/]and extract unique terms from all the files. But when i run this code there is an error could you please rectify this error. I would be glad for your support and cooperation.

With regards,
Haobijam

#!/usr/bin/python
import glob
import re
import linecache
linelist=[]
files = glob.glob('*.sdrf.txt')
for file in files:
    f1 = open(file)
    f2 = open('SDRFparse.txt','a+')

    filename = file.split('.')
    filename_1 = filename[0]
    #print filename_1

    line1 = linecache.getline(file, 1)
    line11 = line1.replace('\n','')
    line2 = line11.split('\t')
    i = len(line2)
    #last = line2[i-1]
    lines = f1.xreadlines()
    linecount = len(f1.readlines())

    for num in range(2,linecount+1):
        line3 = linecache.getline(file, num)
        a = line3.split('\t')
        for j in range(0,i-1):
            f2.write(line2[j] + '\t' + a[j] + '\n')
f1.close()
f2.close()

#output error
Traceback (most recent call last):
File "C:/Users/haojam/Desktop/GEO/arrayexpress/Experiment/sdrfparse.py", line 27, in <module>
f2.write(line2[j] + '\t' + a[j] + '\n')
IndexError: list index out of range

haojam 0 Light Poster · Answer 16 · 2010-11-24T16:54:22+00:00

haojam 0 Light Poster

14 Years Ago

Dear,

I have written a code to parse attributes and values from an XML file (attached MINiML) but i do have an error while running the code at line number 86 GenBank@. When i remove @ sign i can run the code without any error. Could you please suggest me or rectify the code because it is impossible to remove all @ sign from all xml files. I would be glad for your support and cooperation. I hope we can make it out using ElementTree method. I am attaching here the output also.

Regards,
Haobijam

#!/usr/bin/python
import xml.dom.minidom

# Load the Contibutor collection
MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )


def getTextFromElem(parent):
    '''Return a list of text found in the child nodes of a
    parent node, discarding whitespace.'''
    textList = []
    for n in parent.childNodes:
        # TEXT_NODE - 3
        if n.nodeType == 3 and n.nodeValue.strip():
            textList.append(str(n.nodeValue.strip()))
    return textList

def getElemChildren(parent):
    # Return a list of element nodes below parent
    elements = []
    for obj in parent.childNodes:
        if obj.nodeType == obj.ELEMENT_NODE:
            elements.append(obj)
    return elements

def nodeTree(element, pad=0):
    # Return list of strings representing the node tree below element
    results = ["%s%s" % (pad*" ", str(element.nodeName))]
    nextElems = getElemChildren(element)
    if nextElems:
        for node in nextElems:
            results.extend(nodeTree(node, pad+2))
    else:
        results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
    return results

contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Database' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Series' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))

MINiML.xml (270 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<MINiML
   xmlns="http://www.ncbi.nlm.nih.gov/projects/geo/info/MINiML"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/projects/geo/info/MINiML http://www.ncbi.nlm.nih.gov/geo/info/MINiML.xsd"
   version="0.5.0" >

  <Contributor iid="contrib1">
    <Person><First>Yael</First><Last>Strulovici-Barel</Last></Person>
    <Email>yas2003@med.cornell.edu</Email>
    <Phone>646-962-5560</Phone>
    <Laboratory>Crystal</Laboratory>
    <Department>Department of Genetic Medicine</Department>
    <Organization>Weill Cornell Medical College</Organization>
    <Address>
      <Line>1300 York Avenue</Line>
      <City>New York </City>
      <State>NY</State>
      <Zip-Code>10021</Zip-Code>
      <Country>USA</Country>
    </Address>
  </Contributor>

  <Contributor iid="contrib2">
    <Organization></Organization>
    <Email>geo@ncbi.nlm.nih.gov, support@affymetrix.com</Email>
    <Phone>888-362-2447</Phone>
    <Organization>Affymetrix, Inc.</Organization>
    <Address>
      <City>Santa Clara</City>
      <State>CA</State>
      <Zip-Code>95051</Zip-Code>
      <Country>USA</Country>
    </Address>
    <Web-Link>http://www.affymetrix.com/index.affx</Web-Link>
  </Contributor>

  <Contributor iid="contrib3">
    <Person><First>Brendan</First><Last>Carolan</Last></Person>
  </Contributor>

  <Contributor iid="contrib4">
    <Person><First>Ben-Gary</First><Last>Harvey</Last></Person>
  </Contributor>

  <Contributor iid="contrib5">
    <Person><First>Bishnu</First><Middle>P</Middle><Last>De</Last></Person>
  </Contributor>

  <Contributor iid="contrib6">
    <Person><First>Holly</First><Last>Vanni</Last></Person>
  </Contributor>

  <Contributor iid="contrib7">
    <Person><First>Ronald</First><Middle>G</Middle><Last>Crystal</Last></Person>
  </Contributor>

  <Database iid="GEO">
    <Name>Gene Expression Omnibus (GEO)</Name>
    <Public-ID>GEO</Public-ID>
    <Organization>NCBI NLM NIH</Organization>
    <Web-Link>http://www.ncbi.nlm.nih.gov/geo</Web-Link>
    <Email>geo@ncbi.nlm.nih.gov</Email>
  </Database>

  <Platform iid="GPL570">
    <Status database="GEO">
      <Submission-Date>2003-11-07</Submission-Date>
      <Release-Date>2003-11-07</Release-Date>
      <Last-Update-Date>2010-10-21</Last-Update-Date>
    </Status>
    <Title>[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array</Title>
    <Accession database="GEO">GPL570</Accession>
    <Technology>in situ oligonucleotide</Technology>
    <Distribution>commercial</Distribution>
    <Organism taxid="9606">Homo sapiens</Organism>
    <Manufacturer>Affymetrix</Manufacturer>
    <Manufacture-Protocol>
see manufacturer's web site
    </Manufacture-Protocol>
    <Description>
Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html

Complete coverage of the Human Genome U133 Set plus 6,500 additional genes for analysis of over 47,000 transcripts
All probe sets represented on the GeneChip Human Genome U133 Set are identically replicated on the GeneChip Human Genome U133 Plus 2.0 Array. The sequences from which these probe sets were derived were selected from GenBank, dbEST, and RefSeq. The sequence clusters were created from the UniGene database (Build 133, April 20, 2001) and then refined by analysis and comparison with a number of other publicly available databases, including the Washington University EST trace repository and the University of California, Santa Cruz Golden-Path human genome database (April 2001 release). 
In addition, there are 9,921 new probe sets representing approximately 6,500 new genes. These gene sequences were selected from GenBank, dbEST, and RefSeq. Sequence clusters were created from the UniGene database (Build 159, January 25, 2003) and refined by analysis and comparison with a number of other publicly available databases, including the Washington University EST trace repository and the NCBI human genome assembly (Build 31).
    </Description>
    <Web-Link>http://www.affymetrix.com/support/technical/byproduct.affx?product=hg-u133-plus</Web-Link>
    <Web-Link>http://www.affymetrix.com/analysis/index.affx</Web-Link>
    <Relation type="Alternative to" target="GPL4454" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL4866" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL5760" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL6671" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL6732" />
    <Relation type="Alternative to" target="GPL6791" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL6879" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL7567" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL8019" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL8542" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL8715" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL8712" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL9102" comment="Probe Level Version" />
    <Relation type="Alternative to" target="GPL9099" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL9101" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL9324" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL9486" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL9987" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL10175" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL10335" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL10371" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL10526" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL10881" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL10925" comment="Alternative CDF" />
    <Relation type="Alternative to" target="GPL11084" comment="Alternative CDF" />
    <Data-Table>
      <Column position="1">
        <Name>ID</Name>
        <Description>Affymetrix Probe Set ID </Description>
        <Link-Prefix>https://www.affymetrix.com/LinkServlet?array=U133PLUS&amp;probeset=</Link-Prefix>
      </Column>
      <Column position="2">
        <Name>GB_ACC</Name>
        <Description>GenBank Accession Number </Description>
        <Link-Prefix>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&amp;db=Nucleotide&amp;term=</Link-Prefix>
      </Column>
      <Column position="3">
        <Name>SPOT_ID</Name>
        <Description>identifies controls</Description>
      </Column>
      <Column position="4">
        <Name>Species Scientific Name</Name>
        <Description>The genus and species of the organism represented by the probe set.</Description>
      </Column>
      <Column position="5">
        <Name>Annotation Date</Name>
        <Description>The date that the annotations for this probe array were last updated. It will generally be earlier than the date when the annotations were posted on the Affymetrix web site.</Description>
      </Column>
      <Column position="6">
        <Name>Sequence Type</Name>
      </Column>
      <Column position="7">
        <Name>Sequence Source</Name>
        <Description>The database from which the sequence used to design this probe set was taken.</Description>
      </Column>
      <Column position="8">
        <Name>Target Description</Name>
      </Column>
      <Column position="9">
        <Name>Representative Public ID</Name>
        <Description>The accession number of a representative sequence. Note that for consensus-based probe sets, the representative sequence is only one of several sequences (sequence sub-clusters) used to build the consensus sequence and it is not directly used to derive the probe sequences. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set. Refer to the &quot;Sequence Source&quot; field to determine the database used.</Description>
      </Column>
      <Column position="10">
        <Name>Gene Title</Name>
        <Description>Title of Gene represented by the probe set.</Description>
      </Column>
      <Column position="11">
        <Name>Gene Symbol</Name>
        <Description>A gene symbol, when one is available (from UniGene).</Description>
      </Column>
      <Column position="12">
        <Name>ENTREZ_GENE_ID</Name>
        <Description>Entrez Gene Database UID  </Description>
        <Link-Prefix>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&amp;cmd=Retrieve&amp;dopt=Graphics&amp;list_uids=</Link-Prefix>
        <Link-Delimiter> /// </Link-Delimiter>
      </Column>
      <Column position="13">
        <Name>RefSeq Transcript ID</Name>
        <Description>References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet.</Description>
      </Column>
      <Column position="14">
        <Name>Gene Ontology Biological Process</Name>
        <Description>Gene Ontology Consortium Biological Process derived from LocusLink.  Each annotation consists of three parts: &quot;Accession Number // Description // Evidence&quot;. The description corresponds directly to the GO ID. The evidence can be &quot;direct&quot;, or &quot;extended&quot;.</Description>
      </Column>
      <Column position="15">
        <Name>Gene Ontology Cellular Component</Name>
        <Description>Gene Ontology Consortium Cellular Component derived from LocusLink.  Each annotation consists of three parts: &quot;Accession Number // Description // Evidence&quot;. The description corresponds directly to the GO ID. The evidence can be &quot;direct&quot;, or &quot;extended&quot;.</Description>
      </Column>
      <Column position="16">
        <Name>Gene Ontol

output.txt (199.78 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

    ****************************************************************
    Personal firewall software may warn about the connection IDLE
    makes to its subprocess using this computer's internal loopback
    interface.  This connection is not visible on any external
    interface and no data is sent to or received from the Internet.
    ****************************************************************
    
IDLE 2.6.5      
>>> ================================ RESTART ================================
>>> 
Contributor
  Person
    First
      Yael
    Last
      Strulovici-Barel
  Email
    yas2003@med.cornell.edu
  Phone
    646-962-5560
  Laboratory
    Crystal
  Department
    Department of Genetic Medicine
  Organization
    Weill Cornell Medical College
  Address
    Line
      1300 York Avenue
    City
      New York
    State
      NY
    Zip-Code
      10021
    Country
      USA
Contributor
  Organization
    
  Email
    geo@ncbi.nlm.nih.gov, support@affymetrix.com
  Phone
    888-362-2447
  Organization
    Affymetrix, Inc.
  Address
    City
      Santa Clara
    State
      CA
    Zip-Code
      95051
    Country
      USA
  Web-Link
    http://www.affymetrix.com/index.affx
Contributor
  Person
    First
      Brendan
    Last
      Carolan
Contributor
  Person
    First
      Ben-Gary
    Last
      Harvey
Contributor
  Person
    First
      Bishnu
    Middle
      P
    Last
      De
Contributor
  Person
    First
      Holly
    Last
      Vanni
Contributor
  Person
    First
      Ronald
    Middle
      G
    Last
      Crystal
Database
  Name
    Gene Expression Omnibus (GEO)
  Public-ID
    GEO
  Organization
    NCBI NLM NIH
  Web-Link
    http://www.ncbi.nlm.nih.gov/geo
  Email
    geo@ncbi.nlm.nih.gov
Platform
  Status
    Submission-Date
      2003-11-07
    Release-Date
      2003-11-07
    Last-Update-Date
      2010-10-21
  Title
    [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array
  Accession
    GPL570
  Technology
    in situ oligonucleotide
  Distribution
    commercial
  Organism
    Homo sapiens
  Manufacturer
    Affymetrix
  Manufacture-Protocol
    see manufacturer's web site
  Description
    Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html

Complete coverage of the Human Genome U133 Set plus 6,500 additional genes for analysis of over 47,000 transcripts
 
In addition, there are 9,921 new probe sets representing approximately 6,500 new genes. These gene sequences were selected from GenBank, dbEST, and RefSeq. Sequence clusters were created from the UniGene database (Build 159, January 25, 2003) and refined by analysis and comparison with a number of other publicly available databases, including the Washington University EST trace repository and the NCBI human genome assembly (Build 31).
  Web-Link
    http://www.affymetrix.com/support/technical/byproduct.affx?product=hg-u133-plus
  Web-Link
    http://www.affymetrix.com/analysis/index.affx
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Relation
    
  Data-Table
    Column
      Name
        ID
      Description
        Affymetrix Probe Set ID
      Link-Prefix
        https://www.affymetrix.com/LinkServlet?array=U133PLUS&probeset=
    Column
      Name
        GB_ACC
      Description
        GenBank Accession Number
      Link-Prefix
        http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Nucleotide&term=
    Column
      Name
        SPOT_ID
      Description
        identifies controls
    Column
      Name
        Species Scientific Name
      Description
        The genus and species of the organism represented by the probe set.
    Column
      Name
        Annotation Date
      Description
        The date that the annotations for this probe array were last updated. It will generally be earlier than the date when the annotations were posted on the Affymetrix web site.
    Column
      Name
        Sequence Type
    Column
      Name
        Sequence Source
      Description
        The database from which the sequence used to design this probe set was taken.
    Column
      Name
        Target Description
    Column
      Name
        Representative Public ID
      Description
        The accession number of a representative sequence. Note that for consensus-based probe sets, the representative sequence is only one of several sequences (sequence sub-clusters) used to build the consensus sequence and it is not directly used to derive the probe sequences. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set. Refer to the "Sequence Source" field to determine the database used.
    Column
      Name
        Gene Title
      Description
        Title of Gene represented by the probe set.
    Column
      Name
        Gene Symbol
      Description
        A gene symbol, when one is available (from UniGene).
    Column
      Name
        ENTREZ_GENE_ID
      Description
        Entrez Gene Database UID
      Link-Prefix
        http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=
      Link-Delimiter
        ///
    Column
      Name
        RefSeq Transcript ID
      Description
        References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet.
    Column
      Name
        Gene Ontology Biological Process
      Description
        Gene Ontology Consortium Biological Process derived from LocusLink.  Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended".
    Column
      Name
        Gene Ontology Cellular Component
      Description
        Gene Ontology Consortium Cellular Component derived from LocusLink.  Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended".
    Column
      Name
        Gene Ontology Molecular Function
      Description
        Gene Ontology Consortium Molecular Function derived from LocusLink.  Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended".
    External-Data
      GPL570-tbl-1.txt
Sample
  Status
    Submission-Date
      2007-12-21
    Release-Date
      2008-12-19
    Last-Update-Date
      2007-12-21
  Title
    large airways, non-smoker 002
  Accession
    GSM252799
  Type
    RNA
  Channel-Count
    1
  Channel
    Source
      airway epithelial cells obtained by bronchoscopy and brushing
    Organism
      Homo sapiens
    Characteristics
      61
    Characteristics
      M
    Characteristics
      white
    Characteristics
      non-smoker
    Molecule
      total RNA
    Extract-Protocol
      Trizol extraction and RNAeasy clean-up of total RNA was performed according to the manufacturer's instructions.
    Label
      biotin
    Label-Protocol
      Biotinylated cRNA were prepared according to the standard Affymetrix protocol from 3 microg total RNA (Expression Analysis Technical Manual, 701022 Rev.2, Affymetrix).
  Hybridization-Protocol
    Following fragmentation, 15 microg of cRNA were hybridized for 16 hr at 45C on GeneChip HG-U133 Plus 2.0. GeneChips were washed and stained in the Affymetrix Fluidics Station 450.
  Scan-Protocol
    GeneChips were scanned using the GeneChip Scanner 3000 7G.
  Description
    Comparison of gene expression in airway epithelial cells of normal non-smokers, phenotypic normal smokers, smokers with early COPD, and smokers with COPD.
  Data-Processing
    The data were analyzed with Microarray Suite version 5.0 (MAS 5.0) using Affymetrix default analysis settings and global scaling as normalization method.
  Platform-Ref
    
  Contact-Ref
    
  Supplementary-Data
    ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM252nnn/GSM252799/GSM252799.CEL.gz
  Supplementary-Data
    ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM252nnn/GSM252799/GSM252799.CHP.gz
  Data-Table
    Column
      Name
        ID_REF
    Column
      Name
        VALUE
      Description
        Signal
    Column
      Name
        ABS_CALL
      Description
        indicating whether the transcript was present (P), absent (A), or marginal (M)
    Column
      Name
        DETECTION P-VALUE
    External-Data
      GSM252799-tbl-1.txt
Sample
  Status
    Submission-Date
      2007-12-21
    Release-Date
      2008-12-19
    Last-Update-Date
      2007-12-21
  Title
    large airways, non-smoker 004
  Accession
    GSM252800
  Type
    RNA
  Channel-Count
    1
  Channel
    Source
      airway epithelial cells obtained by bronchoscopy and brushing
    Organism
      Homo sapiens
    Characteristics
      37
    Characteristics
      M
    Characteristics
      black
    Characteristics
      non-smoker
    Molecule
      total RNA
    Extract-Protocol
      Trizol extraction and RNAeasy clean-up of total RNA was performed according to the manufacturer's instructions.
    Label
      biotin
    Lab

haojam 0 Light Poster · Answer 17 · 2010-11-24T17:33:32+00:00

haojam 0 Light Poster

14 Years Ago

This is GenBank® not GenBank@

Parsing tab separated .txt files with common and distinct attributes

Recommended Answers Collapse Answers

All 22 Replies

Recommended Answers