Dear Sir,
I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script and i would be glad for your support and cooperation. I have attached a zip file for all sdrf.txt files and the output for the script i have run. The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

Regards,
Haobijam

#!/usr/bin/python
import glob
#import linecache
outfile = open('output_att.txt' , 'w')
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    #count = 0
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        #count = count + 1
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
    infile.close()
outfile.close()

Are the data you want in result file? You want unique lines or what? Any example of desired output would enable me to help you.

Dear ,
Yes from the result file i would like to extract / parse only unique words like Factor Value [bkv] AND Factor Value [incubate] and may be more unique words which is available only one time.

Regards,
Haobijam

Word Factor is in both lines, so it is not unique. Define what you want. Computer can give you things you know you want.

Here is code for finding unique words in file:

import string

inputstring=open('output_att.txt').read()

uniqwords=set(word.strip(string.punctuation+string.digits)
              for word in inputstring.lower().split())

print('The %i unique words are: %s' % (len(uniqwords),sorted(uniqwords)))

Dear,

I would like to read and parse only unique words (attributes) which are unrepeated from first line (i.e. Headers not table values) of all the sdrf.txt files. The first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] or sometimes with Comment [ArrayExpress FTP file].

Regards,
Haobijam

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
inputstring = open('output.txt').read()
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip(string.punctuation+string.digits)
        for word in inputstring.lower().split())
        print('The %i unique words are: %s' % (len(uniqwords),sorted(uniqwords)))
        inputstring.read(output)
    inputstring.close()
    infile.close()
outfile.close()

When i run the output comes with an error like -
Traceback (most recent call last):
File "C:\Users\haojam\Desktop\GEO\arrayexpress\Experiment\ab.py", line 21, in <module>
inputstring.read(output)
AttributeError: 'str' object has no attribute 'read'

Maybe like this?

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
##    previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
    

outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

When i run this code there is an error on the following lines. Could you please assist me.

print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
previous |=  uniqwords 
    infile.close()
    

outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Error message, Python version?

Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
    #previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

With regards,
Haobijam

What have you tried? Which documents have you looked to solve your problem? Error messages?

Dear Sir,

I would like to print unique terms as a whole for all sdrf.txt files not for every individual sdrf.txt files. I am eagerly waiting for your positive response.

Reagards,
Haobijam

You could normalize the unique terms using a regular expression:

import re
item_pattern = re.compile(r"[a-zA-Z][a-z]*|[^\s]")

def tuple_key(composite_term):
    return tuple(w.lower() for w in item_pattern.findall(composite_term))

def normalize(term):
    return ' '.join(tuple_key(term.strip()))
        
print normalize("Characteristics [StrainOrLine]")
"""my output -->
characteristics [ strain or line ]
"""

Dear Sir,

I do have a query regarding this parsing code . I would like to parse all sdrf.txt files but here is a problem in sdrf.txt attributes structure. There are two forms one starts with Source Name and other starts with Labeled Extract Name. SO i need to run the python code separately. Could you please rectify this code in one python code using if else condition for both case. I hereby attached the sdrf.txt files starting with Labeled Extract Name. I would be glad for your support and cooperation.

#!/usr/bin/python
import glob
import string
outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
uniqwords_new = set()
for file in files:
    #print('\n'+file)
    infile = open(file)
    previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous)
        #print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords
    uniqwords_new = uniqwords_new ^ uniqwords
    # making strings to replace the undesired words
    #---------------------------------------------
    str_old1 = str('sex')
    str_new1 = str('Sex')
    str_old2 = str('organism')
    str_new2 = str('Organism')
    str_old3 = str('rganism part')
    str_new3 = str('OrganismPart')
    str_old4 = str('time')
    str_new4 = str('Time')
    str_old5 = str('Quantity of labled extract used')
    str_new5 = str('Quantity of label target used')
    str_old6 = str('quantitiy')
    str_new6 = str('Quantity')
    #--------------add below the other words you wish to replace
    # replacing the words
    output = output.replace(str_old1,str_new1)
    output = output.replace(str_old2,str_new2)
    output = output.replace(str_old3,str_new3)
    output = output.replace(str_old4,str_new4)
    output = output.replace(str_old5,str_new5)
    output = output.replace(str_old6,str_new6)
    #------------ replace other words you wish to
    print (output)
    outfile.write(output)
    infile.close()
print('The %i unique terms are:\n\t%s' % (len(uniqwords_new),'\n\t'.join(sorted(uniqwords_new))))                  
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Regards,
Haobijam

Your coding was really clumsy for lines 22..44. Use tuples and for loop.

#!/usr/bin/python
import glob
import string

with open('output.txt' , 'w') as outfile:
    files = glob.glob('*.sdrf.txt')

    uniqwords_new = set()

    for file in files:
        with open(file) as infile:
            previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Source Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                
            uniqwords_new = uniqwords_new ^ uniqwords
            # making tuples to replace the undesired words
            #---------------------------------------------
            replacement_tuples = (('sex','Sex'),
                                  ('organism','Organism'),
                                  ('rganism part','OrganismPart'),
                                  ('time','Time'),
                                  ('Quantity of labled extract used','Quantity of label target used'),
                                  ('quantitiy', 'Quantity') )
            #--------------add below the other words you wish to replace
            # replacing the words
            for old, new in replacement_tuples:
                output = output.replace(old, new)
            #------------ replace other words you wish to
            print (output)
            outfile.write(output)

print('The %i unique terms are:\n\t%s' % (len(uniqwords_new),'\n\t'.join(sorted(uniqwords_new))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

How you are processing variable output defined inside for outside for loop? It will get only the last lines [0] elements value?

Output is having double new lines and line 15 does not do anything, so finally lineArray[0] is same as line which is also called output with probably unnecessary \t\n

previous is initialized twice to empty set, take out line 6.

startswith is at line 14 as you see, change it to deal also with other starting frase.

Dear Sir,

When i run the python script with Labeled Extract Name attribute in SMDB sdrf.txt files attached earlier an error occurs. Here is the message --

Traceback (most recent call last):
  File "C:/Users/haojam/Desktop/GEO/arrayexpress/Experiment/sdrf_10.py", line 20, in <module>
    uniqwords_new = uniqwords_new ^ uniqwords
NameError: name 'uniqwords' is not defined

Regards,
Haobijam

uniqwords must be initialized.

Sir,

When i run this python script the unique terms output and the terms output remains the same . But when i uncomment previous = set() the output comes different. My query here is "Is %i terms gonna be common terms in all files"?

Regards,
Haobijam

#!/usr/bin/python
import glob
import string

with open('output.txt' , 'w') as outfile:
    files = glob.glob('*.sdrf.txt')

    uniqwords_new = set()
    previous = set()
    for file in files:
        with open(file) as infile:
            #previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Source Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                
            uniqwords_new = uniqwords_new ^ uniqwords
            # making tuples to replace the undesired words
            #---------------------------------------------
            replacement_tuples = (('sex','Sex'),
                                  ('organism','Organism'),
                                  ('organism part','OrganismPart'),
                                  ('time','Time'),
                                  ('Quantity of labled extract used','Quantity of label target used'),
                                  ('quantitiy', 'Quantity') )
            #--------------add below the other words you wish to replace
            # replacing the words
            for old, new in replacement_tuples:
                output = output.replace(old, new)
            #------------ replace other words you wish to
            #print (output)
            outfile.write(output)

print('The %i unique terms are:\n\t%s' % (len(uniqwords_new),'\n\t'.join(sorted(uniqwords_new))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Dear Sir,

I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

With warm regards,
Haobijam

#!/usr/bin/python
import glob
import string
with open('output_Reporter Name.txt' , 'w') as outfile:
    files = glob.glob('*.adf.txt')
    uniqwords = set()
    previous = set()
    for file in files:
        with open(file) as infile:
            #previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Reporter Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                print (output)
                outfile.write(output)
print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Sir,
I had written a python script to parse the first line for all sdrf.txt files [ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/]and extract unique terms from all the files. But when i run this code there is an error could you please rectify this error. I would be glad for your support and cooperation.

With regards,
Haobijam

#!/usr/bin/python
import glob
import re
import linecache
linelist=[]
files = glob.glob('*.sdrf.txt')
for file in files:
    f1 = open(file)
    f2 = open('SDRFparse.txt','a+')

    filename = file.split('.')
    filename_1 = filename[0]
    #print filename_1

    line1 = linecache.getline(file, 1)
    line11 = line1.replace('\n','')
    line2 = line11.split('\t')
    i = len(line2)
    #last = line2[i-1]
    lines = f1.xreadlines()
    linecount = len(f1.readlines())

    for num in range(2,linecount+1):
        line3 = linecache.getline(file, num)
        a = line3.split('\t')
        for j in range(0,i-1):
            f2.write(line2[j] + '\t' + a[j] + '\n')
f1.close()
f2.close()

#output error
Traceback (most recent call last):
File "C:/Users/haojam/Desktop/GEO/arrayexpress/Experiment/sdrfparse.py", line 27, in <module>
f2.write(line2[j] + '\t' + a[j] + '\n')
IndexError: list index out of range

Find where the error is. Print
line2[j], j, len(line2)
and with a separate print statement
a[j], j, len(a).
You can then fix the specific problem. Also, include some comments in the code explaining what is happening, so we know what it is supposed to be doing. No one can be expected to write all of the code for you.

Dear,

I have written a code to parse attributes and values from an XML file (attached MINiML) but i do have an error while running the code at line number 86 GenBank@. When i remove @ sign i can run the code without any error. Could you please suggest me or rectify the code because it is impossible to remove all @ sign from all xml files. I would be glad for your support and cooperation. I hope we can make it out using ElementTree method. I am attaching here the output also.

Regards,
Haobijam

#!/usr/bin/python
import xml.dom.minidom

# Load the Contibutor collection
MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )


def getTextFromElem(parent):
    '''Return a list of text found in the child nodes of a
    parent node, discarding whitespace.'''
    textList = []
    for n in parent.childNodes:
        # TEXT_NODE - 3
        if n.nodeType == 3 and n.nodeValue.strip():
            textList.append(str(n.nodeValue.strip()))
    return textList

def getElemChildren(parent):
    # Return a list of element nodes below parent
    elements = []
    for obj in parent.childNodes:
        if obj.nodeType == obj.ELEMENT_NODE:
            elements.append(obj)
    return elements

def nodeTree(element, pad=0):
    # Return list of strings representing the node tree below element
    results = ["%s%s" % (pad*" ", str(element.nodeName))]
    nextElems = getElemChildren(element)
    if nextElems:
        for node in nextElems:
            results.extend(nodeTree(node, pad+2))
    else:
        results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
    return results

contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Database' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))
contributors = MINiML.documentElement.getElementsByTagName( 'Series' )
for contributor in contributors:
    print "\n".join(nodeTree(contributor))

This is GenBankĀ® not GenBank@