Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

Regards,
Haobijam

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
    #previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Dear Sir,
I have written a python script to parse attributes (i.e. first lines of each sdrf.txt files which is attached here in zip file. But i would also like to extract unique terms from these attributes(output_att.txt) for all sdrf.txt files. Could you please help me.

Regards
Haobijam

#!/usr/bin/python
import glob
#import linecache
outfile = open('output_att.txt' , 'w')
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    #count = 0
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        #count = count + 1
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
    infile.close()
outfile.close()