I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.
#!/usr/bin/python import glob import string outfile = open('output.txt' , 'w') files = glob.glob('*.sdrf.txt') previous = set() for file in files: print('\n'+file) infile = open(file) #previous = set() # uncomment this if do not need to be unique between the files for line in infile: lineArray = line.rstrip() if not line.startswith('Source Name') : continue lineArray = line.split('%s\t') output = "%s\t\n"%(lineArray) outfile.write(output) uniqwords = set(word.strip() for word in lineArray.split('\t') if word.strip() and word.strip() not in previous) print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords)))) previous |= uniqwords infile.close() outfile.close() print('='*80) print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))