Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

Regards,
Haobijam

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
    #previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Dear Sir,
I have written a python script to parse attributes (i.e. first lines of each sdrf.txt files which is attached here in zip file. But i would also like to extract unique terms from these attributes(output_att.txt) for all sdrf.txt files. Could you please help me.

Regards
Haobijam

#!/usr/bin/python
import glob
#import linecache
outfile = open('output_att.txt' , 'w')
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    #count = 0
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        #count = count + 1
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
    infile.close()
outfile.close()
Attachments
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [SampleType]	Term Source REF	Characteristics [OrganismPart]	Term Source REF	Characteristics [Ecotype]	Term Source REF	Characteristics [BioSourceProvider]	Term Source REF	Protocol REF	Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [Ecotype]	Term Source REF	Factor Value [OrganismPart]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Genotype]	Characteristics [Organism]	Term Source REF	Characteristics [DevelopmentalStage]	Characteristics [Ecotype]	Characteristics [ABRC seed stock id]	Term Source REF	Provider	Protocol REF	Parameter Value [day humidity]	Parameter Value [day temperature]	Unit [TemperatureUnit]	Parameter Value [night temperature]	Parameter Value [light source]	Parameter Value [night humidity]	Parameter Value [light hours]	Unit [TimeUnit]	Parameter Value [media]	Parameter Value [light intensity]	Sample Name	Protocol REF	Parameter Value [Extracted product]	Parameter Value [Amplification]	Extract Name	Material Type	Term Source REF	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [genotype]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Protocol REF	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [Origin]	Provider	Description	Protocol REF	Extract Name	Material Type	Description	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Factor Value [Duration of Exposure]	Unit [TimeUnit]	Factor Value [bkv]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [strain]	Term Source REF	Characteristics [sex]	Term Source REF	Characteristics [Organism]	Term Source REF	Provider	Protocol REF	Parameter Value [dose]	Unit [ConcentrationUnit]	Parameter Value [compound]	Parameter Value [diet availability]	Parameter Value [diet]	Parameter Value [sacrifice method]	Sample Name	Characteristics [organism part]	Term Source REF	Protocol REF	Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Factor Value [compound]	Term Source REF	Factor Value [time]	Unit [TimeUnit]	Factor Value [strain]	Scan Name	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [StrainOrLine]	Term Source REF	Characteristics [Genotype]	Term Source REF	Protocol REF	Parameter Value [temperature]	Parameter Value [seeding]	Parameter Value [agitation]	Parameter Value [time]	Parameter Value [volume]	Parameter Value [aeration]	Parameter Value [medium]	Sample Name	Protocol REF	Parameter Value [volume]	Parameter Value [aeration]	Parameter Value [medium]	Parameter Value [time]	Parameter Value [temperature]	Parameter Value [agitation]	Parameter Value [seeding]	Sample Name	Protocol REF	Parameter Value [yield]	Parameter Value [quantitation method]	Parameter Value [bacteria harvested]	Parameter Value [RNA stabilisation]	Extract Name	Material Type	Term Source REF	Protocol REF	Parameter Value [fluorescent label]	Parameter Value [DNA quantity]	Parameter Value [labelled extract yield]	Parameter Value [flourescence label]	Parameter Value [DNA quantitiy]	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Parameter Value [Wash B temperature]	Parameter Value [Wash B time 2]	Parameter Value [Wash A temperature]	Parameter Value [Wash A time]	Parameter Value [hybridization volume]	Parameter Value [Wash B time 1]	Parameter Value [hybridization time]	Parameter Value [hybridization temperature]	Protocol REF	Parameter Value [Per Chip Percentile]	Parameter Value [Per Chip Normalization]	Parameter Value [Per Spot Normalization]	Parameter Value [Per Chip Background Correction]	Parameter Value [Per Chip Positive Control Genes]	Parameter Value [Data Transformation]	Parameter Value [Per Spot Cutoff]	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Factor Value [genetic_modification]	Factor Value [incubate]	Term Source REF	Protocol REF
	
Source Name	Characteristics [Ecotype]	Characteristics [Organism]	Term Source REF	Characteristics [DevelopmentalStage]	Characteristics [Source]	Characteristics [OrganismPart]	Term Source REF	Characteristics [Genotype]	Protocol REF	Protocol REF	Sample Name	Protocol REF	Extract Name	Material Type	Term Source REF	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [period of infection]	Unit [TimeUnit]	Factor Value [pathogen]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Sex]	Term Source REF	Characteristics [DiseaseState]	Term Source REF	Characteristics [Organism]	Description	Protocol REF	Sample Name	Characteristics [OrganismPart]	Term Source REF	Description	Protocol REF	Extract Name	Material Type	Term Source REF	Description	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Description	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [Diabetic State]	Term Source REF	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [OrganismPart]	Term Source REF	Characteristics [DiseaseState]	Characteristics [Organism]	Term Source REF	Protocol REF	Sample Name	Protocol REF	Extract Name	Material Type	Term Source REF	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [DiseaseState]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Protocol REF	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Age]	Characteristics [TimeUnit]	Characteristics [GeneticVariationType]	Characteristics [DiseaseState]	Characteristics [SurvivalTime]	Characteristics [DifferentiationGrade]	Characteristics [Sex]	Term Source REF	Characteristics [RecidiveFreeSurvival]	Characteristics [OrganismPart]	Term Source REF	Characteristics [Organism]	Term Source REF	Characteristics [DevelopmentalStage]	Characteristics [CellType]	Characteristics [ClinicalStage]	Characteristics [Progress]	Characteristics [Died]	Protocol REF	Parameter Value [Amplification]	Parameter Value [Extracted nucleic acid]	Protocol REF	Protocol REF	Parameter Value [Time]	Parameter Value [Time unit]	Parameter Value [Temperature (in C)]	Parameter Value [Media]	Extract Name	Material Type	Term Source REF	Protocol REF	Parameter Value [Mass unit]	Parameter Value [Amplification]	Parameter Value [Amount of nucleic acid labeled]	Labeled Extract Name	Label	Material Type	Protocol REF	Parameter Value [Mass unit]	Parameter Value [Volume]	Parameter Value [Temperature (in C)]	Parameter Value [Duration unit]	Parameter Value [Volume unit]	Parameter Value [Duration]	Parameter Value [Quantity of labled extract used]	Parameter Value [Chamber type]	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Protocol REF	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [CellType]	Characteristics [CellLine]	Protocol REF	Protocol REF	Protocol REF	Parameter Value [Temperature]	Unit [TemperatureUnit]	Parameter Value [NaCl]	Unit [ConcentrationUnit]	Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [NaCl]	Unit [ConcentrationUnit]	Factor Value [Temperature]	Unit [TemperatureUnit]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]	
	
Source Name	Characteristics [InitialTimePoint]	Term Source REF	Characteristics [Organism]	Term Source REF	Characteristics [OrganismPart]	Characteristics [TargetedCellType]	Characteristics [DevelopmentalStage]	Protocol REF	Parameter Value [start time]	Parameter Value [media]	Parameter Value [min temperature]	Parameter Value [max temperature]	Parameter Value [stop time]	Sample Name	Protocol REF	Parameter Value [amplification]	Parameter Value [extracted product]	Extract Name	Material Type	Protocol REF	Parameter Value [amplification]	Parameter Value [label used]	Parameter Value [amount of nucleic acid labeled]	Labeled Extract Name	Label	Material Type	Protocol REF	Parameter Value [chamber type]	Parameter Value [temperature]	Unit [TemperatureUnit]	Parameter Value [quantity of label target used]	Unit [MassUnit]	Parameter Value [volume]	Unit [VolumeUnit]	Parameter Value [time]	Unit [TimeUnit]	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retri
This article has been dead for over six months. Start a new discussion instead.