Parsing attributes from sdrf.txt files and extracting unique terms for all sdrf.txt

Question

haojam 0 Light Poster

13 Years Ago

Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

Regards,
Haobijam

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
    #previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

python

This attachment is potentially unsafe to open. It may be an executable that is capable of making changes to your file system, or it may require specific software to open. Use caution and only open this attachment if you are comfortable working with zip files.

sdrf.txt_.zip (99 KB)

3 Contributors
3 Replies
240 Views
15 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by woooee

All 3 Replies

Gribouillis 1,391 Programming Explorer

13 Years Ago

This thread is duplicate from this one http://www.daniweb.com/forums/thread317912.html see my answer there.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

haojam 0 Light Poster · Answer 1 · 2010-10-19T17:15:14+00:00

haojam 0 Light Poster

13 Years Ago

Dear Sir,
I have written a python script to parse attributes (i.e. first lines of each sdrf.txt files which is attached here in zip file. But i would also like to extract unique terms from these attributes(output_att.txt) for all sdrf.txt files. Could you please help me.

Regards
Haobijam

#!/usr/bin/python
import glob
#import linecache
outfile = open('output_att.txt' , 'w')
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    #count = 0
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        #count = count + 1
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
    infile.close()
outfile.close()

output_att.txt (31.98 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

Source Name	Characteristics [Organism]	Term Source REF	Characteristics [SampleType]	Term Source REF	Characteristics [OrganismPart]	Term Source REF	Characteristics [Ecotype]	Term Source REF	Characteristics [BioSourceProvider]	Term Source REF	Protocol REF	Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [Ecotype]	Term Source REF	Factor Value [OrganismPart]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Genotype]	Characteristics [Organism]	Term Source REF	Characteristics [DevelopmentalStage]	Characteristics [Ecotype]	Characteristics [ABRC seed stock id]	Term Source REF	Provider	Protocol REF	Parameter Value [day humidity]	Parameter Value [day temperature]	Unit [TemperatureUnit]	Parameter Value [night temperature]	Parameter Value [light source]	Parameter Value [night humidity]	Parameter Value [light hours]	Unit [TimeUnit]	Parameter Value [media]	Parameter Value [light intensity]	Sample Name	Protocol REF	Parameter Value [Extracted product]	Parameter Value [Amplification]	Extract Name	Material Type	Term Source REF	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [genotype]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Protocol REF	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [Origin]	Provider	Description	Protocol REF	Extract Name	Material Type	Description	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Factor Value [Duration of Exposure]	Unit [TimeUnit]	Factor Value [bkv]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [strain]	Term Source REF	Characteristics [sex]	Term Source REF	Characteristics [Organism]	Term Source REF	Provider	Protocol REF	Parameter Value [dose]	Unit [ConcentrationUnit]	Parameter Value [compound]	Parameter Value [diet availability]	Parameter Value [diet]	Parameter Value [sacrifice method]	Sample Name	Characteristics [organism part]	Term Source REF	Protocol REF	Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Factor Value [compound]	Term Source REF	Factor Value [time]	Unit [TimeUnit]	Factor Value [strain]	Scan Name	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [StrainOrLine]	Term Source REF	Characteristics [Genotype]	Term Source REF	Protocol REF	Parameter Value [temperature]	Parameter Value [seeding]	Parameter Value [agitation]	Parameter Value [time]	Parameter Value [volume]	Parameter Value [aeration]	Parameter Value [medium]	Sample Name	Protocol REF	Parameter Value [volume]	Parameter Value [aeration]	Parameter Value [medium]	Parameter Value [time]	Parameter Value [temperature]	Parameter Value [agitation]	Parameter Value [seeding]	Sample Name	Protocol REF	Parameter Value [yield]	Parameter Value [quantitation method]	Parameter Value [bacteria harvested]	Parameter Value [RNA stabilisation]	Extract Name	Material Type	Term Source REF	Protocol REF	Parameter Value [fluorescent label]	Parameter Value [DNA quantity]	Parameter Value [labelled extract yield]	Parameter Value [flourescence label]	Parameter Value [DNA quantitiy]	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Parameter Value [Wash B temperature]	Parameter Value [Wash B time 2]	Parameter Value [Wash A temperature]	Parameter Value [Wash A time]	Parameter Value [hybridization volume]	Parameter Value [Wash B time 1]	Parameter Value [hybridization time]	Parameter Value [hybridization temperature]	Protocol REF	Parameter Value [Per Chip Percentile]	Parameter Value [Per Chip Normalization]	Parameter Value [Per Spot Normalization]	Parameter Value [Per Chip Background Correction]	Parameter Value [Per Chip Positive Control Genes]	Parameter Value [Data Transformation]	Parameter Value [Per Spot Cutoff]	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Factor Value [genetic_modification]	Factor Value [incubate]	Term Source REF	Protocol REF
	
Source Name	Characteristics [Ecotype]	Characteristics [Organism]	Term Source REF	Characteristics [DevelopmentalStage]	Characteristics [Source]	Characteristics [OrganismPart]	Term Source REF	Characteristics [Genotype]	Protocol REF	Protocol REF	Sample Name	Protocol REF	Extract Name	Material Type	Term Source REF	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [period of infection]	Unit [TimeUnit]	Factor Value [pathogen]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Sex]	Term Source REF	Characteristics [DiseaseState]	Term Source REF	Characteristics [Organism]	Description	Protocol REF	Sample Name	Characteristics [OrganismPart]	Term Source REF	Description	Protocol REF	Extract Name	Material Type	Term Source REF	Description	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Description	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [Diabetic State]	Term Source REF	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [OrganismPart]	Term Source REF	Characteristics [DiseaseState]	Characteristics [Organism]	Term Source REF	Protocol REF	Sample Name	Protocol REF	Extract Name	Material Type	Term Source REF	Protocol REF	Labeled Extract Name	Label	Material Type	Term Source REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [DiseaseState]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Protocol REF	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Age]	Characteristics [TimeUnit]	Characteristics [GeneticVariationType]	Characteristics [DiseaseState]	Characteristics [SurvivalTime]	Characteristics [DifferentiationGrade]	Characteristics [Sex]	Term Source REF	Characteristics [RecidiveFreeSurvival]	Characteristics [OrganismPart]	Term Source REF	Characteristics [Organism]	Term Source REF	Characteristics [DevelopmentalStage]	Characteristics [CellType]	Characteristics [ClinicalStage]	Characteristics [Progress]	Characteristics [Died]	Protocol REF	Parameter Value [Amplification]	Parameter Value [Extracted nucleic acid]	Protocol REF	Protocol REF	Parameter Value [Time]	Parameter Value [Time unit]	Parameter Value [Temperature (in C)]	Parameter Value [Media]	Extract Name	Material Type	Term Source REF	Protocol REF	Parameter Value [Mass unit]	Parameter Value [Amplification]	Parameter Value [Amount of nucleic acid labeled]	Labeled Extract Name	Label	Material Type	Protocol REF	Parameter Value [Mass unit]	Parameter Value [Volume]	Parameter Value [Temperature (in C)]	Parameter Value [Duration unit]	Parameter Value [Volume unit]	Parameter Value [Duration]	Parameter Value [Quantity of labled extract used]	Parameter Value [Chamber type]	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Protocol REF	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]
	
Source Name	Characteristics [Organism]	Term Source REF	Characteristics [CellType]	Characteristics [CellLine]	Protocol REF	Protocol REF	Protocol REF	Parameter Value [Temperature]	Unit [TemperatureUnit]	Parameter Value [NaCl]	Unit [ConcentrationUnit]	Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type	Protocol REF	Protocol REF	Protocol REF	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Protocol REF	Factor Value [NaCl]	Unit [ConcentrationUnit]	Factor Value [Temperature]	Unit [TemperatureUnit]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retrieval URI]	Derived Array Data Matrix File	Comment [Derived ArrayExpress FTP file]	Comment [Derived ArrayExpress Data Retrieval URI]	
	
Source Name	Characteristics [InitialTimePoint]	Term Source REF	Characteristics [Organism]	Term Source REF	Characteristics [OrganismPart]	Characteristics [TargetedCellType]	Characteristics [DevelopmentalStage]	Protocol REF	Parameter Value [start time]	Parameter Value [media]	Parameter Value [min temperature]	Parameter Value [max temperature]	Parameter Value [stop time]	Sample Name	Protocol REF	Parameter Value [amplification]	Parameter Value [extracted product]	Extract Name	Material Type	Protocol REF	Parameter Value [amplification]	Parameter Value [label used]	Parameter Value [amount of nucleic acid labeled]	Labeled Extract Name	Label	Material Type	Protocol REF	Parameter Value [chamber type]	Parameter Value [temperature]	Unit [TemperatureUnit]	Parameter Value [quantity of label target used]	Unit [MassUnit]	Parameter Value [volume]	Unit [VolumeUnit]	Parameter Value [time]	Unit [TimeUnit]	Hybridization Name	Array Design REF	Term Source REF	Comment [Array Design URI]	Scan Name	Array Data File	Comment [ArrayExpress FTP file]	Comment [ArrayExpress Data Retri

woooee 814 Nearly a Posting Maven · Answer 2 · 2010-10-19T22:54:00+00:00

This thread is duplicate from this one http://www.daniweb.com/forums/thread317912.html see my answer there.

And was copied to bytes.com. At some point you will have to write some of the code yourself.

Parsing attributes from sdrf.txt files and extracting unique terms for all sdrf.txt

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers