I have a series of data files with large headers. Here is an example:

SpectraSuite Data File
++++++++++++++++++++++++++++++++++++
Date: Fri Feb 25 13:43:55 EST 2011
User: group
Dark Spectrum Present: No
Reference Spectrum Present: No
Number of Sampled Component Spectra: 1
Spectrometers: USB2E7196
Integration Time (usec): 11000 (USB2E7196)
Spectra Averaged: 500 (USB2E7196)
Boxcar Smoothing: 0 (USB2E7196)
Correct for Electrical Dark: No (USB2E7196)
Strobe/Lamp Enabled: No (USB2E7196)
Correct for Detector Non-linearity: No (USB2E7196)
Correct for Stray Light: No (USB2E7196)
Number of Pixels in Processed Spectrum: 2048
>>>>>Begin Processed Spectral Data<<<<<
339.09 0.00
339.48 184.72
339.86 186.46
340.24 187.76
340.63 189.11
341.01 190.97
...
...
1023.36 196.86
1023.65 196.36
>>>>>End Processed Spectral Data<<<<<

I've highlighted the heading and footer in red. What I would like to do is have a code completely crop these sections. I have done some RE matching, but don't know what is the best way to identify the entire line as a string. Is there a simple, smart way to do this?

Thanks.

Recommended Answers

All 13 Replies

Well you could (assuming you know how to work with file i/o in python) just have it spit out all but that line
Looks similar to this
print='yes'
If print=='yes' :
PrintTofileFunctiongoeshere

If getridof[1]='>' :
print='no'
Im sure there are better ways but im still new to python so i cant help alot

This should work. I have not actually tested that it even 'compiles', but you can see the pattern, right?

with open(theFileName,'r') as inbound:
    with open(theNewFileName, 'w') as outbound:
        state = 'skip'
        for line in inbound:
            if line.startswith('>>>>>Begin Processed Spectral Data'):
                state = 'write'
                continue
            if line.startswith('>>>>>End Processed Spectral Data'):
                state = 'skip'
                continue
            if 'write' == state:
                outbound.write(line)

I suppose the continue in line 10 could better be break so then line 9 is unneeded.

Simpler but potentialy unsafe is to write out lines starting with number:

for line in infile:
    if line[0].isdigit(): outfile.write(line)

Using the itertools module

import itertools as itt

def not_is_begin(line):
    return not line.startswith(">>>>>Begin Processed Spectral Data<<<<<")

def not_is_end(line):
    return not line.startswith(">>>>>End Processed Spectral Data<<<<<")

lines = itt.dropwhile(not_is_begin, open(FILENAME))
lines = itt.islice(lines, 1, None)
lines = itt.takewhile(not_is_end, lines)

for line in lines:
    print line

!

Thanks guys. These suggestions were all helpful and I believe I will have this working momentarily.

Using the itertools module

import itertools as itt

def not_is_begin(line):
    return not line.startswith(">>>>>Begin Processed Spectral Data<<<<<")

def not_is_end(line):
    return not line.startswith(">>>>>End Processed Spectral Data<<<<<")

lines = itt.dropwhile(not_is_begin, open(FILENAME))
lines = itt.islice(lines, 1, None)
lines = itt.takewhile(not_is_end, lines)

for line in lines:
    print line

!

I like this solution. However, I noticed if the file itself does not have a header, then this program won't recognize it. I've written your code here into a function called 'import data'. Since I have files that have headers and files that dont, I was thinking something like this:

try:
(if first line is composed of digits)
print lines
else:
call import data (the code you wrote)


What would be the best syntax for the try statement do you think? I was thinking of doing what the poster above said:

if line[0].isdigit():

What do you think?

I like this solution. However, I noticed if the file itself does not have a header, then this program won't recognize it. I've written your code here into a function called 'import data'. Since I have files that have headers and files that dont, I was thinking something like this:

try:
(if first line is composed of digits)
print lines
else:
call import data (the code you wrote)


What would be the best syntax for the try statement do you think? I was thinking of doing what the poster above said:

if line[0].isdigit():

What do you think?

Here's how I implemented it:

def import_data(file):
	'''Reads in list of files and stores data in an array'''
	f=open(file, 'r')	
	line=f.readline()
	if not re.match("#", line): 
		if not line[0].isdigit():       #If first line is a letter, cut the header out

		#######Cut's out header above and below the defined statements; so not very modular
			def not_is_begin(line):
				return not line.startswith(">>>>>Begin Processed Spectral Data<<<<<")   #CUTS THE HEADER
			def not_is_end(line):
				return not line.startswith(">>>>>End Processed Spectral Data<<<<<")
			lines=itt.dropwhile(not_is_begin, open(file))
			lines = itt.islice(lines, 1, None)
			lines = itt.takewhile(not_is_end, lines)

			for line in lines:
				line=line.strip()
				sline=line.split()
				print sline
		#######
		if line[0].isdigit():      #If first line is a number, just go on and 
			for line in f:		
				line=line.strip()
				sline=line.split()
				print sline

Let me know if you guys see anymore improvements. As of now, it would fail if the header didn't have the key strings that >>>>Begin spectra data<<< etc... and would fail if the header itself began with a number.

Let me know if you guys see anymore improvements. As of now, it would fail if the header didn't have the key strings that >>>>Begin spectra data<<< etc... and would fail if the header itself began with a number.

This code is horrible. You're saying that the code would fail if the header is different or non existent, but in order to write the code, we must know the different possible formats of your files. So what are these possible formats ?

I don't understand the if not re.match("#", line) . Why is it here ? Also there must be only 1 call to open() in your function.

This code is horrible. You're saying that the code would fail if the header is different or non existent, but in order to write the code, we must know the different possible formats of your files. So what are these possible formats ?

I don't understand the if not re.match("#", line) . Why is it here ? Also there must be only 1 call to open() in your function.

Sometimes the files do not have headers, but have the same type of data. I wanted to incorporate this into the code you wrote, so that if the header isn't there, it still reads the data.

Any ideas?

The re.match(#) is in case I ever put comments at the beginning of my file, but is not necessary.

Another way in spirit of Python philosophy of duck typing, the line is correct it succeeds to become to floating point values:

import pretty
files =  ['testdata.txt', 'testdata2.txt']
values = []
perline = 2
for fn in files:
    with open(fn) as this_file:
        for line in this_file:
            try:
                value = map(float, line.strip().split())
                if len(value) == perline:
                    values.append(value)
            except ValueError as e:
                # debug
                print 'Dumped: %s'%line,
                pass

pretty.printer(values)
                    
            
""" output:
Dumped: SpectraSuite Data File
Dumped: ++++++++++++++++++++++++++++++++++++
Dumped: Date: Fri Feb 25 13:43:55 EST 2011
Dumped: User: group
Dumped: Dark Spectrum Present: No
Dumped: Reference Spectrum Present: No
Dumped: Number of Sampled Component Spectra: 1
Dumped: Spectrometers: USB2E7196
Dumped: Integration Time (usec): 11000 (USB2E7196)
Dumped: Spectra Averaged: 500 (USB2E7196)
Dumped: Boxcar Smoothing: 0 (USB2E7196)
Dumped: Correct for Electrical Dark: No (USB2E7196)
Dumped: Strobe/Lamp Enabled: No (USB2E7196)
Dumped: Correct for Detector Non-linearity: No (USB2E7196)
Dumped: Correct for Stray Light: No (USB2E7196)
Dumped: Number of Pixels in Processed Spectrum: 2048
Dumped: >>>>>Begin Processed Spectral Data<<<<<
Dumped: >>>>>End Processed Spectral Data<<<<<
Dumped: #comment

  [
    [339.09, 0.0], 
    [339.48, 184.72], 
    [339.86, 186.46], 
    [340.24, 187.76], 
    [340.63, 189.11], 
    [341.01, 190.97], 
    [1023.36, 196.86], 
    [1023.65, 196.36], 
    [339.09, 0.0], 
    [339.48, 184.72], 
    [339.86, 186.46], 
    [340.24, 187.76], 
    [340.63, 189.11], 
    [341.01, 190.97], 
    [1023.36, 196.86], 
    [1023.65, 196.36]]
"""

Input is what you gave to us, and copy of it without beginning and end part, a comment added.
Pretty is my pretty printer module posted in code snippets in DaniWeb. If you do not us it you can mange with pprint or simple for loop.

Another way in spirit of Python philosophy of duck typing, the line is correct it succeeds to become to floating point values:

import pretty
files =  ['testdata.txt', 'testdata2.txt']
values = []
perline = 2
for fn in files:
    with open(fn) as this_file:
        for line in this_file:
            try:
                value = map(float, line.strip().split())
                if len(value) == perline:
                    values.append(value)
            except ValueError as e:
                # debug
                print 'Dumped: %s'%line,
                pass

pretty.printer(values)
                    
            
""" output:
Dumped: SpectraSuite Data File
Dumped: ++++++++++++++++++++++++++++++++++++
Dumped: Date: Fri Feb 25 13:43:55 EST 2011
Dumped: User: group
Dumped: Dark Spectrum Present: No
Dumped: Reference Spectrum Present: No
Dumped: Number of Sampled Component Spectra: 1
Dumped: Spectrometers: USB2E7196
Dumped: Integration Time (usec): 11000 (USB2E7196)
Dumped: Spectra Averaged: 500 (USB2E7196)
Dumped: Boxcar Smoothing: 0 (USB2E7196)
Dumped: Correct for Electrical Dark: No (USB2E7196)
Dumped: Strobe/Lamp Enabled: No (USB2E7196)
Dumped: Correct for Detector Non-linearity: No (USB2E7196)
Dumped: Correct for Stray Light: No (USB2E7196)
Dumped: Number of Pixels in Processed Spectrum: 2048
Dumped: >>>>>Begin Processed Spectral Data<<<<<
Dumped: >>>>>End Processed Spectral Data<<<<<
Dumped: #comment

  [
    [339.09, 0.0], 
    [339.48, 184.72], 
    [339.86, 186.46], 
    [340.24, 187.76], 
    [340.63, 189.11], 
    [341.01, 190.97], 
    [1023.36, 196.86], 
    [1023.65, 196.36], 
    [339.09, 0.0], 
    [339.48, 184.72], 
    [339.86, 186.46], 
    [340.24, 187.76], 
    [340.63, 189.11], 
    [341.01, 190.97], 
    [1023.36, 196.86], 
    [1023.65, 196.36]]
"""

Input is what you gave to us, and copy of it without beginning and end part, a comment added.
Pretty is my pretty printer module posted in code snippets in DaniWeb. If you do not us it you can mange with pprint or simple for loop.

This is very cool. I will certainly try it. Thank you

One with regex.

import re

data = '''\
Correct for Detector Non-linearity: No (USB2E7196)
Correct for Stray Light: No (USB2E7196)
Number of Pixels in Processed Spectrum: 2048
>>>>>Begin Processed Spectral Data<<<<<
339.09 0.00
339.48 184.72
339.86 186.46
340.24 187.76
340.63 189.11
341.01 190.97
...
...
1023.36 196.86
1023.65 196.36
>>>>>End Processed Spectral Data<<<<<'''

r = re.compile(r'(\d+\..+)')
for match in r.finditer(data):
    print match.group()

'''Output-->
339.09 0.00
339.48 184.72
339.86 186.46
340.24 187.76
340.63 189.11
341.01 190.97
1023.36 196.86
1023.65 196.36
'''

One with regex.

import re

data = '''\
Correct for Detector Non-linearity: No (USB2E7196)
Correct for Stray Light: No (USB2E7196)
Number of Pixels in Processed Spectrum: 2048
>>>>>Begin Processed Spectral Data<<<<<
339.09 0.00
339.48 184.72
339.86 186.46
340.24 187.76
340.63 189.11
341.01 190.97
...
...
1023.36 196.86
1023.65 196.36
>>>>>End Processed Spectral Data<<<<<'''

r = re.compile(r'(\d+\..+)')
for match in r.finditer(data):
    print match.group()

'''Output-->
339.09 0.00
339.48 184.72
339.86 186.46
340.24 187.76
340.63 189.11
341.01 190.97
1023.36 196.86
1023.65 196.36
'''

Thanks. Can you explain this one a bit? I don't understand exactly what the re.compile() is doing.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.