0

I'm working with some really ugly files at the moment When I get them they can look like any of these:

All data on one line delimited by ┌
data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌

Nice data. All the bits I'm interrested already one one line per bit of information:
data1|data2|data3|
data1|data2|data3|

Mixed:
data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌
data1|data2|data3|
data1|data2|data3|

or even:
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"

So at the moment I have this:

import os

def process_data(data):
    print '%s' % data

directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile
        with open(nfile, 'r') as infile:
            for line in infile:
                #discard blank lines
                if not line.strip():
                    continue
                else:
                    line = line.strip()
                    if '' in line:
                        lines = line.split('')
                        for sline in lines:

                            process_data(sline[:-1])
                    elif line.startswith('"') and line.endswith('"'):

                        process_data(line[1:-2])
                    else:

                        process_data(line[:-1])

This seems to work ok but I'm not convinced this is the best way to go about this. Does anyone have anyt suggestions on how I can tidy this up?

Also the delimiter character is not really the one I have but it is the closest I could find that would display here.

3
Contributors
16
Replies
19
Views
5 Years
Discussion Span
Last Post by Gribouillis
Featured Replies
  • 1

    Like: # -*- coding: utf-8 -*- for d in (u'data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|', u'''data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌ data1|data2|data3| data1|data2|data3|''', u'''"data1|data2|data3|" "data1|data2|data3|" "data1|data2|data3|" "data1|data2|data3|"'''): if d.strip(): print(''.join(c for c in d.replace(u'┌', '\n') if c.isalnum() or c in ('|','\n'))) Read More

  • 1

    def process_data(data): for d in (data): print d # if d.strip(): # print(''.join(c for c in d if c.isalnum() or c in ('|','\n'))) To def process_data(data): return (''.join(c for c in d.replace(u'┌', '\n').replace('\n\n','\n') if c.isalnum() or c in ('|','\n'))) print(process_data(the_line)) It is better to return value and print it in … Read More

  • A known solution is import codecs file = codecs.open(path, encoding='iso8859-1') see if it works for you. Read More

1

Like:

# -*- coding: utf-8 -*-
for d in (u'data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|',
          u'''data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌
data1|data2|data3|
data1|data2|data3|''',
          u'''"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"'''):
    if d.strip():
        print(''.join(c for c in d.replace(u'┌', '\n') if c.isalnum() or  c in ('|','\n')))

Edited by pyTony

0

Thanks Tony. I'll have to figure out wxactly what that does later(lunch time first:)

Quick question though. Can I then do something like this:

import os

def process_data(data):
    for d in (data):
        print d
    #    if d.strip():
    #        print(''.join(c for c in d if c.isalnum() or  c in ('|','\n')))

directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        with open(nfile, 'r') as infile:
            the_file = infile.read()
            process_data(the_file)

At the moment I'm getting:
d
a
t
a
1
d
a
t
a
2
d
a
t
.
.
.

So obviously I'm doing something wrong.
Cheers

1
def process_data(data):
    for d in (data):
        print d
    #    if d.strip():
    #        print(''.join(c for c in d if c.isalnum() or  c in ('|','\n')))

To

def process_data(data):
    return (''.join(c for c in d.replace(u'┌', '\n').replace('\n\n','\n') if c.isalnum() or  c in ('|','\n')))

print(process_data(the_line))

It is better to return value and print it in caller.

Edited by pyTony

0

Ok thanks again Tony.

I ended up with this:

# -*- coding: utf-8 -*-

import os
def process_data(data):
    return ''.join(c for c in data.replace(u'', '\n') if c.isalnum() or  c in ('|','\n'))


directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile


        with open(nfile, 'r') as infile:
            for line in infile:
                if line.strip():
                    print(process_data(line.strip()))
0

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 10182: ordinal not in range(128)
Dernit!

0

Just to be clear I have no idea how to fix this. My guess is that Python is expecting the data to be ascii but it is something else right?

1

A known solution is

import codecs
file = codecs.open(path, encoding='iso8859-1')

see if it works for you.

Edited by Gribouillis

0

I found the problem. The records are seperated byt '┌' then a space then (sometimes one sometimes two) NULL characters. I was trying to get rid of the null characters using
.replace(u'\0', '', line)

this is what brought up the error. At the moment I'm using this:

# -*- coding: utf-8 -*-
import os
import re

def process_data(data):
    return ''.join(c for c in data.replace('', '\n') if c.isalnum() or c in ('|','\n'))

out = open('outddd.txt', 'w')
directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile


        with open(nfile, 'r') as infile:
            for line in infile:
                line = re.sub(r'\0', r'',line)
                if line.strip():
                    out.write(process_data(line.strip()))
out.close()

But I'm losing all my spaces and underscores. Any idea why this is?

0

\0 characters usually mean that your file is encoded. Try to open it with codecs.open and the appropriate encoding (it could be 'utf8' or ' iso8859-1' or another encoding). You could try this first

with open(filename, 'rb') as ifh:
    print repr(ifh.read(4))

this may give you the BOM from which we could perhaps guess the encoding.

Edited by Gribouillis

0

All I get when I run that is this:
'UDC_'
These characters appear at the begining of the 'stream', if I understand this correctly. Does this help me?

I'm reading up on encodings at the moment...

0

sigh, again

So, as I was saying... If I do this:

text_file = open('example.txt')
text_file.readline()

my output looks like this:

UDC_*data|data|data|\x01 \x00\x00UDC_data|data|data|\x01 \x00\x00UDC_data|data|data|\x01 \x00\x00\x00"

If I look at the table found here: http://en.wikipedia.org/wiki/Byte_order_mark , it seems like this is not proper BOMs??

0

Is there a way I can do something like this:

        with open(nfile, 'r') as infile:
            for line in infile:
                match = re.search(u'(\x..)',line)
              #  line = re.sub(r'\\x..', r'',line)
              #  line = re.sub(r'\x01', r'',line)
                #if line.strip():
                #    out.write(process_data(line))
                if match != None:
                    print match.group(1)

This works fine:
line = re.sub(r'\x01', r'',line)

But this:
match = re.search(u'(\x..)',line)

gives me an error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-3: truncated \xXX escape

I want to look through all the files I have and see how many of these characters exists and what they are

0

It doesn't look like a BOM. Did you try to open the file with the codecs module to see if it solves your accented letters issue ?

0

If I do
infile = codecs.open(nfile, encoding='iso8859-1')
I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8421: ordinal not in range(128)

If I try utf8 I get:
UnicodeError: UTF-16 stream does not start with BOM

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.