We get files from a unix system that are delimited with linefeed only, this is not a problem. The problem is that within some of the fields themselves, there are carraige controls ("\r\n"). Reading the file using OS will see a row like this and stop reading at the crlf.

I am trying to figure out how to A) Read to the end of the record regardless of crlf's, and B) if I do encounter a field that has one, remove it, and C) write that row back out in a proper format windows (and SQL) can use by replacing the newlines with CRLF's.

This is the code I'm using:

import csv

class MyDialect(csv.excel): 
        lineterminator = "\n" 
csv.register_dialect("myDialect", MyDialect) 

cr = csv.reader(open("data.csv","rb"), dialect = "myDialect")
cw = csv.writer(open("clean_data.csv", "wb"))
crlf = '\r\n'
for row in cr:
        for col in row:
                if crlf in col:
                        #col.replace("\r\n", "") <-- didn't work
                        col = col.rstrip()
        cw.writerow(row)
    
print "Finished"

I tried (delimiter = '\n') without any luck either. Is there any way to get Python to ignore CRLF's all together?

Recommended Answers

All 10 Replies

I don't know much about OOP. But shouldn't it be like that.

And about the replace, can you post the error message???

import csv

class MyDialect(csv.excel): 
    self.lineterminator = "\n"  #  here
csv.register_dialect("myDialect", MyDialect) 

cr = csv.reader(open("data.csv","rb"), dialect = "myDialect")
cw = csv.writer(open("clean_data.csv", "wb"))
crlf = '\r\n'
for row in cr:
    for col in row:
        if crlf in col:
            #col.replace("\r\n", "") <-- didn't work
            col = col.rstrip()
    cw.writerow(row)
    
print "Finished"

It doesn't error out at all. Since the records are veriable lentgh to begin with, it just lops it off at the embedded CRLF and treats what follows as a new record.

There is one file crlf.py in scripts directory, is it relevant?

D:\Python27>cat Tools\Scripts\crlf.py
#! /usr/bin/env python
"Replace CRLF with LF in argument files.  Print names of changed files."

import sys, os

def main():
    for filename in sys.argv[1:]:
        if os.path.isdir(filename):
            print filename, "Directory!"
            continue
        data = open(filename, "rb").read()
        if '\0' in data:
            print filename, "Binary!"
            continue
        newdata = data.replace("\r\n", "\n")
        if newdata != data:
            print filename
            f = open(filename, "wb")
            f.write(newdata)
            f.close()

if __name__ == '__main__':
    main()

This looks to replace the CRLF at the end of a line with a LF, for a unix/linux environment. The CSV reader does read the record from start to finsh as I've discovered, but it retains the CRLF inside the columns too. Using 'COL.Replace('\r\n', '') doesn't error out, but it doesn't seem to be doing anything either.
MSSQL keeps choking on the embedded CRLF's as it tries to import, so I need to flatten out the data and get rid of all the carraige control, and add just the record end. I appreciate you're help though!!

How about count of '\r'? And .replace('\r','')?

Or doing ' '.join(col.split()) replacing white space with one space everywhere?

And something like:

newdata = data.replace("\\r\\n", "")

No luck. I tried all of the above, and switched from csv to OS using the os.linesep with no sucess.

import os

os.linesep = '\n'

file = "data.csv"
infile = open(file,'rb') 
outfile = open( 'new_' + file, 'wb') 
i = 0
LineNo = 'Record Number: '
for line in infile:
        i += 1
        print LineNo + str(i) + ' ================================================'
        print line
        
print "Finished"

infile.close() 
outfile.close()

The first record (Header) ends in a LF, so it's intact. The second record has an embeded CRLF in one of the fields, and the record ends at that line CRLF, the third record picks up in the middle of the field following the CRLF as it's starting point to the end of the record.

Isn't the OS.Linesep supposed to tell Windows what the proper line-end is supposed to be, and ignore anything else?

The only way I can think to fix this is to byte-read the file once, replacing and CRLF combinations (chr(13)+chr(10)) with an empty (''), and replace and Linefeed-Only's (chr(10)) with the CRLF.

Maybe I'm using OS.Linesep wrong?

About linefeed handling of Python I think it does it's own magic automatically so that everything looks '\n':

os.linesep: The string used to indicate line breaks. On Unix-based systems (Unix, Linux, Mac OS X), this is '\n'. However, if you want to emulate a Windows system, you can change this to '\r\n' to represent the carriage return and newline feed strings used on Windows platforms.

General Purpose Variables

Here is code I tested the removing of \r\n yesterday.

import random, string

rope = ''
for i in range(100):
    rope += random.choice(string.ascii_letters+' '*8+',.')
    if random.random()<0.1:
        rope += '\r\n'

print repr(rope)
print rope
print '-'*40
rope = ' '.join(rope.split())
print repr(rope)
print rope
print rope.count('\r'),"\\r's"

This does work and is a great way to get rid of them. The issue is not removing the CRLF's, it's getting Python to read 'past' them and to the UNIX line-end of 'LF', keeping the CRLF's inside the fields only, so that I 'can' remove them.

I will go back to the CSV with your solution, as it seems the CSV method keeps the CRLF's in the fields and see if it works.

Thanks!

Could be that you are prefering too much to use readline. In Python it is good to prefer to use open('file.csv', 'rb') the binary mode might be necessary to turn off the Python magic processing of new lines.

If you could give sample of your original input, maybe I could give simple input solution for that even without csv module.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.