Hi Folks,

I have a set of csv files that I open and read the contents of a row into a DictReader, this works fine 99% of the time, but occasionally one of the fields in a record has an extra new line character. For example here's the format of said file

field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field

F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
...
...

The python code I have for reading through a csv file is

import csv
fields = ["A","B","C","D","E","F","G"]
delim = "~"
lineReader = csv.DictReader(open('./input/26.dat', 'rb'), delimiter=delim,fieldnames=fields)
fileRows = []
for row in lineReader:
    fileRows.append(row)

Which works great for MOST csv files I read, not so for 'bad' csv files like the example above. The error I get when reading a csv file of this format is

File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/csv.py", line 104, in next
row = self.reader.next()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

I've tried to google the above error but I can't find anything specific to my scenario. Any suggestions?

I failed to replicate the error running your script on the data you posted.

It looks like you have Python version 2.6. I do too (2.6.4 to be exact) so I guess the problem isn't different versions. (Maybe different versions of the csv module?)

Also, posting data in quote tags may not preserve the format. Maybe if you attach the data file to your post or post it between code tags we can reproduce the error.

Edit: Gribouillis I still have problems even after opening it in universal newline character mode

Yeah I was trying to simplify the format for the sake of the thread, see the attached 45.txt, also this is the updated code as an example to what I'm trying to do with the data

import csv
fields = ["PROGTITLE", "SUBTITLE", "EPISODE", "YEAR", "DIRECTOR", "PERFORMERS",
          "PREMIERE", "FILM", "REPEAT", "SUBTITLES", "WIDESCREEN", "NEWSERIES",
          "DEAFSIGNED", "BNW", "STARRATING", "CERTIFICATE", "GENRE", "DESCRIPTION",
          "CHOICE", "DATE", "STARTTIME", "ENDTIME", "DURATION"]
delim    = '~'
     
lineReader   = csv.DictReader(open('45.txt', 'rbU'), delimiter=delim,fieldnames=fields)

def FormatDate(date):
      return date[6:10] +"-" +date[3:5] + "-" +date[0:2]
      
channelPrograms = []
for row in lineReader:
   row["DATE"] = FormatDate(row["DATE"])   
   channelPrograms.append(row)

The FormatDate function works for all lines apart from the one where it has \r characters in the description.

The error I get is

Traceback (most recent call last):
File "readcsv.py", line 15, in <module>
row["DATE"] = FormatDate(row["DATE"])
File "readcsv.py", line 11, in FormatDate
return date[6:10] +"-" +date[3:5] + "-" +date[0:2]
TypeError: 'NoneType' object is unsubscriptable

This is because it fails to read the record properly, but I'm not sure how to read the record correctly even if there are new line characters in the description.

Edit: See the attached screenshot which shows the extra \r\n characters in the middle of the description.

The error occurs if you call your FormatDate function with a parameter having a value of None. I don't see a relation to the presence or absence of newline characters.

>>> def FormatDate(date):
      return date[6:10] +"-" +date[3:5] + "-" +date[0:2]

>>> FormatDate('12/10/2010') #Try it with text representing a date
'2010-10-12'
>>> 
>>> FormatDate(None) #What if the parameter has no value?
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    FormatDate(None) #What if the parameter has no value?
  File "<pyshell#1>", line 2, in FormatDate
    return date[6:10] +"-" +date[3:5] + "-" +date[0:2]
TypeError: 'NoneType' object is unsubscriptable
>>>

The error is merely a symptom of the wider problem. The problem is the reader treating the carriage control/line feed as a new record in the middle of reading a record of a set number of fields.

You probably should roll your own here, that is read one record, split on the comma's and test for length. If the length is too short, read another record, split, and append to the first, although there is probably a little more coding here since the "\r\n" is in the middle of the description. Sometimes the difficult way is the easy way.

The error is merely a symptom of the wider problem. The problem is the reader treating the carriage control/line feed as a new record in the middle of reading a record of a set number of fields.

Yes, I understand now. The reader interprets carriage control/linefeed as end-of-line characters, which results in the rest of the fields for that record having value of None. Your program expects these fields to have values other than None so errors result.

It would be good if there was a way to tell the csv module that the value of end-of-line character is not the usual CR/LF but the docs say it is hardcoded.

The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.

I guess your options include rolling your own parser, as woooee suggests, or pre-processing the file by reading it, removing all the CR/LF characters and writing it to a new file that you will use as input for the csv parsing.