Hi Folks,

I have a set of csv files that I open and read the contents of a row into a DictReader, this works fine 99% of the time, but occasionally one of the fields in a record has an extra new line character. For example here's the format of said file

field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field

F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
field A~field B~field C~field D~field E~field F~field G
...
...

The python code I have for reading through a csv file is

import csv
fields = ["A","B","C","D","E","F","G"]
delim = "~"
lineReader = csv.DictReader(open('./input/26.dat', 'rb'), delimiter=delim,fieldnames=fields)
fileRows = []
for row in lineReader:
    fileRows.append(row)

Which works great for MOST csv files I read, not so for 'bad' csv files like the example above. The error I get when reading a csv file of this format is

File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/csv.py", line 104, in next
row = self.reader.next()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

I've tried to google the above error but I can't find anything specific to my scenario. Any suggestions?

I failed to replicate the error running your script on the data you posted.

It looks like you have Python version 2.6. I do too (2.6.4 to be exact) so I guess the problem isn't different versions. (Maybe different versions of the csv module?)

Also, posting data in quote tags may not preserve the format. Maybe if you attach the data file to your post or post it between code tags we can reproduce the error.

Edited 6 Years Ago by d5e5: n/a

Edit: Gribouillis I still have problems even after opening it in universal newline character mode

Yeah I was trying to simplify the format for the sake of the thread, see the attached 45.txt, also this is the updated code as an example to what I'm trying to do with the data

import csv
fields = ["PROGTITLE", "SUBTITLE", "EPISODE", "YEAR", "DIRECTOR", "PERFORMERS",
          "PREMIERE", "FILM", "REPEAT", "SUBTITLES", "WIDESCREEN", "NEWSERIES",
          "DEAFSIGNED", "BNW", "STARRATING", "CERTIFICATE", "GENRE", "DESCRIPTION",
          "CHOICE", "DATE", "STARTTIME", "ENDTIME", "DURATION"]
delim    = '~'
     
lineReader   = csv.DictReader(open('45.txt', 'rbU'), delimiter=delim,fieldnames=fields)

def FormatDate(date):
      return date[6:10] +"-" +date[3:5] + "-" +date[0:2]
      
channelPrograms = []
for row in lineReader:
   row["DATE"] = FormatDate(row["DATE"])   
   channelPrograms.append(row)

The FormatDate function works for all lines apart from the one where it has \r characters in the description.

The error I get is

Traceback (most recent call last):
File "readcsv.py", line 15, in <module>
row["DATE"] = FormatDate(row["DATE"])
File "readcsv.py", line 11, in FormatDate
return date[6:10] +"-" +date[3:5] + "-" +date[0:2]
TypeError: 'NoneType' object is unsubscriptable

This is because it fails to read the record properly, but I'm not sure how to read the record correctly even if there are new line characters in the description.

Edit: See the attached screenshot which shows the extra \r\n characters in the middle of the description.

Edited 6 Years Ago by PaulStat: n/a

Attachments
PROGTITLE~SUBTITLE~EPISODE~YEAR~DIRECTOR~PERFORMERS~PREMIERE~FILM~REPEAT~SUBTITLES~WIDESCREEN~NEWSERIES~DEAFSIGNED~BNW~STARRATING~CERTIFICATE~GENRE~DESCRIPTION~CHOICE~12/10/2010~STARTTIME~ENDTIME~DURATION
Lip Service~1/6, series 1~~~John McKay~Cat MacKenzie*Laura Fraser|Frankie Alan*Ruta Gedmintas|Tess Roberts*Fiona Button|Jay Adams*Emun Elliott|Ed MacKenzie*James Anthony Pearson|Sam Murray*Heather Peace|Lou Foster*Roxanne McKee|Cameron Alan*Tom Mannion|Karen Alan*Romana Abercromby|Becky*Cush Jumbo|Chloe*Lisa Livingstone|Sally*Alexis Peterman|Alistair Brice*Gilly Gilchrist|Ali*India Wadsworth|Carla*Ashley Lilley|Shona*Frances Mayli McCann|Ad director*Simon Thorpe|Receptionist*Moyo Omoniyi~false~false~false~true~true~true~false~false~~~Drama~With lesbians still almost invisible on British telly, this relationship drama about gay Glasgow girls inevitably feels encumbered by the banner it has to carry. But it could really be a lot more confident, and a lot better, than it is. There are two traps to avoid here. One is portraying lesbianism as exotic deviancy, on screen to entertain randy voyeurs. The other is being too self-conscious and apologetic about the subject matter. Lip Service manages to fall into both: it opens with a topless knee-trembler and closes with another sex scene that's a ludicrous attempt to shock, but in between it's a rather pappy saga about nice, straight-looking women who keep saying, "I'm a lesbian!" as if surprised, in a way that lesbians tend not to. There's no fizzing dialogue or original plotting to compensate. It's good that Lip Service exists, but you can't say much more in its favour than that.
        
       Radio Times reviewer - Jack Seale~true~12/10/2010~22:30~23:30~60
PROGTITLE~SUBTITLE~EPISODE~YEAR~DIRECTOR~PERFORMERS~PREMIERE~FILM~REPEAT~SUBTITLES~WIDESCREEN~NEWSERIES~DEAFSIGNED~BNW~STARRATING~CERTIFICATE~GENRE~DESCRIPTION~CHOICE~12/10/2010~STARTTIME~ENDTIME~DURATION
PROGTITLE~SUBTITLE~EPISODE~YEAR~DIRECTOR~PERFORMERS~PREMIERE~FILM~REPEAT~SUBTITLES~WIDESCREEN~NEWSERIES~DEAFSIGNED~BNW~STARRATING~CERTIFICATE~GENRE~DESCRIPTION~CHOICE~12/10/2010~STARTTIME~ENDTIME~DURATION
newlines.jpg 116.55 KB

The error occurs if you call your FormatDate function with a parameter having a value of None. I don't see a relation to the presence or absence of newline characters.

>>> def FormatDate(date):
      return date[6:10] +"-" +date[3:5] + "-" +date[0:2]

>>> FormatDate('12/10/2010') #Try it with text representing a date
'2010-10-12'
>>> 
>>> FormatDate(None) #What if the parameter has no value?
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    FormatDate(None) #What if the parameter has no value?
  File "<pyshell#1>", line 2, in FormatDate
    return date[6:10] +"-" +date[3:5] + "-" +date[0:2]
TypeError: 'NoneType' object is unsubscriptable
>>>

The error is merely a symptom of the wider problem. The problem is the reader treating the carriage control/line feed as a new record in the middle of reading a record of a set number of fields.

Edited 6 Years Ago by PaulStat: n/a

You probably should roll your own here, that is read one record, split on the comma's and test for length. If the length is too short, read another record, split, and append to the first, although there is probably a little more coding here since the "\r\n" is in the middle of the description. Sometimes the difficult way is the easy way.

Edited 6 Years Ago by woooee: n/a

The error is merely a symptom of the wider problem. The problem is the reader treating the carriage control/line feed as a new record in the middle of reading a record of a set number of fields.

Yes, I understand now. The reader interprets carriage control/linefeed as end-of-line characters, which results in the rest of the fields for that record having value of None. Your program expects these fields to have values other than None so errors result.

It would be good if there was a way to tell the csv module that the value of end-of-line character is not the usual CR/LF but the docs say it is hardcoded.

The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.

I guess your options include rolling your own parser, as woooee suggests, or pre-processing the file by reading it, removing all the CR/LF characters and writing it to a new file that you will use as input for the csv parsing.

This article has been dead for over six months. Start a new discussion instead.