unicode related problem when reading a collection of files

Question

winecoding 0 Junior Poster in Training

10 Years Ago

I am trying to do some text processing tasks against a collection of files stored in a directory. The data set is just standard 20-newsgroup data. However, running the following code segement gives error message such as UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte I think it should be related to unicode problem. But I am not clear how to solve it.

   9: DIR = 'C:\\Users\\Desktop\\data\\rec.sport.hockey'
   10:    posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
   11:    x_train = vectorizer.fit_transform(posts)

The traceback message is as follows

Traceback (most recent call last):
  File "C:/Users/PycharmProjects/Project3/demo10.py", line 11, in <module>
    x_train = vectorizer.fit_transform(posts)
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte

python

4 Contributors
3 Replies
400 Views
1 Week Discussion Span
Latest Post 10 Years Ago Latest Post by vegaseat

All 3 Replies

Gribouillis 1,391 Programming Explorer

10 Years Ago

By default open() uses the ASCII encoding

According to the documentation, the default encoding is locale.getpreferredencoding(). For me it is

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

You can try to guess your file's encoding with the chardet module/cli utility.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

bouncer 0 Newbie Poster · Answer 1 · 2015-05-10T23:13:21+00:00

For Python 3:
By default open() uses the ASCII encoding which only recognizes the first 128 values. 'Latin-1' handles the first 256. So specify the encoding explicitly in your open function - the second parameter should be encoding='Latin-1'

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague · Answer 2 · 2015-05-11T19:14:20+00:00

My current encoding comes up 'cp1252' because I am using the Anaconda3 Python system. It might be best to set it in open() if you need a specific encoding.

unicode related problem when reading a collection of files

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers