I am trying to do some text processing tasks against a collection of files stored in a directory. The data set is just standard 20-newsgroup data. However, running the following code segement gives error message such as UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte I think it should be related to unicode problem. But I am not clear how to solve it.

   9: DIR = 'C:\\Users\\Desktop\\data\\rec.sport.hockey'
10:    posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
11:    x_train = vectorizer.fit_transform(posts)


The traceback message is as follows

Traceback (most recent call last):
File "C:/Users/PycharmProjects/Project3/demo10.py", line 11, in <module>
x_train = vectorizer.fit_transform(posts)
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte


For Python 3:
By default open() uses the ASCII encoding which only recognizes the first 128 values. 'Latin-1' handles the first 256. So specify the encoding explicitly in your open function - the second parameter should be encoding='Latin-1'

By default open() uses the ASCII encoding

According to the documentation, the default encoding is locale.getpreferredencoding(). For me it is

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'


You can try to guess your file's encoding with the chardet module/cli utility.

My current encoding comes up 'cp1252' because I am using the Anaconda3 Python system. It might be best to set it in open() if you need a specific encoding.