I am trying to do some text processing tasks against a collection of files stored in a directory. The data set is just standard 20-newsgroup data. However, running the following code segement gives error message such as UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte
I think it should be related to unicode problem. But I am not clear how to solve it.
9: DIR = 'C:\\Users\\Desktop\\data\\rec.sport.hockey'
10: posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
11: x_train = vectorizer.fit_transform(posts)
The traceback message is as follows
Traceback (most recent call last):
File "C:/Users/PycharmProjects/Project3/demo10.py", line 11, in <module>
x_train = vectorizer.fit_transform(posts)
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte