builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 14: invalid continuation byte

I'm getting the following error that pops up not in my script but in the codecs.py file. I've used code exactly like this in another program and it worked just fine. Any ideas? Script below.

#/usr/bin/env python3

import sys

song = sys.argv[1]
file = open(song)
tag = b'artist='


for i in range(0,2):
    for line in file:
        if tag in line.lower():
            print(tag)

Did you check the type of line ? it seems to me that is a str (which means unicode in python 3). Again you are mixing bytes and str implicitly (tag in line.lower()). Use explicit conversions to control the types.

Edited 2 Years Ago by Gribouillis

I tried tag as both a string and as bytes. I used tag = 'artist=' as well as tag = b'artist='. I get the same error either way.

builtins.UnicodeDecodeError: 'utf-8'

Don't know much about unicode but it may be that the OS is not set up for utf8, so include this at the top of the file to let the interpreter know to use utf8, or whatever,instead of the encoding set by the OS when you try to print (and I think the line is correct but may only be close)

#/usr/bin/env python3
# -*- coding: utf-8 -*-

Edited 2 Years Ago by woooee

builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 14: invalid continuation byte

You must use traceback and print repr of data to find exactly which conversion yields this error. (which string, converted to what).

I think Gribouillis has postet some about unicode in Python 3 in your previous posts.
Can take a little more about this,for unicode was a big change in Python 3.

If you get a UnicodeDecodeError
it most likely means that you’re not reading the file in the correct encoding.
You should carefully read the specification of whatever it is that you’re reading and check that you’re doing it right (e.g., reading data as UTF-8 instead of Latin-1 or whatever it needs to be)
Python 3 is much more picky than Python 2,because of changes made to unicode.

Example in interactive shell all will be ok,we are not reading from file and unicode is great.

Python 3.4
>>> print('Spicy jalapeño ☂')
Spicy jalapeño ☂

Pyhon 2.7
>>> print('Spicy jalapeño ☂')
Spicy jalapeño ☂

Save Spicy jalapeño ☂ as jala.txt and read() it.

>>> f = open('jala.txt', 'rt', encoding='utf-8')
>>> print(f.read())
Traceback (most recent call last):
  File "<pyshell#53>", line 1, in <module>
    print(f.read())
  File "C:\Python34\lib\codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 12: invalid continuation byte

So in test i saved Spicy jalapeño ☂ with cp1252 encoding.
When i try to read it with 'utf-8' i get error.

>>> f = open('jala.txt', 'rt', encoding='cp1252')
>>> print(f.read())
Spicy jalapeño ?

It almost work,but it's not correct for umbrella.

Save Spicy jalapeño ☂ with utf-8 encoding.

>>> f = open('jala.txt', 'rt', encoding='utf-8')
>>> print(f.read())
Spicy jalapeño ☂

Yes it's work.

What if i try to read the file in with ascii encoding

>>> f = open('jala.txt', 'rt', encoding='ascii')
>>> print(f.read())
Traceback (most recent call last):
  File "<pyshell#64>", line 1, in <module>
    print(f.read())
  File "C:\Python34\lib\encodings\ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

As expected i get a UnicodeDecodeError

Sometime the encoding can be correct and we still get error.
There are optional errors argument to open() to deal with the errors.
So here i try to get ascii encoding to not give error.

>>> f = open('jala.txt', 'rt', encoding='ascii', errors='replace')
>>> print(f.read())
���Spicy jalape��o ���
>>> #Or
>>> f = open('jala.txt', 'rt', encoding='ascii', errors='ignore')
>>> print(f.read())
Spicy jalapeo 

No error it works with ascii encoding,but the funny unicode characters are gone.

So if you have no idèe what encoding the file is,there is no method that can give you a 100% correct answer about encoding.
There are chardet that make a guess.
So what will chardet guess about jala.txt

C:\>cd python34
C:\Python34>chardetect jala.txt
jala.txt: utf-8 with confidence 0.87625

It's pretty sure that is utf-8,and i am 100% sure because i know in this version i did save jala.txt with utf-8 encoding.

Edited 2 Years Ago by snippsat

The files I'm working on are all binary files, .mp3, .flac, & .wav. Does that help at all?

The files I'm working on are all binary files, .mp3, .flac, & .wav. Does that help at all?

Sure you have to look at specification for these file types.
Like mp3 that use Id3 for metadata.
Search example "mp3 id3 character encoding".

Look at eyeD3
You see in source code that he really had to think about encoding.
Just one example.

property
    def text_delim(self):
        assert(self.encoding is not None)
        return b"\x00\x00" if self.encoding in (UTF_16_ENCODING,
                                                UTF_16BE_ENCODING) else b"\x00"

    def _initEncoding(self):
        assert(self.header.version and len(self.header.version) == 3)
        if self.encoding is not None:
            # Make sure the encoding is valid for this version
            if self.header.version[:2] < (2, 4):
                if self.header.version[0] == 1:
                    self.encoding = LATIN1_ENCODING
                else:
                    if self.encoding > UTF_16_ENCODING:
                        # v2.3 cannot do utf16 BE or utf8
                        self.encoding = UTF_16_ENCODING
        else:
            if self.header.version[:2] < (2, 4):
                if self.header.version[0] == 2:
                    self.encoding = UTF_16_ENCODING
                else:
                    self.encoding = LATIN1_ENCODING
            else:
                self.encoding = UTF_8_ENCODING

        assert(LATIN1_ENCODING <= self.encoding <= UTF_8_ENCODING)

Edited 2 Years Ago by snippsat

This article has been dead for over six months. Start a new discussion instead.