DISQUS

One More Blog: Python Encodings and Unicode | One More Blog

  • Nikolaus · 1 year ago
    I'm just trying to wrap my head around this topic and found your post quite helpful, thanks. I think it is slightly redundant, but then that's probably the way most people learn things. :-) A minor nitpick: cp1252 and Latin-1 are different encodings. - I find it really weird that the official Python documentation is so quiet about this. Or at least, if this is clearly explained somewhere, it seems to be well hidden. :-)
  • ericmoritz · 1 year ago
    Ah yup you're absolutely correct. latin-1 is the same as ISO-8859-1
    not cp1252. Sorry for the mix up. Latin-1 is basically extended ASCII.

    I like to use cp1252 instead of Latin-1 because Latin-1 is missing
    chars that 1252 doesn't and if you try to decode 1252 as 8859-1 and
    those chars are in the byte stream, python will raise an error.

    However if you try to decode 8859-1 as 1252 you don't have that issue
    because all the 8859-1 chars are valid 1252 chars.

    I've often seen that when people say that their encoding is 8859-1 in
    an XML header or something they actually mean cp1252. As soon as they
    stick a smart quote or em dash in their XML file my parser will break
    because I'm using 8859-1 to decode the data.
  • Will · 1 year ago
    Perfect.

    The section on "when python automatically encodes/decodes" pointed me to a stinky bug in my Django code that had been bothering me for weeks. Involving the infamous en-dash, as it happens. Funny how none of the other web descriptions of UTF-8 and python have much to say about the automatic conversion thing...
  • Brian Becker · 2 months ago
    I've been programming for years (fortran, pascal, clipper, vb, lisp) but started using Python about 2 years ago when I wanted to write a program with major RegEx (my regex expression pattern is over 25,000chars). Anyway, I've be so frustrated by the "'ascii' codec can't" - and I've read several articles about the issue. I finally created a "try/except" that did endless replacements by grabbing the position of the error in the except:
    while True:
    try:
    str( uploadedFile )
    break
    except:
    thisError = sys.YADA # get char that caused the problem from system error object
    uploadedFile.replace( thisError, '?' )

    It worked but depending on what was uploaded, I lost tons of characters. But THIS ARTICLE helped me so much by one phrase:
    Decode First, Encode Last

    Thanks!!!
  • Philip · 2 months ago
    Thanks, I found this really helpful. In case it is useful, I wanted to convert some content from a web page into ASCII. I used:

    content = content.decode("utf-8") #decode byte stream to unicode
    content = content.encode("ascii", "ignore") #encode to ASCII byte stream, removing characters with codes >127
  • Mert Nuhoglu · 2 months ago
    Thank you really very much. I have read lots of articles about unicode in Python. This is the most useful one.