-
Website
http://eric.themoritzfamily.com -
Original page
http://eric.themoritzfamily.com/2008/11/21/python-encodings-and-unicode/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
Affordable SEO Services
2 comments · 1 points
-
ericmoritz
11 comments · 2 points
-
mp3danyul
1 comment · 1 points
-
muratbiskin
1 comment · 1 points
-
Mike Dirolf
1 comment · 1 points
-
-
Popular Threads
not cp1252. Sorry for the mix up. Latin-1 is basically extended ASCII.
I like to use cp1252 instead of Latin-1 because Latin-1 is missing
chars that 1252 doesn't and if you try to decode 1252 as 8859-1 and
those chars are in the byte stream, python will raise an error.
However if you try to decode 8859-1 as 1252 you don't have that issue
because all the 8859-1 chars are valid 1252 chars.
I've often seen that when people say that their encoding is 8859-1 in
an XML header or something they actually mean cp1252. As soon as they
stick a smart quote or em dash in their XML file my parser will break
because I'm using 8859-1 to decode the data.
The section on "when python automatically encodes/decodes" pointed me to a stinky bug in my Django code that had been bothering me for weeks. Involving the infamous en-dash, as it happens. Funny how none of the other web descriptions of UTF-8 and python have much to say about the automatic conversion thing...
while True:
try:
str( uploadedFile )
break
except:
thisError = sys.YADA # get char that caused the problem from system error object
uploadedFile.replace( thisError, '?' )
It worked but depending on what was uploaded, I lost tons of characters. But THIS ARTICLE helped me so much by one phrase:
Decode First, Encode Last
Thanks!!!
content = content.decode("utf-8") #decode byte stream to unicode
content = content.encode("ascii", "ignore") #encode to ASCII byte stream, removing characters with codes >127