|
|
|||
Building a UTF8 (Unicode) Sanitizer for Python
Problem: I am running a number of scripts that receive inputs from HTML forms or API clients. The system is a human/machine translation service. Internally, everything is in UTF8, which normally works fine, except sometimes we receive IS0-Latin-1 and other encodings without proper headers and then it breaks sporadically.
Solution: a cgi2utf() function that receives an input text and tries to convert it to UTF8 that can be processed by string functions without throwing a codec error, and in the case of messy utf8, can sanitize it (ok to insert placeholder characters). For my purposes, I just need to catch a few common encodings, as well as common problems. I've been pulling my hair out with this for a long time, as I am sure many others have. It'd be nice to have a function like this that can preprocess anything coming in from a CGI interface to keep it from causing problems downstream. Any suggestions, code fragments would be welcome. If I come up with anything myself, I plan to share it as I know this issue is a hangup for many people. Thanks, Brian McConnell www.worldwidelexicon.org 1 Reply
I would first run the bytes through Mark Pilgrim's Universal Encoding Detector, which should allow you to detect the common encodings you mentioned without any fuss. It even spits out scores for each encoding it suggests. If even at that point you were unable to decode the bytes, it would probably be such an edge case that it'd be worthwhile to just report an error back to the user. Or you could convert each byte into ASCII. But, generally, without knowing what the encoding is it's a bad idea to keep the data because it could range anywhere from slightly malformed to complete garbage. So I guess the code snippet would look like:
def to_unicode(bytes):
assert(isinstance(bytes, str))
for encoding, confidence in chardet.detect(bytes):
try:
return bytes.decode(encoding)
except UnicodeDecodeError:
pass
# Deal with errors.
raise MalformedInputBadNaughtyUserError("Sorry, we were unable to process your input. Could you try using ASCII only?")
|
|||
|