Message ID | 20140611103312.GE24193@spoyarek.pnq.redhat.com |
---|---|
State | New |
Headers | show |
On Wed, 11 Jun 2014, Siddhesh Poyarekar wrote:
> + try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859']
I suspect you mean iso-8859-1. (Both windows-1252 and iso-8859-1 provide
useful fallbacks for mixed-character-set patches - what's relevant is
actually the bytes of the patch, after all, when it's changing files that
don't all use the same character set, and misinterpreting those bytes as
the wrong characters doesn't matter so much. This particular patch was
actually iso-8859-1 - +/- signs in pre-existing comments in patch
context.)
On Wed, Jun 11, 2014 at 12:11:12PM +0000, Joseph S. Myers wrote: > On Wed, 11 Jun 2014, Siddhesh Poyarekar wrote: > > > + try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859'] > > I suspect you mean iso-8859-1. (Both windows-1252 and iso-8859-1 provide > useful fallbacks for mixed-character-set patches - what's relevant is > actually the bytes of the patch, after all, when it's changing files that > don't all use the same character set, and misinterpreting those bytes as > the wrong characters doesn't matter so much. This particular patch was > actually iso-8859-1 - +/- signs in pre-existing comments in patch > context.) Thanks, fixed. Siddhesh
--- a/apps/patchwork/bin/parsemail.py 2014-06-11 15:53:12.685666812 +0530 +++ b/apps/patchwork/bin/parsemail.py 2014-06-11 15:53:03.991667186 +0530 @@ -147,6 +147,13 @@ return match.group(1) return None +def try_decode(payload, charset): + try: + payload = unicode(payload, charset) + except UnicodeDecodeError: + return None + return payload + def find_content(project, mail): patchbuf = None commentbuf = '' @@ -157,15 +164,27 @@ continue payload = part.get_payload(decode=True) - charset = part.get_content_charset() subtype = part.get_content_subtype() - # if we don't have a charset, assume utf-8 - if charset is None: - charset = 'utf-8' - if not isinstance(payload, unicode): - payload = unicode(payload, charset) + charset = part.get_content_charset() + + # If there is no charset or if it is unknown, then try some common + # charsets before we fail. + if charset is None or charset == 'x-unknown': + try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859'] + else: + try_charsets = [charset] + + for cset in try_charsets: + decoded_payload = try_decode(payload, cset) + if decoded_payload is not None: + break + payload = decoded_payload + + # Could not find a valid decoded payload. Fail. + if payload is None: + return (None, None) if subtype in ['x-patch', 'x-diff']: patchbuf = payload