Detection of unlabeled UTF-8

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Detection of unlabeled UTF-8

Zack Weinberg-2
All the discussion of fallback character encodings has reminded me of an
issue I've been meaning to bring up for some time: As a user of the
en-US localization, nowadays the overwhelmingly most common situation
where I see mojibake is when a site puts UTF-8 in its pages without
declaring any encoding at all (neither via <meta charset> nor
Content-Type). It is possible to distinguish UTF-8 from most legacy
encodings heuristically with high reliability, and I'd like to suggest
that we ought to do so, independent of locale.

Having read through a bunch of the "fallback encoding is wrong" bugs
Henri's been filing, I have the impression that Henri would prefer we
*not* detect UTF-8, if only to limit the amount of 'magic' platform
behavior; however, I have three counterarguments for this:

1. There exist sites that still regularly add new, UTF-8-encoded
content, but whose *structure* was laid down in the late 1990s or early
2000s, declares no encoding, and is unlikely ever to be updated again.
The example I have to hand is
http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded 
; many other posts on this forum have the same problem. Take note of the
vintage HTML. I suggested to the admins of this site that they add <meta
charset="utf-8"> to the master page template, and was told that no one
involved in current day-to-day operations has the necessary access
privileges. I suspect that this kind of situation is rather more common
than we would like to believe.

2. For some of the fallback-encoding-is-wrong bugs still open, a binary
UTF-8/unibyte heuristic would save the localization from having to
choose between displaying legacy minority-language content correctly and
displaying legacy hegemonic-language content correctly. If I understand
correctly, this is the case at least for Welsh:
https://bugzilla.mozilla.org/show_bug.cgi?id=844087 .

3. Files loaded from local disk have no encoding metadata from the
transport, and may have no in-band label either; in particular, UTF-8
plain text with no byte order mark, which is increasingly common, should
not be misidentified as the legacy encoding.  Having a binary
UTF-8/unibyte heuristic might address some of the concerns mentioned in
the "File API should not use 'universal' character detection" bug,
https://bugzilla.mozilla.org/show_bug.cgi?id=848842 .

If people are concerned about "infecting" the modern platform with
heuristics, perhaps we could limit application of the heuristic to
quirks mode, for HTML delivered over HTTP. I expect this would cover the
majority of the sites described under point 1, and probably 2 as well.

zw
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Detection of unlabeled UTF-8

Gervase Markham
On 29/08/13 19:41, Zack Weinberg wrote:
> All the discussion of fallback character encodings has reminded me of an
> issue I've been meaning to bring up for some time: As a user of the
> en-US localization, nowadays the overwhelmingly most common situation
> where I see mojibake is when a site puts UTF-8 in its pages without
> declaring any encoding at all (neither via <meta charset> nor
> Content-Type). It is possible to distinguish UTF-8 from most legacy
> encodings heuristically with high reliability, and I'd like to suggest
> that we ought to do so, independent of locale.

That seems wise to me, on gut instinct. If the web is moving to UTF-8,
and we are trying to encourage that, then it seems we should expect that
this is what we get unless there are hints that we are wrong, whether
that's the TLD, the statistical profile of the characters, or something
else.

We don't want people to try and move to UTF-8, but move back because
they haven't figured out how (or are technically unable) to label it
correctly and "it comes out all wrong".

Gerv


_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Detection of unlabeled UTF-8

Robert Kaiser
In reply to this post by Zack Weinberg-2
Zack Weinberg schrieb:
> It is possible to distinguish UTF-8 from most legacy
> encodings heuristically with high reliability, and I'd like to suggest
> that we ought to do so, independent of locale.

I would very much agree with doing that. UTF-8 is what is being
suggested everywhere as the encoding to go with, and as we should be
able to detect it easily enough, we should do it and switch to it when
we find it.

Robert Kaiser

_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Detection of unlabeled UTF-8

Neil-4
In reply to this post by Zack Weinberg-2
And then you get sites that send ISO-8859-1 but the server is configured
to send UTF-8 in the headers, e.g.
http://darwinawards.com/darwin/darwin1999-38.html

--
Warning: May contain traces of nuts.
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n