6259 – handling of character encoding in Direct Input can be confusing

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6259 - handling of character encoding in Direct Input can be confusing

Summary: handling of character encoding in Direct Input can be confusing

Status:	RESOLVED FIXED

Alias:	None

Product:	Validator
Classification:	Unclassified
Component:	check (show other bugs)
Version:	HEAD
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	This bug has no owner yet - up for the taking
QA Contact:	qa-dev tracking

URL:
Whiteboard:
Keywords:

Duplicates (1):	6458 (view as bug list)
Depends on:
Blocks:

Reported:	2008-11-30 01:07 UTC by Harald-Ren
Modified:	2009-01-28 15:35 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Harald-Ren 2008-11-30 01:07:16 UTC

When validating by direct HTML input I get the following result:

This document was successfully checked as HTML 4.01 Transitional!
Result: Passed

When validating the same HTML document by url I get the following result:

This document was Tentatively checked as HTML 4.01 Transitional
Tentatively passed, 1 warning(s)
No Character Encoding Found! Falling back to UTF-8.

So, character encoding is not checked when using validating by direct HTML input. Imho the character encoding should also be checked when validating by direct HTML input.

Usually I am validating my HTML documents by direct input. Only when validation was successfully I upload them to the server.

Comment 1 Olivier Thereaux 2008-12-02 16:22:52 UTC

(In reply to comment #0)
> So, character encoding is not checked when using validating by direct HTML
> input. Imho the character encoding should also be checked when validating by
> direct HTML input.

When you validate by URI, the validator retrieves the resource via HTTP. What it retrieves are bytes, which must be decoded into characters - hence the importance of knowing the character encoding, either  thanks to the charset= parameter in the HTTP headers sent by the server, or with the <meta> information in HTML, or other possibles sources.

Likewise, with file upload, with some minor differences (there is no web server, but the web browser pretty much plays that role).

When using direct input however, what you give to the validator is not a series of bytes: you copy and paste characters into a form on the validator's home page. That validator page is encoded in utf-8, which means that, automatically, the form submitted to the validator will be in utf-8. And that, regardless of what your original content was encoded in, regardless of whatver meta information is present. 

Does the validator need to check encoding in "direct input" mode? No, per the above. Should it do it? There answer here too is "no".

Imagine that your document is encoded in ISO-9959-1 (a.k.a latin-1). It is properly served as latin-1 on your web server, there is a <meta> tag with that information too. All is well. Now, imagine that you take that page source, copy-paste it into the "direct input" form of the validator: then, as explained above, the markup automatically becomes characters in utf-8. Should the validator complain that it is receiving utf-8 content when the source says "iso-8859-1"? Of course not, that would be wrong and confusing.

In other words, direct input and by UIR validation are a very different paradigm, and the difference shows most clearly in handling encodings. It *is* confusing, and if you can think of any way to make it less confusing, ideas are welcome.

Comment 2 Dominique Hazael-Massieux 2008-12-02 17:42:57 UTC

in validation by input form,  there could be warning when no <meta http-equiv /> / XML Declaration is present: "Warning: the content encoding is not declared in the document." (with a bit more of explanation as to why this might be problem).

Likewise, when encoding is declared, there could be a warning (or a "notice" if such a thing exists?) saying that the document's correct encoding was not verified.

Comment 3 Harald-Ren 2008-12-02 18:43:10 UTC

Thank you for your prompt replies :o) OK, it is right (of course) that validating by URI or direct input are completely different paradigms.

I also agree with "Dominique Hazael-Massieux". When validating by direct input a warning or information may be displayed when there is no <meta http-equiv
/> which makes it less confusing + helps web masters to validate before uploading to the web server. The warning should contain exactly what "Olivier Thereaux" said here so it is not possible to forget this and everyone will understand it (imho).

Validating by direct input is a handy utility to validate quickly from every source but it should not assume an encoding (imho). So only that feature must be adjusted. Validating via URI remains the same of course.

Never mind what the answer would be: w3.org offers great stuff and I wish that more web developers, web designers, web masters use it in order to offer "better" web pages (regardless of which browser users are using) for all people.

Comment 4 Olivier Thereaux 2008-12-02 20:02:12 UTC

(In reply to comment #2)
> in validation by input form,  there could be warning when no <meta http-equiv
> /> / XML Declaration is present: "Warning: the content encoding is not declared
> in the document." (with a bit more of explanation as to why this might be
> problem).

Right. We can do that ** if (and only if) we actually want to recommend declaring encoding at the document level **

The XML declaration is a liability in text/html, at least as long as IE6 has such a big market share (anything before the doctype means quirks mode).

charset in <meta> doesn't seem to be a problem, other than the fact that it is the most misused/mistyped html construct ever (cue link to hixie's study)

having a <meta> can be a problem with transcoding servers, also seems to be widespread to have HTTP and <meta> disagree, see e.g. http://dev.opera.com/articles/view/mama-document-encodings/#agree

OTOH, I do see For these reasons you should always ensure that encoding information is also declared inside the document. in http://www.w3.org/International/tutorials/tutorial-char-enc/#Slide0250 but the recommendation is confusing, even for me.

I guess an "info" about it would be in order, then. The algorithm may not be too trivial to add to the validator, but that's feasible.  

> Likewise, when encoding is declared, there could be a warning (or a "notice" if
> such a thing exists?) saying that the document's correct encoding was not
> verified.

Good.

I'm reopening the bug, until we address the usability issue.

Comment 5 Olivier Thereaux 2009-01-20 15:15:16 UTC

*** Bug 6458 has been marked as a duplicate of this bug. ***

Comment 6 Olivier Thereaux 2009-01-28 15:35:02 UTC

(In reply to comment #4)
> I'm reopening the bug, until we address the usability issue.

I added two note outputs to the validation results:
* suggest adding charset info within documen when none present
* explanation of UTF-8 force in direct input mode

http://lists.w3.org/Archives/Public/www-validator-cvs/2009Jan/0195.html

The wording of the warnings can surely be improved, but I believe this solves the cases raised in this bug report.