php-src/ext/mbstring/tests/utf8_error_handling.phpt
Alex Dowad 04e59c916f Error handling for UTF-8 complies with WHATWG specification
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.

Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.

Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
2022-04-16 15:04:38 +02:00

56 lines
1.7 KiB
PHP

--TEST--
Confirm error handling for UTF-8 complies with WHATWG spec
--EXTENSIONS--
mbstring
--FILE--
<?php
/* The WHATWG specifies not just how web browsers should handle _valid_
* UTF-8 text, but how they should handle _invalid_ UTF-8 text (such
* as how many error markers each invalid byte sequence should decode
* to).
* That specification is followed by the JavaScript Encoding API.
*
* The API documentation for mb_convert_encoding does not specify how
* many error markers we will emit for each possible invalid byte
* sequence, so we might as well comply with the WHATWG specification.
*
* Thanks to Martin Auswöger for pointing this out... and another big
* thanks for providing test cases!
*
* Ref: https://encoding.spec.whatwg.org/#utf-8-decoder
*/
mb_substitute_character(0x25);
$testCases = [
["\x80", "%"],
["\xFF", "%"],
["\xC2\x7F", "%\x7F"],
["\xC2\x80", "\xC2\x80"],
["\xDF\xBF", "\xDF\xBF"],
["\xDF\xC0", "%%"],
["\xE0\xA0\x7F", "%\x7F"],
["\xE0\xA0\x80", "\xE0\xA0\x80"],
["\xEF\xBF\xBF", "\xEF\xBF\xBF"],
["\xEF\xBF\xC0", "%%"],
["\xF0\x90\x80\x7F", "%\x7F"],
["\xF0\x90\x80\x80", "\xF0\x90\x80\x80"],
["\xF4\x8F\xBF\xBF", "\xF4\x8F\xBF\xBF"],
["\xF4\x8F\xBF\xC0", "%%"],
["\xFA\x80\x80\x80\x80", "%%%%%"],
["\xFB\xBF\xBF\xBF\xBF", "%%%%%"],
["\xFD\x80\x80\x80\x80\x80", "%%%%%%"],
["\xFD\xBF\xBF\xBF\xBF\xBF", "%%%%%%"]
];
foreach ($testCases as $testCase) {
$result = mb_convert_encoding($testCase[0], 'UTF-8', 'UTF-8');
if ($result !== $testCase[1]) {
die("Expected UTF-8 string " . bin2hex($testCase[0]) . " to convert to UTF-8 string " . bin2hex($testCase[1]) . "; got " . bin2hex($result));
}
}
echo "All done!\n";
?>
--EXPECT--
All done!