Previously, mbstring used the same logic for encoding validation as for
encoding conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
Some places use an if check, which implicitly checks for a non-zero
value, and some places use > 0. The > 0 is the correct one because at
least some of those functions already use the CK() macro to return -1 on
error. Because -1 != 0 this is wrongly interpreted as a success instead
of a failure.
Fixes GH-10627
The php_mb_convert_encoding() function can return NULL on error, but
this case was not handled, which led to a NULL pointer dereference and
hence a crash.
Closes GH-10628
Signed-off-by: George Peter Banyard <girgias@php.net>
Commit 8bbd0952e5 added a check rejecting empty strings; in the
merge commiot 379d9a1cfc however it was changed to a NULL check,
one that did not make sense because ZSTR_VAL() is guaranteed to never
be NULL; the length check was accidently removed by that merge commit.
This bug was found by GCC's -Waddress warning:
ext/mbstring/mbstring.c:748:27: warning: the comparison will always evaluate as ‘true’ for the address of ‘val’ will never be NULL [-Waddress]
748 | if (!new_value || !ZSTR_VAL(new_value)) {
| ^
Closes GH-10532
Signed-off-by: George Peter Banyard <girgias@php.net>
As a performance optimization, mbstring implements some functions using
tables which give the (byte) length of a multi-byte character using a
lookup based on the value of the first byte. These tables are called
`mblen_table`.
For many years, the mblen_table for SJIS has had '2' in position 0x80.
That is wrong; it should have been '1'. Reasons:
For SJIS, SJIS-2004, and mobile variants of SJIS, 0x80 has never been
treated as the first byte of a 2-byte character. It has always been
treated as a single erroneous byte. On the other hand, 0x80 is a valid
character in MacJapanese... but a 1-byte character, not a 2-byte one.
The same applies to bytes 0xFD-FF; these are 1-byte characters in
MacJapanese, and in other SJIS variants, they are not valid (as the
first byte of a character).
Thanks to the GitHub user 'youkidearitai' for finding this problem.
In b5ff87ca71, I made a number of adjustments to our conversion code
for CP1252. One of the adjustments was to make the mappings match those
published by the Unicode Consortium in the file CP1252.TXT. These do
not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and
0x9D.
Rostyslav Gulka reported that this caused a problem. His application
stores binary JPEG data in an MS-SQL database. When they SELECT the
binary data out of the database, it is treated as CP1252 text and
automatically converted to UTF-8. To recover the original binary
data, they then do a conversion from UTF-8 to CP1252.
Obviously, that does not work if certain CP1252 bytes do not map to
any Unicode codepoint at all.
While this is a very unusual application of text encoding conversion,
and we might choose not to support it if there was no other basis for
including those mappings, it seems that Microsoft does actually include
them in the Win32 API as "best fit" mappings. These are extra mappings
from Unicode to other text encodings, which the Win32 API function
WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS
flag was passed.
A list of these "best fit" mappings for CP1252 can be found here:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.
If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.
This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.
The simplest fix is to suppress error markers during the final flush
operation.
While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.
The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.
However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.
Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects should not be
re-used after the 'flush' function is called to complete a
text conversion operation.
However, it turns out that the implementation of
_php_mb_encoding_handler_ex DID re-use filter objects
after flush. That means that functions which were based on
_php_mb_encoding_handler_ex, including mb_parse_str and
php_mb_post_handler, would break in some cases; state left
over from converting one substring (perhaps a variable name)
would affect the results of converting another substring
(perhaps the value of the same variable), and could cause
extraneous characters to get inserted into the output.
All this code should be deleted soon, but fixing it helps me
to avoid spurious failures when fuzzing the new/old code to
look for differences in behavior.
(This bug fix commit was originally applied to PHP-8.2 when fuzzing
the new mbstring text conversion code to check for differences with
the old code. Later, Kentaro Ohkouchi kindly reported a problem with
mb_encode_mimeheader under PHP 8.1 which was caused by the same issue.
Hence, this commit was backported to PHP-8.1.)
Fixes GH-9683.
In 0d0029d729 and 315d48b434, I changed the mappings used for Unicode
to Shift-JIS-2004, in an attempt to follow the JISC specification
more closely. However, feedback from Japanese PHP users indicates
that most users of SJIS-2004 expect 0x5C and 0x7E to be treated as
equivalent to the same ASCII bytes. This is due to a long history of
non-complying implementations which then became a de-facto standard.
Therefore, restore the earlier mappings for U+005C and U+007E.
Thanks to the GitHub user 'youkidearitai' for reporting this issue.
Fixes GH-9528.
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".
Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:
CP932 SJIS-win
U+203E 0x7E 0x81 0x50
U+00A5 0x5C 0x81 0x8F
As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.
Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".
It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:
https://github.com/php/php-src/issues/8308
Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`. Therefore, we need to remove the non-
encodings from the `elist` in this case as well. Thus, we duplicate
the global `elist`, so we can modify it.
Closes GH-9063.
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.
However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.
Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.
Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not
detecting the desired text encoding for strings containing š or Ž.
These characters are used in Czech, Serbian, Croatian, Bosnian,
Macedonian, etc. names.
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.
Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.
Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.
Thanks to divinity76 for the report.
In a2bc57e0e5, mb_detect_encoding was modified to ensure it would never
return 'UUENCODE', 'QPrint', or other non-encodings as the "detected
text encoding". Before mb_detect_encoding was enhanced so that it could
detect any supported text encoding, those were never returned, and they
are not desired. Actually, we want to eventually remove them completely
from mbstring, since PHP already contains other implementations of
UUEncode, QPrint, Base64, and HTML entities.
For more clarity on why we need to suppress UUEncode, etc. from being
detected by mb_detect_encoding, the existing UUEncode implementation
in mbstring *never* treats any input as erroneous. It just accepts
everything. This means that it would *always* be treated as a valid
choice by mb_detect_encoding, and would be returned in many, many cases
where the input is obviously not UUEncoded.
It turns out that the form of mb_convert_encoding where the user passes
multiple candidate encodings (and mbstring auto-detects which one to
use) was also affected by the same issue. Apply the same fix.
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found. Since we don't actually need to convert to wchar, we encode to
8bit.
Closes GH-7712.
Previously, some accented letters commonly used to write Polish text
were counted as 'rare' codepoints. Treat them as 'common' instead.
Thanks to Alec for pointing this out.
We must not reuse per-request memory across multiple requests, so this
check triggered during RINIT makes no sense. As explained in the bug
report[1], it can be even harmful, if some request startup fails, and
the pointers refer to already freed memory in the next request.
[1] <https://bugs.php.net/76167>
Closes GH-7604.
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.
Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.
In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.
This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.
One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.
Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.
Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
Headers should not be processed in a locale-depdendent fashion.
Switch from upper to lowercasing because that's the standard for
PHP and we provide an ASCII implementation of this operation.
This is adapted from GH-7506.
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.
As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.
From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.
Adjust the guard clause so that these characters can be converted as
the original author apparently intended.
The code which handles ku 115-119 is the part which reads:
} else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];