php-src/ext/mbstring
NathanFreeman fa0401b0b5 Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.

If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.

This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.

The simplest fix is to suppress error markers during the final flush
operation.

While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.

The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.

However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.

Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
2022-11-13 14:37:55 +02:00
..
libmbfl Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) 2022-11-13 14:37:55 +02:00
tests Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) 2022-11-13 14:37:55 +02:00
ucgendat Combine control into one character group 2021-08-24 20:39:16 +02:00
common_codepoints.txt mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
config.m4 Remove duplicate implementation of CP932 from mbstring 2021-06-17 13:12:40 +02:00
config.w32 Remove duplicate implementation of CP932 from mbstring 2021-06-17 13:12:40 +02:00
CREDITS
gen_rare_cp_bitvec.php Improve detection accuracy of mb_detect_encoding 2021-10-19 18:05:51 +02:00
mb_gpc.c Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
mb_gpc.h Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
mbstring.c Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings 2022-07-20 16:58:55 +02:00
mbstring.h Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
mbstring.stub.php Add support for generating MAY_BE_ARRAY_OF_REF func info flag (#7416) 2021-08-30 13:50:34 +02:00
mbstring_arginfo.h Add support for generating MAY_BE_ARRAY_OF_REF func info flag (#7416) 2021-08-30 13:50:34 +02:00
php_mbregex.c Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
php_mbregex.h Update http->https in license (#6945) 2021-05-06 12:16:35 +02:00
php_onig_compat.h
php_unicode.c Return bool from php_unicode_is_prop() 2021-08-24 19:21:21 +02:00
php_unicode.h Add comments to grouped character properties 2021-08-24 22:09:26 +02:00
rare_cp_bitvec.h mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
unicode_data.h Update Unicode tables to 14.0.0 2021-09-20 09:58:20 +02:00