Similar to the fast, specialized mb_strcut implementation for UTF-8
in 1f0cf133db, this new implementation of mb_strcut for UTF-16 strings
just examines a few bytes before each cut point.
Even for short strings, the new implementation is around 2x faster.
For strings around 10,000 bytes in length, it comes out about 100-500x
faster in my microbenchmarks.
The new implementation behaves identically to the old one on valid
UTF-16 strings; a fuzzer was used to help verify this.
This bug was introduced in cb840799b4.
Thanks to Ignace Nyamagana Butera for discovering this bug and
to Sebastian Bergmann for doing an initial investigation and opening
a bug ticket.
...So conditionally including code which uses __builtin_usub_overflow
(for performance) if the macro is defined is not correct. We also need
to check if the macro is defined as a non-zero value.
Apparently this broke the build for a user whose C compiler is GCC
4.9.4. Sorry, user! That was my fault!
Thanks to Jakub Zelenka for reporting the issue.
The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.
For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)
The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).
These unit tests cover situations which were not previously tested by the
mbstring test suite. Adding them will make the test suite more complete.
To be specific, the 'obscure' case which we are now testing is: what happens
when the first half of a surrogate pair appears at end of an improperly
terminated Base64 section in UTF7-IMAP text?
I don't believe such a buffer overrun will ever occur, but just in
case the code is changed in the future, it will be good to have an
assertion here to help catch bugs. (A similar assertion is already
used in the UTF-7 version of this function.)
...So conditionally including code which uses __builtin_usub_overflow
(for performance) if the macro is defined is not correct. We also need
to check if the macro is defined as a non-zero value.
Apparently this broke the build for a user whose C compiler is GCC
4.9.4. Sorry, user! That was my fault!
Thanks to Jakub Zelenka for reporting the issue.
We need to remove the value from the GC buffer before freeing it. Otherwise
shutdown will uaf when running the gc. Do that by switching from
zend_hash_destroy to zend_array_destroy, which should also be faster for freeing
members due to inlining of i_zval_ptr_dtor.
Closes GH-11822
When not providing a pad string, *and* not having other defaulted
arguments, the function would crash on a NULL pad zend_string*.
Despite testing with an empty pad string, the issue wasn't found because
when using named arguments the pad string *is* filled in.
I tweaked the #if check such that the workaround only applies on GCC
versions older than 8.0.
I tested this with GCC 7.5, 8.4, 9.4, GCC 13.1.1, and Clang 10.0.
Closes GH-11516.
This bug was introduced in e837a8800b. In that commit, I increased the
performance of CP949 text conversion, but accidentally broke the case
where 0xC9 (illegal byte to start a character) is followed by a valid
character with a first byte less than 0xA1. The 'broken' behavior is
that both the 0xC9 byte and the following valid character would be
converted to error markers.
When combining all the CJK encoding conversion code in a single file,
I combined some redundant mblen tables. This check will help to ensure
that all the mblen tables are correct.
These (static) tables were defined in a header file, which was included
in two different .c files. That will result in two copies of the tables
being included in the PHP binary.
But the tables were only used in one of the two .c files. Move it where
it is used to avoid needlessly bloating the binary. (I checked in a
hex editor and confirmed that while the previous binary contained two
copies of these tables, it now only contains one.)
Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is
now about 60% faster than before. (Many other mbstring functions will
also be faster now on SJIS-2004 text.)
This will make it easier to combine duplicated code between all the
CJK text encodings (a significant amount is already combined in this
commit, such as the repeated definitions of SJIS_DECODE and
SJIS_ENCODE), but I hope to remove even more redundancy in the future.
The table used to implement mb_strlen for CP932 has been changed to
the same table as "SJIS-win".
In 6fc8d014df, pakutoma added specialized validity checking functions
for some legacy text encodings like ISO-2022-JP and UTF-7. These
check functions perform a more strict validity check than the encoding
conversion functions for the same text encodings. For example, the
check function for ISO-2022-JP verifies that the string ends in the
correct state required by the specification for ISO-2022-JP.
These check functions are already being used to make detection of text
encoding more accurate when 'strict' detection mode is enabled.
However, since the default is 'non-strict' detection (a bad API design
but we're stuck with it now), most users will not benefit from
pakutoma's work. I was previously reluctant to enable this new logic
for non-strict detection mode. My intention was to reduce the scope of
behavior changes, since almost *any* behavior change may affect *some*
user in a way we don't expect.
However, we definitely have users whose (production) code was broken
by the changes I made in 28b346bc06, and enabling pakutoma's check
functions for non-strict detection mode would un-break it. (See
GH-10192 as an example.) The added checks do also make sense.
In non-strict detection mode, we will not immediately reject candidate
encodings whose validity check function returns false; but they will
be much less likely to be selected. However, failure of the validity
check function is weighted less heavily than an encoding error detected
by the encoding conversion function.
The documentation for mb_detect_encoding says that this function
"Detects the most likely character encoding for string `string` from an
ordered list of candidates".
Prior to 28b346bc06, mb_detect_encoding did not really attempt to
determine the "most likely" text encoding for the input string. It
would just return the first candidate encoding for which the string was
valid. In 28b346bc06, I amended this function so that it uses heuristics
to try to guess which candidate encoding is "most likely".
However, the caller did not have any way to indicate which candidate
text encoding(s) they consider to be more likely, in case the
heuristics applied are inconclusive. In the language of Bayesian
probability, there was no way for the caller to indicate their 'prior'
assignment of probabilities.
Further, the documentation for mb_detect_encoding also says that the
second parameter `encodings` is "a list of character encodings to try,
in order". The documentation clearly implies that the order of
the `encodings` argument should be significant.
Therefore, amend mb_detect_encoding so that while it still uses
heuristics to guess the most likely text encoding for the input string,
it favors those which are earlier in the list of candidate encodings.
One complication is that many callers of mb_detect_encoding use it
in this way:
mb_detect_encoding($string, mb_list_encodings());
In a majority of cases, this is bad code; mb_detect_encoding will both
be much slower and the results will be less reliable than if a smaller
list of candidates is used. However, since such code already exists and
people are using it in production, we should not unnecessarily break it.
The order of candidate encodings obviously does not express any prior
belief of which candidates are more likely in this case, and treating
it as if it did will degrade the accuracy of the result.
Since mb_list_encodings now returns a single, immutable array on each
call, we can avoid that problem by turning off the new behavior when
we receive the array of encodings returned by mb_list_encodings.
This implementation means that if the user does this:
$a = mb_list_encodings();
mb_detect_encoding($string, $a);
...then the order of candidate encodings will not be considered.
However, if the user explicitly initializes their own array of all
supported legacy text encodings, then the order *will* be considered.
The other functions which also follow this new behavior are:
• mb_convert_variables
• mb_convert_encoding (when multiple candidate input encodings are
listed)
Other places where "detection" (or really "guessing") of text encoding
may be performed include:
• mb_send_mail
• Zend engine, when determining the encoding of a PHP script
• mbstring processing of HTTP request contents, when http_input INI
parameter is set to a list
In these cases, the new logic based on order of candidate encodings
is *not* enabled. It *might* be logical to consider the order of
candidate encodings in some or all of these cases, but I'm not sure if
that is true, so it seems wiser to avoid more behavior changes than is
necessary. Further, ever since the new encoding detection heuristics
were implemented in 28b346bc06, we have not received any complaints of
user code being broken in these areas. So I am reluctant to "fix what
isn't broken".
Well, some might say that applying the new detection heuristics
to mb_send_mail, etc. in 28b346bc06 was "fixing what wasn't broken",
but (cough cough) I don't have any comment on that...
This will allow us to easily check in other mbstring functions if the
list of all supported encodings, returned by mb_list_encodings, is
passed in as input to another function.
Co-authored-by: Ilija Tovilo <ilija.tovilo@me.com>
We're setting the encoding from PHP_FUNCTION(mb_strpos), but mbfl_strpos would
discard it, setting it to mbfl_encoding_pass, making zend_memnrstr fail due to a
null-pointer exception.
Fixes GH-11217
Closes GH-11220
Compiling in release mode with UBSAN gives me the following compiler warning:
```
In function ‘mb_wchar_to_sjismac’:
mbfilter_sjis.c:1419:89: warning: ‘i’ may be used uninitialized [-Wmaybe-uninitialized]
1419 | buf->state = (i << 24) | (index << 16) | (w & 0xFFFF);
| ^~
mbfilter_sjis.c:1398:42: note: ‘i’ was declared here
1398 | for (int i = 0; i < code_tbl_m_len; i++) {
| ^
```
Since the if condition will always be taken after the goto, we can get
rid of the warning by moving the label inside the if.
Signed-off-by: Alex Dowad <alexinbeijing@gmail.com>
For mb_parse_str, when mbstring.http_input (INI parameter) is a list of
multiple possible text encodings (which is not the case by default),
this new implementation is about 25% faster.
When mbstring.http_input is a single value, then nothing is changed.
(No automatic encoding detection is done in that case.)
The documentation for mb_strcut states:
mb_strcut(
string $string,
int $start,
?int $length = null,
?string $encoding = null
): string
mb_strcut() extracts a substring from a string similarly to mb_substr(),
but operates on bytes instead of characters. If the cut position happens
to be between two bytes of a multi-byte character, the cut is performed
starting from the first byte of that character.
My understanding of the $length parameter for mb_strcut is that it
specified the range of bytes to extract from $string, and that all
characters encoded by those bytes should be included in the returned
string, even if that means the returned string would be longer than
$length bytes. This can happen either if 1) there is more than one way
to encode the same character in $encoding, and one way requires more
bytes than the other, or 2) $encoding uses escape sequences.
However, discussion with users of mb_strcut indicates that many of them
interpret $length as the maximum length of the *returned* string.
This is also the historical behavior of the function.
Hence, there is no need to modify the behavior of mb_strcut and then
remove XFAIL from these test cases afterwards. We can keep the current
behavior.
This (rare) situation was already handled correctly for the 1st and 2nd
of every 3 codepoints in a Base64-encoded section of a UTF-7 string.
However, it was not handled correctly if it happened on the 3rd,
6th, 9th, etc. codepoint of such a Base64-encoded section.
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
In 6fc8d014df, pakutoma added some additional validation logic to
mb_detect_encoding. Since the implementation of mb_detect_encoding
has changed significantly between PHP 8.2 and 8.3, when merging this
change down from PHP-8.2 into master, I had to port his code over to
the new implementation in master.
However, I did this in a wrong way. In merge commit 0779950768,
the ported code modifies a function argument (to mb_guess_encoding)
which is marked 'const'. In the Windows CI job, MS VC++ rightly
flags this as a compile error.
Adjust the code to accomplish the same thing, but without destructively
modifying 'const' arguments.
When I built and tested 0779950768 locally, the build was successful
and all tests passed. However, in CI, some CI jobs are failing due to
compile errors. Fix those.