php-src

mirror of https://github.com/php/php-src.git synced 2025-08-16 05:58:45 +02:00

Author	SHA1	Message	Date
Yuya Hamada	c50172e812	Fix mb_strlen is wrong length for CP932 when 0x80.	2023-05-30 13:44:30 -07:00
pakutoma	b721d0f71e	Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. (The same change has already been made to PHP 8.2 and 8.3; see `6fc8d014df`. This commit is backporting the change to PHP 8.1.)	2023-03-25 09:52:10 +02:00
Alex Dowad	7c1ee5a02a	mb_encode_mimeheader does not crash if provided encoding has no MIME name set	2023-03-07 11:30:21 +02:00
nielsdos	d66ca5dabb	Propagate error checks for mbfl_filt_conv_illegal_output()	2023-03-02 22:36:00 +02:00
nielsdos	263655a520	Use CK() macro to check the output function in mbfilter_unicode2sjis_emoji_sb()	2023-03-02 22:36:00 +02:00
nielsdos	69543e6a10	Make error checks on encoding methods for docomo, kddi, sb consistent Some places use an if check, which implicitly checks for a non-zero value, and some places use > 0. The > 0 is the correct one because at least some of those functions already use the CK() macro to return -1 on error. Because -1 != 0 this is wrongly interpreted as a success instead of a failure.	2023-03-02 22:36:00 +02:00
Niels Dossche	ed0c0df351	Fix GH-10627: mb_convert_encoding crashes PHP on Windows Fixes GH-10627 The php_mb_convert_encoding() function can return NULL on error, but this case was not handled, which led to a NULL pointer dereference and hence a crash. Closes GH-10628 Signed-off-by: George Peter Banyard <girgias@php.net>	2023-02-20 13:33:11 +00:00
Max Kellermann	243865ae57	ext/mbstring: fix new_value length check Commit `8bbd0952e5` added a check rejecting empty strings; in the merge commiot `379d9a1cfc` however it was changed to a NULL check, one that did not make sense because ZSTR_VAL() is guaranteed to never be NULL; the length check was accidently removed by that merge commit. This bug was found by GCC's -Waddress warning: ext/mbstring/mbstring.c:748:27: warning: the comparison will always evaluate as ‘true’ for the address of ‘val’ will never be NULL [-Waddress] 748 \| if (!new_value \|\| !ZSTR_VAL(new_value)) { \| ^ Closes GH-10532 Signed-off-by: George Peter Banyard <girgias@php.net>	2023-02-20 13:32:56 +00:00
Alex Dowad	3152b7b26f	Use different mblen_table for different SJIS variants	2023-01-06 14:09:43 +02:00
Alex Dowad	d104481af8	Correct entry for 0x80,0xFD-FF in SJIS multi-byte character length table As a performance optimization, mbstring implements some functions using tables which give the (byte) length of a multi-byte character using a lookup based on the value of the first byte. These tables are called `mblen_table`. For many years, the mblen_table for SJIS has had '2' in position 0x80. That is wrong; it should have been '1'. Reasons: For SJIS, SJIS-2004, and mobile variants of SJIS, 0x80 has never been treated as the first byte of a 2-byte character. It has always been treated as a single erroneous byte. On the other hand, 0x80 is a valid character in MacJapanese... but a 1-byte character, not a 2-byte one. The same applies to bytes 0xFD-FF; these are 1-byte characters in MacJapanese, and in other SJIS variants, they are not valid (as the first byte of a character). Thanks to the GitHub user 'youkidearitai' for finding this problem.	2023-01-05 14:05:39 +02:00
Alex Dowad	a1a69c3734	Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding In `b5ff87ca71`, I made a number of adjustments to our conversion code for CP1252. One of the adjustments was to make the mappings match those published by the Unicode Consortium in the file CP1252.TXT. These do not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and 0x9D. Rostyslav Gulka reported that this caused a problem. His application stores binary JPEG data in an MS-SQL database. When they SELECT the binary data out of the database, it is treated as CP1252 text and automatically converted to UTF-8. To recover the original binary data, they then do a conversion from UTF-8 to CP1252. Obviously, that does not work if certain CP1252 bytes do not map to any Unicode codepoint at all. While this is a very unusual application of text encoding conversion, and we might choose not to support it if there was no other basis for including those mappings, it seems that Microsoft does actually include them in the Win32 API as "best fit" mappings. These are extra mappings from Unicode to other text encodings, which the Win32 API function WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS flag was passed. A list of these "best fit" mappings for CP1252 can be found here: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt	2022-12-09 15:18:37 +02:00
NathanFreeman	fa0401b0b5	Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) The existing implementation of mb_strcut extracts part of a multi-byte encoded string by pulling out raw bytes and then running them through a conversion filter to ensure that the output is valid in the requested encoding. If the conversion filter emits error markers when doing the final 'flush' operation which ends the conversion of the extracted bytes, these error markers may (in some cases) be included in the output. The conversion operation does not respect the value of mb_substitute_character; rather, it always uses '?' as an error marker. So this issue manifests itself as unwanted '?' characters being inserted into the output. This issue has existed for a long time, but became noticeable in PHP 8.1 because for at least some of the supported text encodings, mbstring is now more strict about emitting error markers when strings end in an illegal state. The simplest fix is to suppress error markers during the final flush operation. While working on a fix for this problem, another problem with mb_strcut was discovered; since it decides when to stop consuming bytes from the input by looking at the byte length of its OUTPUT, anything which causes extra bytes to be emitted to the output may cause mb_strcut to not consume all the bytes in the requested range. The one case where we DO emit extra output bytes is for encodings which have a selectable mode, like ISO-2022-JP; if a string in such an encoding ends in a mode which is not the default, we emit an ending escape sequence which changes back to the default mode. This is done so that concatenating strings in such encodings is safe. However, as mentioned, this can cause the output of mb_strcut to be shorter than it logically should be. This bug has existed for a long time, and fixing it now will be a BC break, so we may not fix it right away. Therefore, tests for THIS fix which don't pass because of that OTHER bug have been split out into a separate test file (gh9535b.phpt), and that file has been marked XFAIL.	2022-11-13 14:37:55 +02:00
Alex Dowad	faa5425b0f	Add regression test for problem with mb_encode_mimeheader reported as GH-9683	2022-10-10 20:46:12 +09:00
Alex Dowad	5812b4fe54	In legacy text conversion filters, reset filter state in 'flush' function Up until now, I believed that mbstring had been designed such that (legacy) text conversion filter objects should not be re-used after the 'flush' function is called to complete a text conversion operation. However, it turns out that the implementation of _php_mb_encoding_handler_ex DID re-use filter objects after flush. That means that functions which were based on _php_mb_encoding_handler_ex, including mb_parse_str and php_mb_post_handler, would break in some cases; state left over from converting one substring (perhaps a variable name) would affect the results of converting another substring (perhaps the value of the same variable), and could cause extraneous characters to get inserted into the output. All this code should be deleted soon, but fixing it helps me to avoid spurious failures when fuzzing the new/old code to look for differences in behavior. (This bug fix commit was originally applied to PHP-8.2 when fuzzing the new mbstring text conversion code to check for differences with the old code. Later, Kentaro Ohkouchi kindly reported a problem with mb_encode_mimeheader under PHP 8.1 which was caused by the same issue. Hence, this commit was backported to PHP-8.1.) Fixes GH-9683.	2022-10-10 20:46:12 +09:00
Alex Dowad	dd00e2f1e3	Restore backwards-compatible mappings of U+005C and U+007E to SJIS-2004 In `0d0029d729` and `315d48b434`, I changed the mappings used for Unicode to Shift-JIS-2004, in an attempt to follow the JISC specification more closely. However, feedback from Japanese PHP users indicates that most users of SJIS-2004 expect 0x5C and 0x7E to be treated as equivalent to the same ASCII bytes. This is due to a long history of non-complying implementations which then became a de-facto standard. Therefore, restore the earlier mappings for U+005C and U+007E. Thanks to the GitHub user 'youkidearitai' for reporting this issue. Fixes GH-9528.	2022-10-05 12:18:38 +09:00
Alex Dowad	371367ce3e	Reintroduce legacy 'SJIS-win' text encoding in mbstring In `e2459857af`, I combined mbstring's "SJIS-win" text encoding into CP932. This was done after doing some testing which appeared to show that the mappings for "SJIS-win" were the same as those for "CP932". Later, it was found that there was actually a small difference prior to `e2459857af` when converting Unicode to CP932. The mappings for the following two codepoints were different: CP932 SJIS-win U+203E 0x7E 0x81 0x50 U+00A5 0x5C 0x81 0x8F As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and 'YEN SIGN' to the ASCII bytes which have conflicting uses in most legacy Japanese text encodings. "SJIS-win" mapped these to equivalent JIS X 0208 fullwidth characters. Since e2459867af was not intended to cause any user-visible change in behavior, I am rolling back the merge of "CP932" and "SJIS-win". It seems doubtful whether these two text encodings should be kept separate or merged in a future release. An extensive discussion of the related historical background and compatibility issues involved can be found in this GitHub thread: https://github.com/php/php-src/issues/8308	2022-08-16 20:18:54 +02:00
Christoph M. Becker	c2bdaa48e1	Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings Passing `null` to `$encodings` is supposed to behave like passing the result of `mb_detect_order()`. Therefore, we need to remove the non- encodings from the `elist` in this case as well. Thus, we duplicate the global `elist`, so we can modify it. Closes GH-9063.	2022-07-20 16:58:55 +02:00
Alex Dowad	2dc9026cbc	Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS According to the relevant Japan Industrial Standards Committee standards, SJIS 0x5C is a Yen sign, and 0x7E is an overline. However, this conflicts with the implementation of SJIS in various legacy software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken as equivalent to the same ASCII bytes. Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes compatibly with Microsoft products. This was changed in PHP 8.1.0, in an attempt to comply with the JISC specifications. However, after discussion with various concerned Japanese developers, it seems that the historical behavior was more useful in the majority of applications which process SJIS-encoded text. Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE) to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore the mappings for those codepoints from before PHP 8.1.0.	2022-06-10 21:04:36 +02:00
Remi Collet	966a90873d	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: NEWS for GH-8685 Fix GH-8685 mbstring requires pcre	2022-06-03 07:54:58 +02:00
Remi Collet	2eb2f9d74f	Fix GH-8685 mbstring requires pcre	2022-06-03 07:53:48 +02:00
Alex Dowad	58d0aad75c	mb_detect_encoding recognizes all letters in Hungarian alphabet	2022-05-25 08:22:07 +02:00
Alex Dowad	6a4b6d2344	mb_detect_encoding recognizes all letters in Czech alphabet	2022-05-25 07:52:39 +02:00
Alex Dowad	9bb97ee8bc	Fix mb_detect_encoding's recognition of Slavic names Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.	2022-05-24 15:32:20 +02:00
Alex Dowad	04e59c916f	Error handling for UTF-8 complies with WHATWG specification In `7502c86342`, I adjusted the number of error markers emitted on invalid UTF-8 text to be more consistent with mbstring's behavior on other text encodings (generally, it emits one error marker for one unexpected byte). I didn't expect that anybody would actually care one way or the other, but felt that it was better to be consistent than not. Later, Martin Auswöger kindly pointed out that the WHATWG encoding specification, which governs how various text encodings are handled by web browsers, does actually specify how many error markers should be generated for any given piece of invalid UTF-8 text. Until now, we have never really paid much attention to the WHATWG specification, but we do want to comply with as many relevant specifications as possible. And since PHP is commonly used for web applications, compatibility with the behavior of web browsers is obviously a good thing.	2022-04-16 15:04:38 +02:00
Christoph M. Becker	5003831260	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:34:31 +01:00
Christoph M. Becker	d0417ebc93	Fix GH-8208: mb_encode_mimeheader: $indent functionality broken We also need to factor in the indent, when getting the encoder result. Closes GH-8213.	2022-03-17 17:31:58 +01:00
Alex Dowad	8a8533d263	mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F This was the old behavior of mb_check_encoding() before `3e7acf901d`, but yours truly broke it. If only we had more thorough tests at that time, this might not have slipped through the cracks. Thanks to divinity76 for the report.	2022-02-22 23:56:56 +02:00
Christoph M. Becker	69f6b09b2a	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:09:52 +01:00
Christoph M. Becker	03816fba46	Fix GH-7902: mb_send_mail may delimit headers with LF only Email headers are supposed to be separated with CRLF. Period. We introduce a `CRLF` macro for better comprehensibility right away. Closes GH-7907.	2022-01-18 13:08:08 +01:00
Alex Dowad	f07c193583	mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint In `a2bc57e0e5`, mb_detect_encoding was modified to ensure it would never return 'UUENCODE', 'QPrint', or other non-encodings as the "detected text encoding". Before mb_detect_encoding was enhanced so that it could detect any supported text encoding, those were never returned, and they are not desired. Actually, we want to eventually remove them completely from mbstring, since PHP already contains other implementations of UUEncode, QPrint, Base64, and HTML entities. For more clarity on why we need to suppress UUEncode, etc. from being detected by mb_detect_encoding, the existing UUEncode implementation in mbstring never treats any input as erroneous. It just accepts everything. This means that it would always be treated as a valid choice by mb_detect_encoding, and would be returned in many, many cases where the input is obviously not UUEncoded. It turns out that the form of mb_convert_encoding where the user passes multiple candidate encodings (and mbstring auto-detects which one to use) was also affected by the same issue. Apply the same fix.	2021-12-20 22:09:33 +02:00
Christoph M. Becker	929d847152	Fix #81693 : mb_check_encoding(7bit) segfaults `php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`. Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be found. Since we don't actually need to convert to wchar, we encode to 8bit. Closes GH-7712.	2021-12-03 22:49:47 +01:00
Alex Dowad	1a2c608053	Add unit tests for mb_detect_encoding on Polish text	2021-11-26 17:42:53 +02:00
Alex Dowad	d573054ebe	Enable encoding detection for Polish text Previously, some accented letters commonly used to write Polish text were counted as 'rare' codepoints. Treat them as 'common' instead. Thanks to Alec for pointing this out.	2021-11-25 11:10:47 +02:00
Christoph M. Becker	7fcf17c41e	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:21 +02:00
Christoph M. Becker	6e6a8443a8	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:39:57 +02:00
Christoph M. Becker	d3d6d7906e	Fix #76167 : mbstring may use pointer from some previous request We must not reuse per-request memory across multiple requests, so this check triggered during RINIT makes no sense. As explained in the bug report[1], it can be even harmful, if some request startup fails, and the pointers refer to already freed memory in the next request. [1] <https://bugs.php.net/76167> Closes GH-7604.	2021-10-25 12:37:28 +02:00
Alex Dowad	a2bc57e0e5	mb_detect_encoding will not return non-encodings Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.	2021-10-19 18:05:52 +02:00
Alex Dowad	28b346bc06	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In `3e7acf901d`, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.	2021-10-19 18:05:51 +02:00
Nikita Popov	46315defc7	Use locale-independent case conversion in mb_send_mail() Headers should not be processed in a locale-depdendent fashion. Switch from upper to lowercasing because that's the standard for PHP and we provide an ASCII implementation of this operation. This is adapted from GH-7506.	2021-09-23 17:20:54 +02:00
Alex Dowad	c25a1ef8d0	Bug #81390 : mb_detect_encoding should not prematurely stop processing input As a performance optimization, mb_detect_encoding tries to stop processing the input string early when there is only one 'candidate' encoding which the input string is valid in. However, the code which keeps count of how many candidate encodings have already been rejected was buggy. This caused mb_detect_encoding to prematurely stop processing the input when it should have continued. As a result, it did not notice that in the test case provided by Alec, the input string was not valid in UTF-16.	2021-09-20 11:21:39 +02:00
Alex Dowad	ca33ab59ad	mb_detect_encoding with only one candidate encoding uses mb_check_encoding ...Because it's about 5% faster.	2021-09-20 11:20:53 +02:00
Alex Dowad	6acd4f7f3a	Optimize text encoding detection for speed (eliminate Unicode property lookups) ...By just testing the input codepoints if they are within a few fixed ranges instead. This avoids hash lookups in property tables. From (micro-)benchmarking on my PC, this looks to be a bit less than 4x faster than the existing code.	2021-09-20 11:20:53 +02:00
Colin O'Dell	fe36b81d5e	Update Unicode tables to 14.0.0 Closes GH-7502.	2021-09-20 09:58:20 +02:00
Alex Dowad	df32267494	Add more tests for UTF7-IMAP text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	16a1e0a219	In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly	2021-08-31 13:41:34 +02:00
Alex Dowad	355464935d	Add another test for UTF-7 text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	51b6c687db	Add another test for GB18030 text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	a0415b22ab	Add more tests for CP5022{0,1,2} text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	e3f6a9fbfe	CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has an 'if' clause specifically to handle such characters... but that 'if' clause was dead code, since a guard clause earlier in the same function prevented it from accepting 2-byte characters with a starting byte of 0x93-0x97. Adjust the guard clause so that these characters can be converted as the original author apparently intended. The code which handles ku 115-119 is the part which reads: } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) { w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];	2021-08-31 13:41:34 +02:00
Alex Dowad	671dcee01e	Add test for mb_str_split on UCS-2 text	2021-08-31 13:41:34 +02:00

1 2 3 4 5 ...

2162 commits