php-src

mirror of https://github.com/php/php-src.git synced 2025-08-16 14:08:47 +02:00

Author	SHA1	Message	Date
pakutoma	6fc8d014df	Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.	2023-03-24 20:34:22 +02:00
Ilija Tovilo	805dafddbb	Merge branch 'PHP-8.1' into PHP-8.2 * PHP-8.1: Enable GitHub actions cancel-in-progress for PRs mb_encode_mimeheader does not crash if provided encoding has no MIME name set	2023-03-07 11:02:00 +01:00
Alex Dowad	7c1ee5a02a	mb_encode_mimeheader does not crash if provided encoding has no MIME name set	2023-03-07 11:30:21 +02:00
George Peter Banyard	73f9ffc5cd	Merge branch 'PHP-8.1' into PHP-8.2 * PHP-8.1: Fix GH-10627: mb_convert_encoding crashes PHP on Windows ext/mbstring: fix new_value length check	2023-02-20 13:41:11 +00:00
Niels Dossche	ed0c0df351	Fix GH-10627: mb_convert_encoding crashes PHP on Windows Fixes GH-10627 The php_mb_convert_encoding() function can return NULL on error, but this case was not handled, which led to a NULL pointer dereference and hence a crash. Closes GH-10628 Signed-off-by: George Peter Banyard <girgias@php.net>	2023-02-20 13:33:11 +00:00
Max Kellermann	243865ae57	ext/mbstring: fix new_value length check Commit `8bbd0952e5` added a check rejecting empty strings; in the merge commiot `379d9a1cfc` however it was changed to a NULL check, one that did not make sense because ZSTR_VAL() is guaranteed to never be NULL; the length check was accidently removed by that merge commit. This bug was found by GCC's -Waddress warning: ext/mbstring/mbstring.c:748:27: warning: the comparison will always evaluate as ‘true’ for the address of ‘val’ will never be NULL [-Waddress] 748 \| if (!new_value \|\| !ZSTR_VAL(new_value)) { \| ^ Closes GH-10532 Signed-off-by: George Peter Banyard <girgias@php.net>	2023-02-20 13:32:56 +00:00
Jakub Zelenka	cc931af35d	Fix GH-8086: Introduce mail.mixed_lf_and_crlf INI When this INI option is enabled, it reverts the line separator for headers and message to LF which was a non conformant behavior in PHP 7. It is done because some non conformant MTAs fail to parse CRLF line separator for headers and body. This is used for mail and mb_send_mail functions.	2023-01-19 19:05:39 +00:00
Alex Dowad	f7a19181d7	Allow 'h' and 'k' flags to be combined for mb_convert_kana The 'h' flag makes mb_convert_kana convert zenkaku hiragana to hankaku katakana; 'k' makes it convert zenkaku katakana to hankaku katakana. When working on the implementation of mb_convert_kana, I added some additional checks to catch combinations of flags which do not make sense; but there is no conflict between 'h' and 'k' (they control conversions for two disjoint ranges of codepoints) and this combination should not have been restricted. Thanks to the GitHub user 'akira345' for reporting this problem. Closes GH-10174.	2022-12-29 20:38:01 +02:00
Alex Dowad	8df515555b	Remove unused 'to_language' and 'from_language' struct fields	2022-08-16 16:43:26 +02:00
Christoph M. Becker	d013d94985	Fix GH-9248: Segmentation fault in mb_strimwidth() We need to initialize the optional argument `trimmarker` with its default value. Closes GH-9273.	2022-08-08 18:35:37 +02:00
Alex Dowad	5370f344d2	mb_strimwidth inserts error markers in invalid input string (for backwards compatibility) The old implementation did this. It also did the same to the trim marker, if the trim marker was invalid in the specified encoding, but I have not imitated that behavior (for performance).	2022-08-02 11:07:06 +02:00
Alex Dowad	78ee18413f	Move kana conversion function to mbfilter_cp5022x.c ...To avoid a dependency from libmbfl to mbstring. Thanks to Nikita Popov for pointing this issue out.	2022-08-02 11:07:06 +02:00
Alex Dowad	7299096095	New implementation of mb_strimwidth This new implementation of mb_strimwidth uses the new text encoding conversion filters. Changes from the previous implementation: • mb_strimwidth allows a negative 'from' argument, which should count backwards from the end of the string. However, the implementation of this feature was buggy (starting right from when it was first implemented). It used the following code: if ((from < 0) \|\| (width < 0)) { swidth = mbfl_strwidth(&string); } if (from < 0) { from += swidth; } Do you see the bug? 'from' is a count of CODEPOINTS, but 'swidth' is a count of TERMINAL COLUMNS. Adding those two together does not make sense. If there were no fullwidth characters in the input string, then the two counts coincide and the feature would work correctly. However, each fullwidth character would throw the result off by one, causing more characters to be skipped than was requested. • mb_strimwidth also allows a negative 'width' argument, which again counts backwards from the end of the string; in this case, it is not determining the START of the portion which we want to extract, but rather, the END of that portion. Perhaps unsurprisingly, this feature was also buggy. Code: if (width < 0) { width = swidth + width - from; } 'swidth + width' is fine here; the problem is '- from'. Again, that is subtracting a count of CODEPOINTS from a count of TERMINAL COLUMNS. In this case, we really need to count the terminal width of the string prefix skipped over by 'from', and subtract that rather than the number of codepoints which are being skipped. As a result, if a 'from' count was passed along with a negative 'width', for every fullwidth character in the skipped prefix, the result of mb_strimwidth was one terminal column wider than requested. Since these situations were covered by unit tests, you might wonder why the bugs were not caught. Well, as far as I can see, it looks like the author of the 'tests' just captured the actual output of mb_strimwidth and defined it as 'correct'. The tests were written in such a way that it was difficult to examine them and see whether they made sense or not; but a careful examination of the inputs and outputs clearly shows that the legacy tests did not conform to the documented contract of mb_strimwidth. • The old implementation would always pass the input string through decoding/encoding filters before returning it to the caller, even if it fit within the specified width. This means that invalid byte sequences would be converted to error markers. For performance, the new implementation returns the very same string which was passed in if it does not exceed the specified width. This means that erroneous byte sequences are not converted to error markers unless it is necessary to trim the string. • The same applies to the 'trim marker' string. • The old implementation was buggy in the (unusual) case that the trim marker is wider than the requested maximum width of the result. It did an unsigned subtraction of the requested width and the width of the trim marker. If the width of the trim marker was greater, that subtraction would underflow and yield a huge number. As a result, mb_strimwidth would then pass the input string through, even if it was far wider than the requested maximum width. In that case, since the input string is wider than the requested width, and NONE of it will fit together with the trim marker, the new implementation returns just the trim marker. This is the one case where the output can be wider than the requested width: when BOTH the input string and also the trim marker are too wide. • Since it passed the input string and trim marker through decoding/encoding filters, when using "Quoted-Printable" as the encoding, newlines could be inserted into the trim marker to maintain the maximum line length for QP. This is an extremely bizarre use case and I don't think there is any point in worrying about it. QP will be removed from mbstring in time, anyways. PERFORMANCE: • From micro-benchmarking with various input string lengths and text encodings, it appears that the new implementation is 2-3x faster for UTF-8 and UTF-16. For legacy Japanese text encodings like ISO-2022-JP or SJIS, the new implementation is perhaps 25% faster. • Note that correctly implementing negative 'from' and 'width' arguments imposes a small performance burden in such cases; one which the old implementation did not pay. This slightly skews benchmarking results in favor of the old implementation. However, even so, the new implementation is faster in all cases which I tested.	2022-08-02 11:07:06 +02:00
Alex Dowad	94fde1566f	Move implementation of mb_strlen to mbstring.c mbfl_strlen (in mbfilter.c) is still being used in a couple of places but will go away soon.	2022-08-02 11:07:06 +02:00
Christoph M. Becker	3e922bf08f	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings	2022-07-20 17:01:42 +02:00
Christoph M. Becker	c2bdaa48e1	Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings Passing `null` to `$encodings` is supposed to behave like passing the result of `mb_detect_order()`. Therefore, we need to remove the non- encodings from the `elist` in this case as well. Thus, we duplicate the global `elist`, so we can modify it. Closes GH-9063.	2022-07-20 16:58:55 +02:00
Alex Dowad	9ac49c0dd3	New implementation of mb_convert_kana mb_convert_kana now uses the new text encoding conversion filters. Microbenchmarking shows speed gains of 50%-150% across various text encodings and input string lengths. The behavior is the same as the old mb_convert_kana except for one fix: if the 'zero codepoint' U+0000 appeared in the input, the old implementation would sometimes drop it, not passing it through to the output. This is now fixed.	2022-07-20 07:44:19 +02:00
Alex Dowad	76a92c26e3	mb_decode_numericentity decodes valid entities which are truncated at end of string Since mb_decode_numericentity does not require all HTML entities to end with ';', but allows them to be terminated by ANY non-digit character, it doesn't make sense that valid entities which butt up against the end of the input string are not converted. As it turned out, supporting this case also made it possible to simplify the code nicely.	2022-07-18 15:11:47 +02:00
Alex Dowad	5d6bd557b3	mb_decode_numericentity converts entities which immediately follow a valid/invalid entity Thanks to Kamil Tieleka for suggesting that some of the behaviors of the legacy implementation which the new mb_decode_numericentity implementation took care to maintain were actually bugs and should be fixed. Thanks also to Trevor Rowbotham for providing a link to the HTML specification, showing how HTML numeric entities should be interpreted. mb_decode_numericentity now processes numeric entities in the following situations where the old implementation would not: - &<ENTITY> (for example, &A) - &#<ENTITY> - &#x<ENTITY> - <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65A) - <VALID BUT UNTERMINATED HEX ENTITY><ENTITY> - <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why the first entity is invalid; the value could be too big, it could have too many digits, or it could not match the 'convmap' parameter) - <INVALID AND UNTERMINATED HEX ENTITY><ENTITY> This is consistent with the way that web browsers process HTML entities.	2022-07-18 15:11:32 +02:00
Alex Dowad	91969e908f	New implementation of mb_{de,en}code_numericentity This new implementation uses the new encoding conversion filters. Aside from fewer LOC and (hopefully) improved readability, the differences are as follows: BEHAVIOR CHANGES: - The old implementation used signed arithmetic when operating on the 'convmap'. This meant that results could be surprising when using convmap entries with 1 in the MSB. Further, types like 'int' were used rather than those with a specific bit width, such as 'int32_t'. This meant that results could also depend on the platform width of an 'int'. Now unsigned arithmetic is used, with explicit bit widths. - Similarly, while converting decimal numeric entities, the legacy implementation would ensure that the value never overflowed INT_MAX, and if it did, the entity would be treated as invalid and passed through unconverted. However, that again means that results depend on the platform size of an 'int'. So now, we use a value with explicit bit width (32 bits) to hold the value of a deconverted decimal entity, and ensure that the entity value does not overflow that. Further, because we are using an UNSIGNED 32-bit value rather than a signed one, the ceiling for how large a decimal entity can be is higher now. All of this will probably not affect anyone, since Unicode codepoints above U+10FFFF are invalid anyways. To see the difference, you need to be using a text encoding like UCS-4, which allows huge 'codepoints'. - If it saw something which looked like a hex entity, but turned out not to be a valid numeric entity, the old implementation would sometimes convert the hexadecimal digits a-f to A-F (uppercase). The new implementation passes invalid numeric entities through without performing case conversion. - The old implementation of mb_encode_numericentity was limited in how many decimal/hex digits it could emit. If a text encoding like UCS-4 was in use, where 'codepoints' can have huge values (larger than the valid range stipulated by the Unicode standard), it would not error out on a 'codepoint' whose value was too large for it, but would rather mangle the value and emit a numeric entity which decoded to some other random codepoint. The new implementation is able to emit enough digits to express any value which fits in 32 bits. PERFORMANCE: Based on micro-benchmarks run on my development machine: Decoding numeric HTML entities is about 4 times faster, for both decimal and hexadecimal entities, across a variety of input string lengths. Encoding is about 3 times faster.	2022-07-18 15:11:30 +02:00
Christoph M. Becker	d6fc165028	Drop useless TODO comment Cf. <https://github.com/php/php-src/pull/9018#issuecomment-1185481492>.	2022-07-15 14:23:59 +02:00
Máté Kocsis	56137cd26e	Declare ext/mbstring constants in stubs (#8798 )	2022-06-23 17:34:08 +02:00
Alex Dowad	880803a21e	Use fast conversion filters to implement php_mb_ord Even for single-character strings, this is about 50% faster for ASCII, UTF-8, and UTF-16. For long strings, the performance gain is enormous, since the old code would convert the ENTIRE string, just to pick out the first codepoint.	2022-06-12 15:24:41 +02:00
Alex Dowad	950a7db9fe	Use fast text conversion filters to implement mb_check_encoding Benchmarking reveals that this is about 8% slower for UTF-8 strings which have a bad codepoint at the very beginning of the string. For good strings, or those where the first bad codepoint is much later in the string, it is significantly faster (2-3 times faster in many cases).	2022-06-12 15:24:41 +02:00
Remi Collet	5b2e413e89	Merge branch 'PHP-8.1' * PHP-8.1: NEWS for GH-8685 NEWS for GH-8685 Fix GH-8685 mbstring requires pcre	2022-06-03 07:55:51 +02:00
Remi Collet	966a90873d	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: NEWS for GH-8685 Fix GH-8685 mbstring requires pcre	2022-06-03 07:54:58 +02:00
Remi Collet	2eb2f9d74f	Fix GH-8685 mbstring requires pcre	2022-06-03 07:53:48 +02:00
Alex Dowad	492021168d	php_mb_convert_encoding{,_ex} returns zend_string That's what all existing callers want anyways. This avoids 2 unnecessary copies of the converted string.	2022-05-28 21:53:39 +02:00
Alex Dowad	0154a5ac9f	Use fast text conversion filters to implement php_mb_convert_encoding_ex	2022-05-28 21:53:38 +02:00
Christoph M. Becker	58cbee1ce3	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:11:01 +01:00
Christoph M. Becker	69f6b09b2a	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:09:52 +01:00
Christoph M. Becker	03816fba46	Fix GH-7902: mb_send_mail may delimit headers with LF only Email headers are supposed to be separated with CRLF. Period. We introduce a `CRLF` macro for better comprehensibility right away. Closes GH-7907.	2022-01-18 13:08:08 +01:00
Alex Dowad	3c73225125	New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.)	2021-12-21 08:33:11 +02:00
Alex Dowad	edc6b756c1	Merge branch 'PHP-8.1' * PHP-8.1: mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint	2021-12-20 22:47:18 +02:00
Alex Dowad	f07c193583	mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint In `a2bc57e0e5`, mb_detect_encoding was modified to ensure it would never return 'UUENCODE', 'QPrint', or other non-encodings as the "detected text encoding". Before mb_detect_encoding was enhanced so that it could detect any supported text encoding, those were never returned, and they are not desired. Actually, we want to eventually remove them completely from mbstring, since PHP already contains other implementations of UUEncode, QPrint, Base64, and HTML entities. For more clarity on why we need to suppress UUEncode, etc. from being detected by mb_detect_encoding, the existing UUEncode implementation in mbstring never treats any input as erroneous. It just accepts everything. This means that it would always be treated as a valid choice by mb_detect_encoding, and would be returned in many, many cases where the input is obviously not UUEncoded. It turns out that the form of mb_convert_encoding where the user passes multiple candidate encodings (and mbstring auto-detects which one to use) was also affected by the same issue. Apply the same fix.	2021-12-20 22:09:33 +02:00
Dmitry Stogov	1a4f49f1fe	Use cheaper memchr() instead of php_memnstr()	2021-11-10 10:19:49 +03:00
Alex Dowad	9308974f8c	Deprecate use of mbstring to convert text to Base64/QPrint/HTML entities/etc The purpose of mbstring is for working with Unicode and legacy text encodings; but Base64, QPrint, etc. are not text encodings and don't really belong in mbstring. PHP already contains separate implementations of Base64, QPrint, and HTML entities. It will be better to eventually remove these non-encodings from mbstring. Regarding HTML entities... there is a bit more to say. mbstring's implementation of HTML entities is different from the other built-in implementation (htmlspecialchars and htmlentities). Those functions convert <, >, and & to HTML entities, but mbstring does not. It appears that the original author of mbstring intended for something to be done with <, >, and &. He used a table to identify which characters should be converted to HTML entities, and </>/& all have a special value in that table. However, nothing ever checks for that special value, so the characters are passed through unconverted. This seems like a very useless implementation of HTML entities. The most important characters which need to be expressed as entities in HTML documents are those three!	2021-11-01 11:23:21 +02:00
Christoph M. Becker	7c75c61206	Merge branch 'PHP-8.1' * PHP-8.1: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:46 +02:00
Christoph M. Becker	7fcf17c41e	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:21 +02:00
Christoph M. Becker	6e6a8443a8	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:39:57 +02:00
Christoph M. Becker	d3d6d7906e	Fix #76167 : mbstring may use pointer from some previous request We must not reuse per-request memory across multiple requests, so this check triggered during RINIT makes no sense. As explained in the bug report[1], it can be even harmful, if some request startup fails, and the pointers refer to already freed memory in the next request. [1] <https://bugs.php.net/76167> Closes GH-7604.	2021-10-25 12:37:28 +02:00
Alex Dowad	9962aa9774	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding will not return non-encodings Improve detection accuracy of mb_detect_encoding	2021-10-19 18:11:35 +02:00
Alex Dowad	a2bc57e0e5	mb_detect_encoding will not return non-encodings Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.	2021-10-19 18:05:52 +02:00
Alex Dowad	dcaa010fff	Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine inverse flags. But, clever reader of commit logs, you will surely say: What if I want all my halfwidth numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and will never be used, and face up to another stark reality: mb_convert_kana does not work for that case, and never has. This was probably never noticed because nobody ever tried. Disallowing useless combinations of flags gives freedom to rearrange the kana conversion code without changing behavior. We can also reject unrecognized flags. This may help users to catch bugs. Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized at all).	2021-10-01 19:27:39 +02:00
Alex Dowad	7800491289	Inline SKIP_LONG_HEADER... macro which is only used once I don't find that pulling this code out into a macro makes anything clearer. Not at all.	2021-09-29 18:19:01 +02:00
Alex Dowad	8c32deb605	Don't check for impossible error condition in mb_strwidth	2021-09-29 18:19:01 +02:00
Alex Dowad	bf78070cbe	Don't check for impossible error condition in mb_strlen	2021-09-29 18:19:01 +02:00
Alex Dowad	d3f56e5ac9	Rename php_mb_mbchar_bytes_ex to php_mb_mbchar_bytes ...And remove the original php_mb_mbchar_bytes, which was not being used.	2021-09-29 18:19:01 +02:00
Alex Dowad	774cd960ab	No need to null-terminate buffer in php_mb_chr `mbfl_buffer_converter_feed_result` will not overrun the specified length.	2021-09-29 18:19:01 +02:00
Alex Dowad	abf83e5079	Rename php_mb_safe_strrchr_ex to php_mb_safe_strrchr ...And remove the original php_mb_safe_strrchr, which was not being used anywhere.	2021-09-29 18:19:01 +02:00

1 2 3 4 5 ...

832 commits