php-src

mirror of https://github.com/php/php-src.git synced 2025-08-16 14:08:47 +02:00

Author	SHA1	Message	Date
Alex Dowad	3f12d26e3a	Merge branch 'PHP-8.1' * PHP-8.1: Error handling for UTF-8 complies with WHATWG specification	2022-04-16 20:32:12 +02:00
Alex Dowad	04e59c916f	Error handling for UTF-8 complies with WHATWG specification In `7502c86342`, I adjusted the number of error markers emitted on invalid UTF-8 text to be more consistent with mbstring's behavior on other text encodings (generally, it emits one error marker for one unexpected byte). I didn't expect that anybody would actually care one way or the other, but felt that it was better to be consistent than not. Later, Martin Auswöger kindly pointed out that the WHATWG encoding specification, which governs how various text encodings are handled by web browsers, does actually specify how many error markers should be generated for any given piece of invalid UTF-8 text. Until now, we have never really paid much attention to the WHATWG specification, but we do want to comply with as many relevant specifications as possible. And since PHP is commonly used for web applications, compatibility with the behavior of web browsers is obviously a good thing.	2022-04-16 15:04:38 +02:00
Christoph M. Becker	20c0eb47df	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:35:06 +01:00
Christoph M. Becker	5003831260	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:34:31 +01:00
Christoph M. Becker	d0417ebc93	Fix GH-8208: mb_encode_mimeheader: $indent functionality broken We also need to factor in the indent, when getting the encoder result. Closes GH-8213.	2022-03-17 17:31:58 +01:00
Alex Dowad	ff76694f28	Merge branch 'PHP-8.1' * PHP-8.1: mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F	2022-02-22 23:58:57 +02:00
Alex Dowad	8a8533d263	mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F This was the old behavior of mb_check_encoding() before `3e7acf901d`, but yours truly broke it. If only we had more thorough tests at that time, this might not have slipped through the cracks. Thanks to divinity76 for the report.	2022-02-22 23:56:56 +02:00
Christoph M. Becker	58cbee1ce3	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:11:01 +01:00
Christoph M. Becker	69f6b09b2a	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:09:52 +01:00
Christoph M. Becker	03816fba46	Fix GH-7902: mb_send_mail may delimit headers with LF only Email headers are supposed to be separated with CRLF. Period. We introduce a `CRLF` macro for better comprehensibility right away. Closes GH-7907.	2022-01-18 13:08:08 +01:00
Dmitry Stogov	e1782c08bf	Fix ASAN undefined behavior (unsigned char << 24) ext/mbstring/libmbfl/filters/mbfilter_utf32.c:259:20: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'	2022-01-11 09:13:22 +03:00
Christoph M. Becker	51eec5086f	Run mb_send_mail tests on Windows, too We use the run-tests.php `{MAIL}` abstraction instead of `cat`. Closes GH-7908.	2022-01-07 22:46:02 +01:00
Alex Dowad	53ffba967c	Implement fast text conversion interface for CP5022{0,1,2}	2021-12-26 22:19:51 +02:00
Alex Dowad	01afd9f141	Implement fast text conversion interface for JIS	2021-12-26 22:19:51 +02:00
Alex Dowad	cb4626c5b2	Implement fast text conversion interface for GB18030	2021-12-26 22:19:51 +02:00
Alex Dowad	3e8088dc80	Implement fast text conversion interface for EUC-JP-MS	2021-12-26 22:19:51 +02:00
Alex Dowad	e5af94b74f	Implement fast text conversion interface for CP51932	2021-12-26 22:19:51 +02:00
Alex Dowad	6ef1b35223	Implement fast text conversion interface for EUC-CN	2021-12-26 22:19:51 +02:00
Alex Dowad	9bd08a97d9	Implement fast text conversion interface for EUC-TW	2021-12-26 22:19:51 +02:00
Alex Dowad	661a10160b	Implement fast text conversion interface for CP936	2021-12-26 22:19:51 +02:00
Alex Dowad	20555371d5	Implement fast text conversion interface for CP932	2021-12-26 22:19:51 +02:00
Alex Dowad	43bb97c539	Implement fast text conversion interface for EUC-KR	2021-12-26 22:19:51 +02:00
Alex Dowad	c0936d48b0	Implement fast text conversion interface for UHC	2021-12-26 22:19:51 +02:00
Alex Dowad	40809cb19f	Implement fast text conversion interface for HZ	2021-12-26 22:19:51 +02:00
Alex Dowad	da58d42d94	Implement fast text conversion interface for CP950	2021-12-26 22:19:51 +02:00
Alex Dowad	eac50a360f	Implement fast text conversion interface for Big5	2021-12-26 22:19:51 +02:00
Alex Dowad	3c73225125	New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.)	2021-12-21 08:33:11 +02:00
Alex Dowad	edc6b756c1	Merge branch 'PHP-8.1' * PHP-8.1: mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint	2021-12-20 22:47:18 +02:00
Alex Dowad	f07c193583	mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint In `a2bc57e0e5`, mb_detect_encoding was modified to ensure it would never return 'UUENCODE', 'QPrint', or other non-encodings as the "detected text encoding". Before mb_detect_encoding was enhanced so that it could detect any supported text encoding, those were never returned, and they are not desired. Actually, we want to eventually remove them completely from mbstring, since PHP already contains other implementations of UUEncode, QPrint, Base64, and HTML entities. For more clarity on why we need to suppress UUEncode, etc. from being detected by mb_detect_encoding, the existing UUEncode implementation in mbstring never treats any input as erroneous. It just accepts everything. This means that it would always be treated as a valid choice by mb_detect_encoding, and would be returned in many, many cases where the input is obviously not UUEncoded. It turns out that the form of mb_convert_encoding where the user passes multiple candidate encodings (and mbstring auto-detects which one to use) was also affected by the same issue. Apply the same fix.	2021-12-20 22:09:33 +02:00
Christoph M. Becker	97f78b3bb7	Merge branch 'PHP-8.1' * PHP-8.1: Fix #81693: mb_check_encoding(7bit) segfaults	2021-12-03 22:50:27 +01:00
Christoph M. Becker	929d847152	Fix #81693 : mb_check_encoding(7bit) segfaults `php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`. Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be found. Since we don't actually need to convert to wchar, we encode to 8bit. Closes GH-7712.	2021-12-03 22:49:47 +01:00
Alex Dowad	ee3caef8eb	Merge branch 'PHP-8.1' * PHP-8.1: Add unit tests for mb_detect_encoding on Polish text	2021-11-26 17:43:40 +02:00
Alex Dowad	1a2c608053	Add unit tests for mb_detect_encoding on Polish text	2021-11-26 17:42:53 +02:00
Remi Collet	7c0f2b4dc0	Merge branch 'PHP-8.1' * PHP-8.1: add missing cond. Enable encoding detection for Polish text	2021-11-25 10:16:34 +01:00
Alex Dowad	d573054ebe	Enable encoding detection for Polish text Previously, some accented letters commonly used to write Polish text were counted as 'rare' codepoints. Treat them as 'common' instead. Thanks to Alec for pointing this out.	2021-11-25 11:10:47 +02:00
Dmitry Stogov	1a4f49f1fe	Use cheaper memchr() instead of php_memnstr()	2021-11-10 10:19:49 +03:00
Alex Dowad	9308974f8c	Deprecate use of mbstring to convert text to Base64/QPrint/HTML entities/etc The purpose of mbstring is for working with Unicode and legacy text encodings; but Base64, QPrint, etc. are not text encodings and don't really belong in mbstring. PHP already contains separate implementations of Base64, QPrint, and HTML entities. It will be better to eventually remove these non-encodings from mbstring. Regarding HTML entities... there is a bit more to say. mbstring's implementation of HTML entities is different from the other built-in implementation (htmlspecialchars and htmlentities). Those functions convert <, >, and & to HTML entities, but mbstring does not. It appears that the original author of mbstring intended for something to be done with <, >, and &. He used a table to identify which characters should be converted to HTML entities, and </>/& all have a special value in that table. However, nothing ever checks for that special value, so the characters are passed through unconverted. This seems like a very useless implementation of HTML entities. The most important characters which need to be expressed as entities in HTML documents are those three!	2021-11-01 11:23:21 +02:00
Christoph M. Becker	7c75c61206	Merge branch 'PHP-8.1' * PHP-8.1: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:46 +02:00
Christoph M. Becker	7fcf17c41e	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:21 +02:00
Christoph M. Becker	6e6a8443a8	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:39:57 +02:00
Christoph M. Becker	d3d6d7906e	Fix #76167 : mbstring may use pointer from some previous request We must not reuse per-request memory across multiple requests, so this check triggered during RINIT makes no sense. As explained in the bug report[1], it can be even harmful, if some request startup fails, and the pointers refer to already freed memory in the next request. [1] <https://bugs.php.net/76167> Closes GH-7604.	2021-10-25 12:37:28 +02:00
Alex Dowad	9962aa9774	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding will not return non-encodings Improve detection accuracy of mb_detect_encoding	2021-10-19 18:11:35 +02:00
Alex Dowad	a2bc57e0e5	mb_detect_encoding will not return non-encodings Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.	2021-10-19 18:05:52 +02:00
Alex Dowad	28b346bc06	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In `3e7acf901d`, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.	2021-10-19 18:05:51 +02:00
Alex Dowad	dcaa010fff	Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine inverse flags. But, clever reader of commit logs, you will surely say: What if I want all my halfwidth numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and will never be used, and face up to another stark reality: mb_convert_kana does not work for that case, and never has. This was probably never noticed because nobody ever tried. Disallowing useless combinations of flags gives freedom to rearrange the kana conversion code without changing behavior. We can also reject unrecognized flags. This may help users to catch bugs. Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized at all).	2021-10-01 19:27:39 +02:00
Alex Dowad	7800491289	Inline SKIP_LONG_HEADER... macro which is only used once I don't find that pulling this code out into a macro makes anything clearer. Not at all.	2021-09-29 18:19:01 +02:00
Alex Dowad	0b32a15eb0	Optimize mb_str{,im}width for performance Rather than doing a linear search of a table of fullwidth codepoint ranges for every input character, 1) Short-cut the search if the codepoint is below the first such range 2) Otherwise, do a binary (rather than linear) search	2021-09-29 18:19:01 +02:00
Alex Dowad	f4365d2c26	Remove unused typedef 'mbfl_encoding_id'	2021-09-29 18:19:01 +02:00
Alex Dowad	3bf431969e	Don't check for impossible error condition in mb_substr_count	2021-09-29 18:19:01 +02:00
Alex Dowad	8c32deb605	Don't check for impossible error condition in mb_strwidth	2021-09-29 18:19:01 +02:00

1 2 3 4 5 ...

2207 commits