php-src

mirror of https://github.com/php/php-src.git synced 2025-08-17 14:38:49 +02:00

Author	SHA1	Message	Date
Remi Collet	2eb2f9d74f	Fix GH-8685 mbstring requires pcre	2022-06-03 07:53:48 +02:00
Alex Dowad	492021168d	php_mb_convert_encoding{,_ex} returns zend_string That's what all existing callers want anyways. This avoids 2 unnecessary copies of the converted string.	2022-05-28 21:53:39 +02:00
Alex Dowad	0154a5ac9f	Use fast text conversion filters to implement php_mb_convert_encoding_ex	2022-05-28 21:53:38 +02:00
Christoph M. Becker	58cbee1ce3	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:11:01 +01:00
Christoph M. Becker	69f6b09b2a	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:09:52 +01:00
Christoph M. Becker	03816fba46	Fix GH-7902: mb_send_mail may delimit headers with LF only Email headers are supposed to be separated with CRLF. Period. We introduce a `CRLF` macro for better comprehensibility right away. Closes GH-7907.	2022-01-18 13:08:08 +01:00
Alex Dowad	3c73225125	New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.)	2021-12-21 08:33:11 +02:00
Alex Dowad	edc6b756c1	Merge branch 'PHP-8.1' * PHP-8.1: mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint	2021-12-20 22:47:18 +02:00
Alex Dowad	f07c193583	mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint In `a2bc57e0e5`, mb_detect_encoding was modified to ensure it would never return 'UUENCODE', 'QPrint', or other non-encodings as the "detected text encoding". Before mb_detect_encoding was enhanced so that it could detect any supported text encoding, those were never returned, and they are not desired. Actually, we want to eventually remove them completely from mbstring, since PHP already contains other implementations of UUEncode, QPrint, Base64, and HTML entities. For more clarity on why we need to suppress UUEncode, etc. from being detected by mb_detect_encoding, the existing UUEncode implementation in mbstring never treats any input as erroneous. It just accepts everything. This means that it would always be treated as a valid choice by mb_detect_encoding, and would be returned in many, many cases where the input is obviously not UUEncoded. It turns out that the form of mb_convert_encoding where the user passes multiple candidate encodings (and mbstring auto-detects which one to use) was also affected by the same issue. Apply the same fix.	2021-12-20 22:09:33 +02:00
Dmitry Stogov	1a4f49f1fe	Use cheaper memchr() instead of php_memnstr()	2021-11-10 10:19:49 +03:00
Alex Dowad	9308974f8c	Deprecate use of mbstring to convert text to Base64/QPrint/HTML entities/etc The purpose of mbstring is for working with Unicode and legacy text encodings; but Base64, QPrint, etc. are not text encodings and don't really belong in mbstring. PHP already contains separate implementations of Base64, QPrint, and HTML entities. It will be better to eventually remove these non-encodings from mbstring. Regarding HTML entities... there is a bit more to say. mbstring's implementation of HTML entities is different from the other built-in implementation (htmlspecialchars and htmlentities). Those functions convert <, >, and & to HTML entities, but mbstring does not. It appears that the original author of mbstring intended for something to be done with <, >, and &. He used a table to identify which characters should be converted to HTML entities, and </>/& all have a special value in that table. However, nothing ever checks for that special value, so the characters are passed through unconverted. This seems like a very useless implementation of HTML entities. The most important characters which need to be expressed as entities in HTML documents are those three!	2021-11-01 11:23:21 +02:00
Christoph M. Becker	7c75c61206	Merge branch 'PHP-8.1' * PHP-8.1: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:46 +02:00
Christoph M. Becker	7fcf17c41e	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:21 +02:00
Christoph M. Becker	6e6a8443a8	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:39:57 +02:00
Christoph M. Becker	d3d6d7906e	Fix #76167 : mbstring may use pointer from some previous request We must not reuse per-request memory across multiple requests, so this check triggered during RINIT makes no sense. As explained in the bug report[1], it can be even harmful, if some request startup fails, and the pointers refer to already freed memory in the next request. [1] <https://bugs.php.net/76167> Closes GH-7604.	2021-10-25 12:37:28 +02:00
Alex Dowad	9962aa9774	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding will not return non-encodings Improve detection accuracy of mb_detect_encoding	2021-10-19 18:11:35 +02:00
Alex Dowad	a2bc57e0e5	mb_detect_encoding will not return non-encodings Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.	2021-10-19 18:05:52 +02:00
Alex Dowad	dcaa010fff	Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine inverse flags. But, clever reader of commit logs, you will surely say: What if I want all my halfwidth numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and will never be used, and face up to another stark reality: mb_convert_kana does not work for that case, and never has. This was probably never noticed because nobody ever tried. Disallowing useless combinations of flags gives freedom to rearrange the kana conversion code without changing behavior. We can also reject unrecognized flags. This may help users to catch bugs. Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized at all).	2021-10-01 19:27:39 +02:00
Alex Dowad	7800491289	Inline SKIP_LONG_HEADER... macro which is only used once I don't find that pulling this code out into a macro makes anything clearer. Not at all.	2021-09-29 18:19:01 +02:00
Alex Dowad	8c32deb605	Don't check for impossible error condition in mb_strwidth	2021-09-29 18:19:01 +02:00
Alex Dowad	bf78070cbe	Don't check for impossible error condition in mb_strlen	2021-09-29 18:19:01 +02:00
Alex Dowad	d3f56e5ac9	Rename php_mb_mbchar_bytes_ex to php_mb_mbchar_bytes ...And remove the original php_mb_mbchar_bytes, which was not being used.	2021-09-29 18:19:01 +02:00
Alex Dowad	774cd960ab	No need to null-terminate buffer in php_mb_chr `mbfl_buffer_converter_feed_result` will not overrun the specified length.	2021-09-29 18:19:01 +02:00
Alex Dowad	abf83e5079	Rename php_mb_safe_strrchr_ex to php_mb_safe_strrchr ...And remove the original php_mb_safe_strrchr, which was not being used anywhere.	2021-09-29 18:19:01 +02:00
Nikita Popov	c37b35fa41	Merge branch 'PHP-8.1' * PHP-8.1: Use locale-independent case conversion in mb_send_mail()	2021-09-23 17:21:14 +02:00
Nikita Popov	46315defc7	Use locale-independent case conversion in mb_send_mail() Headers should not be processed in a locale-depdendent fashion. Switch from upper to lowercasing because that's the standard for PHP and we provide an ASCII implementation of this operation. This is adapted from GH-7506.	2021-09-23 17:20:54 +02:00
Alex Dowad	36c979e2b6	Use stack-allocated buffer in php_mb_chr	2021-09-20 11:27:54 +02:00
Alex Dowad	1170981b33	Fix mb_str_split on empty strings in variable-length text encodings Previously, when passed an empty string, and given an encoding which uses a variable number of bytes per character (and which doesn't have a 'character length table'), mb_str_split would return an array containing a single empty string, rather than an empty array. The ISO-2022 encodings are among those which were affected by this bug.	2021-09-20 11:27:54 +02:00
Alex Dowad	f663344f33	Merge branch 'PHP-8.1' * PHP-8.1: Bug #81390: mb_detect_encoding should not prematurely stop processing input mb_detect_encoding with only one candidate encoding uses mb_check_encoding Optimize text encoding detection for speed (eliminate Unicode property lookups)	2021-09-20 11:27:07 +02:00
Alex Dowad	ca33ab59ad	mb_detect_encoding with only one candidate encoding uses mb_check_encoding ...Because it's about 5% faster.	2021-09-20 11:20:53 +02:00
Alex Dowad	9e1447dbf3	Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to these modes had names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA. The "ZEN2HAN" part of the name is misleading, since these modes do not convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted characters are fullwidth both before and after the conversion. So... let's name the constants accordingly.	2021-09-06 13:16:23 +02:00
Alex Dowad	776296e12f	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.	2021-08-31 13:41:34 +02:00
Nikita Popov	639015845f	Deprecate calling mb_check_encoding() without argument Part of https://wiki.php.net/rfc/deprecations_php_8_1.	2021-07-08 15:34:49 +02:00
George Peter Banyard	e7135cb817	Use zend_string_equals_* API in a couple of more place Closes GH-6979	2021-05-14 13:45:17 +01:00
George Peter Banyard	aca6aefd85	Remove 'register' type qualifier (#6980 ) The compiler should be smart enough to optimize this on its own	2021-05-14 13:38:01 +01:00
KsaR	01b3fc03c3	Update http->https in license (#6945 ) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before \|	2021-05-06 12:16:35 +02:00
Christoph M. Becker	592cfa309e	Merge branch 'PHP-8.0' * PHP-8.0: Fix #81011: mb_convert_encoding removes references from arrays	2021-05-04 18:40:23 +02:00
Christoph M. Becker	d1c0cbdcb1	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #81011: mb_convert_encoding removes references from arrays	2021-05-04 18:39:39 +02:00
Christoph M. Becker	0cafd53d18	Fix #81011 : mb_convert_encoding removes references from arrays We need to dereference references. Closes GH-6938.	2021-05-04 18:37:40 +02:00
George Peter Banyard	09efad615b	Use zend_string_equals_(literal_)ci() API more often Also drive-by usage of zend_ini_parse_bool() Closes GH-6844	2021-04-09 02:34:50 +01:00
George Peter Banyard	5caaf40b43	Introduce pseudo-keyword ZEND_FALLTHROUGH And use it instead of comments	2021-04-07 00:46:29 +01:00
Alex Dowad	a06c20a17c	Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it.	2021-01-15 21:55:41 +02:00
Nikita Popov	3e01f5afb1	Replace zend_bool uses with bool We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool is retained as an alias.	2021-01-15 12:33:06 +01:00
Alex Dowad	72660c416a	Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.	2020-11-25 19:52:19 +02:00
Alex Dowad	e169ad3b61	Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way.	2020-11-11 11:18:59 +02:00
Alex Dowad	3e7acf901d	Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.	2020-11-09 13:45:17 +02:00
Alex Dowad	be1a215538	Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+) Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or not, it would convert the whole string into wchar (code points), which required dynamically allocating a (potentially large) buffer. Then it would turn right around and convert that big 'ol buffer of code points back to the original encoding again. Finally, it would check whether any invalid bytes were detected during that long and onerous process. The thing is, mbstring _already_ has machinery for detecting whether a string is valid in a certain encoding or not, and it doesn't require copying any data around or allocating buffers. Better yet, it can fail fast when an invalid byte is found. Why not use it? It's sure a lot faster! Further, the legacy code was also badly broken. Why? Because aside from checking whether illegal characters were detected, it would also check whether the conversion to and from wchars was lossless. But, some encodings have more than one valid encoding for the same character. In such cases, it is not possible to make the conversion to and from wchars lossless for every valid character. So `mb_check_encoding` would actually reject good strings in a lot of encodings!	2020-11-02 21:31:06 +02:00
Alex Dowad	7dc16374b4	Remove unused IS_SJIS1 and IS_SJIS2 macros	2020-10-14 08:31:51 +02:00
Nikita Popov	4371a4b241	Merge branch 'PHP-8.0' * PHP-8.0: Fix incorrect zpp parameter count in mb_substr() / mb_strcut()	2020-10-13 17:47:11 +02:00
Nikita Popov	9b4094c3d7	Fix incorrect zpp parameter count in mb_substr() / mb_strcut() These functions only accept 4 params.	2020-10-13 17:46:56 +02:00

1 2 3 4 5 ...

906 commits