php-src

mirror of https://github.com/php/php-src.git synced 2025-08-15 21:48:51 +02:00

Author	SHA1	Message	Date
Alex Dowad	3ab10da758	Take order of candidate encodings into account when guessing text encoding The documentation for mb_detect_encoding says that this function "Detects the most likely character encoding for string `string` from an ordered list of candidates". Prior to `28b346bc06`, mb_detect_encoding did not really attempt to determine the "most likely" text encoding for the input string. It would just return the first candidate encoding for which the string was valid. In `28b346bc06`, I amended this function so that it uses heuristics to try to guess which candidate encoding is "most likely". However, the caller did not have any way to indicate which candidate text encoding(s) they consider to be more likely, in case the heuristics applied are inconclusive. In the language of Bayesian probability, there was no way for the caller to indicate their 'prior' assignment of probabilities. Further, the documentation for mb_detect_encoding also says that the second parameter `encodings` is "a list of character encodings to try, in order". The documentation clearly implies that the order of the `encodings` argument should be significant. Therefore, amend mb_detect_encoding so that while it still uses heuristics to guess the most likely text encoding for the input string, it favors those which are earlier in the list of candidate encodings. One complication is that many callers of mb_detect_encoding use it in this way: mb_detect_encoding($string, mb_list_encodings()); In a majority of cases, this is bad code; mb_detect_encoding will both be much slower and the results will be less reliable than if a smaller list of candidates is used. However, since such code already exists and people are using it in production, we should not unnecessarily break it. The order of candidate encodings obviously does not express any prior belief of which candidates are more likely in this case, and treating it as if it did will degrade the accuracy of the result. Since mb_list_encodings now returns a single, immutable array on each call, we can avoid that problem by turning off the new behavior when we receive the array of encodings returned by mb_list_encodings. This implementation means that if the user does this: $a = mb_list_encodings(); mb_detect_encoding($string, $a); ...then the order of candidate encodings will not be considered. However, if the user explicitly initializes their own array of all supported legacy text encodings, then the order will be considered. The other functions which also follow this new behavior are: • mb_convert_variables • mb_convert_encoding (when multiple candidate input encodings are listed) Other places where "detection" (or really "guessing") of text encoding may be performed include: • mb_send_mail • Zend engine, when determining the encoding of a PHP script • mbstring processing of HTTP request contents, when http_input INI parameter is set to a list In these cases, the new logic based on order of candidate encodings is not enabled. It might be logical to consider the order of candidate encodings in some or all of these cases, but I'm not sure if that is true, so it seems wiser to avoid more behavior changes than is necessary. Further, ever since the new encoding detection heuristics were implemented in `28b346bc06`, we have not received any complaints of user code being broken in these areas. So I am reluctant to "fix what isn't broken". Well, some might say that applying the new detection heuristics to mb_send_mail, etc. in `28b346bc06` was "fixing what wasn't broken", but (cough cough) I don't have any comment on that...	2023-05-16 07:01:07 -07:00
Alex Dowad	97e29bed9e	Use shared, immutable array for return value of mb_list_encodings This will allow us to easily check in other mbstring functions if the list of all supported encodings, returned by mb_list_encodings, is passed in as input to another function. Co-authored-by: Ilija Tovilo <ilija.tovilo@me.com>	2023-05-16 07:01:07 -07:00
Alex Dowad	6df7557e43	mb_parse_str, mb_http_input, and mb_convert_variables use fast text conversion code for automatic encoding detection For mb_parse_str, when mbstring.http_input (INI parameter) is a list of multiple possible text encodings (which is not the case by default), this new implementation is about 25% faster. When mbstring.http_input is a single value, then nothing is changed. (No automatic encoding detection is done in that case.)	2023-04-12 19:57:52 +02:00
Alex Dowad	a9a672048b	Implement mb_output_handler using fast text conversion filters	2023-01-03 09:02:21 +02:00
Alex Dowad	0c0774f5b4	Use fast text conversion filters for mb_strpos, mb_stripos, mb_substr, etc This boosts the performance of mb_strpos, mb_stripos, mb_strrpos, mb_strripos, mb_strstr, mb_stristr, mb_strrchr, and mb_strrichr when used on non-UTF-8 strings. mb_substr is also faster. With UTF-8 input, there is no appreciable difference in performance for mb_strpos, mb_stripos, mb_strrpos, etc. This is expected, since the only real difference here (aside from shorter and simpler code) is that the new text conversion code is used when converting non-UTF-8 input strings to UTF-8. (This is done because internally, mb_strpos, etc. work only on UTF-8 text.) For ASCII, speed is boosted by 30-65%. For other legacy text encodings, the degree of performance improvement will depend on how slow the legacy conversion code was. One other minor, but notable difference is that strings encoded using UTF-8 variants from Japanese mobile vendors (SoftBank, KDDI, Docomo) will not undergo encoding conversion but will be processed "as is". It is expected that this will result in a large performance boost for such input strings; but realistically, the number of users who work with such strings is probably minute. I was not originally planning to include mb_substr in this commit, but fuzzing of the reimplemented mb_strstr revealed that mb_substr needed to be reimplemented, too; using the old mbfl_substr, which was based on the old text conversion filters, in combination with functions which use the new text conversion filters caused bugs. The performance boost for mb_substr varies from 10%-500%, depending on the encoding and input string used.	2022-12-12 16:28:49 +02:00
Alex Dowad	d0d834429f	Cache UTF-8-validity status of strings in GC flags The PCRE extension is already doing this. The flag is set when a string is determined to be valid UTF-8, and cleared in zend_string_forget_hash_val. We might as well make good use of it in mbstring as well. This should result in a negligible slowdown for non-UTF-8 strings, bad UTF-8 strings, and good UTF-8 strings which are checked only once. However, when microbenchmarking this change using a variety of text encodings and string lengths, I found that in most of these cases, the 'new' code was a few percent faster. In a couple of cases, the 'old' code was a few percent faster. This was not a result of sampling error, since I could reproduce these test results repeatedly, and even when running a large number of iterations. Both the new and old code were compiled with -O3 -march=native. My (unproved) hypothesis is that although the new code appears to only add one more conditional branch, the compiler may emit slightly different code from before (perhaps with different register allocation and so on), and this may cause some cases to run slightly faster and others to run slightly slower. I have not disassembled the old and new binaries to see if an examination of the emitted assembly code would support this hypothesis. For good UTF-8 strings which are checked repeatedly, the speedup is about 40% even for strings 1-5 bytes in length. For ~100 byte strings, it is ~700%, and for ~10000 byte strings, it is ~80000%. I tried fuzzing MBString's php_mb_check_encoding function and pcre2lib's valid_utf function to see if I could find any cases where their output would be different. After running the fuzzer for a couple of minutes, it had tried more than 1 million test cases without finding any where the output was different. Therefore, it appears that MBString's UTF-8 validation is compatible with PCRE's.	2022-11-15 19:14:35 +02:00
Alex Dowad	3ce888a837	Use uint32_t for 'illegal_substchar' codepoint in mbstring This value is a wchar, so the best type for it is uint32_t.	2022-10-05 10:02:02 +09:00
Alex Dowad	492021168d	php_mb_convert_encoding{,_ex} returns zend_string That's what all existing callers want anyways. This avoids 2 unnecessary copies of the converted string.	2022-05-28 21:53:39 +02:00
Alex Dowad	d3f56e5ac9	Rename php_mb_mbchar_bytes_ex to php_mb_mbchar_bytes ...And remove the original php_mb_mbchar_bytes, which was not being used.	2021-09-29 18:19:01 +02:00
Alex Dowad	abf83e5079	Rename php_mb_safe_strrchr_ex to php_mb_safe_strrchr ...And remove the original php_mb_safe_strrchr, which was not being used anywhere.	2021-09-29 18:19:01 +02:00
KsaR	01b3fc03c3	Update http->https in license (#6945 ) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before \|	2021-05-06 12:16:35 +02:00
Nikita Popov	3e01f5afb1	Replace zend_bool uses with bool We're starting to see a mix between uses of zend_bool and bool. Replace all usages with the standard bool type everywhere. Of course, zend_bool is retained as an alias.	2021-01-15 12:33:06 +01:00
Alex Dowad	7eddcabe2b	Don't guard mbstring code with #ifdef HAVE_MBSTRING This is just a very silly feature of mbstring -- you can compile the source files with HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the use of this? Why compile the source files and link against them if you don't want the mbstring extension? It doesn't make any kind of sense.	2020-08-31 23:18:13 +02:00
Alex Dowad	62317d592f	Remove redundant includes from mbstring (and make sure correct config.h is used) Very interesting... it turns out that when Valgrind support was enabled, `#include "config.h"` from within mbstring was actually including the file "config.h" from Valgrind, and not the one from mbstring!! This is because -I/usr/include/valgrind was added to the compiler invocation _before_ -Iext/mbstring/libmbfl. Make sure we actually include the file which was intended.	2020-08-31 23:17:58 +02:00
George Peter Banyard	68164f40ce	Fix [-Wundef] warning in MBString extension	2020-05-16 15:31:20 +02:00
Máté Kocsis	21cfa03f17	Generate function entries for another batch of extensions Closes GH-5352	2020-04-05 21:15:30 +02:00
Nikita Popov	11f0e1d1cb	Move encoding fetching out of php_mb_convert_encoding()	2020-03-31 21:47:55 +02:00
Nikita Popov	7cea789cfc	Parse mb_convert_encoding() encodings only once Instead of re-parsing them for every converted value. Also reuse the generic parse_array() helper.	2020-03-30 14:54:15 +02:00
Nikita Popov	ed850f2723	Move encoding fetching outside php_mb_stripos()	2020-03-30 12:29:11 +02:00
Nikita Popov	7db3a51884	Only fetch to_encoding once in mb_convert_encoding() Instead of doing it on every conversion. This is both more efficient and avoids generating multiple warnings.	2020-01-28 15:12:24 +01:00
Nikita Popov	21e631e473	Merge branch 'PHP-7.4'	2019-10-06 10:07:57 +02:00
Nikita Popov	6623e7ac51	Add support for mbstring.regex_retry_limit This is very similar to the existing mbstring.regex_stack_limit, but for backtracking. The default value matches pcre.backtrack_limit. Only used on libonig >= 2.8.0.	2019-10-06 10:06:33 +02:00
Gabriel Caruso	5d6e923d46	Remove mention of PHP major version in Copyright headers Closes GH-4732.	2019-09-25 14:51:43 +02:00
Nikita Popov	1d53d6df7e	Merge branch 'PHP-7.4'	2019-04-17 14:06:05 +02:00
Nikita Popov	f73f190c3f	Fix internal_encoding fallback in mbstring By introducing a hook that is called whenever one of internal_encoding / input_encoding / output_encoding changes, so that mbstring can adjust it's internal state. This also makes internal_encoding work with zend multibyte.	2019-04-17 14:05:53 +02:00
Stanislav Malyshev	63e0c22037	Merge branch 'PHP-7.4' * PHP-7.4: Unfortunately, travis CI has old oniguruma library Update NEWS & UPGRADING Add fallbacks for older oniguruma versions Add mbstring.regex_stack_limit to php.ini-* Implement RF bug #72777 - ensure stack limits on mbstring functions.	2019-04-01 00:32:49 -07:00
Stanislav Malyshev	077ce33aa9	Merge branch 'PHP-7.3' into PHP-7.4 * PHP-7.3: Update NEWS & UPGRADING Add fallbacks for older oniguruma versions Add mbstring.regex_stack_limit to php.ini-* Implement RF bug #72777 - ensure stack limits on mbstring functions.	2019-04-01 00:05:36 -07:00
Yasuo Ohgaki	738016bd88	Implement RF bug #72777 - ensure stack limits on mbstring functions. The patch creates new config: mbstring.regex_stack_limit, which defaults to 100000.	2019-03-28 00:31:57 -07:00
Nikita Popov	e683c189f2	Merge branch 'PHP-7.4'	2019-02-12 16:43:34 +01:00
legale	d77ad27415	Implement mb_str_split() RFC: https://wiki.php.net/rfc/mb_str_split	2019-02-12 16:42:51 +01:00
Peter Kokot	623911f993	Merge branch 'PHP-7.4' * PHP-7.4: Remove local variables	2019-02-03 21:23:18 +01:00
Peter Kokot	92ac598aab	Remove local variables This patch removes the so called local variables defined per file basis for certain editors to properly show tab width, and similar settings. These are mainly used by Vim and Emacs editors yet with recent changes the once working definitions don't work anymore in Vim without custom plugins or additional configuration. Neither are these settings synced across the PHP code base. A simpler and better approach is EditorConfig and fixing code using some code style fixing tools in the future instead. This patch also removes the so called modelines for Vim. Modelines allow Vim editor specifically to set some editor configuration such as syntax highlighting, indentation style and tab width to be set in the first line or the last 5 lines per file basis. Since the php test files have syntax highlighting already set in most editors properly and EditorConfig takes care of the indentation settings, this patch removes these as well for the Vim 6.0 and newer versions. With the removal of local variables for certain editors such as Emacs and Vim, the footer is also probably not needed anymore when creating extensions using ext_skel.php script. Additionally, Vim modelines for setting php syntax and some editor settings has been removed from some *.phpt files. All these are mostly not relevant for phpt files neither work properly in the middle of the file.	2019-02-03 21:03:00 +01:00
Zeev Suraski	0cf7de1c70	Remove yearly range from copyright notice	2019-01-30 11:03:12 +02:00
Zeev Suraski	38c337f22e	Remove year range from copyright notice	2019-01-30 11:00:23 +02:00
Nikita Popov	331e56ce38	Remove mbstring.func_overload Deprecated in PHP 7.2 as part of https://wiki.php.net/rfc/deprecations_php_7_2.	2019-01-28 15:58:23 +01:00
Nikita Popov	24085b187a	Remove unused prototype in mbstring Reported by legale.	2019-01-24 16:11:35 +01:00
Nikita Popov	a7d6b2c1fb	Use zend_string for mbstring last encoding cache Saves us a string duplication, as well as case-insensitive string comparisons for the likely case of an interned string encoding.	2018-10-29 20:29:22 +01:00
Peter Kokot	8d3f8ca12a	Remove unused Git attributes ident The $Id$ keywords were used in Subversion where they can be substituted with filename, last revision number change, last changed date, and last user who changed it. In Git this functionality is different and can be done with Git attribute ident. These need to be defined manually for each file in the .gitattributes file and are afterwards replaced with 40-character hexadecimal blob object name which is based only on the particular file contents. This patch simplifies handling of $Id$ keywords by removing them since they are not used anymore.	2018-07-25 00:53:25 +02:00
Xinchen Hui	a6519d0514	year++	2018-01-02 12:57:58 +08:00
Peter Kokot	a57de26c3d	Refactor mbstring READMEs	2017-10-08 17:51:02 +02:00
Anatol Belski	13a2629005	size_t fixes	2017-07-25 19:03:33 +02:00
Anatol Belski	ea83b69883	Adjust datatypes and reorder which saves 8 bytes on 64-bit	2017-07-23 16:37:30 +02:00
Anatol Belski	4fbd7ccba2	touch yet more places for datatypes	2017-07-23 00:47:24 +02:00
Anatol Belski	e0825ec60f	Mitigation for ssize_t issue in `22a5f554a8` and some more	2017-07-22 22:34:16 +02:00
Nikita Popov	ba383b8239	Add basic mbstring encoding cache Store the last used encoding and compare against it. It's quite likely that an application is going to be using the same encoding again and again. The actual mbfl_name2encoding() function could also be optimized to use a hash lookup rather than a linear scan, but we don't have a hashtable implmentation in libmbfl...	2017-07-20 13:58:40 +02:00
Nikita Popov	adaea77593	Switch libmbfl to use size_t Switch mbfl_string and related structures to use size_t lengths. Quite likely that I broke some things along the way...	2017-07-20 13:58:40 +02:00
Nikita Popov	dead4f0b1b	Avoid unnecessary encoding lookups in mbstring Extract part of php_mb_convert_encoding that does the actual work and use it whenever we already know the encoding.	2017-07-19 23:59:42 +02:00
Sammy Kaye Powers	9e29f841ce	Update copyright headers to 2017	2017-01-02 09:30:12 -06:00
Anatol Belski	b204b3abd1	further normalizations, uint vs uint32_t fix merge mistake yet one more replacement run	2016-11-26 17:29:01 +01:00
Yasuo Ohgaki	8ad4ef98b6	pull-request/1099 Request #65081 - implemeting mb_scrub	2016-08-10 14:09:48 +09:00

1 2 3 4

161 commits