php-src/ext/mbstring
Alex Dowad 1f0cf133db Add fast mb_strcut implementation for UTF-8
The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)

The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).
2023-10-04 09:10:38 +02:00
..
libmbfl Add fast mb_strcut implementation for UTF-8 2023-10-04 09:10:38 +02:00
tests Add test cases for mb_strcut 2023-10-04 09:10:25 +02:00
ucgendat Optimize mb_str{,im}width for performance 2021-09-29 18:19:01 +02:00
common_codepoints.txt Improve mb_detect_encoding accuracy for text containing vowels with macrons 2023-08-25 12:09:55 +02:00
config.m4 Combine CJK encoding conversion code in a single source file 2023-05-20 21:27:48 -07:00
config.w32 Combine CJK encoding conversion code in a single source file 2023-05-20 21:27:48 -07:00
CREDITS
gen_rare_cp_bitvec.php Mark globals as const (#10303) 2023-01-23 13:46:58 +00:00
mb_gpc.c Take order of candidate encodings into account when guessing text encoding 2023-05-16 07:01:07 -07:00
mb_gpc.h Remove unused 'to_language' and 'from_language' struct fields 2022-08-16 16:43:26 +02:00
mbstring.c Add fast mb_strcut implementation for UTF-8 2023-10-04 09:10:38 +02:00
mbstring.h Take order of candidate encodings into account when guessing text encoding 2023-05-16 07:01:07 -07:00
mbstring.stub.php [RFC] Implement mb_str_pad() (#11284) 2023-06-20 21:22:04 +02:00
mbstring_arginfo.h [RFC] Implement mb_str_pad() (#11284) 2023-06-20 21:22:04 +02:00
php_mbregex.c Reduce memory allocated by var_export, json_encode, serialize, and other (#8902) 2022-07-08 14:47:46 +02:00
php_mbregex.h Declare ext/mbstring constants in stubs (#8798) 2022-06-23 17:34:08 +02:00
php_onig_compat.h
php_unicode.c Implement conditional casing for Greek letter sigma when title-casing text 2023-01-12 17:41:11 +02:00
php_unicode.h Speed boost for mb_stripos (when not using UTF-8) 2022-12-18 15:31:20 +02:00
rare_cp_bitvec.h Improve mb_detect_encoding accuracy for text containing vowels with macrons 2023-08-25 12:09:55 +02:00
unicode_data.h Update Unicode tables to 14.0.0 2021-09-20 09:58:20 +02:00