php-src

mirror of https://github.com/php/php-src.git synced 2025-08-16 05:58:45 +02:00

Author	SHA1	Message	Date
Alex Dowad	39b46a5398	Implement Unicode conditional casing rules for Greek letter sigma The capital Greek letter sigma (Σ) should be lowercased as σ except when it appears at the end of a word; in that case, it should be lowercased as the special form ς. This rule is included in the Unicode data file SpecialCasing.txt. The condition for applying the rule is called "Final_Sigma" and is defined in Unicode technical report 21. The rule is: • For the special casing form to apply, the capital letter sigma must be preceded by 0 or more "case-ignorable" characters, preceded by at least 1 "cased" character. • Further, capital sigma must NOT be followed by 0 or more case-ignorable characters and then at least 1 cased character. "Case-ignorable" characters include certain punctuation marks, like the apostrophe, as well as various accent marks. There are actually close to 500 different case-ignorable characters, including accent marks from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati, Telugu, Tibetan, and many other alphabets. This category also includes zero-width spaces, codepoints which indicate RTL/LTR text direction, certain musical symbols, etc. Since the rule involves scanning over "0 or more" of such case-ignorable characters, it may be necessary to scan arbitrarily far to the left and right of capital sigma to determine whether the special lowercase form should be used or not. However, since we are trying to be both memory-efficient and CPU-efficient, this implementation limits how far to the left we will scan. Generally, we scan up to 63 characters to the left looking for a "cased" character, but not more. When scanning to the right, we go up to the end of the string if necessary, even if it means scanning over thousands of characters. Anyways, it is almost impossible to imagine that natural text will include "words" with more than 63 successive apostrophes (for example) followed by a capital sigma. Closes GH-8096.	2023-01-12 17:41:11 +02:00
Alex Dowad	4427b2e1ab	Mark UTF-8 strings emitted by mbstring functions as valid UTF-8 We now have a couple of mbstring functions which have fast paths for strings marked as 'valid UTF-8'. Later, we may likely have more. So that these fast paths can be used more frequently, mark UTF-8 strings emitted by mbstring as 'valid UTF-8'. This is always a correct thing to do, because mbstring never returns invalid UTF-8 as the result of a conversion (or similar) operation. Internally, we do have a conversion mode which deliberately emits invalid UTF-8 in some cases. (This is done to prevent unwanted matches when we are converting strings to UTF-8 before performing matching operations on them.) For such strings, don't set the 'valid UTF-8' flag. It probably wouldn't hurt anything to set it, because strings generated using that special conversion mode should never be returned to userland, and I don't think we do anything with them which cares about the IS_STR_VALID_UTF8 flag... but still, it would likely cause confusion for developers.	2023-01-11 17:08:27 +02:00
Alex Dowad	744ca16e73	Speed boost for mb_stripos (when not using UTF-8) Instead of case-folding a string and then converting it to UTF-8 as a separate operation, why not convert it to UTF-8 at the same time as we fold case? For non-UTF-8 encodings, this typically makes mb_stripos about 2x faster.	2022-12-18 15:31:20 +02:00
Alex Dowad	3ce888a837	Use uint32_t for 'illegal_substchar' codepoint in mbstring This value is a wchar, so the best type for it is uint32_t.	2022-10-05 10:02:02 +09:00
Alex Dowad	20769fb9ab	Make enum for valid case_mode values (for php_unicode_convert_case)	2022-10-05 10:02:02 +09:00
Alex Dowad	7eef2fb45e	Use fast text conversion filters for mb_convert_case, mb_strtoupper, mb_strtolower Speed increase is only about 50% for title casing, but 2-3x for other forms of case conversion.	2022-10-05 10:02:02 +09:00
Alex Dowad	4e51810f9b	Optimize mbstring upper/lowercasing: use fast path in more cases The 'fast path' in the uppercase/lowercase functions for Unicode text can be used for a slightly greater range of characters. This is not expected to have a big impact on performance, since the number of characters which will use the 'fast path' is only increased by about 50-60, and these are not very commonly used characters... but still, it doesn't cost anything.	2021-09-20 11:27:54 +02:00
Alex Dowad	a312620607	Remove redundant NULL checks in mbstring Whoever originally wrote mbstring seems to have a deathly fear of NULL pointers lurking behind every corner. A common pattern is that one function will check if a pointer is NULL, then pass it to another function, which will again check if it is NULL, then pass to yet another function, which will yet again check if it is NULL... it's NULL checks all the way down. Remove all the NULL checks in places where pointers could not possibly be NULL.	2021-09-06 13:16:23 +02:00
Nikita Popov	d2073179e3	Return bool from php_unicode_is_prop()	2021-08-24 19:21:21 +02:00
Nikita Popov	3be94217f4	Don't use sentinel value for unicode property lookup 0xffff was used to mark character properties without any members. This made the code unnecessarily complicated, because we need to check for 0xffff values when looking up the property ranges. We can simply encode this as an empty set of ranges.	2021-08-24 15:53:43 +02:00
Patrick Allaert	aff365871a	Fixed some spaces used instead of tabs	2021-06-29 11:30:26 +02:00
KsaR	01b3fc03c3	Update http->https in license (#6945 ) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before \|	2021-05-06 12:16:35 +02:00
Alex Dowad	7eddcabe2b	Don't guard mbstring code with #ifdef HAVE_MBSTRING This is just a very silly feature of mbstring -- you can compile the source files with HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the use of this? Why compile the source files and link against them if you don't want the mbstring extension? It doesn't make any kind of sense.	2020-08-31 23:18:13 +02:00
Alex Dowad	62317d592f	Remove redundant includes from mbstring (and make sure correct config.h is used) Very interesting... it turns out that when Valgrind support was enabled, `#include "config.h"` from within mbstring was actually including the file "config.h" from Valgrind, and not the one from mbstring!! This is because -I/usr/include/valgrind was added to the compiler invocation _before_ -Iext/mbstring/libmbfl. Make sure we actually include the file which was intended.	2020-08-31 23:17:58 +02:00
Alex Dowad	ea3f0ee0b9	Optimize php_unicode_convert_case (cuts mbstring case conversion time ~15%) This function uses various subfunctions to convert case of Unicode wchars. Previously, these subfunctions would store the case-converted characters in a buffer, and the parent function would then pass them (byte by byte) to the next filter in the filter chain. Rather than passing around that buffer, it's better for the subfunctions to directly pass the case-converted bytes to the next filter in the filter chain. This speeds things up nicely.	2020-08-31 23:17:25 +02:00
George Peter Banyard	68164f40ce	Fix [-Wundef] warning in MBString extension	2020-05-16 15:31:20 +02:00
Christoph M. Becker	ebdaeb8572	Fix #79371 : mb_strtolower (UTF-32LE): stack-buffer-overflow We make sure that negative values are properly compared.	2020-03-16 22:42:15 -07:00
Gabriel Caruso	5d6e923d46	Remove mention of PHP major version in Copyright headers Closes GH-4732.	2019-09-25 14:51:43 +02:00
Nikita Popov	8e8d129d7f	Use EMPTY_SWITCH_DEFAULT_CASE in php_unicode.c Avoids a potentially uninitialized variable warning.	2019-04-12 10:26:11 +02:00
Peter Kokot	92ac598aab	Remove local variables This patch removes the so called local variables defined per file basis for certain editors to properly show tab width, and similar settings. These are mainly used by Vim and Emacs editors yet with recent changes the once working definitions don't work anymore in Vim without custom plugins or additional configuration. Neither are these settings synced across the PHP code base. A simpler and better approach is EditorConfig and fixing code using some code style fixing tools in the future instead. This patch also removes the so called modelines for Vim. Modelines allow Vim editor specifically to set some editor configuration such as syntax highlighting, indentation style and tab width to be set in the first line or the last 5 lines per file basis. Since the php test files have syntax highlighting already set in most editors properly and EditorConfig takes care of the indentation settings, this patch removes these as well for the Vim 6.0 and newer versions. With the removal of local variables for certain editors such as Emacs and Vim, the footer is also probably not needed anymore when creating extensions using ext_skel.php script. Additionally, Vim modelines for setting php syntax and some editor settings has been removed from some *.phpt files. All these are mostly not relevant for phpt files neither work properly in the middle of the file.	2019-02-03 21:03:00 +01:00
Zeev Suraski	0cf7de1c70	Remove yearly range from copyright notice	2019-01-30 11:03:12 +02:00
Nikita Popov	9d63f4dec1	Fixed bug #76319 While at it, also make sure that mbstring case conversion takes into account the specified substitution character and substitution mode.	2018-05-25 11:33:13 +02:00
Xinchen Hui	a6519d0514	year++	2018-01-02 12:57:58 +08:00
Anatol Belski	f9c3ee9ae8	fix c89 compat	2017-07-28 22:18:51 +02:00
Nikita Popov	f4a1d9c821	Fixed bug #65544 and #71298	2017-07-28 14:57:08 +02:00
Nikita Popov	582a65b06f	Implement full case mapping Implement full case mapping according to SpecialCasing.txt and also full case folding according to CaseFolding.txt (F). There are a number of caveats: * Only language-agnostic and unconditional full case mapping is implemented. The only language-agnostic conditional case mapping rule relates to Greek sigma in final position (Final_Sigma). Correctly handling this requires both arbitrary lookahead and lookbehind, which would require some larger changes to how the case mapping is implemented. This is a possible future extension. * The only language-specific handling that is implemented is for Turkish dotted/undotted Is, if the ISO-8859-9 encoding is used. This matches the previous behavior and makes sure that no codepoints not supported by the encoding are produced. A future extension would be to also handle the Turkish mappings specified by SpecialCasing.txt based on the mbfl internal language. * Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string. * mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are: * MB_CASE_LOWER (used by mb_strtolower) * MB_CASE_UPPER (used by mb_strtolower) * MB_CASE_TITLE * MB_CASE_FOLD * MB_CASE_LOWER_SIMPLE * MB_CASE_UPPER_SIMPLE * MB_CASE_TITLE_SIMPLE * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)	2017-07-28 12:32:50 +02:00
Nikita Popov	9ac7c1e71d	Use case-folding for case insensitive comparisons Instead of using lowercasing.	2017-07-28 12:32:50 +02:00
Nikita Popov	80a0601fe5	Use MPH for case maps Instead of performing a binary search, use a hashtable to store the case maps. In particular a minimal perfect hash construction is used, which does not require collision resolution (but does use an auxiliary table for the hash perturbation).	2017-07-28 12:32:50 +02:00
Nikita Popov	3c6b2512cb	Change layout of case mapping table Previously the case mapping table was segregated by the type of the character (upper, lower, title) and always stored the other two variants (key, other1, other2). Now the table is segregated by the target type (key, other). As only very few characters have more than one target this only slightly increases the size of the table. The advantage of this layout is that we only need to perform a single table lookup in the case table. Previously, depending on the case that was hit, either one lookup in the property table, or two lookups in the property table and one lookup in the case table were required. This changes the layout from libunicode in the OpenLDAP project -- however, the last commit there was over 10 years ago, so I don't see value in keeping this in sync.	2017-07-23 18:33:15 +02:00
Nikita Popov	7077c719db	Merge branch 'PHP-7.2'	2017-07-23 15:36:25 +02:00
Nikita Popov	c0bcd301d3	Another fix for bug #69267 mb_strtoupper() was converting lowercase characters into titlecase characters, instead of uppercase characters. Luckily there are only very few characters with a distinct titlecase representation, so this mostly worked out okay...	2017-07-23 15:07:02 +02:00
Nikita Popov	0e4af9192f	Partial fix for bug #69267 This pulls in 60a25c72ba389f53b0621ca250bc99f3b295d43f from the OpenLDAP project.	2017-07-23 14:47:21 +02:00
Nikita Popov	b3c1d9d111	Directly use encodings instead of no_encoding in libmbfl In particular strings now store encoding rather than the no_encoding. I've also pruned out libmbfl APIs that existed in two forms, one using no_encoding and the other using encoding. We were not actually using any of the former.	2017-07-20 21:41:52 +02:00
Nikita Popov	c098304e17	Reduce number of encoding conversions in case conversion Don't indirect through UCS4BE, instead directly work on wchars using a custom filter. This replaces the pipeline utf8 -> wchar -> ucs4be -> wchar -case-> wchar -> ucs4be -> wchar -> utf8 with utf8 -> wchar -case-> -> wchar -> utf8	2017-07-20 15:33:24 +02:00
Nikita Popov	17da862b51	Optimize php_unicode_tolower/upper for ASCII	2017-07-20 13:58:40 +02:00
Nikita Popov	9c73be898d	Directly accept encoding in php_unicode_convert_case() As a side-effect mb_strtolower() and mb_strtoupper() now correctly handle a NULL encoding parameter by using the internal encoding. This is what caused the two test changes.	2017-07-19 23:59:42 +02:00
Nikita Popov	4cf22cbb2d	Optimize php_unicode_is_prop() Do not try to extract the properties from a bitmask. Instead make the function variadic and pass all properties individually. Also add a php_unicode_is_prop1() function to check only a single property.	2017-07-19 23:59:42 +02:00
Nikita Popov	dead4f0b1b	Avoid unnecessary encoding lookups in mbstring Extract part of php_mb_convert_encoding that does the actual work and use it whenever we already know the encoding.	2017-07-19 23:59:42 +02:00
Sammy Kaye Powers	9e29f841ce	Update copyright headers to 2017	2017-01-02 09:30:12 -06:00
Lior Kaplan	ed35de784f	Merge branch 'PHP-5.6' into PHP-7.0 * PHP-5.6: Happy new year (Update copyright to 2016)	2016-01-01 19:48:25 +02:00
Lior Kaplan	49493a2dcf	Happy new year (Update copyright to 2016)	2016-01-01 19:21:47 +02:00
Xinchen Hui	fc33f52d8c	bump year	2015-01-15 23:27:30 +08:00
Xinchen Hui	0579e8278d	bump year	2015-01-15 23:26:37 +08:00
Stanislav Malyshev	b7a7b1a624	trailing whitespace removal	2015-01-10 15:07:38 -08:00
Anatol Belski	bdeb220f48	first shot remove TSRMLS_* things	2014-12-13 23:06:14 +01:00
Johannes Schlüter	d0cb715373	s/PHP 5/PHP 7/	2014-09-19 18:33:14 +02:00
Xinchen Hui	c081ce628f	Bump year	2014-01-03 11:08:10 +08:00
Xinchen Hui	a666285bc2	Happy New Year	2013-01-01 16:37:09 +08:00
Felipe Pena	8775a37559	- Year++	2012-01-01 13:15:04 +00:00
Felipe Pena	0203cc3d44	- Year++	2011-01-01 02:17:06 +00:00

1 2

71 commits