Commit graph

71 commits

Author SHA1 Message Date
Alex Dowad
39b46a5398 Implement Unicode conditional casing rules for Greek letter sigma
The capital Greek letter sigma (Σ) should be lowercased as σ except
when it appears at the end of a word; in that case, it should be
lowercased as the special form ς.

This rule is included in the Unicode data file SpecialCasing.txt.
The condition for applying the rule is called "Final_Sigma" and is
defined in Unicode technical report 21. The rule is:

• For the special casing form to apply, the capital letter sigma must
  be preceded by 0 or more "case-ignorable" characters, preceded by
  at least 1 "cased" character.
• Further, capital sigma must NOT be followed by 0 or more
  case-ignorable characters and then at least 1 cased character.

"Case-ignorable" characters include certain punctuation marks, like
the apostrophe, as well as various accent marks. There are actually
close to 500 different case-ignorable characters, including accent marks
from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati,
Telugu, Tibetan, and many other alphabets. This category also includes
zero-width spaces, codepoints which indicate RTL/LTR text direction,
certain musical symbols, etc.

Since the rule involves scanning over "0 or more" of such
case-ignorable characters, it may be necessary to scan arbitrarily far
to the left and right of capital sigma to determine whether the special
lowercase form should be used or not. However, since we are trying to
be both memory-efficient and CPU-efficient, this implementation limits
how far to the left we will scan. Generally, we scan up to 63 characters
to the left looking for a "cased" character, but not more.

When scanning to the right, we go up to the end of the string if
necessary, even if it means scanning over thousands of characters.

Anyways, it is almost impossible to imagine that natural text will
include "words" with more than 63 successive apostrophes (for example)
followed by a capital sigma.

Closes GH-8096.
2023-01-12 17:41:11 +02:00
Alex Dowad
4427b2e1ab Mark UTF-8 strings emitted by mbstring functions as valid UTF-8
We now have a couple of mbstring functions which have fast paths for
strings marked as 'valid UTF-8'. Later, we may likely have more. So
that these fast paths can be used more frequently, mark UTF-8 strings
emitted by mbstring as 'valid UTF-8'. This is always a correct thing
to do, because mbstring never returns invalid UTF-8 as the result of
a conversion (or similar) operation.

Internally, we do have a conversion mode which deliberately emits
invalid UTF-8 in some cases. (This is done to prevent unwanted matches
when we are converting strings to UTF-8 before performing matching
operations on them.) For such strings, don't set the 'valid UTF-8' flag.
It probably wouldn't hurt anything to set it, because strings generated
using that special conversion mode should *never* be returned to
userland, and I don't think we do anything with them which cares about
the IS_STR_VALID_UTF8 flag... but still, it would likely cause
confusion for developers.
2023-01-11 17:08:27 +02:00
Alex Dowad
744ca16e73 Speed boost for mb_stripos (when not using UTF-8)
Instead of case-folding a string and then converting it to UTF-8 as a
separate operation, why not convert it to UTF-8 at the same time as
we fold case?

For non-UTF-8 encodings, this typically makes mb_stripos about 2x
faster.
2022-12-18 15:31:20 +02:00
Alex Dowad
3ce888a837 Use uint32_t for 'illegal_substchar' codepoint in mbstring
This value is a wchar, so the best type for it is uint32_t.
2022-10-05 10:02:02 +09:00
Alex Dowad
20769fb9ab Make enum for valid case_mode values (for php_unicode_convert_case) 2022-10-05 10:02:02 +09:00
Alex Dowad
7eef2fb45e Use fast text conversion filters for mb_convert_case, mb_strtoupper, mb_strtolower
Speed increase is only about 50% for title casing, but 2-3x for other
forms of case conversion.
2022-10-05 10:02:02 +09:00
Alex Dowad
4e51810f9b Optimize mbstring upper/lowercasing: use fast path in more cases
The 'fast path' in the uppercase/lowercase functions for Unicode text can be used
for a slightly greater range of characters. This is not expected to have a big
impact on performance, since the number of characters which will use the 'fast path'
is only increased by about 50-60, and these are not very commonly used characters...
but still, it doesn't cost anything.
2021-09-20 11:27:54 +02:00
Alex Dowad
a312620607 Remove redundant NULL checks in mbstring
Whoever originally wrote mbstring seems to have a deathly fear of NULL
pointers lurking behind every corner. A common pattern is that one
function will check if a pointer is NULL, then pass it to another
function, which will again check if it is NULL, then pass to yet another
function, which will yet again check if it is NULL... it's NULL checks
all the way down.

Remove all the NULL checks in places where pointers could not possibly
be NULL.
2021-09-06 13:16:23 +02:00
Nikita Popov
d2073179e3 Return bool from php_unicode_is_prop() 2021-08-24 19:21:21 +02:00
Nikita Popov
3be94217f4 Don't use sentinel value for unicode property lookup
0xffff was used to mark character properties without any members.
This made the code unnecessarily complicated, because we need to
check for 0xffff values when looking up the property ranges. We
can simply encode this as an empty set of ranges.
2021-08-24 15:53:43 +02:00
Patrick Allaert
aff365871a Fixed some spaces used instead of tabs 2021-06-29 11:30:26 +02:00
KsaR
01b3fc03c3
Update http->https in license (#6945)
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
2021-05-06 12:16:35 +02:00
Alex Dowad
7eddcabe2b Don't guard mbstring code with #ifdef HAVE_MBSTRING
This is just a very silly feature of mbstring -- you can compile the source files with
HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the
use of this? Why compile the source files and link against them if you don't want the
mbstring extension? It doesn't make any kind of sense.
2020-08-31 23:18:13 +02:00
Alex Dowad
62317d592f Remove redundant includes from mbstring (and make sure correct config.h is used)
Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!

This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.

Make sure we actually include the file which was intended.
2020-08-31 23:17:58 +02:00
Alex Dowad
ea3f0ee0b9 Optimize php_unicode_convert_case (cuts mbstring case conversion time ~15%)
This function uses various subfunctions to convert case of Unicode wchars.
Previously, these subfunctions would store the case-converted characters in
a buffer, and the parent function would then pass them (byte by byte) to
the next filter in the filter chain.

Rather than passing around that buffer, it's better for the subfunctions to
directly pass the case-converted bytes to the next filter in the filter chain.
This speeds things up nicely.
2020-08-31 23:17:25 +02:00
George Peter Banyard
68164f40ce Fix [-Wundef] warning in MBString extension 2020-05-16 15:31:20 +02:00
Christoph M. Becker
ebdaeb8572 Fix #79371: mb_strtolower (UTF-32LE): stack-buffer-overflow
We make sure that negative values are properly compared.
2020-03-16 22:42:15 -07:00
Gabriel Caruso
5d6e923d46
Remove mention of PHP major version in Copyright headers
Closes GH-4732.
2019-09-25 14:51:43 +02:00
Nikita Popov
8e8d129d7f Use EMPTY_SWITCH_DEFAULT_CASE in php_unicode.c
Avoids a potentially uninitialized variable warning.
2019-04-12 10:26:11 +02:00
Peter Kokot
92ac598aab Remove local variables
This patch removes the so called local variables defined per
file basis for certain editors to properly show tab width, and
similar settings. These are mainly used by Vim and Emacs editors
yet with recent changes the once working definitions don't work
anymore in Vim without custom plugins or additional configuration.
Neither are these settings synced across the PHP code base.

A simpler and better approach is EditorConfig and fixing code
using some code style fixing tools in the future instead.

This patch also removes the so called modelines for Vim. Modelines
allow Vim editor specifically to set some editor configuration such as
syntax highlighting, indentation style and tab width to be set in the
first line or the last 5 lines per file basis. Since the php test
files have syntax highlighting already set in most editors properly and
EditorConfig takes care of the indentation settings, this patch removes
these as well for the Vim 6.0 and newer versions.

With the removal of local variables for certain editors such as
Emacs and Vim, the footer is also probably not needed anymore when
creating extensions using ext_skel.php script.

Additionally, Vim modelines for setting php syntax and some editor
settings has been removed from some *.phpt files.  All these are
mostly not relevant for phpt files neither work properly in the
middle of the file.
2019-02-03 21:03:00 +01:00
Zeev Suraski
0cf7de1c70 Remove yearly range from copyright notice 2019-01-30 11:03:12 +02:00
Nikita Popov
9d63f4dec1 Fixed bug #76319
While at it, also make sure that mbstring case conversion takes
into account the specified substitution character and substitution
mode.
2018-05-25 11:33:13 +02:00
Xinchen Hui
a6519d0514 year++ 2018-01-02 12:57:58 +08:00
Anatol Belski
f9c3ee9ae8 fix c89 compat 2017-07-28 22:18:51 +02:00
Nikita Popov
f4a1d9c821 Fixed bug #65544 and #71298 2017-07-28 14:57:08 +02:00
Nikita Popov
582a65b06f Implement full case mapping
Implement full case mapping according to SpecialCasing.txt and
also full case folding according to CaseFolding.txt (F). There
are a number of caveats:

* Only language-agnostic and unconditional full case mapping
  is implemented. The only language-agnostic conditional case
  mapping rule relates to Greek sigma in final position
  (Final_Sigma). Correctly handling this requires both arbitrary
  lookahead and lookbehind, which would require some larger
  changes to how the case mapping is implemented. This is a
  possible future extension.
* The only language-specific handling that is implemented is
  for Turkish dotted/undotted Is, if the ISO-8859-9 encoding
  is used. This matches the previous behavior and makes sure
  that no codepoints not supported by the encoding are
  produced. A future extension would be to also handle the
  Turkish mappings specified by SpecialCasing.txt based on
  the mbfl internal language.
* Full case folding is implemented, but case-insensitive mb_*
  operations continue to use simple case folding. The reason is
  that full case folding of the haystack string may change the
  position at which a match occurred. This would have to be
  mapped back into the position in the original string.
* mb_convert_case() exposes both the full and the simple case
  mapping / folding, where full is the default. The constants
  are:

   * MB_CASE_LOWER (used by mb_strtolower)
   * MB_CASE_UPPER (used by mb_strtolower)
   * MB_CASE_TITLE
   * MB_CASE_FOLD
   * MB_CASE_LOWER_SIMPLE
   * MB_CASE_UPPER_SIMPLE
   * MB_CASE_TITLE_SIMPLE
   * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)
2017-07-28 12:32:50 +02:00
Nikita Popov
9ac7c1e71d Use case-folding for case insensitive comparisons
Instead of using lowercasing.
2017-07-28 12:32:50 +02:00
Nikita Popov
80a0601fe5 Use MPH for case maps
Instead of performing a binary search, use a hashtable to store
the case maps. In particular a minimal perfect hash construction
is used, which does not require collision resolution (but does
use an auxiliary table for the hash perturbation).
2017-07-28 12:32:50 +02:00
Nikita Popov
3c6b2512cb Change layout of case mapping table
Previously the case mapping table was segregated by the type of
the character (upper, lower, title) and always stored the other
two variants (key, other1, other2). Now the table is segregated
by the target type (key, other). As only very few characters have
more than one target this only slightly increases the size of the
table.

The advantage of this layout is that we only need to perform a
single table lookup in the case table. Previously, depending on
the case that was hit, either one lookup in the property table,
or two lookups in the property table and one lookup in the case
table were required.

This changes the layout from libunicode in the OpenLDAP project
-- however, the last commit there was over 10 years ago, so I
don't see value in keeping this in sync.
2017-07-23 18:33:15 +02:00
Nikita Popov
7077c719db Merge branch 'PHP-7.2' 2017-07-23 15:36:25 +02:00
Nikita Popov
c0bcd301d3 Another fix for bug #69267
mb_strtoupper() was converting lowercase characters into
titlecase characters, instead of uppercase characters. Luckily
there are only very few characters with a distinct titlecase
representation, so this mostly worked out okay...
2017-07-23 15:07:02 +02:00
Nikita Popov
0e4af9192f Partial fix for bug #69267
This pulls in 60a25c72ba389f53b0621ca250bc99f3b295d43f from the
OpenLDAP project.
2017-07-23 14:47:21 +02:00
Nikita Popov
b3c1d9d111 Directly use encodings instead of no_encoding in libmbfl
In particular strings now store encoding rather than the
no_encoding.

I've also pruned out libmbfl APIs that existed in two forms, one
using no_encoding and the other using encoding. We were not actually
using any of the former.
2017-07-20 21:41:52 +02:00
Nikita Popov
c098304e17 Reduce number of encoding conversions in case conversion
Don't indirect through UCS4BE, instead directly work on wchars
using a custom filter.

This replaces the pipeline
  utf8 -> wchar -> ucs4be -> wchar -case-> wchar -> ucs4be -> wchar -> utf8
with
  utf8 -> wchar -case-> -> wchar -> utf8
2017-07-20 15:33:24 +02:00
Nikita Popov
17da862b51 Optimize php_unicode_tolower/upper for ASCII 2017-07-20 13:58:40 +02:00
Nikita Popov
9c73be898d Directly accept encoding in php_unicode_convert_case()
As a side-effect mb_strtolower() and mb_strtoupper() now correctly
handle a NULL encoding parameter by using the internal encoding.
This is what caused the two test changes.
2017-07-19 23:59:42 +02:00
Nikita Popov
4cf22cbb2d Optimize php_unicode_is_prop()
Do not try to extract the properties from a bitmask. Instead make
the function variadic and pass all properties individually.

Also add a php_unicode_is_prop1() function to check only a single
property.
2017-07-19 23:59:42 +02:00
Nikita Popov
dead4f0b1b Avoid unnecessary encoding lookups in mbstring
Extract part of php_mb_convert_encoding that does the actual work
and use it whenever we already know the encoding.
2017-07-19 23:59:42 +02:00
Sammy Kaye Powers
9e29f841ce Update copyright headers to 2017 2017-01-02 09:30:12 -06:00
Lior Kaplan
ed35de784f Merge branch 'PHP-5.6' into PHP-7.0
* PHP-5.6:
  Happy new year (Update copyright to 2016)
2016-01-01 19:48:25 +02:00
Lior Kaplan
49493a2dcf Happy new year (Update copyright to 2016) 2016-01-01 19:21:47 +02:00
Xinchen Hui
fc33f52d8c bump year 2015-01-15 23:27:30 +08:00
Xinchen Hui
0579e8278d bump year 2015-01-15 23:26:37 +08:00
Stanislav Malyshev
b7a7b1a624 trailing whitespace removal 2015-01-10 15:07:38 -08:00
Anatol Belski
bdeb220f48 first shot remove TSRMLS_* things 2014-12-13 23:06:14 +01:00
Johannes Schlüter
d0cb715373 s/PHP 5/PHP 7/ 2014-09-19 18:33:14 +02:00
Xinchen Hui
c081ce628f Bump year 2014-01-03 11:08:10 +08:00
Xinchen Hui
a666285bc2 Happy New Year 2013-01-01 16:37:09 +08:00
Felipe Pena
8775a37559 - Year++ 2012-01-01 13:15:04 +00:00
Felipe Pena
0203cc3d44 - Year++ 2011-01-01 02:17:06 +00:00