Headers should not be processed in a locale-depdendent fashion.
Switch from upper to lowercasing because that's the standard for
PHP and we provide an ASCII implementation of this operation.
This is adapted from GH-7506.
Previously, when passed an empty string, and given an encoding which
uses a variable number of bytes per character (and which doesn't have
a 'character length table'), mb_str_split would return an array
containing a single empty string, rather than an empty array.
The ISO-2022 encodings are among those which were affected by this bug.
* PHP-8.1:
Bug #81390: mb_detect_encoding should not prematurely stop processing input
mb_detect_encoding with only one candidate encoding uses mb_check_encoding
Optimize text encoding detection for speed (eliminate Unicode property lookups)
mb_convert_kana is able to convert fullwidth katakana to fullwidth
hiragana (and vice versa). The constants referring to these modes had
names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA.
The "ZEN2HAN" part of the name is misleading, since these modes do not
convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted
characters are fullwidth both before and after the conversion. So...
let's name the constants accordingly.
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.
However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.
More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.
Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.
However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.
With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.
One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.
Therefore, amend all these functions to return 0 for success.
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.
Of course, zend_bool is retained as an alias.
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).
So we can simplify things by combining constants.
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.
One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.
Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.
Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to
determine whether a string was valid or not, it would convert the whole string
into wchar (code points), which required dynamically allocating a (potentially
large) buffer. Then it would turn right around and convert that big 'ol buffer
of code points back to the original encoding again. Finally, it would check
whether any invalid bytes were detected during that long and onerous process.
The thing is, mbstring _already_ has machinery for detecting whether a string
is valid in a certain encoding or not, and it doesn't require copying any data
around or allocating buffers. Better yet, it can fail fast when an invalid byte
is found. Why not use it? It's sure a lot faster!
Further, the legacy code was also badly broken. Why? Because aside from
checking whether illegal characters were detected, it would also check whether
the conversion to and from wchars was lossless. But, some encodings have
more than one valid encoding for the same character. In such cases, it is
not possible to make the conversion to and from wchars lossless for every
valid character. So `mb_check_encoding` would actually reject good strings
in a lot of encodings!
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
list of alias strings. Pointers to pointers (2 levels of indirection)
is what actually makes sense. This gets rid of some extraneous
dereference operations.
This is just a very silly feature of mbstring -- you can compile the source files with
HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the
use of this? Why compile the source files and link against them if you don't want the
mbstring extension? It doesn't make any kind of sense.
Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!
This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.
Make sure we actually include the file which was intended.
Rather than using a magic boolean parameter to choose different behavior of
the subfunction, inline it. The code size doesn't really grow anyways. And
soon these will be trimmed down more.