The documentation for mb_detect_encoding says that this function
"Detects the most likely character encoding for string `string` from an
ordered list of candidates".
Prior to 28b346bc06, mb_detect_encoding did not really attempt to
determine the "most likely" text encoding for the input string. It
would just return the first candidate encoding for which the string was
valid. In 28b346bc06, I amended this function so that it uses heuristics
to try to guess which candidate encoding is "most likely".
However, the caller did not have any way to indicate which candidate
text encoding(s) they consider to be more likely, in case the
heuristics applied are inconclusive. In the language of Bayesian
probability, there was no way for the caller to indicate their 'prior'
assignment of probabilities.
Further, the documentation for mb_detect_encoding also says that the
second parameter `encodings` is "a list of character encodings to try,
in order". The documentation clearly implies that the order of
the `encodings` argument should be significant.
Therefore, amend mb_detect_encoding so that while it still uses
heuristics to guess the most likely text encoding for the input string,
it favors those which are earlier in the list of candidate encodings.
One complication is that many callers of mb_detect_encoding use it
in this way:
mb_detect_encoding($string, mb_list_encodings());
In a majority of cases, this is bad code; mb_detect_encoding will both
be much slower and the results will be less reliable than if a smaller
list of candidates is used. However, since such code already exists and
people are using it in production, we should not unnecessarily break it.
The order of candidate encodings obviously does not express any prior
belief of which candidates are more likely in this case, and treating
it as if it did will degrade the accuracy of the result.
Since mb_list_encodings now returns a single, immutable array on each
call, we can avoid that problem by turning off the new behavior when
we receive the array of encodings returned by mb_list_encodings.
This implementation means that if the user does this:
$a = mb_list_encodings();
mb_detect_encoding($string, $a);
...then the order of candidate encodings will not be considered.
However, if the user explicitly initializes their own array of all
supported legacy text encodings, then the order *will* be considered.
The other functions which also follow this new behavior are:
• mb_convert_variables
• mb_convert_encoding (when multiple candidate input encodings are
listed)
Other places where "detection" (or really "guessing") of text encoding
may be performed include:
• mb_send_mail
• Zend engine, when determining the encoding of a PHP script
• mbstring processing of HTTP request contents, when http_input INI
parameter is set to a list
In these cases, the new logic based on order of candidate encodings
is *not* enabled. It *might* be logical to consider the order of
candidate encodings in some or all of these cases, but I'm not sure if
that is true, so it seems wiser to avoid more behavior changes than is
necessary. Further, ever since the new encoding detection heuristics
were implemented in 28b346bc06, we have not received any complaints of
user code being broken in these areas. So I am reluctant to "fix what
isn't broken".
Well, some might say that applying the new detection heuristics
to mb_send_mail, etc. in 28b346bc06 was "fixing what wasn't broken",
but (cough cough) I don't have any comment on that...
This will allow us to easily check in other mbstring functions if the
list of all supported encodings, returned by mb_list_encodings, is
passed in as input to another function.
Co-authored-by: Ilija Tovilo <ilija.tovilo@me.com>
For mb_parse_str, when mbstring.http_input (INI parameter) is a list of
multiple possible text encodings (which is not the case by default),
this new implementation is about 25% faster.
When mbstring.http_input is a single value, then nothing is changed.
(No automatic encoding detection is done in that case.)
This boosts the performance of mb_strpos, mb_stripos, mb_strrpos,
mb_strripos, mb_strstr, mb_stristr, mb_strrchr, and mb_strrichr when
used on non-UTF-8 strings. mb_substr is also faster.
With UTF-8 input, there is no appreciable difference in performance for
mb_strpos, mb_stripos, mb_strrpos, etc. This is expected, since the only
real difference here (aside from shorter and simpler code) is that the
new text conversion code is used when converting non-UTF-8 input strings
to UTF-8. (This is done because internally, mb_strpos, etc. work only
on UTF-8 text.)
For ASCII, speed is boosted by 30-65%. For other legacy text encodings,
the degree of performance improvement will depend on how slow the
legacy conversion code was.
One other minor, but notable difference is that strings encoded using
UTF-8 variants from Japanese mobile vendors (SoftBank, KDDI, Docomo)
will not undergo encoding conversion but will be processed "as is". It
is expected that this will result in a large performance boost for
such input strings; but realistically, the number of users who work
with such strings is probably minute.
I was not originally planning to include mb_substr in this commit, but
fuzzing of the reimplemented mb_strstr revealed that mb_substr needed
to be reimplemented, too; using the old mbfl_substr, which was based
on the old text conversion filters, in combination with functions which
use the new text conversion filters caused bugs.
The performance boost for mb_substr varies from 10%-500%, depending
on the encoding and input string used.
The PCRE extension is already doing this. The flag is set when a string
is determined to be valid UTF-8, and cleared in
zend_string_forget_hash_val.
We might as well make good use of it in mbstring as well.
This should result in a negligible slowdown for non-UTF-8 strings,
bad UTF-8 strings, and good UTF-8 strings which are checked only once.
However, when microbenchmarking this change using a variety of text
encodings and string lengths, I found that in most of these cases,
the 'new' code was a few percent faster. In a couple of cases, the 'old'
code was a few percent faster. This was not a result of sampling error,
since I could reproduce these test results repeatedly, and even when
running a large number of iterations. Both the new and old code
were compiled with -O3 -march=native. My (unproved) hypothesis is that
although the new code appears to only add one more conditional branch,
the compiler may emit slightly different code from before (perhaps
with different register allocation and so on), and this may cause some
cases to run slightly faster and others to run slightly slower. I have
not disassembled the old and new binaries to see if an examination of
the emitted assembly code would support this hypothesis.
For good UTF-8 strings which are checked repeatedly, the speedup is
about 40% even for strings 1-5 bytes in length. For ~100 byte strings,
it is ~700%, and for ~10000 byte strings, it is ~80000%.
I tried fuzzing MBString's php_mb_check_encoding function and
pcre2lib's valid_utf function to see if I could find any cases where
their output would be different. After running the fuzzer for a couple
of minutes, it had tried more than 1 million test cases without finding
any where the output was different. Therefore, it appears that
MBString's UTF-8 validation is compatible with PCRE's.
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.
Of course, zend_bool is retained as an alias.
This is just a very silly feature of mbstring -- you can compile the source files with
HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the
use of this? Why compile the source files and link against them if you don't want the
mbstring extension? It doesn't make any kind of sense.
Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!
This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.
Make sure we actually include the file which was intended.
This is very similar to the existing mbstring.regex_stack_limit,
but for backtracking. The default value matches pcre.backtrack_limit.
Only used on libonig >= 2.8.0.
By introducing a hook that is called whenever one of
internal_encoding / input_encoding / output_encoding changes, so
that mbstring can adjust it's internal state.
This also makes internal_encoding work with zend multibyte.
This patch removes the so called local variables defined per
file basis for certain editors to properly show tab width, and
similar settings. These are mainly used by Vim and Emacs editors
yet with recent changes the once working definitions don't work
anymore in Vim without custom plugins or additional configuration.
Neither are these settings synced across the PHP code base.
A simpler and better approach is EditorConfig and fixing code
using some code style fixing tools in the future instead.
This patch also removes the so called modelines for Vim. Modelines
allow Vim editor specifically to set some editor configuration such as
syntax highlighting, indentation style and tab width to be set in the
first line or the last 5 lines per file basis. Since the php test
files have syntax highlighting already set in most editors properly and
EditorConfig takes care of the indentation settings, this patch removes
these as well for the Vim 6.0 and newer versions.
With the removal of local variables for certain editors such as
Emacs and Vim, the footer is also probably not needed anymore when
creating extensions using ext_skel.php script.
Additionally, Vim modelines for setting php syntax and some editor
settings has been removed from some *.phpt files. All these are
mostly not relevant for phpt files neither work properly in the
middle of the file.
The $Id$ keywords were used in Subversion where they can be substituted
with filename, last revision number change, last changed date, and last
user who changed it.
In Git this functionality is different and can be done with Git attribute
ident. These need to be defined manually for each file in the
.gitattributes file and are afterwards replaced with 40-character
hexadecimal blob object name which is based only on the particular file
contents.
This patch simplifies handling of $Id$ keywords by removing them since
they are not used anymore.
Store the last used encoding and compare against it. It's quite
likely that an application is going to be using the same encoding
again and again.
The actual mbfl_name2encoding() function could also be optimized
to use a hash lookup rather than a linear scan, but we don't have
a hashtable implmentation in libmbfl...