Instead of returning the encoding of the current substitution
character. This allows a robust check for the failure case. The
substitution character (especially the default of "?") is also
a valid output of mb_chr() for a valid input (for "?" that would be
0x3f), so it's a bad choice for an error value.
Previously mb_chr() had two different encoding-dependent behaviors:
* For "Unicode-encodings" it took a Unicode codepoint and returned
its encoded representation.
* Otherwise it returned a big-endian binary encoding of the passed
integer.
Now the input is always interpreted as a Unicode codepoint. If
a big-endian binary encoding is what you want, you don't need
mbstring to implement that.
The introduced checks were not correct in two respects:
* It was checked whether the source encoding of the string matches
the internal encoding, while the actually relevant encoding is
the *target* encoding.
* Even if the correct encoding is used, the checks are still too
conservative. Just because something is not a "Unicode-encoding"
does not mean that it does not map any non-ASCII characters.
I've reverted the added checks and instead adjusted mbfl_convert
to first try to use the provided substitution character and if
that fails, perform the fallback to '?' at that point. This means
that any codepoint mapped in the target encoding should now be
correctly supported and anything else should fall back to '?'.
The introduced checks did not treat "non-Unicode" encodings correctly,
because they treated the passed integer as encoded in the internal
encoding in that case, while in actuality the substitute character
is always a Unicode codepoint.
Additionally checking the codepoint against the internal encoding
is not correct in any case, because the substitution character must
be mapped in the *target* encoding of the conversion, which does
not necessarily coincide with the internal encoding (the internal
encoding is the default *source* encoding, not *target* encoding).
This reverts the checks back to simple range checks, but in a way
that still resolves#69079: Characters outside the Basic
Multilingual Plane are now accepted and Surrogate Codepoints are
rejected. A distinction between UTF-8 and non-UTF-8 encodings is
not made for surrogate checks (as in the original patch), as
surrogates are always illegal on their own. Specifying a surrogate
as substitution character would only make sense if you could
specify a substitution string with more than one character --
however we do not support that.
Implement full case mapping according to SpecialCasing.txt and
also full case folding according to CaseFolding.txt (F). There
are a number of caveats:
* Only language-agnostic and unconditional full case mapping
is implemented. The only language-agnostic conditional case
mapping rule relates to Greek sigma in final position
(Final_Sigma). Correctly handling this requires both arbitrary
lookahead and lookbehind, which would require some larger
changes to how the case mapping is implemented. This is a
possible future extension.
* The only language-specific handling that is implemented is
for Turkish dotted/undotted Is, if the ISO-8859-9 encoding
is used. This matches the previous behavior and makes sure
that no codepoints not supported by the encoding are
produced. A future extension would be to also handle the
Turkish mappings specified by SpecialCasing.txt based on
the mbfl internal language.
* Full case folding is implemented, but case-insensitive mb_*
operations continue to use simple case folding. The reason is
that full case folding of the haystack string may change the
position at which a match occurred. This would have to be
mapped back into the position in the original string.
* mb_convert_case() exposes both the full and the simple case
mapping / folding, where full is the default. The constants
are:
* MB_CASE_LOWER (used by mb_strtolower)
* MB_CASE_UPPER (used by mb_strtolower)
* MB_CASE_TITLE
* MB_CASE_FOLD
* MB_CASE_LOWER_SIMPLE
* MB_CASE_UPPER_SIMPLE
* MB_CASE_TITLE_SIMPLE
* MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)
This takes the substr from the offset to the end of the string.
This avoids pointless searching for the end position and also
saves us a length calculation in the strstr family of functions.
In particular strings now store encoding rather than the
no_encoding.
I've also pruned out libmbfl APIs that existed in two forms, one
using no_encoding and the other using encoding. We were not actually
using any of the former.
Store the last used encoding and compare against it. It's quite
likely that an application is going to be using the same encoding
again and again.
The actual mbfl_name2encoding() function could also be optimized
to use a hash lookup rather than a linear scan, but we don't have
a hashtable implmentation in libmbfl...
As a side-effect mb_strtolower() and mb_strtoupper() now correctly
handle a NULL encoding parameter by using the internal encoding.
This is what caused the two test changes.