Commit graph

1983 commits

Author SHA1 Message Date
Alex Dowad
e169ad3b61 Consolidate all single-byte encodings in one source file
We can squeeze out a lot of duplicated code in this way.
2020-11-11 11:18:59 +02:00
Alex Dowad
17d82b6886 Enhance mbstring support for UCS-2 text
- For consistency with UTF-16, UTF-32, and UCS-4, strip leading byte
  order marks.
- Treat it as an error if string is truncated (i.e. has an odd number
  of bytes).
2020-11-11 11:18:59 +02:00
Alex Dowad
6dd75478d5 Leading BOM is stripped for UTF-32
For consistency with UTF-16 and UCS-4.

Also, do some code cleanup.
2020-11-11 11:18:59 +02:00
Alex Dowad
1cf12c02f0 Add test suite for SJIS-mac encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
fbdcab953d Unicode -> SJIS-mac conversion doesn't reject valid codepoints after a bad transcoding hint
To give the background on this issue, here is an excerpt from JAPANESE.txt,
from the Unicode Consortium:

    Apple has defined a block of 32 corporate characters as "transcoding
    hints." These are used in combination with standard Unicode characters
    to force them to be treated in a special way for mapping to other
    encodings; they have no other effect. Sixteen of these transcoding
    hints are "grouping hints" - they indicate that the next 2-4 Unicode
    characters should be treated as a single entity for transcoding. The
    other sixteen transcoding hints are "variant tags" - they are like
    combining characters, and can follow a standard Unicode (or a sequence
    consisting of a base character and other combining characters) to
    cause it to be treated in a special way for transcoding. These always
    terminate a combining-character sequence.

    The transcoding coding hints used in this mapping table are:

    0xF860  group next 2 characters as a single entity for transcoding
    0xF861  group next 3 characters as a single entity for transcoding
    0xF862  group next 4 characters as a single entity for transcoding
    0xF87A  variant tag for "negative" (i.e. black & white reversed)
    0xF87E  variant tag for vertical form
    0xF87F  variant tag for other alternate form

    For example, the Apple addition character 0x85AB is Roman numeral
    thirteen. There is no single Unicode for this (although there are
    standard Unicodes for Roman numerals 1-12). Using the grouping hint
    0xF862 in combination with standard Unicodes, we can map this as
    0xF862+0x0058+0x0049+0x0049+0x0049 (i.e. X + I + I + I).

Our SJIS-mac conversion code actually recognizes some special sequences
which start with an Apple 'transcoding hint'. However, if a transcoding
hint is misplaced and is not followed by one of the expected sequences,
we can just emit one error marker for the bad transcoding hint and then
process the following codepoint as normal.
2020-11-11 11:18:58 +02:00
Alex Dowad
b27a34c5a9 SJIS-mac encoding conversion: Stop the carnage of innocent Unicode codepoints
When converting Unicode to MacJapanese, some special sequences of Unicode
codepoints are collapsed into a single SJIS character. When the implementation
sees a codepoint which *might* begin such a sequence, it is cached and examined
again after the next codepoint arrives.

If it turns out that it wasn't one of the 'special' sequences, then a 'fallback'
conversion table is consulted to convert the cached codepoint. Then we re-enter
the regular conversion code to convert the immediately following codepoint.
BUT, local variables need to be reinitialized properly when doing this!

Because the locals weren't reinitialized, the sad result was that some codepoints
would get chopped up into bit salad and emitted as something totally bogus
(which might not even be valid SJIS-mac text at all).
2020-11-11 11:18:58 +02:00
Alex Dowad
7f0e86b2dc Convert Unicode halfwidth Yen sign to MacJapanese halfwidth Yen sign
Since 1993, Unicode has had a specific codepoint for a fullwidth Yen sign.
Likewise, MacJapanese has separate kuten codes for halfwidth and fullwidth
Yen signs. But mbstring mapped _both_ Yen sign codepoints to the
MacJapanese fullwidth Yen sign.

It's probably more appropriate to map the 'ordinary' Yen sign to the
MacJapanese halfwidth Yen sign. Besides, this means that the conversion
between Unicode and MacJapanese is closer to being lossless and reversible.
2020-11-11 11:18:58 +02:00
Alex Dowad
4c39cd3d1d SJIS-mac encoding conversion: handle invalid (or truncated) 2nd byte for Kanji correctly
Also, don't accept 1st bytes above 0xED, since none of the possible 2-byte
sequences starting with 0xEE and above are actually mapped to any character.
2020-11-11 11:18:58 +02:00
Alex Dowad
d40f9cf735 Add test suite for SJIS-2004 encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
eda73a5f6f Don't mangle non-Japanese chars which appear after a 'combining' kana in SJIS-2004
Unicode has 'combining' characters which join with another following character.
Japanese hiragana and katakana with the 'two dots' voice mark can be represented
in this way, with one Unicode character for the 'base' kana and another one which
adds the voice mark.

In SJIS-2004, however, there are dedicated characters for voiced and unvoiced
kana. So some special checks are done to identify sequences of Unicode characters
which need to be 'collapsed' into a single SJIS-2004 character.

If a kana is immediately followed by some other unrelated character, like a
Cyrillic letter, then the cached kana should be output 'as is' and we
proceed with encoding the unrelated character. When doing this, though,
we need to re-initialize local variables, or else the unrelated character
will be mangled in some cases.
2020-11-11 11:18:58 +02:00
Alex Dowad
2f98bd8844 SJIS-2004 encoding conversion: handle invalid (or truncated) 2nd byte for Kanji correctly
If the 2nd byte of a 2-byte character is invalid, then mb_substitute_character()
should be respected. Instead, what mbstring was doing was 'swallowing' the
first byte, then emitting the 2nd byte as if it was an ASCII character.

Likewise, if the 2nd byte is missing, instead of just keeping quiet, report an
illegal character as specified by mb_substitute_character().
2020-11-11 11:18:58 +02:00
Alex Dowad
a5827c2d35 Fix broken binary search function in mbstring
This faulty binary search would never reject values at the very high
end of the range being searched, even if they were not actually in
the table.

Among other things, this meant that some Unicode codepoints which do
not correspond to any character in JIS X 0213 would be converted to
bogus Shift-JIS-2004 values rather than being rejected.
2020-11-11 11:18:58 +02:00
Alex Dowad
b05ad5112a Don't redundantly flush mbstring filters multiple times
Each flush function in a chain of mbstring conversion filters always
calls the next flush function in the chain. So it is not necessary to
explicitly flush the second filter in a chain. (Due to this bug, in many
cases, flush functions were actually being called three times.)
2020-11-11 11:18:58 +02:00
Alex Dowad
d1d50c2b7a Test EUC-JP and Shift-JIS more thoroughly
Previously, the unit tests for these text encodings covered all mappings
from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy.
However, we should also test the few Unicode -> legacy mappings which
are not reversible.
2020-11-11 11:18:58 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad
a416f938f3 Treat non-ASCII characters as erroneous when converting ASCII text 2020-11-09 13:45:17 +02:00
Alex Dowad
8f6889b20d Fix mbstring support for EUC-JP text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (A strange feature, or perhaps misfeature, of mbstring which is
  not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
  map to any character, whether converting from EUC-JP to another encoding,
  or converting another encoding which uses JIS X 0208/0212 charsets to
  EUC-JP.
- Truncated multi-byte characters are treated as an error.
2020-11-09 13:45:17 +02:00
Alex Dowad
ad7e0f16cc Fix mbstring support for Shift-JIS
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
  Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
  Unicode Consortium, and as iconv does.
  (NOTE: This will affect PHP scripts which use an internal encoding of
  Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
  it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
  sign to the Unicode yen sign means the yen sign will not be usable for
  C escapes in double-quoted strings. Japanese PHP programmers who want to
  write their source code in Shift-JIS for some strange reason will have to
  use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
  escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
  waiting to see the next byte. (Previously, the value used was 0xFC, which is
  the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
  character.

The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
2020-11-09 13:45:16 +02:00
Alex Dowad
cc03c54c36 Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
2020-11-09 13:45:16 +02:00
Alex Dowad
3eb8828d1a Fix issues with mbstring encoding tests
I made some mistakes on this code, which meant that not everything which
should be tested was actually being tested.
2020-11-09 13:45:16 +02:00
Alex Dowad
ff953f254c Add test suite for ARMSCII-8 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9f5a4b3bd9 Fix mbstring support for ARMSCII-8
- Identify filter was completely wrong.
- Respect `mb_substitute_character` rather than converting invalid bytes to
  Unicode 0xFFFD (generic replacement character).
- Don't convert Unicode 0xFFFD to a valid ARMSCII-8 character.
- When converting ARMSCII-8 to ARMSCII-8, don't pass invalid bytes through
  silently.
2020-11-02 21:31:06 +02:00
Alex Dowad
be1a215538 Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+)
Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to
determine whether a string was valid or not, it would convert the whole string
into wchar (code points), which required dynamically allocating a (potentially
large) buffer. Then it would turn right around and convert that big 'ol buffer
of code points back to the original encoding again. Finally, it would check
whether any invalid bytes were detected during that long and onerous process.

The thing is, mbstring _already_ has machinery for detecting whether a string
is valid in a certain encoding or not, and it doesn't require copying any data
around or allocating buffers. Better yet, it can fail fast when an invalid byte
is found. Why not use it? It's sure a lot faster!

Further, the legacy code was also badly broken. Why? Because aside from
checking whether illegal characters were detected, it would also check whether
the conversion to and from wchars was lossless. But, some encodings have
more than one valid encoding for the same character. In such cases, it is
not possible to make the conversion to and from wchars lossless for every
valid character. So `mb_check_encoding` would actually reject good strings
in a lot of encodings!
2020-11-02 21:31:06 +02:00
Alex Dowad
335c1b98c2 Add test suite for KOI8-U encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
e81458862b Remove dead code from mbfilter_koi8u.c (and do general code cleanup) 2020-11-02 21:31:06 +02:00
Alex Dowad
f9826fba46 All bytes are valid in KOI8-U encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9db4387f14 Add test suite for KOI8-R encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
fde7794556 Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.c
...Plus some dead code related to ISO-8859-1.
2020-11-02 21:31:06 +02:00
Alex Dowad
0a8ebb36a5 Remove dead code from mbfilter_koi8r.c 2020-11-02 21:31:06 +02:00
Alex Dowad
7b97789ec0 All bytes are valid in KOI8-R encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9980534a4e Add test suite for CP850 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
b6e75265d0 Remove dead code from mbfilter_cp850.c (and do general code cleanup)
Since there are no invalid bytes in CP850, these `if` conditions will never
be true.
2020-11-02 21:31:06 +02:00
Alex Dowad
8926252ee8 All bytes are valid in CP850 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
0485bed4c7 Add test suite for CP866 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
20a404f765 Remove dead code from mbfilter_cp866.c (and do general code cleanup)
Since there are no invalid bytes in CP866, these `if` conditions will never
be true.
2020-11-02 21:31:06 +02:00
Alex Dowad
bc04e0cc6d All bytes are valid in CP866 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
0b13305ccc Add test suite for CP1254 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
e6d17cfe44 Fix mbstring support for CP1254 encoding
One funny thing: while the original author used Unicode 0xFFFD (generic
replacement character) for invalid bytes in CP1251 and CP1252, for CP1254
they used 0xFFFE, which is not a valid Unicode codepoint at all, but is a
reversed byte-order mark. Probably this was by mistake.

Anyways,

- Fixed identify filter, which was completely wrong.
- Don't convert Unicode 0xFFFE to a random (but valid) CP1254 byte.
- When converting CP1254 to CP1254, don't pass invalid bytes through silently.
2020-11-02 21:31:05 +02:00
Alex Dowad
eb4151e89e Add test suite for CP1251 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
44bd5804b0 Fix mbstring support for CP1251 encoding
- Identify filter was as wrong as wrong can be.
- Invalid CP1251 byte 0x98 was converted to Unicode 0xFFFD (generic
  replacement character), rather than respecting `mb_substitute_character`.
- Unicode 0xFFFD was converted to some random CP1251 byte.
- When converting CP1251 to CP1251, don't pass invalid bytes through silently.
2020-11-02 21:31:05 +02:00
Alex Dowad
b18b9c9ef6 Test cases for mbstring encodings are less repetitive 2020-11-02 21:31:05 +02:00
Alex Dowad
831abe2d90 Add test suite for CP1252 encoding
Also remove a bogus test (bug62545.phpt) which wrongly assumed that all invalid
characters in CP1251 and CP1252 should map to Unicode 0xFFFD (REPLACEMENT
CHARACTER).

mbstring has an interface to specify what invalid characters should be
replaced with; it's called `mb_substitute_character`. If a user wants to see
the Unicode 'replacement character', they can specify that using
`mb_substitute_character`. But if they specify something else, we should
follow that.
2020-10-30 22:13:27 +02:00
Alex Dowad
b5ff87ca71 Fix mbstring support for CP1252 encoding
It's a bit surprising how much was broken here.

- Identify filter was utterly and completely wrong.
- Instead of handling invalid CP1252 bytes as specified by
  `mb_substitute_character`, it would convert them to Unicode 0xFFFD
  (generic replacement character).
- When converting ISO-8859-1 to CP1252, invalid ISO-8859-1 bytes would
  be passed through silently.
- Unicode codepoints from 0x80-0x9F were converted to CP1252 bytes 0x80-0x9F,
  which is wrong.
- Unicode codepoint 0xFFFD was converted to CP1252 0x9F, which is very wrong.

Also clean up some unneeded code, and make the conversion table consistent with
others by using zero as a 'invalid' marker, rather than 0xFFFD.
2020-10-30 22:13:27 +02:00
Alex Dowad
e26234a044 UTF-32 conversion treats truncated characters as illegal 2020-10-27 10:19:01 +02:00
Alex Dowad
7047e5d2c4 Add identify filter for UTF-32{,BE,LE} 2020-10-27 10:19:01 +02:00
Alex Dowad
d8895cd054 Improve error handling for UTF-16{,BE,LE}
Catch various errors such as the first part of a surrogate pair not being
followed by a proper second part, the first part of a surrogate pair appearing
at the end of a string, the second part of a surrogate pair appearing out
of place, and so on.
2020-10-27 10:19:01 +02:00
Alex Dowad
d9ddeb6e85 UTF-16 text conversion handles truncated characters as illegal
This broke one old test (Zend/tests/multibyte_encoding_003.phpt), which used
a PHP script encoded as UTF-16. The problem was that to terminate the test
script, we need the text: "\n--EXPECT--". Out of that text, the terminating
newline (0x0A byte) becomes part of the resulting test script; but a bare
0x0A byte with no 0x00 is not valid UTF-16.

Since we now treat truncated UTF-16 characters as erroneous, an extra '?' is
appended to the output as an 'illegal character' marker.

Really, if we are running PHP scripts which are treated as encoded in UTF-16
or some other arbitrary text encoding (not ASCII), and the script is not
actually a valid string in that encoding, inserting '?' characters into the
code which the PHP interpreter runs is a bad thing to do. In such cases, the
script shouldn't be treated as UTF-16 (or whatever) at all.

I wonder if mbstring's encoding detection is being used in 'non-strict' mode?
2020-10-27 10:19:00 +02:00
Alex Dowad
84c180d88b Add test suite for ISO-8859-x encoding verification and conversion 2020-10-16 22:25:48 +02:00
Alex Dowad
bc18e32690 Do not pass invalid ISO-8859-{3,6,7,8} characters through silently
mbstring has a bad habit of passing invalid characters through silently
when converting to the same (or a "compatible") encoding.

For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS,
and try to convert that to EUC-JP, mbstring will just quietly re-encode the
invalid code in the EUC-JP representation.

At the same, some parts of the code (like `mb_check_encoding`) assume that
invalid characters will be treated as... well, invalid. Let's unbreak things
by actually catching errors and reporting them, instead of swallowing them.
2020-10-16 22:17:45 +02:00
Alex Dowad
5c6b2a7ad2 Add identify filter for ISO-8859-8 (Latin/Hebrew) 2020-10-16 22:17:45 +02:00