Commit graph

2207 commits

Author SHA1 Message Date
Alex Dowad
3f12d26e3a Merge branch 'PHP-8.1'
* PHP-8.1:
  Error handling for UTF-8 complies with WHATWG specification
2022-04-16 20:32:12 +02:00
Alex Dowad
04e59c916f Error handling for UTF-8 complies with WHATWG specification
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.

Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.

Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
2022-04-16 15:04:38 +02:00
Christoph M. Becker
20c0eb47df
Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:35:06 +01:00
Christoph M. Becker
5003831260
Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker
d0417ebc93
Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad
ff76694f28 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
2022-02-22 23:58:57 +02:00
Alex Dowad
8a8533d263 mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.

Thanks to divinity76 for the report.
2022-02-22 23:56:56 +02:00
Christoph M. Becker
58cbee1ce3
Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:11:01 +01:00
Christoph M. Becker
69f6b09b2a
Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:09:52 +01:00
Christoph M. Becker
03816fba46
Fix GH-7902: mb_send_mail may delimit headers with LF only
Email headers are supposed to be separated with CRLF. Period.

We introduce a `CRLF` macro for better comprehensibility right away.

Closes GH-7907.
2022-01-18 13:08:08 +01:00
Dmitry Stogov
e1782c08bf Fix ASAN undefined behavior (unsigned char << 24)
ext/mbstring/libmbfl/filters/mbfilter_utf32.c:259:20: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'
2022-01-11 09:13:22 +03:00
Christoph M. Becker
51eec5086f
Run mb_send_mail tests on Windows, too
We use the run-tests.php `{MAIL}` abstraction instead of `cat`.

Closes GH-7908.
2022-01-07 22:46:02 +01:00
Alex Dowad
53ffba967c Implement fast text conversion interface for CP5022{0,1,2} 2021-12-26 22:19:51 +02:00
Alex Dowad
01afd9f141 Implement fast text conversion interface for JIS 2021-12-26 22:19:51 +02:00
Alex Dowad
cb4626c5b2 Implement fast text conversion interface for GB18030 2021-12-26 22:19:51 +02:00
Alex Dowad
3e8088dc80 Implement fast text conversion interface for EUC-JP-MS 2021-12-26 22:19:51 +02:00
Alex Dowad
e5af94b74f Implement fast text conversion interface for CP51932 2021-12-26 22:19:51 +02:00
Alex Dowad
6ef1b35223 Implement fast text conversion interface for EUC-CN 2021-12-26 22:19:51 +02:00
Alex Dowad
9bd08a97d9 Implement fast text conversion interface for EUC-TW 2021-12-26 22:19:51 +02:00
Alex Dowad
661a10160b Implement fast text conversion interface for CP936 2021-12-26 22:19:51 +02:00
Alex Dowad
20555371d5 Implement fast text conversion interface for CP932 2021-12-26 22:19:51 +02:00
Alex Dowad
43bb97c539 Implement fast text conversion interface for EUC-KR 2021-12-26 22:19:51 +02:00
Alex Dowad
c0936d48b0 Implement fast text conversion interface for UHC 2021-12-26 22:19:51 +02:00
Alex Dowad
40809cb19f Implement fast text conversion interface for HZ 2021-12-26 22:19:51 +02:00
Alex Dowad
da58d42d94 Implement fast text conversion interface for CP950 2021-12-26 22:19:51 +02:00
Alex Dowad
eac50a360f Implement fast text conversion interface for Big5 2021-12-26 22:19:51 +02:00
Alex Dowad
3c73225125 New internal interface for fast text conversion in mbstring
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
2021-12-21 08:33:11 +02:00
Alex Dowad
edc6b756c1 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
2021-12-20 22:47:18 +02:00
Alex Dowad
f07c193583 mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
In a2bc57e0e5, mb_detect_encoding was modified to ensure it would never
return 'UUENCODE', 'QPrint', or other non-encodings as the "detected
text encoding". Before mb_detect_encoding was enhanced so that it could
detect any supported text encoding, those were never returned, and they
are not desired. Actually, we want to eventually remove them completely
from mbstring, since PHP already contains other implementations of
UUEncode, QPrint, Base64, and HTML entities.

For more clarity on why we need to suppress UUEncode, etc. from being
detected by mb_detect_encoding, the existing UUEncode implementation
in mbstring *never* treats any input as erroneous. It just accepts
everything. This means that it would *always* be treated as a valid
choice by mb_detect_encoding, and would be returned in many, many cases
where the input is obviously not UUEncoded.

It turns out that the form of mb_convert_encoding where the user passes
multiple candidate encodings (and mbstring auto-detects which one to
use) was also affected by the same issue. Apply the same fix.
2021-12-20 22:09:33 +02:00
Christoph M. Becker
97f78b3bb7
Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix #81693: mb_check_encoding(7bit) segfaults
2021-12-03 22:50:27 +01:00
Christoph M. Becker
929d847152
Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad
ee3caef8eb Merge branch 'PHP-8.1'
* PHP-8.1:
  Add unit tests for mb_detect_encoding on Polish text
2021-11-26 17:43:40 +02:00
Alex Dowad
1a2c608053 Add unit tests for mb_detect_encoding on Polish text 2021-11-26 17:42:53 +02:00
Remi Collet
7c0f2b4dc0
Merge branch 'PHP-8.1'
* PHP-8.1:
  add missing cond.
  Enable encoding detection for Polish text
2021-11-25 10:16:34 +01:00
Alex Dowad
d573054ebe Enable encoding detection for Polish text
Previously, some accented letters commonly used to write Polish text
were counted as 'rare' codepoints. Treat them as 'common' instead.

Thanks to Alec for pointing this out.
2021-11-25 11:10:47 +02:00
Dmitry Stogov
1a4f49f1fe Use cheaper memchr() instead of php_memnstr() 2021-11-10 10:19:49 +03:00
Alex Dowad
9308974f8c Deprecate use of mbstring to convert text to Base64/QPrint/HTML entities/etc
The purpose of mbstring is for working with Unicode and legacy text
encodings; but Base64, QPrint, etc. are not text encodings and don't
really belong in mbstring. PHP already contains separate implementations
of Base64, QPrint, and HTML entities. It will be better to eventually
remove these non-encodings from mbstring.

Regarding HTML entities... there is a bit more to say. mbstring's
implementation of HTML entities is different from the other built-in
implementation (htmlspecialchars and htmlentities). Those functions
convert <, >, and & to HTML entities, but mbstring does not.

It appears that the original author of mbstring intended for something
to be done with <, >, and &. He used a table to identify which
characters should be converted to HTML entities, and </>/& all have a
special value in that table. However, nothing ever checks for that
special value, so the characters are passed through unconverted.

This seems like a very useless implementation of HTML entities. The most
important characters which need to be expressed as entities in HTML
documents are those three!
2021-11-01 11:23:21 +02:00
Christoph M. Becker
7c75c61206
Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix #76167: mbstring may use pointer from some previous request
2021-10-25 12:41:46 +02:00
Christoph M. Becker
7fcf17c41e
Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix #76167: mbstring may use pointer from some previous request
2021-10-25 12:41:21 +02:00
Christoph M. Becker
6e6a8443a8
Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fix #76167: mbstring may use pointer from some previous request
2021-10-25 12:39:57 +02:00
Christoph M. Becker
d3d6d7906e
Fix #76167: mbstring may use pointer from some previous request
We must not reuse per-request memory across multiple requests, so this
check triggered during RINIT makes no sense.  As explained in the bug
report[1], it can be even harmful, if some request startup fails, and
the pointers refer to already freed memory in the next request.

[1] <https://bugs.php.net/76167>

Closes GH-7604.
2021-10-25 12:37:28 +02:00
Alex Dowad
9962aa9774 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding will not return non-encodings
  Improve detection accuracy of mb_detect_encoding
2021-10-19 18:11:35 +02:00
Alex Dowad
a2bc57e0e5 mb_detect_encoding will not return non-encodings
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.

Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
2021-10-19 18:05:52 +02:00
Alex Dowad
28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad
dcaa010fff Strict validation of conversion flags to mb_convert_kana
mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth
numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine
inverse flags.

But, clever reader of commit logs, you will surely say: What if I want all my halfwidth
numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too
clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and
will never be used, and face up to another stark reality: mb_convert_kana does not work
for that case, and never has. This was probably never noticed because nobody ever tried.

Disallowing useless combinations of flags gives freedom to rearrange the kana conversion
code without changing behavior.

We can also reject unrecognized flags. This may help users to catch bugs.

Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized
at all).
2021-10-01 19:27:39 +02:00
Alex Dowad
7800491289 Inline SKIP_LONG_HEADER... macro which is only used once
I don't find that pulling this code out into a macro makes anything
clearer. Not at all.
2021-09-29 18:19:01 +02:00
Alex Dowad
0b32a15eb0 Optimize mb_str{,im}width for performance
Rather than doing a linear search of a table of fullwidth codepoint
ranges for every input character,

1) Short-cut the search if the codepoint is below the first such range
2) Otherwise, do a binary (rather than linear) search
2021-09-29 18:19:01 +02:00
Alex Dowad
f4365d2c26 Remove unused typedef 'mbfl_encoding_id' 2021-09-29 18:19:01 +02:00
Alex Dowad
3bf431969e Don't check for impossible error condition in mb_substr_count 2021-09-29 18:19:01 +02:00
Alex Dowad
8c32deb605 Don't check for impossible error condition in mb_strwidth 2021-09-29 18:19:01 +02:00