Commit graph

791 commits

Author SHA1 Message Date
pakutoma
b721d0f71e Fix phpGH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
2023-03-25 09:52:10 +02:00
Alex Dowad
7c1ee5a02a mb_encode_mimeheader does not crash if provided encoding has no MIME name set 2023-03-07 11:30:21 +02:00
Niels Dossche
ed0c0df351
Fix GH-10627: mb_convert_encoding crashes PHP on Windows
Fixes GH-10627

The php_mb_convert_encoding() function can return NULL on error, but
this case was not handled, which led to a NULL pointer dereference and
hence a crash.

Closes GH-10628

Signed-off-by: George Peter Banyard <girgias@php.net>
2023-02-20 13:33:11 +00:00
Max Kellermann
243865ae57
ext/mbstring: fix new_value length check
Commit 8bbd0952e5 added a check rejecting empty strings; in the
merge commiot 379d9a1cfc however it was changed to a NULL check,
one that did not make sense because ZSTR_VAL() is guaranteed to never
be NULL; the length check was accidently removed by that merge commit.

This bug was found by GCC's -Waddress warning:

 ext/mbstring/mbstring.c:748:27: warning: the comparison will always evaluate as ‘true’ for the address of ‘val’ will never be NULL [-Waddress]
   748 |         if (!new_value || !ZSTR_VAL(new_value)) {
       |                           ^

Closes GH-10532

Signed-off-by: George Peter Banyard <girgias@php.net>
2023-02-20 13:32:56 +00:00
Christoph M. Becker
c2bdaa48e1
Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings
Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`.  Therefore, we need to remove the non-
encodings from the `elist` in this case as well.  Thus, we duplicate
the global `elist`, so we can modify it.

Closes GH-9063.
2022-07-20 16:58:55 +02:00
Remi Collet
966a90873d
Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  NEWS for GH-8685
  Fix GH-8685 mbstring requires pcre
2022-06-03 07:54:58 +02:00
Remi Collet
2eb2f9d74f
Fix GH-8685 mbstring requires pcre 2022-06-03 07:53:48 +02:00
Christoph M. Becker
69f6b09b2a
Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:09:52 +01:00
Christoph M. Becker
03816fba46
Fix GH-7902: mb_send_mail may delimit headers with LF only
Email headers are supposed to be separated with CRLF. Period.

We introduce a `CRLF` macro for better comprehensibility right away.

Closes GH-7907.
2022-01-18 13:08:08 +01:00
Alex Dowad
f07c193583 mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
In a2bc57e0e5, mb_detect_encoding was modified to ensure it would never
return 'UUENCODE', 'QPrint', or other non-encodings as the "detected
text encoding". Before mb_detect_encoding was enhanced so that it could
detect any supported text encoding, those were never returned, and they
are not desired. Actually, we want to eventually remove them completely
from mbstring, since PHP already contains other implementations of
UUEncode, QPrint, Base64, and HTML entities.

For more clarity on why we need to suppress UUEncode, etc. from being
detected by mb_detect_encoding, the existing UUEncode implementation
in mbstring *never* treats any input as erroneous. It just accepts
everything. This means that it would *always* be treated as a valid
choice by mb_detect_encoding, and would be returned in many, many cases
where the input is obviously not UUEncoded.

It turns out that the form of mb_convert_encoding where the user passes
multiple candidate encodings (and mbstring auto-detects which one to
use) was also affected by the same issue. Apply the same fix.
2021-12-20 22:09:33 +02:00
Christoph M. Becker
7fcf17c41e
Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix #76167: mbstring may use pointer from some previous request
2021-10-25 12:41:21 +02:00
Christoph M. Becker
6e6a8443a8
Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fix #76167: mbstring may use pointer from some previous request
2021-10-25 12:39:57 +02:00
Christoph M. Becker
d3d6d7906e
Fix #76167: mbstring may use pointer from some previous request
We must not reuse per-request memory across multiple requests, so this
check triggered during RINIT makes no sense.  As explained in the bug
report[1], it can be even harmful, if some request startup fails, and
the pointers refer to already freed memory in the next request.

[1] <https://bugs.php.net/76167>

Closes GH-7604.
2021-10-25 12:37:28 +02:00
Alex Dowad
a2bc57e0e5 mb_detect_encoding will not return non-encodings
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.

Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
2021-10-19 18:05:52 +02:00
Nikita Popov
46315defc7 Use locale-independent case conversion in mb_send_mail()
Headers should not be processed in a locale-depdendent fashion.
Switch from upper to lowercasing because that's the standard for
PHP and we provide an ASCII implementation of this operation.

This is adapted from GH-7506.
2021-09-23 17:20:54 +02:00
Alex Dowad
ca33ab59ad mb_detect_encoding with only one candidate encoding uses mb_check_encoding
...Because it's about 5% faster.
2021-09-20 11:20:53 +02:00
Alex Dowad
776296e12f mbstring no longer provides 'long' substitutions for erroneous input bytes
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.
2021-08-31 13:41:34 +02:00
Nikita Popov
639015845f Deprecate calling mb_check_encoding() without argument
Part of https://wiki.php.net/rfc/deprecations_php_8_1.
2021-07-08 15:34:49 +02:00
George Peter Banyard
e7135cb817
Use zend_string_equals_* API in a couple of more place
Closes GH-6979
2021-05-14 13:45:17 +01:00
George Peter Banyard
aca6aefd85
Remove 'register' type qualifier (#6980)
The compiler should be smart enough to optimize this on its own
2021-05-14 13:38:01 +01:00
KsaR
01b3fc03c3
Update http->https in license (#6945)
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
2021-05-06 12:16:35 +02:00
Christoph M. Becker
592cfa309e
Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix #81011: mb_convert_encoding removes references from arrays
2021-05-04 18:40:23 +02:00
Christoph M. Becker
d1c0cbdcb1
Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fix #81011: mb_convert_encoding removes references from arrays
2021-05-04 18:39:39 +02:00
Christoph M. Becker
0cafd53d18
Fix #81011: mb_convert_encoding removes references from arrays
We need to dereference references.

Closes GH-6938.
2021-05-04 18:37:40 +02:00
George Peter Banyard
09efad615b
Use zend_string_equals_(literal_)ci() API more often
Also drive-by usage of zend_ini_parse_bool()

Closes GH-6844
2021-04-09 02:34:50 +01:00
George Peter Banyard
5caaf40b43
Introduce pseudo-keyword ZEND_FALLTHROUGH
And use it instead of comments
2021-04-07 00:46:29 +01:00
Alex Dowad
a06c20a17c Remove useless constant MBFL_ENCTYPE_MBCS
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
2021-01-15 21:55:41 +02:00
Nikita Popov
3e01f5afb1 Replace zend_bool uses with bool
We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.

Of course, zend_bool is retained as an alias.
2021-01-15 12:33:06 +01:00
Alex Dowad
72660c416a Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).

So we can simplify things by combining constants.
2020-11-25 19:52:19 +02:00
Alex Dowad
e169ad3b61 Consolidate all single-byte encodings in one source file
We can squeeze out a lot of duplicated code in this way.
2020-11-11 11:18:59 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad
be1a215538 Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+)
Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to
determine whether a string was valid or not, it would convert the whole string
into wchar (code points), which required dynamically allocating a (potentially
large) buffer. Then it would turn right around and convert that big 'ol buffer
of code points back to the original encoding again. Finally, it would check
whether any invalid bytes were detected during that long and onerous process.

The thing is, mbstring _already_ has machinery for detecting whether a string
is valid in a certain encoding or not, and it doesn't require copying any data
around or allocating buffers. Better yet, it can fail fast when an invalid byte
is found. Why not use it? It's sure a lot faster!

Further, the legacy code was also badly broken. Why? Because aside from
checking whether illegal characters were detected, it would also check whether
the conversion to and from wchars was lossless. But, some encodings have
more than one valid encoding for the same character. In such cases, it is
not possible to make the conversion to and from wchars lossless for every
valid character. So `mb_check_encoding` would actually reject good strings
in a lot of encodings!
2020-11-02 21:31:06 +02:00
Alex Dowad
7dc16374b4 Remove unused IS_SJIS1 and IS_SJIS2 macros 2020-10-14 08:31:51 +02:00
Nikita Popov
4371a4b241 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix incorrect zpp parameter count in mb_substr() / mb_strcut()
2020-10-13 17:47:11 +02:00
Nikita Popov
9b4094c3d7 Fix incorrect zpp parameter count in mb_substr() / mb_strcut()
These functions only accept 4 params.
2020-10-13 17:46:56 +02:00
Nikita Popov
40e920ebd9 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix argument nullability in mbstring
2020-10-13 16:03:29 +02:00
Nikita Popov
124bce3c7a Fix argument nullability in mbstring
These arguments were declared nullable in stubs (and should be
nullable), but didn't accept null in zpp.
2020-10-13 16:03:04 +02:00
Alex Dowad
0ffc1f55b3 Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
  will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
  list of alias strings. Pointers to pointers (2 levels of indirection)
  is what actually makes sense. This gets rid of some extraneous
  dereference operations.
2020-10-13 06:12:38 +02:00
Máté Kocsis
e950ca13ea
Consolidate the usage of "either" and "one of" in error messages
Closes GH-6173
2020-09-20 19:41:47 +02:00
Máté Kocsis
c37a1cd650
Promote a few remaining errors in ext/standard
Closes GH-6110
2020-09-15 14:26:16 +02:00
Máté Kocsis
1c81a34563
Make mb_send_mail() consistent with mail()
The $additional_headers parameter shouldn't accept null.
2020-09-14 11:52:33 +02:00
Máté Kocsis
c98d47696f
Consolidate new union type ZPP macro names
They will now follow the canonical order of types. Older macros are
left intact due to maintaining BC.

Closes GH-6112
2020-09-11 11:00:18 +02:00
Nikita Popov
f33fd9b7fe Throw ValueError on null bytes in mb_send_mail()
Instead of silently replacing with spaces.
2020-09-11 10:46:59 +02:00
Alex Dowad
5b78d76ec8 mb_str_split is already documented on php.net
So remove TODO comment which implies that it's not.
2020-09-08 20:09:45 +02:00
Nikita Popov
2386f655d8 Always use PCRE for mbstring.http_output_conv_mimetypes
Instead of using either oniguruma or pcre depending on which is
available. We always have PCRE, so use it. This ensures consistent
behavior.
2020-09-08 15:02:15 +02:00
Nikita Popov
623bf96e7e Throw on invalid mb_http_input() type 2020-09-07 09:59:51 +02:00
Nikita Popov
d57f9e5ea4 Handle null encoding in mb_http_input() 2020-09-04 17:15:35 +02:00
Alex Dowad
409aa20ab0 Refactor mbfl_convert.c 2020-09-03 15:56:29 +02:00
Máté Kocsis
3e800e997b
Move custom type checks to ZPP
Closes GH-6034
2020-09-02 11:11:38 +02:00
Alex Dowad
b03fd37677 Code cleanup in mbstring.c 2020-08-31 23:19:43 +02:00