Commit graph

765 commits

Author SHA1 Message Date
Alex Dowad
a06c20a17c Remove useless constant MBFL_ENCTYPE_MBCS
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
2021-01-15 21:55:41 +02:00
Nikita Popov
3e01f5afb1 Replace zend_bool uses with bool
We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.

Of course, zend_bool is retained as an alias.
2021-01-15 12:33:06 +01:00
Alex Dowad
72660c416a Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).

So we can simplify things by combining constants.
2020-11-25 19:52:19 +02:00
Alex Dowad
e169ad3b61 Consolidate all single-byte encodings in one source file
We can squeeze out a lot of duplicated code in this way.
2020-11-11 11:18:59 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad
be1a215538 Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+)
Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to
determine whether a string was valid or not, it would convert the whole string
into wchar (code points), which required dynamically allocating a (potentially
large) buffer. Then it would turn right around and convert that big 'ol buffer
of code points back to the original encoding again. Finally, it would check
whether any invalid bytes were detected during that long and onerous process.

The thing is, mbstring _already_ has machinery for detecting whether a string
is valid in a certain encoding or not, and it doesn't require copying any data
around or allocating buffers. Better yet, it can fail fast when an invalid byte
is found. Why not use it? It's sure a lot faster!

Further, the legacy code was also badly broken. Why? Because aside from
checking whether illegal characters were detected, it would also check whether
the conversion to and from wchars was lossless. But, some encodings have
more than one valid encoding for the same character. In such cases, it is
not possible to make the conversion to and from wchars lossless for every
valid character. So `mb_check_encoding` would actually reject good strings
in a lot of encodings!
2020-11-02 21:31:06 +02:00
Alex Dowad
7dc16374b4 Remove unused IS_SJIS1 and IS_SJIS2 macros 2020-10-14 08:31:51 +02:00
Nikita Popov
4371a4b241 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix incorrect zpp parameter count in mb_substr() / mb_strcut()
2020-10-13 17:47:11 +02:00
Nikita Popov
9b4094c3d7 Fix incorrect zpp parameter count in mb_substr() / mb_strcut()
These functions only accept 4 params.
2020-10-13 17:46:56 +02:00
Nikita Popov
40e920ebd9 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix argument nullability in mbstring
2020-10-13 16:03:29 +02:00
Nikita Popov
124bce3c7a Fix argument nullability in mbstring
These arguments were declared nullable in stubs (and should be
nullable), but didn't accept null in zpp.
2020-10-13 16:03:04 +02:00
Alex Dowad
0ffc1f55b3 Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
  will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
  list of alias strings. Pointers to pointers (2 levels of indirection)
  is what actually makes sense. This gets rid of some extraneous
  dereference operations.
2020-10-13 06:12:38 +02:00
Máté Kocsis
e950ca13ea
Consolidate the usage of "either" and "one of" in error messages
Closes GH-6173
2020-09-20 19:41:47 +02:00
Máté Kocsis
c37a1cd650
Promote a few remaining errors in ext/standard
Closes GH-6110
2020-09-15 14:26:16 +02:00
Máté Kocsis
1c81a34563
Make mb_send_mail() consistent with mail()
The $additional_headers parameter shouldn't accept null.
2020-09-14 11:52:33 +02:00
Máté Kocsis
c98d47696f
Consolidate new union type ZPP macro names
They will now follow the canonical order of types. Older macros are
left intact due to maintaining BC.

Closes GH-6112
2020-09-11 11:00:18 +02:00
Nikita Popov
f33fd9b7fe Throw ValueError on null bytes in mb_send_mail()
Instead of silently replacing with spaces.
2020-09-11 10:46:59 +02:00
Alex Dowad
5b78d76ec8 mb_str_split is already documented on php.net
So remove TODO comment which implies that it's not.
2020-09-08 20:09:45 +02:00
Nikita Popov
2386f655d8 Always use PCRE for mbstring.http_output_conv_mimetypes
Instead of using either oniguruma or pcre depending on which is
available. We always have PCRE, so use it. This ensures consistent
behavior.
2020-09-08 15:02:15 +02:00
Nikita Popov
623bf96e7e Throw on invalid mb_http_input() type 2020-09-07 09:59:51 +02:00
Nikita Popov
d57f9e5ea4 Handle null encoding in mb_http_input() 2020-09-04 17:15:35 +02:00
Alex Dowad
409aa20ab0 Refactor mbfl_convert.c 2020-09-03 15:56:29 +02:00
Máté Kocsis
3e800e997b
Move custom type checks to ZPP
Closes GH-6034
2020-09-02 11:11:38 +02:00
Alex Dowad
b03fd37677 Code cleanup in mbstring.c 2020-08-31 23:19:43 +02:00
Alex Dowad
7eddcabe2b Don't guard mbstring code with #ifdef HAVE_MBSTRING
This is just a very silly feature of mbstring -- you can compile the source files with
HAVE_MBSTRING undefined, and it will all just compile to (almost) nothing. What is the
use of this? Why compile the source files and link against them if you don't want the
mbstring extension? It doesn't make any kind of sense.
2020-08-31 23:18:13 +02:00
Alex Dowad
62317d592f Remove redundant includes from mbstring (and make sure correct config.h is used)
Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!

This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.

Make sure we actually include the file which was intended.
2020-08-31 23:17:58 +02:00
Alex Dowad
ddc76e5abf Fix typos in comments in mb_send_mail 2020-08-31 23:17:14 +02:00
Alex Dowad
a64241b540 Remove unused functions from mbstring
- mbfl_buffer_converter_reset
- mbfl_buffer_converter_strncat
- mbfl_buffer_converter_getbuffer
- mbfl_oddlen
- mbfl_filter_output_pipe_flush
- mbfl_memory_device_output2
- mbfl_memory_device_output4
- mbfl_is_support_encoding
- mbfl_buffer_converter_feed2
- _php_mb_regex_globals_dtor
- mime_header_encoder_feed
- mime_header_decoder_feed
- mbfl_convert_filter_feed
2020-08-31 23:16:57 +02:00
Alex Dowad
8d13348bb5 Separate implementation of mb_{en,de}code_numericentity
Rather than using a magic boolean parameter to choose different behavior of
the subfunction, inline it. The code size doesn't really grow anyways. And
soon these will be trimmed down more.
2020-08-31 23:16:28 +02:00
Alex Dowad
29b02bf290 Use new-style argument parsing macros in mbstring.c 2020-08-31 23:16:21 +02:00
Alex Dowad
d4ef7ef11d Inline unneeded indirection for mbstring memory management
All memory allocation and deallocation for mbstring bounces through a table of
function pointers before going to emalloc/efree/etc. But this is unnecessary.
The allocators are never swapped out. Better to just call them directly.
2020-08-31 23:16:09 +02:00
George Peter Banyard
fa8d9b1183 Improve type declarations for Zend APIs
Voidification of Zend API which always succeeded
Use bool argument types instead of int for boolean arguments
Use bool return type for functions which return true/false (1/0)
Use zend_result return type for functions which return SUCCESS/FAILURE as they don't follow normal boolean semantics

Closes GH-6002
2020-08-28 15:41:27 +02:00
Máté Kocsis
ac0da090ae
Fix UNKNOWN default values in ext/mbstring and ext/gd
Closes GH-5598
2020-07-28 17:06:25 +02:00
Máté Kocsis
d30cd7d7e7
Review the usage of apostrophes in error messages
Closes GH-5590
2020-07-10 21:05:28 +02:00
Max Semenik
2b5de6f839
Remove proto comments from C files
Closes GH-5758
2020-07-06 21:13:34 +02:00
Nikita Popov
88021ffe0e Fix count_commas implementation
Ooops, I did not account for the changing length here.
2020-06-12 11:04:35 +02:00
Nikita Popov
f691693ebc Fix null pointer ub in encoding parsing
And do a bit of drive-by cleanup by extracting count_commas and
reducing some variable scopes.
2020-06-12 10:08:34 +02:00
Christoph M. Becker
5a04796f76 Fix MSVC level 1 (severe) warnings
We fix (hopefully) all instances of:

* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4005>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4024>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4028>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4047>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4087>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4090>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4273>
* <https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4312>

`zend_llist_add_element()` and `zend_llist_prepend_element()` now
explicitly expect a *const* pointer.

We use the macro `ZEND_VOIDP()` instead of a `(void*)` cast to suppress
C4090; this should prevent accidential removal of the cast by
clarifying the intention, and makes it easier to remove the casts if
the issue[1] will be resolved sometime.

[1] <https://developercommunity.visualstudio.com/content/problem/390711/c-compiler-incorrect-propagation-of-const-qualifie.html>
2020-06-05 11:17:05 +02:00
George Peter Banyard
68164f40ce Fix [-Wundef] warning in MBString extension 2020-05-16 15:31:20 +02:00
George Peter Banyard
7dd332f110 Refactor mb_substitute_character()
Using the new Fast ZPP API for string|int|null

This also fixes Bug #79448 which was too disruptive to fix in PHP 7.x
2020-05-11 17:30:01 +02:00
Nikita Popov
481b7421f3 Throw warning if invalid internal_encoding ini is specified 2020-05-07 14:44:13 +02:00
Nikita Popov
217f6013b3 Remove no_language from mbfl_string
This is not actually used for anything and just causes confusion.
2020-05-07 11:36:57 +02:00
Nikita Popov
226d9dd30a Only allow "pass" as input/output encoding
"pass" is not a real encoding, it just means "don't perform any
conversion". Using it as an internal encoding or passing it to
any of the mbstring() function will not work (and on master commonly
assert).
2020-05-07 11:19:14 +02:00
Nikita Popov
5bfa9598f4 Return false from failed mb_convert_variables()
If we fail to detect the encoding return false, just like
mb_convert_encoding() does, and the implementation here clearly
intended. Previously the "pass" pseudo-incoding was returned.
2020-05-07 10:16:46 +02:00
Nikita Popov
71f48260af Fix assertion failure when failing to detect encoding
Looks like prior to 7.3 this just passed the original string
through. Since 7.3 it returns false. Let's stick with that
behavior.
2020-05-06 22:56:01 +02:00
Nikita Popov
7d4ff8443e Remove persistent allocators from libmbfl
These functions are not used, and I don't think we have any plans
to ever use them.
2020-05-04 23:19:07 +02:00
Máté Kocsis
6111d64cda
Improve a last couple of argument error messages
Closes GH-5404
2020-04-20 13:09:00 +02:00
Máté Kocsis
1f48feebb9
Improve some TypeError and ValueError messages
Closes GH-5377
2020-04-14 14:38:45 +02:00
George Peter Banyard
01762e56ed Adapt assertion as mbfl_strwidth returns a size_t 2020-04-12 19:34:05 +02:00
George Peter Banyard
12ec7a2730 Convert if blocks to assertions and adapt stubs accordingly 2020-04-09 13:50:37 +02:00