Commit graph

20 commits

Author SHA1 Message Date
George Peter Banyard
90d41cccfd ext/mbstring: move another test case that only works on 64 bits 2023-12-08 17:17:28 +00:00
Gina Peter Banyard
7684a3d138
ext/mbstring: move unsigned 32 bit integer tests to a new test (#12891)
And only run it on 64 bit architectures as those are floats on 32 bit.
2023-12-07 20:19:11 +00:00
Gina Peter Banyard
e74bf42c81
ext/mbstring: Check conversion map only has integers 2023-12-06 23:47:00 +00:00
Alex Dowad
76a92c26e3 mb_decode_numericentity decodes valid entities which are truncated at end of string
Since mb_decode_numericentity does not require all HTML entities
to end with ';', but allows them to be terminated by ANY non-digit
character, it doesn't make sense that valid entities which butt
up against the end of the input string are not converted.

As it turned out, supporting this case also made it possible
to simplify the code nicely.
2022-07-18 15:11:47 +02:00
Alex Dowad
5d6bd557b3 mb_decode_numericentity converts entities which immediately follow a valid/invalid entity
Thanks to Kamil Tieleka for suggesting that some of the behaviors of
the legacy implementation which the new mb_decode_numericentity
implementation took care to maintain were actually bugs and should
be fixed. Thanks also to Trevor Rowbotham for providing a link to
the HTML specification, showing how HTML numeric entities should
be interpreted.

mb_decode_numericentity now processes numeric entities in the
following situations where the old implementation would not:

- &<ENTITY> (for example, &&#65;)
- &#<ENTITY>
- &#x<ENTITY>
- <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65&#65;)
- <VALID BUT UNTERMINATED HEX ENTITY><ENTITY>
- <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why
  the first entity is invalid; the value could be too big, it could have
  too many digits, or it could not match the 'convmap' parameter)
- <INVALID AND UNTERMINATED HEX ENTITY><ENTITY>

This is consistent with the way that web browsers process
HTML entities.
2022-07-18 15:11:32 +02:00
Alex Dowad
91969e908f New implementation of mb_{de,en}code_numericentity
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:

BEHAVIOR CHANGES:

- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.

Now unsigned arithmetic is used, with explicit bit widths.

- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.

However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.

Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.

All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.

- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.

- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.

PERFORMANCE:

Based on micro-benchmarks run on my development machine:

Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
2022-07-18 15:11:30 +02:00
Alex Dowad
57eafd44c6 Add more tests for mb_decode_numericentity 2021-09-20 11:27:54 +02:00
Nikita Popov
a06d015e61 Remove unnecessary mbstring skipifs
These functions are always available (if the extension is available
at all).
2021-06-14 15:27:28 +02:00
Nikita Popov
7485978339
Migrate SKIPIF -> EXTENSIONS (#7138)
This is an automated migration of most SKIPIF extension_loaded checks.
2021-06-11 11:57:42 +02:00
Nikita Popov
cafceea742 Update mbstring parameter names
Closes GH-6207.
2020-09-28 09:51:58 +02:00
Alex Dowad
dc98c1346d Additional tests for mbstring extension 2020-08-31 23:15:57 +02:00
Máté Kocsis
6111d64cda
Improve a last couple of argument error messages
Closes GH-5404
2020-04-20 13:09:00 +02:00
Nikita Popov
7d170eb295 Merge branch 'PHP-7.4'
* PHP-7.4:
  Fix shift ub in mbstring
  Restore digit check in mb_decode_numericentity()
2020-01-30 10:08:21 +01:00
Nikita Popov
9aadcb18e1 Restore digit check in mb_decode_numericentity()
I replaced it with a multiplication overflow check in
18599f9c52. However, we need both,
because the code for restoring the number can't handle numbers
with many leading zeros right now and I don't feel like teaching it.
2020-01-30 10:07:01 +01:00
Nikita Popov
b2c8abe951 Merge branch 'PHP-7.4'
* PHP-7.4:
  Better overflow check for entity decoding
2020-01-29 16:08:55 +01:00
Nikita Popov
18599f9c52 Better overflow check for entity decoding
Check for multiplication overflow rather than number of digits.
2020-01-29 16:08:46 +01:00
Nikita Popov
bc32cce6a2 Merge branch 'PHP-7.4'
* PHP-7.4:
  Fix recovery of large entities in mb_decode_numericentity()
2020-01-29 11:49:27 +01:00
Nikita Popov
91f878779c Fix recovery of large entities in mb_decode_numericentity()
Make sure we don't overflow the integer.
2020-01-29 11:48:34 +01:00
Christoph M. Becker
e2100619ac Expect appropriate parameter type in the first place
`mb_encode_numericentity()` and `mb_decode_numericentity()` accepted
arbitrary zvals as `$convmap`, but ignored anything else than arrays.
This appears to be an unresolved relict of their ZPP conversion for
PHP 5.3[1].  We now expect an array in the first place.

We also expect `count($convmap)` to be a multiple of four (else we
throw a `ValueError`), and do no longer special case empty `$convmap`.

[1] <http://git.php.net/?p=php-src.git;a=commit;h=1c77f594294aee9d60e7309279c616c01c39ba9d>
2019-10-07 16:48:08 +02:00
Felipe Pena
b79740b458 - New tests (WurzbrugUG testfest) 2009-07-07 01:15:12 +00:00