Commit graph

41 commits

Author SHA1 Message Date
Niels Dossche
286030d532
Merge branch 'PHP-8.4'
* PHP-8.4:
  Fix GH-17609: Typo in error message: Dom\NO_DEFAULT_NS instead of Dom\HTML_NO_DEFAULT_NS
  PHP-8.4 is now for PHP 8.4.5-dev
2025-01-28 19:30:55 +01:00
Niels Dossche
359eb30351
Fix GH-17609: Typo in error message: Dom\NO_DEFAULT_NS instead of Dom\HTML_NO_DEFAULT_NS 2025-01-28 19:30:25 +01:00
Niels Dossche
72708f298b
Merge branch 'PHP-8.4'
* PHP-8.4:
  Fix GH-17481: UTF-8 corruption in \Dom\HTMLDocument
  Fix GH-17486: Incorrect error line numbers reported in Dom\HTMLDocument::createFromString
2025-01-17 16:25:23 +01:00
Niels Dossche
2952e164a9
Fix GH-17481: UTF-8 corruption in \Dom\HTMLDocument
We need to properly handle the case when we return from having too few
bytes, this needs to be handled separately because the while loop
otherwise just performs a partial byte copy.

Closes GH-17489.
2025-01-17 16:25:08 +01:00
Niels Dossche
21c170c75a
Fix GH-17486: Incorrect error line numbers reported in Dom\HTMLDocument::createFromString
Closes GH-17491.
2025-01-17 16:24:28 +01:00
Niels Dossche
935fef29bd
Optimize DOM HTML serialization for UTF-8 (#16376)
* Use a direct call for decoding the UTF-8 buffer

* Add fast path for UTF-8 HTML serialization

This patch adds a fast path to the HTML serialization encoding that has
to encode to UTF-8. Because the DOM internally represents all strings
using UTF-8, we only need to validate here.

Tested on Wikipedia English home page on an i7-4790:
```
Benchmark 1: ./sapi/cli/php x.php
  Time (mean ± σ):     516.0 ms ±   6.4 ms    [User: 511.2 ms, System: 3.5 ms]
  Range (min … max):   506.0 ms … 527.1 ms    10 runs

Benchmark 2: ./sapi/cli/php_old x.php
  Time (mean ± σ):     682.8 ms ±   6.5 ms    [User: 676.8 ms, System: 3.8 ms]
  Range (min … max):   675.8 ms … 695.6 ms    10 runs

Summary
  ./sapi/cli/php x.php ran
    1.32 ± 0.02 times faster than ./sapi/cli/php_old x.php
```

(And if you're interested: it takes over a second on my machine using the old DOMDocument class)

Future optimizations are certainly possible, but let's start here.
2024-10-22 07:18:36 +02:00
Niels Dossche
baa76be615
Use SWAR to seek for non-ASCII UTF-8 in DOM parsing (#16350)
GitHub FYP test case:
```
Benchmark 1: ./sapi/cli/php test.php
  Time (mean ± σ):     502.8 ms ±   6.2 ms    [User: 498.3 ms, System: 3.2 ms]
  Range (min … max):   495.2 ms … 509.8 ms    10 runs

Benchmark 2: ./sapi/cli/php_old test.php
  Time (mean ± σ):     518.4 ms ±   4.3 ms    [User: 513.9 ms, System: 3.2 ms]
  Range (min … max):   511.5 ms … 525.5 ms    10 runs

Summary
  ./sapi/cli/php test.php ran
    1.03 ± 0.02 times faster than ./sapi/cli/php_old test.php
```

Wikipedia English homepage test case:
```
Benchmark 1: ./sapi/cli/php test.php
  Time (mean ± σ):     301.1 ms ±   4.2 ms    [User: 295.5 ms, System: 4.8 ms]
  Range (min … max):   296.3 ms … 308.8 ms    10 runs

Benchmark 2: ./sapi/cli/php_old test.php
  Time (mean ± σ):     308.2 ms ±   1.7 ms    [User: 304.6 ms, System: 2.9 ms]
  Range (min … max):   306.9 ms … 312.8 ms    10 runs

Summary
  ./sapi/cli/php test.php ran
    1.02 ± 0.02 times faster than ./sapi/cli/php_old test.php
```
2024-10-12 13:29:33 +02:00
Niels Dossche
1e949d189a
Fix edge-case in DOM parsing decoding
There are three connected subtle issues:
1) The fast path didn't correctly handle the case where the decoder
   requests more data. This caused a bogus additional replacement
   sequence to be outputted when encountering an incomplete sequence at
   the edges of a buffer.
2) The finishing of decoding incorrectly assumed that the fast path
   cannot be in a state where the last few bytes were an incomplete
   sequence, but this is not true as shown by test 08.
3) The finishing of decoding could output bytes twice because it called
   into dom_process_parse_chunk() twice without clearing the decoded
   data. However, calling twice is not even necessary as the entire
   buffer cannot be filled up entirely.

Closes GH-16226.
2024-10-05 18:27:18 +02:00
Niels Dossche
88393cfaf7
Fix GH-13988: Storing DOMElement consume 4 times more memory in PHP 8.1 than in PHP 8.0
We avoid creating backing storage by using the feature introduced in
f78d5cfcd2.

Closes GH-15593.
2024-08-27 20:14:25 +02:00
Niels Dossche
d32b97a1c7
Fix NULL pointer dereference with NULL content in legacy nodes in title getting (#15558) 2024-08-23 19:38:13 +02:00
Gina Peter Bnayard
5853cdb73d Use "must not" instead of "cannot" wording 2024-08-21 21:12:17 +01:00
Gina Peter Bnayard
6d9a74cde0 ext/dom: Use standard wording for ValueError 2024-08-21 21:12:17 +01:00
Niels Dossche
80a4783d25
Deduplicate NULL checks in ext/dom (#15015)
This introduces a new helper php_dom_create_nullable_object() that does
the NULL check and puts NULL in return_value. Otherwise it runs
php_dom_create_object(). This deduplicates a bit of code.
2024-07-18 21:20:03 +02:00
Niels Dossche
6980eba863
Support templated content
The template element in HTML 5 is special in the sense that it does not
add its contents into the DOM tree, but instead keeps them in a separate
shadow DOM document fragment. Interacting with the DOM tree cannot touch
the elements in the document fragment.

Closes GH-14906.
2024-07-15 11:10:51 +02:00
Niels Dossche
4ef7539144
Split off private data from the ns mapper 2024-07-15 11:02:52 +02:00
Niels Dossche
88da914910 Implement CSS selectors 2024-06-29 13:00:26 -07:00
Niels Dossche
48c9f1e2c3 Implement Dom\HTMLElement class 2024-06-26 12:17:12 -07:00
Niels Dossche
78401ba867 Implement Dom\Document::$title setter 2024-06-26 12:17:12 -07:00
Niels Dossche
04af960397 Implement Dom\Document::$title getter 2024-06-26 12:17:12 -07:00
Niels Dossche
a12db3b656 Implement Dom\Document::$body setter 2024-06-26 12:17:12 -07:00
Niels Dossche
287cf91724 Implement Dom\Document::$head 2024-06-26 12:17:12 -07:00
Niels Dossche
a1485df55a Implement Dom\Document::$body getter 2024-06-26 12:17:12 -07:00
Arnaud Le Blanc
11accb5cdf
Preferably include from build dir (#13516)
* Include from build dir first

This fixes out of tree builds by ensuring that configure artifacts are included
from the build dir.

Before, out of tree builds would preferably include files from the src dir, as
the include path was defined as follows (ignoring includes from ext/ and sapi/) :

    -I$(top_builddir)/main
    -I$(top_srcdir)
    -I$(top_builddir)/TSRM
    -I$(top_builddir)/Zend
    -I$(top_srcdir)/main
    -I$(top_srcdir)/Zend
    -I$(top_srcdir)/TSRM
    -I$(top_builddir)/

As a result, an out of tree build would include configure artifacts such as
`main/php_config.h` from the src dir.

After this change, the include path is defined as follows:

    -I$(top_builddir)/main
    -I$(top_builddir)
    -I$(top_srcdir)/main
    -I$(top_srcdir)
    -I$(top_builddir)/TSRM
    -I$(top_builddir)/Zend
    -I$(top_srcdir)/Zend
    -I$(top_srcdir)/TSRM

* Fix extension include path for out of tree builds

* Include config.h with the brackets form

`#include "config.h"` searches in the directory containing the including-file
before any other include path. This can include the wrong config.h when building
out of tree and a config.h exists in the source tree.

Using `#include <config.h>` uses exclusively the include path, and gives
priority to the build dir.
2024-06-26 00:26:43 +02:00
Peter Kokot
84a0da1574
Sync #if/ifdef/defined (#14508)
This syncs CPP macro conditions:
- _WIN32
- _WIN64
- HAVE_ALLOCA_H
- HAVE_ALPHASORT
- HAVE_ARPA_INET_H
- HAVE_CONFIG_H
- HAVE_DIRENT_H
- HAVE_DLFCN_H
- HAVE_GETTIMEOFDAY
- HAVE_LIBDL
- HAVE_POLL_H
- HAVE_PWD_H
- HAVE_SCANDIR
- HAVE_SYS_FILE_H
- HAVE_SYS_PARAM_H
- HAVE_SYS_SOCKET_H
- HAVE_SYS_TIME_H
- HAVE_SYS_TYPES_H
- HAVE_SYS_WAIT_H
- HAVE_UNISTD_H
- PHP_WIN32
- ZEND_WIN32

These are either undefined or defined to 1 in Autotools and Windows.

Follow up of GH-5526 (-Wundef).
2024-06-09 14:23:41 +02:00
Niels Dossche
1fdbb0aba6 Get rid of unused declarations 2024-05-13 19:46:51 +02:00
Niels Dossche
e7af2bfd5b Get rid of reserved name usage 2024-05-13 19:46:51 +02:00
Niels Dossche
44485892df Factor out all common code for XML serialization and merge common paths 2024-05-11 18:09:39 +02:00
Niels Dossche
6e7adb3c48
Update ext/dom names after policy change (#14171) 2024-05-09 10:40:53 +02:00
Niels Dossche
191d0501a5
Cleanup dom_html_document_encoding_write() (#13788) 2024-03-23 22:17:58 +01:00
Niels Dossche
b955973818 Only register error handling when observable
Closes GH-13702.
2024-03-17 18:24:40 +01:00
Niels Dossche
9fd74cfc9d Use temporary variables to reduce memory stores 2024-03-17 18:21:59 +01:00
Niels Dossche
cbc421e163 Add fast path for ASCII bytes in UTF-8 validation 2024-03-17 18:21:59 +01:00
Niels Dossche
cc0260e014
Change return type of DOM\HTMLDocument::saveHTML() (#13701)
Strict error checking is always true for classes in "new DOM".
This means that we always throw an error when calling
`php_dom_throw_error`, and therefore the false return value is not
actually possible.
Also change the stub to reflect this.
2024-03-13 21:49:40 +01:00
Niels Dossche
539d8d9259 Use common helper macro for getting the node in property handlers 2024-03-10 11:08:46 +01:00
Niels Dossche
d57e7a920b Use BAD_CAST consistently 2024-03-10 11:08:46 +01:00
Niels Dossche
6c55513e33 Use true instead of 1 with php_dom_throw_error 2024-03-10 11:08:46 +01:00
Niels Dossche
14b6c981c3
[RFC] Add a way to opt-in ext/dom spec compliance (#13031)
RFC: https://wiki.php.net/rfc/opt_in_dom_spec_compliance
2024-03-09 16:56:00 +01:00
Niels Dossche
2f1fe3209c Use a direct statically-known call for decoding in the fast path 2024-02-07 18:02:42 +01:00
Niels Dossche
89ea24f63e
Give anonymous dom structs a name (#13135) 2024-01-13 11:34:40 +01:00
Niels Dossche
a9064816db
Optimizations for HTML 5 loading (#12896)
* Fix inverted NULL and add dictionary

* Avoid useless error processing if no reporting is set

* Avoid double work while processing attributes and use fast text instantiation
2023-12-08 18:45:01 +01:00
Niels Dossche
1492be5286
[RFC] DOM HTML5 parsing and serialization support (#12111) 2023-11-13 20:18:19 +01:00