We only use that buffer for parsing integer and floats, these
are unlikely to be very big, and if so we can just use RB_ALLOCV as it will
almost always end in a small `alloca`.
This allow to no longer need `rb_protect` around the parser.
994859916a
And get rid of the Ragel parser.
This is 7% faster on activitypub, 15% after on twitter and 11% faster
on citm_catalog.
There might be some more optimization opportunities, I did a quick
optimization pass to fix a regression in string parsing, but other
than that I haven't dug much in performance.
Ref: https://github.com/ruby/json/pull/718
The existing `Parser` interface is pretty bad, as it forces to
instantiate a new instance for each document.
Instead it's preferable to only take the config and do all the
initialization needed, and then keep the parsing state on the
stack on in ephemeral memory.
This refactor makes the `JSON::Coder` pull request much easier to
implement in a performant way.
c8d5236a92
Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>
Before this commit, we would try to scan for a float, then if that
failed, scan for an integer. But floats and integers have many bytes in
common, so we would end up scanning the same bytes multiple times.
This patch combines integer and float scanning machines so that we only
have to scan bytes once. If the machine finds "float parts", then it
executes the "isFloat" transition in the machine, which sets a boolean
letting us know that the parser found a float.
If we didn't find a float, but we did match, then we know it's an int.
0c0e0930cd
`rb_cstr2inum` isn't very fast because it handles tons of
different scenarios, and also require a NULL terminated string
which forces us to copy the number into a secondary buffer.
But since the parser already computed the length, we can much more
cheaply do this with a very simple function as long as the number
is small enough to fit into a native type (`long long`).
If the number is too long, we can fallback to the `rb_cstr2inum`
slowpath.
Before:
```
== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T07:59:09Z precompute-hash-wh.. 7943f98a8a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 40.000 i/100ms
oj 35.000 i/100ms
Oj::Parser 45.000 i/100ms
rapidjson 38.000 i/100ms
Calculating -------------------------------------
json 425.941 (± 1.9%) i/s (2.35 ms/i) - 2.160k in 5.072833s
oj 349.617 (± 1.7%) i/s (2.86 ms/i) - 1.750k in 5.006953s
Oj::Parser 464.767 (± 1.7%) i/s (2.15 ms/i) - 2.340k in 5.036381s
rapidjson 382.413 (± 2.4%) i/s (2.61 ms/i) - 1.938k in 5.070757s
Comparison:
json: 425.9 i/s
Oj::Parser: 464.8 i/s - 1.09x faster
rapidjson: 382.4 i/s - 1.11x slower
oj: 349.6 i/s - 1.22x slower
```
After:
```
== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T07:59:09Z precompute-hash-wh.. 7943f98a8a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
json 46.000 i/100ms
oj 33.000 i/100ms
Oj::Parser 45.000 i/100ms
rapidjson 39.000 i/100ms
Calculating -------------------------------------
json 462.332 (± 3.2%) i/s (2.16 ms/i) - 2.346k in 5.080504s
oj 351.140 (± 1.1%) i/s (2.85 ms/i) - 1.782k in 5.075616s
Oj::Parser 473.500 (± 1.3%) i/s (2.11 ms/i) - 2.385k in 5.037695s
rapidjson 395.052 (± 3.5%) i/s (2.53 ms/i) - 1.989k in 5.042275s
Comparison:
json: 462.3 i/s
Oj::Parser: 473.5 i/s - same-ish: difference falls within error
rapidjson: 395.1 i/s - 1.17x slower
oj: 351.1 i/s - 1.32x slower
```
3a4dc9e1b4
This is somewhat dead code as unless you are using `JSON::Parser.new`
direcltly we never allocate `JSON::Ext::Parser` anymore.
But still, we should mark all its reference in case some code out there
uses that.
Followup: #6758bf74a977b
* rb_str_conv_enc() returns the source string unmodified
if the conversion did not work. But we should be consistent with
the generator here and only accept BINARY or convertible to UTF-8.
1344ad6f66
[Feature #19528]
Ref: https://bugs.ruby-lang.org/issues/19528
`load` is understood as the default method for serializer kind of libraries, and
the default options of `JSON.load` has caused many security vulnerabilities over the
years.
The plan is to do like YAML/Psych, deprecate these default options and direct
users toward using `JSON.unsafe_load` so at least it's obvious it should be
used against untrusted data.
Ref: https://github.com/ruby/json/issues/655
Followup: https://github.com/ruby/json/issues/657
Assuming the generator might be used for fairly small documents
we can start with a reasonable buffer size of the stack, and if
we outgrow it, we can spill on the heap.
In a way this is optimizing for micro-benchmarks, but there are
valid use case for fiarly small JSON document in actual real world
scenarios, so trashing the GC less in such case make sense.
Before:
```
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
Oj 518.700k i/100ms
JSON reuse 483.370k i/100ms
Calculating -------------------------------------
Oj 5.722M (± 1.8%) i/s (174.76 ns/i) - 29.047M in 5.077823s
JSON reuse 5.278M (± 1.5%) i/s (189.46 ns/i) - 26.585M in 5.038172s
Comparison:
Oj: 5722283.8 i/s
JSON reuse: 5278061.7 i/s - 1.08x slower
```
After:
```
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
Oj 517.837k i/100ms
JSON reuse 548.871k i/100ms
Calculating -------------------------------------
Oj 5.693M (± 1.6%) i/s (175.65 ns/i) - 28.481M in 5.004056s
JSON reuse 5.855M (± 1.2%) i/s (170.80 ns/i) - 29.639M in 5.063004s
Comparison:
Oj: 5692985.6 i/s
JSON reuse: 5854857.9 i/s - 1.03x faster
```
fe607f4806
Extracted from: https://github.com/ruby/json/pull/512
Use `rb_hash_lookup2` to check for hash key existence instead
of going through `rb_funcall`.
43835a0d13
Co-Authored-By: lukeg <luke.gru@gmail.com>
I, Luke T. Shumaker, am the sole author of the added code.
I did not reference CVTUTF when writing it. I did reference the
Unicode standard (15.0.0), the Wikipedia article on UTF-8, and the
Wikipedia article on UTF-16. When I saw some tests fail, I did
reference the old deleted code (but a JSON-specific part, inherently
not as based on CVTUTF) to determine that script_safe should also
escape U+2028 and U+2029.
I targeted simplicity and clarity when writing the code--it can likely
be optimized. In my mind, the obvious next optimization is to have it
combine contiguous non-escaped characters into just one call to
fbuffer_append(), instead of calling fbuffer_append() for each
character.
Regarding the use of the "modern" types `uint32_t`, `uint16_t`, and
`bool`:
- ruby.h is guaranteed to give us uint32_t and uint16_t.
- Since Ruby 3.0.0, ruby.h is guaranteed to give us bool... but we
support down to Ruby 2.3. But, ruby.h is guaranteed to give us
HAVE_STDBOOL_H for the C99 stdbool.h; so use that to include
stdbool.h if we can, and if not then fall back to a copy of the
same bool definition that Ruby 3.0.5 uses with C89.
c96351f874
> https://github.com/flori/json/pull/525
> Rename escape_slash in script_safe and also escape E+2028 and E+2029
Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
> https://github.com/flori/json/pull/454
> Remove unnecessary initialization of create_id in JSON.parse()
Co-authored-by: Watson <watson1978@gmail.com>
Previously in the JSON::Ext parser, when we encountered an "Infinity"
token (and weren't allowing NaN/Infinity) we would try to display the
"unexpected token" at the character before.
42ac170712
It makes testing for JSON errors very tedious. You either have
to use a Regexp or to regularly update all your assertions
when JSON is upgraded.
de9eb1d28e
When `HAVE_RB_ENC_INTERNED_STR` is enabled it is possible to
pass through a null pointer to `rb_enc_interned_str` resulting
in a segfault
Fixes#495b59368a8c2