Commit graph

129 commits

Author SHA1 Message Date
Hiroshi SHIBATA
0f06626915
Bump up strscan version to 3.1.5.dev 2025-05-02 10:11:09 +09:00
Sutou Kouhei
af6d6b64ea [ruby/strscan] named_captures: fix incompatibility with
MatchData#named_captures
(https://github.com/ruby/strscan/pull/146)

Fix https://github.com/ruby/strscan/pull/145

`MatchData#named_captures` use the last matched value for each name.

Reported by Linus Sellberg. Thanks!!!

a6086ea322
2025-05-02 09:52:38 +09:00
Hiroshi SHIBATA
4634a0042e
Mark development version for unreleased gems 2025-04-22 11:27:24 +09:00
Sutou Kouhei
067fc410fc
[ruby/strscan] Bump version
8ff80150c4
2025-04-22 11:27:24 +09:00
Sutou Kouhei
ad8cb532d5 [ruby/strscan] Bump version
7b1eb1e4ed
2025-04-14 16:18:48 +09:00
Jean byroot Boussier
0db87b8943 [ruby/strscan] Allow parsing strings larger than 2GiB
(https://github.com/ruby/strscan/pull/147)

For a reason unknown, even though `pos` is stored as a `long`, the
`#pos` and `#pos=` treat it as an `int`, which prevent seeking into
strings larger than 2GiB.

b76368416e

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
2025-04-14 16:18:47 +09:00
NAITOH Jun
018943ba05 [ruby/strscan] Fix a bug that inconsistency of IndexError vs nil for
unknown capture group
(https://github.com/ruby/strscan/pull/143)

Fix https://github.com/ruby/strscan/pull/139

Reported by Benoit Daloze. Thanks!!!

bc8a0d2623
2025-02-25 15:36:46 +09:00
NAITOH Jun
36ab247e4d [ruby/strscan] Fix a bug that scanning methods that don't use Regexp
don't clear named capture groups
(https://github.com/ruby/strscan/pull/142)

Fix https://github.com/ruby/strscan/pull/135

b957443e20
2025-02-25 15:36:46 +09:00
Jean Boussier
bf6c106d54 [ruby/strscan] scan_integer(base: 16) ignore x suffix if not
followed by hexadecimal
(https://github.com/ruby/strscan/pull/141)

Fix: https://github.com/ruby/strscan/issues/140

`0x<EOF>`, `0xZZZ` should be parsed as `0` instead of not matching at
all.

c4e4795ed2
2025-02-21 11:31:36 +09:00
NAITOH Jun
eee9bd1aa4 [ruby/strscan] Fix a bug that scan_until behaves differently with
Regexp and String patterns
(https://github.com/ruby/strscan/pull/138)

Fix https://github.com/ruby/strscan/pull/131

e1cec2e726
2025-02-17 11:04:32 +09:00
Hiroshi SHIBATA
b4ed6db096
Removed trailing spaces 2025-02-14 16:16:55 +09:00
Jean Boussier
51004c3641
[ruby/strscan] Fix a bug that scan_integer doesn't update matched
data
(https://github.com/ruby/strscan/pull/133)

Fix https://github.com/ruby/strscan/pull/130

Reported by Andrii Konchyn. Thanks!!!

4e5f17f87a
2025-02-14 16:13:26 +09:00
Alexander Momchilov
41e24c2f3e
[ruby/strscan] [DOC] Add syntax highlighting to MarkDown code blocks
(https://github.com/ruby/strscan/pull/126)

Split off from https://github.com/ruby/ruby/pull/12322

9bee37e0f5
2024-12-16 10:10:34 +09:00
Sutou Kouhei
219c2eee5a
[ruby/strscan] Bump version
fd140b8582
2024-12-16 10:10:34 +09:00
Hiroshi SHIBATA
78ca87f8a8
Lock released version of strscan-3.1.1 2024-12-12 16:14:25 +09:00
Jean Boussier
636d57bd1c [ruby/strscan] Micro optimize encoding checks
(https://github.com/ruby/strscan/pull/117)

Profiling shows a lot of time spent in various encoding check functions.
I'm working on optimizing them on the Ruby side, but if we assume most
strings are one of the simple 3 encodings, we can skip a lot of
overhead.

```ruby
require 'strscan'
require 'benchmark/ips'

source = 10_000.times.map { rand(9999999).to_s }.join(",").force_encoding(Encoding::UTF_8).freeze

def scan_to_i(source)
  scanner = StringScanner.new(source)
  while number = scanner.scan(/\d+/)
    number.to_i
    scanner.skip(",")
  end
end

def scan_integer(source)
  scanner = StringScanner.new(source)
  while scanner.scan_integer
    scanner.skip(",")
  end
end

Benchmark.ips do |x|
  x.report("scan.to_i") { scan_to_i(source) }
  x.report("scan_integer") { scan_integer(source) }
  x.compare!
end
```

Before:

```
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
           scan.to_i    93.000 i/100ms
        scan_integer   232.000 i/100ms
Calculating -------------------------------------
           scan.to_i    933.191 (± 0.2%) i/s    (1.07 ms/i) -      4.743k in   5.082597s
        scan_integer      2.326k (± 0.8%) i/s  (429.99 μs/i) -     11.832k in   5.087974s

Comparison:
        scan_integer:     2325.6 i/s
           scan.to_i:      933.2 i/s - 2.49x  slower
```

After:

```
ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [arm64-darwin23]
Warming up --------------------------------------
           scan.to_i    96.000 i/100ms
        scan_integer   274.000 i/100ms
Calculating -------------------------------------
           scan.to_i    969.489 (± 0.2%) i/s    (1.03 ms/i) -      4.896k in   5.050114s
        scan_integer      2.756k (± 0.1%) i/s  (362.88 μs/i) -     13.974k in   5.070837s

Comparison:
        scan_integer:     2755.8 i/s
           scan.to_i:      969.5 i/s - 2.84x  slower
```

c02b1ce684
2024-12-02 10:50:34 +09:00
Jean Boussier
79cc3d26ed StringScanner#scan_integer support base 16 integers (#116)
Followup: https://github.com/ruby/strscan/pull/115

`scan_integer` is now implemented in Ruby as to efficiently handle
keyword arguments without allocating a Hash. Given the goal of
`scan_integer` is to more effciently parse integers without having to
allocate an intermediary object, using `rb_scan_args` would defeat the
purpose.

Additionally, the C implementation now uses `rb_isdigit` and
`rb_isxdigit`, because on Windows `isdigit` is locale dependent.
2024-12-02 10:50:34 +09:00
Jean Boussier
d5de1a5789 [ruby/strscan] Implement #scan_integer to efficiently parse Integer
(https://github.com/ruby/strscan/pull/115)

Fix: https://github.com/ruby/strscan/issues/113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.

6a3c74b4c8
2024-11-27 09:24:07 +09:00
NAITOH Jun
e73f35ddaf [ruby/strscan] [CRuby] Optimize strscan_do_scan(): Remove
unnecessary use of `rb_enc_get()`
(https://github.com/ruby/strscan/pull/108)

- before: #106

## Why?

In `rb_strseq_index()`, the result of `rb_enc_check()` is used.

-
6c7209cd37/string.c (L4335-L4368)
> enc = rb_enc_check(str, sub);

> return strseq_core(str_ptr, str_ptr_end, str_len, sub_ptr, sub_len,
offset, enc);

-
6c7209cd37/string.c (L4309-L4318)
```C
strseq_core(const char *str_ptr, const char *str_ptr_end, long str_len,
            const char *sub_ptr, long sub_len, long offset, rb_encoding *enc)
{
    const char *search_start = str_ptr;
    long pos, search_len = str_len - offset;

    for (;;) {
        const char *t;
        pos = rb_memsearch(sub_ptr, sub_len, search_start, search_len, enc);
```

## Benchmark

It shows String as a pattern is 1.24x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.225M i/s -      9.328M times in 1.011068s (108.40ns/i)
          regexp_var     9.327M i/s -      9.413M times in 1.009214s (107.21ns/i)
              string     9.200M i/s -      9.355M times in 1.016840s (108.70ns/i)
          string_var    11.249M i/s -     11.255M times in 1.000578s (88.90ns/i)
Calculating -------------------------------------
              regexp     9.565M i/s -     27.676M times in 2.893476s (104.55ns/i)
          regexp_var    10.111M i/s -     27.982M times in 2.767496s (98.90ns/i)
              string    10.060M i/s -     27.600M times in 2.743465s (99.40ns/i)
          string_var    12.519M i/s -     33.746M times in 2.695615s (79.88ns/i)

Comparison:
          string_var:  12518707.2 i/s
          regexp_var:  10111089.6 i/s - 1.24x  slower
              string:  10060144.4 i/s - 1.24x  slower
              regexp:   9565124.4 i/s - 1.31x  slower
```

ff2d7afa19
2024-10-26 18:44:15 +09:00
Nobuyoshi Nakada
d6046bccb7 [ruby/strscan] Use C90 as far as supporting 2.6 or earlier
(https://github.com/ruby/strscan/pull/101)

d31274f41b
2024-10-26 18:44:15 +09:00
NAITOH Jun
d81b0588bb
[ruby/strscan] Accept String as a pattern at non head
(https://github.com/ruby/strscan/pull/106)

It supports non-head match cases such as StringScanner#scan_until.

If we use a String as a pattern, we can improve match performance.
Here is a result of the including benchmark.

## CRuby

It shows String as a pattern is 1.18x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     9.403M i/s -      9.548M times in 1.015459s (106.35ns/i)
          regexp_var     9.162M i/s -      9.248M times in 1.009479s (109.15ns/i)
              string     8.966M i/s -      9.274M times in 1.034343s (111.54ns/i)
          string_var    11.051M i/s -     11.190M times in 1.012538s (90.49ns/i)
Calculating -------------------------------------
              regexp    10.319M i/s -     28.209M times in 2.733707s (96.91ns/i)
          regexp_var    10.032M i/s -     27.485M times in 2.739807s (99.68ns/i)
              string     9.681M i/s -     26.897M times in 2.778397s (103.30ns/i)
          string_var    12.162M i/s -     33.154M times in 2.726046s (82.22ns/i)

Comparison:
          string_var:  12161920.6 i/s
              regexp:  10318949.7 i/s - 1.18x  slower
          regexp_var:  10031617.6 i/s - 1.21x  slower
              string:   9680843.7 i/s - 1.26x  slower
```

## JRuby

It shows String as a pattern is 2.11x faster than Regexp as a pattern.

```
$ benchmark-driver benchmark/check_until.yaml
Warming up --------------------------------------
              regexp     7.591M i/s -      7.544M times in 0.993780s (131.74ns/i)
          regexp_var     6.143M i/s -      6.125M times in 0.997038s (162.77ns/i)
              string    14.135M i/s -     14.079M times in 0.996067s (70.75ns/i)
          string_var    14.079M i/s -     14.057M times in 0.998420s (71.03ns/i)
Calculating -------------------------------------
              regexp     9.409M i/s -     22.773M times in 2.420268s (106.28ns/i)
          regexp_var    10.116M i/s -     18.430M times in 1.821820s (98.85ns/i)
              string    21.389M i/s -     42.404M times in 1.982519s (46.75ns/i)
          string_var    20.897M i/s -     42.237M times in 2.021187s (47.85ns/i)

Comparison:
              string:  21389191.1 i/s
          string_var:  20897327.5 i/s - 1.02x  slower
          regexp_var:  10116464.7 i/s - 2.11x  slower
              regexp:   9409222.3 i/s - 2.27x  slower
```

See:
be7815ec02/core/src/main/java/org/jruby/util/StringSupport.java (L1706-L1736)

---------

f9d96c446a

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
2024-09-17 15:12:25 +09:00
Hiroshi SHIBATA
32f134bb85
Added pre-release suffix for development version of default gems
https://github.com/ruby/stringio/issues/81
2024-08-31 14:22:17 +09:00
Hiroshi SHIBATA
3eda59e975
Sync strscan HEAD again.
https://github.com/ruby/strscan/pull/99 split document with multi-byte
chars.
2024-06-04 12:40:08 +09:00
Hiroshi SHIBATA
78bfde5d9f
Revert "[ruby/strscan] Doc for StringScanner"
This reverts commit 974ed1408c.
2024-05-30 21:13:10 +09:00
Hiroshi SHIBATA
d70b0da482
Revert "Fix reference path for strscan documentation"
This reverts commit 1fa93fb948.
2024-05-30 21:13:01 +09:00
Hiroshi SHIBATA
1fa93fb948
Fix reference path for strscan documentation 2024-05-30 14:29:25 +09:00
Burdette Lamar
974ed1408c
[ruby/strscan] Doc for StringScanner
(https://github.com/ruby/strscan/pull/96)

#peek_byte and #scan_byte not updated (not available in my repo --
sorry).

---------

0123da7352

Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
2024-05-30 12:34:18 +09:00
Aaron Patterson
164e464b04 [ruby/strscan] Add a method for peeking and reading bytes as
integers
(https://github.com/ruby/strscan/pull/89)

This commit adds `scan_byte` and `peek_byte`. `scan_byte` will scan the
current byte, return it as an integer, and advance the cursor.
`peek_byte` will return the current byte as an integer without advancing
the cursor.

Currently `StringScanner#get_byte` returns a string, but I want to get
the current byte without allocating a string. I think this will help
with writing high performance lexers.

---------

873aba2e5d

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
2024-02-26 15:54:54 +09:00
Sutou Kouhei
ce2618c628
[ruby/strscan] Bump version
ba338b882c
2024-02-08 14:43:56 +09:00
Sutou Kouhei
5afae77ce9
[ruby/strscan] Bump version
842845af1f
2024-02-08 14:43:56 +09:00
Sutou Kouhei
ac636f5709
[ruby/strscan] Bump version
d6f97ec102
2024-01-19 10:49:12 +09:00
NAITOH Jun
338eb0065b [ruby/strscan] StringScanner#captures: Return nil not "" for
unmached capture
(https://github.com/ruby/strscan/pull/72)

fix https://github.com/ruby/strscan/issues/70
If there is no substring matching the group (s[3]), the behavior is
different.

If there is no substring matching the group, the corresponding element
(s[3]) should be nil.

```
s = StringScanner.new('foobarbaz') #=> #<StringScanner 0/9 @ "fooba...">
s.scan /(foo)(bar)(BAZ)?/  #=> "foobar"
s[0]           #=> "foobar"
s[1]           #=> "foo"
s[2]           #=> "bar"
s[3]           #=> nil
s.captures #=> ["foo", "bar", ""]
s.captures.compact #=> ["foo", "bar", ""]
```

```
s = StringScanner.new('foobarbaz') #=> #<StringScanner 0/9 @ "fooba...">
s.scan /(foo)(bar)(BAZ)?/  #=> "foobar"
s[0]           #=> "foobar"
s[1]           #=> "foo"
s[2]           #=> "bar"
s[3]           #=> nil
s.captures #=> ["foo", "bar", nil]
s.captures.compact #=> ["foo", "bar"]
```

https://docs.ruby-lang.org/ja/latest/method/MatchData/i/captures.html
```
/(foo)(bar)(BAZ)?/ =~ "foobarbaz" #=> 0
$~.to_a        #=> ["foobar", "foo", "bar", nil]
$~.captures #=> ["foo", "bar", nil]
$~.captures.compact #=> ["foo", "bar"]
```

* StringScanner#captures is not yet documented.
https://docs.ruby-lang.org/ja/latest/class/StringScanner.html

1fbfdd3c6f
2024-01-14 22:27:24 +09:00
Hiroshi SHIBATA
f54369830f Revert "Rollback to released version numbers of stringio and strscan"
This reverts commit 6a79e53823.
2023-12-25 21:12:49 +09:00
Hiroshi SHIBATA
6a79e53823
Rollback to released version numbers of stringio and strscan 2023-12-16 12:00:59 +08:00
Sutou Kouhei
ce8301084f [ruby/strscan] Bump version
1b3393be05
2023-11-08 09:26:58 +09:00
Peter Zhu
91e13a5207 [ruby/strscan] Fix indentation in strscan.c
[ci skip]
2023-07-28 10:12:52 -04:00
Peter Zhu
7193b404a1 Add function rb_reg_onig_match
rb_reg_onig_match performs preparation, error handling, and cleanup for
matching a regex against a string. This reduces repetitive code and
removes the need for StringScanner to access internal data of regex.
2023-07-27 13:33:40 -04:00
Peter Zhu
e27eab2f85 [ruby/strscan] Sync missed commit
Syncs commit ruby/strscan@76b377a5d8.
2023-07-27 09:42:42 -04:00
Sutou Kouhei
18e840ac60 [ruby/strscan] Bump version
681cde0f27
2023-02-21 19:31:36 +09:00
OKURA Masafumi
a44f5ab089 [ruby/strscan] Mention return value of rest? in the doc
(https://github.com/ruby/strscan/pull/49)

The doc of `rest?` was unclear about return value. This commit adds the
return value to the doc.
2023-02-21 19:31:35 +09:00
Sutou Kouhei
79ad045214 [ruby/strscan] Bump version
3ada12613d
2022-12-26 15:09:21 +09:00
Hiroshi SHIBATA
4e31fea77d Merge strscan-3.0.5 2022-12-09 16:36:22 +09:00
Sutou Kouhei
c0c43276a1 [ruby/strscan] Bump version
If we use the same version as the default strscan gem in Ruby, "gem
install" doesn't extract .gem. It fails "gem install" because "gem
install" can't find ext/strscan/ to be built.

3ceafa6cdc
2021-10-24 05:57:48 +09:00
Gannon McGibbon
a42b7de436 [ruby/strscan] Replace "iff" with "if and only if" (#18)
iff means if and only if, but readers without that knowledge might
assume this to be a spelling mistake. To me, this seems like
exclusionary language that is unnecessary. Simply using "if and only if"
instead should suffice.

066451c11e
2021-05-06 16:21:14 +09:00
Kenichi Kamiya
564ccd095a [ruby/strscan] Fix segmentation fault of StringScanner#charpos when String#byteslice returns non string value [Bug #17756] (#20)
92961cde2b
2021-05-06 16:20:38 +09:00
Jeremy Evans
c03b723f56 Update class documentation for StringScanner
The [] wasn't being displayed, and try to fix formatting for bol?
and << (even if they aren't linked).

Fixes [Bug #17620]
2021-02-10 08:17:07 -08:00
Kenta Murata
b5de66e132
[strscan] Fix license comment and files
a999f2c6d1
2020-12-18 14:25:48 +09:00
Kenta Murata
5370963992
[strscan] Version 3.0.0
08645e4e77
2020-12-18 14:25:42 +09:00
Kenta Murata
985f0af257
[strscan] Make strscan Ractor safe (#17)
* Make strscan Ractor safe

* Add test-unit in the development dependencies

3c93c2bebe
2020-12-18 14:25:41 +09:00
Aaron Patterson
6aa466ba9c mark regex internal to string scanner 2020-10-02 12:01:57 -07:00