Commit graph

658 commits

Author SHA1 Message Date
Nobuyoshi Nakada
b49cd84311 Remove REG_LITERAL flag
All `Regexp` literals are frozen now.
2023-02-09 19:21:24 +09:00
Jeremy Evans
eccfc978fd Fix parsing of regexps that toggle extended mode on/off inside regexp
This was broken in ec3542229b. That commit
didn't handle cases where extended mode was turned on/off inside the
regexp.  There are two ways to turn extended mode on/off:

```
/(?-x:#y)#z
/x =~ '#y'

/(?-x)#y(?x)#z
/x =~ '#y'
```

These can be nested inside the same regexp:

```
/(?-x:(?x)#x
(?-x)#y)#z
/x =~ '#y'
```

As you can probably imagine, this makes handling these regexps
somewhat complex. Due to the nesting inside portions of regexps,
the unassign_nonascii function needs to be recursive.  In
recursive mode, it needs to track both opening and closing
parentheses, similar to how it already tracked opening and
closing brackets for character classes.

When scanning the regexp and coming to `(?` not followed by `#`,
scan for options, and use `x` and `i` to determine whether to
turn on or off extended mode.  For `:`, indicting only the
current regexp section should have the extended mode
switched, recurse with the extended mode set or unset. For `)`,
indicating the remainder of the regexp (or current regexp portion
if already recursing) should turn extended mode on or off, just
change the extended mode flag and keep scanning.

While testing this, I noticed that `a`, `d`, and `u` are accepted
as options, in addition to `i`, `m`, and `x`, but I can't see
where those options are documented.  I'm not sure whether or not
handling  `a`, `d`, and `u` as options is a bug.

Fixes [Bug #19379]
2023-01-30 08:51:12 -08:00
Burdette Lamar
30bd2a32fa
[DOC] Correction to RDoc for Regexp.new (#7130)
Correction to RDoc for Regexp.new
2023-01-16 11:02:23 -06:00
Jeremy Evans
7e8fa06022 Always issue deprecation warning when calling Regexp.new with 3rd positional argument
Previously, only certain values of the 3rd argument triggered a
deprecation warning.

First step for fix for bug #18797.  Support for the 3rd argument
will be removed after the release of Ruby 3.2.

Fix minor fallout discovered by the tests.

Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
2022-12-22 11:50:26 -08:00
Nobuyoshi Nakada
e61e4ae60b
Refactor reg_extract_args to return regexp if given 2022-12-22 19:27:27 +09:00
Nobuyoshi Nakada
454c00723a Share argument parsing in Regexp#initialize and Regexp.linear_time? 2022-12-22 15:51:00 +09:00
卜部昌平
34d43ed9f5 typo in doc [ci skip] 2022-12-19 11:20:55 +09:00
卜部昌平
47a6e7b518 Note about Regexp.linera_time? [ci skip] 2022-12-19 11:05:55 +09:00
TSUYUSATO Kitsune
fbedadb61f
Add Regexp.linear_time? (#6901) 2022-12-14 12:57:14 +09:00
S-H-GAMELINKS
1a64d45c67 Introduce encoding check macro 2022-12-02 01:31:27 +09:00
Yusuke Endoh
ab4c7077cc Prevent segfault in String#scan with ObjectSpace.each_object
Calling `String#scan` without a block creates an incomplete MatchData
object whose `RMATCH(match)->str` is Qfalse. Usually this object is not
leaked, but it was possible to pull it by using ObjectSpace.each_object.

This change hides the internal MatchData object by using rb_obj_hide.

Fixes [Bug #19159]
2022-12-01 02:38:51 +09:00
S-H-GAMELINKS
1f4f6c9832 Using UNDEF_P macro 2022-11-16 18:58:33 +09:00
Nobuyoshi Nakada
001606097b Suppress false warning by a bug of gcc
GCC [Bug 99578] seems triggered by calling `rb_reg_last_match` before
`match_check(match)`, probably by `NIL_P(match)` in `rb_reg_nth_match`.

[Bug 99578]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
2022-11-08 16:13:30 +09:00
Yusuke Endoh
67ed70da61 Refactor timeout-setting code to a function 2022-10-24 18:21:30 +09:00
Yusuke Endoh
ef01482f64 Refactor timeout-related code in re.c a little 2022-10-24 18:13:26 +09:00
Yusuke Endoh
b51b22513f
Fix per-instance Regexp timeout (#6621)
Fix per-instance Regexp timeout

This makes it follow what was decided in [Bug #19055]:

* `Regexp.new(str, timeout: nil)` should respect the global timeout
* `Regexp.new(str, timeout: huge_val)` should use the maximum value that
  can be represented in the internal representation
* `Regexp.new(str, timeout: 0 or negative value)` should raise an error
2022-10-24 18:03:26 +09:00
S-H-GAMELINKS
c4089e6524 Fix argument & Remove enum 2022-10-23 17:38:59 +09:00
S-H-GAMELINKS
1e06ef1328 Introduce rb_memsearch_with_char_size function 2022-10-23 17:38:59 +09:00
git
2dd1a037de * expand tabs. [ci skip]
Tabs were expanded because the file did not have any tab indentation in unedited lines.
Please update your editor config, and use misc/expand_tabs.rb in the pre-commit hook.
2022-10-10 13:22:15 +09:00
Nobuyoshi Nakada
0a98dd1cff
Should use dedecated function Check_Type 2022-10-10 13:21:57 +09:00
Vladimir Dementyev
4954c9fc0f Add MatchData#deconstruct/deconstruct_keys 2022-10-10 12:41:13 +09:00
Nobuyoshi Nakada
c53667691a
[DOC] offset argument of Regexp#match 2022-08-18 23:25:05 +09:00
Aaron Patterson
e4e054e3ce Speed up setting the backref match object
This patch speeds up setting the backref match object by avoiding some
memcopies.  Take the following code for example:

```ruby
"hello world" =~ /hello/
p $~
```

When the RE matches the string, we have to set the Match object in the
backref global.  So we would allocate a match object[^1] and use
`rb_reg_region_copy`[^2] to make a deep copy of the stack allocated
`re_registers` struct[^3] in to the newly created Ruby object.  This
could possibly trigger GC[^4], and would allocate new memory.

This patch makes a shallow copy of the `re_registers` struct on to the
Match object allowing the match object to manage the `re_registers`
pointer and also avoiding some calls to `xmalloc` and some manual
memcopy.

Benchmark looks like this:

```ruby

require "benchmark/ips"

def test_re thing
  thing =~ /hello/
end

Benchmark.ips do |x|
  x.report("re hit") do
    test_re "hello world"
  end

  x.report("re miss") do
    test_re "world"
  end
end
```

Before this patch:

```
$ ruby -v test.rb
ruby 3.2.0dev (2022-07-27T22:29:00Z master 4ad69899b7) [arm64-darwin21]
Ignoring bcrypt-3.1.16 because its extensions are not built. Try: gem pristine bcrypt --version 3.1.16
Warming up --------------------------------------
              re hit   345.401k i/100ms
             re miss   673.584k i/100ms
Calculating -------------------------------------
              re hit      3.452M (± 0.5%) i/s -     17.270M in   5.002535s
             re miss      6.736M (± 0.4%) i/s -     34.353M in   5.099593s
```

After this patch:

```
$ ./ruby -v test.rb
ruby 3.2.0dev (2022-08-01T21:24:12Z less-memcpy 0ff2a56606) [arm64-darwin21]
Warming up --------------------------------------
              re hit   419.578k i/100ms
             re miss   673.251k i/100ms
Calculating -------------------------------------
              re hit      4.201M (± 0.7%) i/s -     21.398M in   5.093593s
             re miss      6.716M (± 0.4%) i/s -     33.663M in   5.012756s
```

Matches get faster and misses maintain the same speed

[^1]: 24204d54ab/re.c (L1737)
[^2]: 24204d54ab/re.c (L1738)
[^3]: 24204d54ab/re.c (L1686)
[^4]: 24204d54ab/re.c (L981)
2022-08-02 09:04:04 -07:00
Takashi Kokubun
5b21e94beb Expand tabs [ci skip]
[Misc #18891]
2022-07-21 09:42:04 -07:00
Kazuhiro NISHIYAMA
846a6bb60f
[DOC] Fix a typo [ci skip] 2022-06-26 14:17:14 +09:00
Jeremy Evans
596f4b0d3a Document that Regexp#source does not retain lexer escapes
Related to [Feature #18838]
2022-06-20 15:56:28 -07:00
Nobuyoshi Nakada
4a6facc2d6 [Feature #18788] [DOC] String options to Regexp.new
Co-Authored-By: Janosch Müller <janosch.mueller@betterplace.org>
2022-06-20 19:35:12 +09:00
Nobuyoshi Nakada
1e9939dae2 [Feature #18788] Support options as String to Regexp.new
`Regexp.new` now supports passing the regexp flags not only as an
`Integer`, but also as a `String.  Unknown flags raise errors.
2022-06-20 19:35:12 +09:00
Nobuyoshi Nakada
ab2a43265c Warn suspicious flag to Regexp.new
Now second argument should be `true`, `false`, `nil` or Integer.
This flag is confused with third argument some times.
2022-06-20 19:35:12 +09:00
Nobuyoshi Nakada
7f8a915715
[DOC] Refine Regexp.new argument descriptions 2022-06-20 18:39:50 +09:00
Nobuyoshi Nakada
914c26eab3
[DOC] Regexp timeout is float or nil 2022-06-20 17:47:44 +09:00
Nobuyoshi Nakada
cd3a5cd0e3
[DOC] Fixed omissions in Regexp.new arguments 2022-06-20 09:26:11 +09:00
Jeremy Evans
ec3542229b
Ignore invalid escapes in regexp comments
Invalid escapes are handled at multiple levels.  The first level
is in parse.y, so skip invalid unicode escape checks for regexps
in parse.y.

Make rb_reg_preprocess and unescape_nonascii accept the regexp
options.  In unescape_nonascii, if the regexp is an extended
regexp, when "#" is encountered, ignore all characters until the
end of line or end of regexp.

Unfortunately, in extended regexps, you can use "#" as a non-comment
character inside a character class, so also parse "[" and "]"
specially for extended regexps, and only skip comments if "#" is
not inside a character class. Handle nested character classes as well.

This issue doesn't just affect extended regexps, it also affects
"(#?" comments inside all regexps.  So for those comments, scan
until trailing ")" and ignore content inside.

I'm not sure if there are other corner cases not handled.  A
better fix would be to redesign the regexp parser so that it
unescaped during parsing instead of before parsing, so you already
know the current parsing state.

Fixes [Bug #18294]

Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
2022-06-06 13:50:03 -07:00
Burdette Lamar
b41de3a1e8
[DOC] Enhanced RDoc for MatchData (#5822)
Treats:
    #to_s
    #named_captures
    #string
    #inspect
    #hash
    #==
2022-04-18 18:19:10 -05:00
Burdette Lamar
6db3f7c405
Enhanced RDoc for MatchData (#5821)
Treats:
    #[]
    #values_at
2022-04-18 15:52:07 -05:00
Burdette Lamar
86e23529ad
Enhanced RDoc for MatchData (#5820)
Treats:
    #pre_match
    #post_match
    #to_a
    #captures
2022-04-18 14:34:40 -05:00
Burdette Lamar
b074bc3d61
[DOC] Enhanced RDoc for MatchData (#5819)
Treats:
    #begin
    #end
    #match
    #match_length
2022-04-18 13:02:35 -05:00
Burdette Lamar
9d1dd7a9ed
[DOC] Enhanced RDoc for MatchData (#5818)
Treats:
    #regexp
    #names
    #size
    #offset
2022-04-18 11:31:30 -05:00
Burdette Lamar
51ea67698e
[DOC] Enhanced RDoc for Regexp (#5815)
Treats:
    ::new
    ::escape
    ::try_convert
    ::union
    ::last_match
2022-04-18 10:45:29 -05:00
Burdette Lamar
2b4b513ef0
[DOC] Enhanced RDoc for Regexp (#5812)
Treats:

    #fixed_encoding?
    #hash
    #==
    #=~
    #match
    #match?

Also, in regexp.rdoc:

    Changes heading from 'Special Global Variables' to 'Regexp Global Variables'.
    Add tiny section 'Regexp Interpolation'.
2022-04-16 15:20:03 -05:00
Burdette Lamar
e021754db0
[DOC] Enhanced RDoc for Regexp (#5807)
Treats:

    #source
    #inspect
    #to_s
    #casefold?
    #options
    #names
    #named_captures
2022-04-15 13:31:15 -05:00
Nobuyoshi Nakada
d8189ed23f
Return only captured range in MatchData [Bug #18670] 2022-03-31 18:01:15 +09:00
Yusuke Endoh
c499a4c28a re.c: stop a wrong warning of "flags ignored" on Regexp.new(//)
[Bug #18669]
2022-03-31 10:07:09 +09:00
Yusuke Endoh
5df2589b64 internal/ractor.h: Added
Currently it has only one function prototype.
2022-03-30 16:50:46 +09:00
Yusuke Endoh
2ade40276b re.c: raise Regexp::TimeoutError instead of RuntimeError 2022-03-30 16:50:46 +09:00
Yusuke Endoh
ce87bb8bd6 re.c: Add timeout keyword for Regexp.new and Regexp#timeout 2022-03-30 16:50:46 +09:00
Yusuke Endoh
ffc3b37f96 re.c: Add Regexp.timeout= and Regexp.timeout
[Feature #17837]
2022-03-30 16:50:46 +09:00
Shugo Maeda
c8817d6a3e
Add String#byteindex, String#byterindex, and MatchData#byteoffset (#5518)
* Add String#byteindex, String#byterindex, and MatchData#byteoffset [Feature #13110]

Co-authored-by: NARUSE, Yui <naruse@airemix.jp>
2022-02-19 19:10:00 +09:00
Shugo Maeda
cda5aee74e
LONG2NUM() should be used for rmatch_offset::{beg,end}
https://github.com/ruby/ruby/pull/5518#discussion_r809645406
2022-02-18 22:13:45 +09:00
Nobuyoshi Nakada
16fdc1ff46
[DOC] Fix broken links to literals.rdoc 2022-02-08 01:27:52 +09:00