Commit graph

1847 commits

Author SHA1 Message Date
nagachika
726bff43b4 merge revision(s) c224ca4fea: [Backport #21172]
Fix a race condition with interned strings sweeping.

	[Bug #21172]

	This fixes a rare CI failure.

	The timeline of the race condition is:

	- A `"foo" oid=1` string is interned.
	- `"foo" oid=1` is no longer referenced and will be swept in the future.
	- Another `"foo" oid=2` string is interned.
	- `register_fstring` finds `"foo" oid=1`, but since it is about to be swept,
	  removes it from `fstring_table` and insert `"foo" oid=2` instead.
	- `"foo" oid=1` is swept, since it has the `RSTRING_FSTR` flag,
	  a `st_delete` is issued in `fstring_table` which removes `"foo" oid=2`.

	I don't know how to reproduce this bug consistently in a single test
	case.
2025-03-16 18:02:45 +09:00
Takashi Kokubun
1e48631e0f merge revision(s) 02b70256b5, 6b4f8945d6: [Backport #20909]
Check negative integer underflow

	Many of Oniguruma functions need valid encoding strings
2025-01-14 17:50:24 -08:00
Jean byroot Boussier
d1ffd5ecfa
String.new(capacity:) don't substract termlen (#11027)
[Bug #20585]

This was changed in 36a06efdd9 because
`String.new(1024)` would end up allocating `1025` bytes, but the problem
with this change is that the caller may be trying to right size a String.

So instead, we should just better document the behavior of `capacity:`.

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
2024-06-20 10:39:20 -07:00
Takashi Kokubun
548c7cb9f5 merge revision(s) 7e4b1f8e19: [Backport #20322]
[Bug #20322] Fix rb_enc_interned_str_cstr null encoding

	The documentation for `rb_enc_interned_str_cstr` notes that `enc` can be
	a null pointer, but this currently causes a segmentation fault when
	trying to autoload the encoding. This commit fixes the issue by checking
	for NULL before calling `rb_enc_autoload`.
2024-05-29 11:07:07 -07:00
Takashi Kokubun
f12c947192 merge revision(s) e04146129e: [Backport #20292]
[Bug #20292] Truncate embedded string to new capacity
2024-05-29 10:19:49 -07:00
Takashi Kokubun
b77b5c1915 merge revision(s) 5e0c171451: [Backport #20169]
Make io_fwrite safe for compaction

	[Bug #20169]

	Embedded strings are not safe for system calls without the GVL because
	compaction can cause pages to be locked causing the operation to fail
	with EFAULT. This commit changes io_fwrite to use rb_str_tmp_frozen_no_embed_acquire,
	which guarantees that the return string is not embedded.
2024-05-28 14:22:45 -07:00
NARUSE, Yui
ce372be903
merge revision(s) ade56737e2: [Backport #20190] (#10300)
Fix coderange of invalid_encoding_string.<<(ord)

	Appending valid encoding character can change coderange from invalid to valid.
	Example: "\x95".force_encoding('sjis')<<0x5C will be a valid string "\x{955C}"
2024-03-20 13:40:46 +00:00
NARUSE, Yui
fafe5db732
merge revision(s) b3d6128049: [Backport #20150] (#10253)
Fix memory leak in grapheme clusters

	[Bug #20150]

	String#grapheme_cluters and String#each_grapheme_cluster leaks memory
	because if the string is not UTF-8, then the created regex will not
	be freed.

	For example:

	    str = "hello world".encode(Encoding::UTF_32LE)

	    10.times do
	      1_000.times do
	        str.grapheme_clusters
	      end

	      puts `ps -o rss= -p #{$$}`
	    end

	Before:

	    26000
	    42256
	    59008
	    75792
	    92528
	    109232
	    125936
	    142672
	    159392
	    176160

	After:

	    9264
	    9504
	    9808
	    10000
	    10128
	    10224
	    10352
	    10544
	    10704
	    10896
	---
	 string.c                 | 98 +++++++++++++++++++++++++++++++-----------------
	 test/ruby/test_string.rb | 11 ++++++
	 2 files changed, 75 insertions(+), 34 deletions(-)
2024-03-14 14:18:15 +00:00
Peter Zhu
7002e77694 Fix Symbol#inspect for GC compaction
The test fails when RGENGC_CHECK_MODE is turned on:

    1) Failure:
    TestSymbol#test_inspect_under_gc_compact_stress [test/ruby/test_symbol.rb:123]:
    <":testing"> expected but was
    <":\x00\x00\x00\x00\x00\x00\x00">.
2023-12-24 21:29:40 -05:00
Peter Zhu
50bf437341 Fix String#sub for GC compaction
The test fails when RGENGC_CHECK_MODE is turned on:

    TestString#test_sub_gc_compact_stress = 9.42 s
    1) Failure:
    TestString#test_sub_gc_compact_stress [test/ruby/test_string.rb:2089]:
    <"aaa [amp] yyy"> expected but was
    <"aaa [] yyy">.
2023-12-23 18:00:27 -05:00
Nobuyoshi Nakada
ab7f54688b
Stir the hash value more with encoding index 2023-12-17 00:30:00 +09:00
Nobuyoshi Nakada
b710f96b5a
[Bug #20068] Encoding does not matter to empty strings 2023-12-16 16:00:12 +09:00
Jeremy Evans
0d53dba7ce Make String#chomp! raise ArgumentError for 2+ arguments if string is empty
String#chomp! returned nil without checking the number of passed
arguments in this case.
2023-12-13 07:05:21 -08:00
Peter Zhu
ee0eca191f Make String#undump compaction safe 2023-12-01 15:04:31 -05:00
Peter Zhu
80ea7fbad8 Pin embedded shared strings
Embedded shared strings cannot be moved because strings point into the
slot of the shared string. There may be code using the RSTRING_PTR on
the stack, which would pin the string but not pin the shared string,
causing it to move.
2023-12-01 15:04:31 -05:00
Peter Zhu
3d908a41ab Guard match from GC in String#gsub
We need to guard match from GC because otherwise it could end up being
reclaimed or moved in compaction.
2023-11-29 19:21:40 -05:00
Peter Zhu
94015e0dce Guard match from GC when scanning string
We need to guard match from GC because otherwise it could end up being
reclaimed or moved in compaction.
2023-11-27 16:49:52 -05:00
Jean Boussier
83c385719d Specialize String#dup
`String#+@` is 2-3 times faster than `String#dup` because it can
directly go through `rb_str_dup` instead of using the generic
much slower `rb_obj_dup`.

This fact led to the existance of the ugly `Performance/UnfreezeString`
rubocop performance rule that encourage users to rewrite the much
more readable and convenient `"foo".dup` into the ugly `(+"foo")`.

Let's make that rubocop rule useless.

```
compare-ruby: ruby 3.3.0dev (2023-11-20T02:02:55Z master 701b0650de) [arm64-darwin22]
last_commit=[ruby/prism] feat: add encoding for IBM865 (https://github.com/ruby/prism/pull/1884)
built-ruby: ruby 3.3.0dev (2023-11-20T12:51:45Z faster-str-lit-dup 6b745bbc5d) [arm64-darwin22]
warming up..

|       |compare-ruby|built-ruby|
|:------|-----------:|---------:|
|uplus  |     16.312M|   16.332M|
|       |           -|     1.00x|
|dup    |      5.912M|   16.329M|
|       |           -|     2.76x|
```
2023-11-20 14:33:20 +01:00
Jean Boussier
ea1b1ea1aa String#force_encoding don't clear coderange if encoding is unchanged
Some code out there blind calls `force_encoding` without checking
what the original encoding was, which clears the coderange uselessly.

If the String is big, it can be a rather costly mistake.

For instance the `rack-utf8_sanitizer` gem does this on request
bodies.
2023-11-09 12:38:10 +01:00
Nobuyoshi Nakada
1910bd4247
String for string literal is not resizable 2023-11-08 00:59:45 +09:00
Jean Boussier
ac8ec004e5 Make String.new size pools aware.
If the required capacity would fit in an embded string,
returns one.

This can reduce malloc churn for code that use string buffers.
2023-11-02 23:34:58 +01:00
Nobuyoshi Nakada
50520cc193
[DOC] Missing comment markers 2023-09-27 16:18:05 +09:00
Nobuyoshi Nakada
6b66b5fded [Bug #19902] Update the coderange regarding the changed region 2023-09-26 15:35:40 +09:00
John Hawthorn
d89b15cdce Use end of char boundary in start_with?
Previously we used the next character following the found prefix to
determine if the match ended on a broken character.

This had caused surprising behaviour when a valid character was followed
by a UTF-8 continuation byte.

This commit changes the behaviour to instead look for the end of the
last character in the prefix.

[Bug #19784]

Co-authored-by: ywenc <ywenc@github.com>
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
2023-09-01 16:23:28 -07:00
Nobuyoshi Nakada
b054c2fe06 [Bug #19784] Fix behaviors against prefix with broken encoding
- String#start_with?
- String#delete_prefix
- String#delete_prefix!
2023-08-26 08:58:02 +09:00
Nobuyoshi Nakada
00ac3a64ba Introduce at_char_boundary function 2023-08-26 08:58:02 +09:00
Alan Wu
2214bcb70d Fix premature string collection during append
Previously, the following crashed due to use-after-free
with AArch64 Alpine Linux 3.18.3 (aarch64-linux-musl):

```ruby
str = 'a' * (32*1024*1024)
p({z: str})
```

32 MiB is the default for `GC_MALLOC_LIMIT_MAX`, and the crash
could be dodged by setting `RUBY_GC_MALLOC_LIMIT_MAX` to large values.
Under a debugger, one can see the `str2` of rb_str_buf_append()
getting prematurely collected while str_buf_cat4() allocates capacity.

Add GC guards so the buffer of `str2` lives across the GC run
initiated in str_buf_cat4().

[Bug #19792]
2023-08-23 18:07:49 -04:00
Peter Zhu
837c12b0c8 Use STR_EMBED_P instead of testing STR_NOEMBED 2023-08-22 16:31:36 -04:00
Peter Zhu
724223b4ca Don't check for STR_NOEMBED in rb_fstring
We don't need to check for STR_NOEMBED because the check above for
STR_EMBED_P means that it can never be false.
2023-08-18 09:24:45 -04:00
Burdette Lamar
0e162457d6
[DOC] Don't suppress autolinks (#8208) 2023-08-11 19:22:21 -04:00
Kunshan Wang
132f097149 No computing embed_capa_max in str_subseq
Fix str_subseq so that it does not attempt to predict the size of the
object returned by str_alloc_heap.
2023-08-03 14:52:44 -04:00
Nobuyoshi Nakada
af04e26924
Fill terminator properly 2023-07-28 22:17:53 +09:00
alexandre184
e5825de7c9
[Bug #19769] Fix range of size 1 in String#tr 2023-07-15 16:36:53 +09:00
Nobuyoshi Nakada
9dcdffb8bf
Make the string index functions closer to symmetric
So that irregular parts may be more noticeable.
2023-07-09 18:45:51 +09:00
Nobuyoshi Nakada
5e79d5a560
Make rb_str_rindex return byte index
Leave callers to convert byte index to char index, as well as
`rb_str_index`, so that `rb_str_rpartition` does not need to
re-convert char index to byte index.
2023-07-09 16:39:28 +09:00
Nobuyoshi Nakada
e2257831ab
[Bug #19763] Raise same message exception for regexp 2023-07-09 16:21:02 +09:00
Nobuyoshi Nakada
3d7a6bbc12 Ensure the byte position is a valid boundary 2023-06-28 22:42:04 +09:00
Nobuyoshi Nakada
bc3ac1872e [Bug #19748] Fix out-of-bound access in String#byteindex 2023-06-28 17:23:32 +09:00
Nobuyoshi Nakada
0cbfeb8210 [Bug #19746] String#index with regexp should clear $~ unless matched 2023-06-28 14:06:28 +09:00
Burdette Lamar
932dd9f10e
[DOC] Regexp doc (#7923) 2023-06-20 09:28:21 -04:00
Matt Valentine-House
d54f66d1b4 Assign into optimal size pools using String#split("")
When String#split is used with an empty string as the field seperator it
effectively splits the original string into chars, and there is a
pre-existing fast path for this using SPLIT_TYPE_CHARS.

However this path creates an empty array in the smallest size pool and
grows from there, despite already knowing the size of the desired array.

This commit pre-allocates the correct size array in this case in order
to allow the arrays to be embedded and avoid being allocated in the
transient heap
2023-06-09 10:54:40 +01:00
Peter Zhu
7577c101ed
Unify length field for embedded and heap strings (#7908)
* Unify length field for embedded and heap strings

The length field is of the same type and position in RString for both
embedded and heap allocated strings, so we can unify it.

* Remove RSTRING_EMBED_LEN
2023-06-06 10:19:20 -04:00
Peter Zhu
1a7ee14578 [DOC] Update flags doc for strings
The length of an embedded string is no longer in the flags.
2023-06-05 09:49:35 -04:00
Peter Zhu
a16cffe384 Simplify duplicated code
The capacity of the string can be calculated using the str_capacity
function.
2023-06-01 08:32:29 -04:00
Peter Zhu
8a8618d4f3 Don't refetch ptr and len
The call to RSTRING_GETMEM already fetched the pointer and length, so we
don't need to fetch it again.
2023-06-01 08:32:29 -04:00
Peter Zhu
c37ebfe08f Remove dead code in string.c
The STR_DEC_LEN macro is not used.
2023-05-26 13:34:26 -04:00
Matt Valentine-House
026321c5b9 [Feature #19474] Refactor NEWOBJ macros
NEWOBJ_OF is now our canonical newobj macro. It takes an optional ec
2023-04-06 11:07:16 +01:00
Peter Zhu
1da2e7fca3
[Feature #19579] Remove !USE_RVARGC code (#7655)
Remove !USE_RVARGC code

[Feature #19579]

The Variable Width Allocation feature was turned on by default in Ruby
3.2. Since then, we haven't received bug reports or backports to the
non-Variable Width Allocation code paths, so we assume that nobody is
using it. We also don't plan on maintaining the non-Variable Width
Allocation code, so we are going to remove it.
2023-04-04 17:30:06 -04:00
Takashi Kokubun
32e0c97dfa RJIT: Optimize String#bytesize 2023-03-18 23:35:42 -07:00
Takashi Kokubun
233ddfac54 Stop exporting symbols for MJIT 2023-03-06 21:59:23 -08:00