Commit graph

1799 commits

Author SHA1 Message Date
Nobuyoshi Nakada
e26e8423b5 Suppress gcc 15 unterminated-string-initialization warnings 2025-07-24 14:39:20 +09:00
nagachika
937319126a merge revision(s) 02b70256b5, 6b4f8945d6: [Backport #20909]
Check negative integer underflow

	Many of Oniguruma functions need valid encoding strings
2024-11-30 18:34:32 +09:00
nagachika
a6b7aad954 merge revision(s) 7e4b1f8e19: [Backport #20322]
[Bug #20322] Fix rb_enc_interned_str_cstr null encoding

	The documentation for `rb_enc_interned_str_cstr` notes that `enc` can be
	a null pointer, but this currently causes a segmentation fault when
	trying to autoload the encoding. This commit fixes the issue by checking
	for NULL before calling `rb_enc_autoload`.
2024-07-15 13:40:01 +09:00
nagachika
0cb1e753ca Revert "merge revision(s) 5e0c171451: [Backport #20169]"
This reverts commit 6b73406833.
2024-07-15 11:55:41 +09:00
nagachika
b5e554d03a Revert "merge revision(s) e04146129e, d5080f6e8b: [Backport #20292]"
This reverts commit a54c717c7a.
2024-07-15 11:08:50 +09:00
nagachika
8051a6d385 Revert "follow-up for a54c717c7a."
This reverts commit 715633ba6e.
2024-07-15 11:07:31 +09:00
nagachika
715633ba6e follow-up for a54c717c7a. 2024-07-15 10:41:21 +09:00
nagachika
a54c717c7a merge revision(s) e04146129e, d5080f6e8b: [Backport #20292]
[Bug #20292] Truncate embedded string to new capacity

	Fix -Wsign-compare on String#initialize
	MIME-Version: 1.0
	Content-Type: text/plain; charset=UTF-8
	Content-Transfer-Encoding: 8bit

	../string.c:1886:57: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare]
	 1886 |                 if (STR_EMBED_P(str)) RUBY_ASSERT(osize <= str_embed_capa(str));
	      |                                                         ^~
2024-07-15 09:26:25 +09:00
nagachika
6b73406833 merge revision(s) 5e0c171451: [Backport #20169]
Make io_fwrite safe for compaction

	[Bug #20169]

	Embedded strings are not safe for system calls without the GVL because
	compaction can cause pages to be locked causing the operation to fail
	with EFAULT. This commit changes io_fwrite to use rb_str_tmp_frozen_no_embed_acquire,
	which guarantees that the return string is not embedded.
2024-07-15 08:50:38 +09:00
Jean Boussier
449899b383 Fix String#index to clear MatchData when a regexp is passed
[Bug #20421]

The bug was fixed in Ruby 3.3 via 9dcdffb8bf
2024-05-14 09:29:21 +02:00
nagachika
4f3ed07d5b merge revision(s) ade56737e2: [Backport #20190]
Fix coderange of invalid_encoding_string.<<(ord)

	Appending valid encoding character can change coderange from invalid to valid.
	Example: "\x95".force_encoding('sjis')<<0x5C will be a valid string "\x{955C}"
	---
	 string.c                 | 6 +++++-
	 test/ruby/test_string.rb | 3 +++
	 2 files changed, 8 insertions(+), 1 deletion(-)
2024-03-31 17:18:55 +09:00
nagachika
b4f8623441 merge revision(s) b3d6128049: [Backport #20150]
Fix memory leak in grapheme clusters

	[Bug #20150]

	String#grapheme_cluters and String#each_grapheme_cluster leaks memory
	because if the string is not UTF-8, then the created regex will not
	be freed.

	For example:

	    str = "hello world".encode(Encoding::UTF_32LE)

	    10.times do
	      1_000.times do
	        str.grapheme_clusters
	      end

	      puts `ps -o rss= -p #{$$}`
	    end

	Before:

	    26000
	    42256
	    59008
	    75792
	    92528
	    109232
	    125936
	    142672
	    159392
	    176160

	After:

	    9264
	    9504
	    9808
	    10000
	    10128
	    10224
	    10352
	    10544
	    10704
	    10896
	---
	 string.c                 | 98 +++++++++++++++++++++++++++++++-----------------
	 test/ruby/test_string.rb | 11 ++++++
	 2 files changed, 75 insertions(+), 34 deletions(-)
2024-01-18 11:50:31 +09:00
nagachika
ddbab4f837 merge revision(s) 6b66b5fded: [Backport #19902]
[Bug #19902] Update the coderange regarding the changed region

	---
	 ext/-test-/string/set_len.c       | 10 ++++++++++
	 string.c                          | 27 +++++++++++++++++++++++++++
	 test/-ext-/string/test_set_len.rb | 29 +++++++++++++++++++++++++++++
	 3 files changed, 66 insertions(+)
2023-09-30 13:51:18 +09:00
nagachika
d30781db4d merge revision(s) 2214bcb70d: [Backport #19792]
Fix premature string collection during append

	Previously, the following crashed due to use-after-free
	with AArch64 Alpine Linux 3.18.3 (aarch64-linux-musl):

	```ruby
	str = 'a' * (32*1024*1024)
	p({z: str})
	```

	32 MiB is the default for `GC_MALLOC_LIMIT_MAX`, and the crash
	could be dodged by setting `RUBY_GC_MALLOC_LIMIT_MAX` to large values.
	Under a debugger, one can see the `str2` of rb_str_buf_append()
	getting prematurely collected while str_buf_cat4() allocates capacity.

	Add GC guards so the buffer of `str2` lives across the GC run
	initiated in str_buf_cat4().

	[Bug #19792]
	---
	 string.c | 2 ++
	 1 file changed, 2 insertions(+)
2023-09-30 13:07:35 +09:00
nagachika
65d294ad01 merge revision(s) bc3ac1872e: [Backport #19748]
[Bug #19748] Fix out-of-bound access in `String#byteindex`

	---
	 string.c                 | 17 +++++++----------
	 test/ruby/test_string.rb |  3 +++
	 2 files changed, 10 insertions(+), 10 deletions(-)
2023-07-22 13:39:44 +09:00
NARUSE, Yui
b309c246ee merge revision(s) d78ae78fd7: [Backport #19468]
rb_str_modify_expand: clear the string coderange

	[Bug #19468]

	b0b9f7201a errornously stopped
	clearing the coderange.

	Since `rb_str_modify` clears it, `rb_str_modify_expand`
	should too.
	---
	 string.c | 1 +
	 1 file changed, 1 insertion(+)
2023-03-17 10:56:18 +09:00
NARUSE, Yui
40e0b1e123 merge revision(s) 9726736006: [Backport #19327]
Set STR_SHARED_ROOT flag on root of string

	---
	 string.c | 1 +
	 1 file changed, 1 insertion(+)
2023-01-31 23:46:50 +09:00
NARUSE, Yui
373e62248c merge revision(s) f7b72462aa: [Backport #19356]
String#bytesplice should return self

	In Feature #19314, we concluded that the return value of String#bytesplice
	should be changed from the source string to the receiver, because the source
	string is useless and confusing when extra arguments are added.

	This change should be included in Ruby 3.2.1.
	---
	 string.c                 | 4 ++--
	 test/ruby/test_string.rb | 2 +-
	 2 files changed, 3 insertions(+), 3 deletions(-)
2023-01-20 12:24:24 +09:00
NARUSE, Yui
6a8fcb5021 merge revision(s) 3be2acfafd: [Backport #19327]
Fix re-embedding of strings during compaction

	The reference updating code for strings is not re-embedding strings
	because the code is incorrectly wrapped inside of a
	`if (STR_SHARED_P(obj))` clause. Shared strings can't be re-embedded
	so this ends up being a no-op. This means that strings can be moved to a
	large size pool during compaction, but won't be re-embedded, which would
	waste the space.
	---
	 gc.c                         | 16 +++++++++-------
	 string.c                     | 12 ++++++++----
	 test/ruby/test_gc_compact.rb |  8 ++++----
	 3 files changed, 21 insertions(+), 15 deletions(-)
2023-01-19 21:52:47 +09:00
NARUSE, Yui
686b38f83e merge revision(s) d8ef0a98c6: [Backport #19319]
[Bug #19319] Fix crash in rb_str_casemap

	The following code crashes on my machine:

	```
	GC.stress = true

	str = "testing testing testing"

	puts str.capitalize
	```

	We need to ensure that the object `buffer_anchor` remains on the stack
	so it does not get GC'd.
	---
	 string.c | 2 ++
	 1 file changed, 2 insertions(+)
2023-01-19 11:59:43 +09:00
Nobuyoshi Nakada
98fbebf110
[DOC] Fix typo 2022-12-22 00:01:18 +09:00
S-H-GAMELINKS
1a64d45c67 Introduce encoding check macro 2022-12-02 01:31:27 +09:00
Jeremy Evans
571d21fd4a Make String#rstrip{,!} raise Encoding::CompatibilityError for broken coderange
It's questionable whether we want to allow rstrip to work for strings
where the broken coderange occurs before the trailing whitespace and
not after, but this approach is probably simpler, and I don't think
users should expect string operations like rstrip to work on broken
strings.

In some cases, this changes rstrip to raise
Encoding::CompatibilityError instead of ArgumentError.  However, as
the problem is related to an encoding issue in the receiver, and due
not due to an issue with an argument, I think
Encoding::CompatibilityError is the more appropriate error.

Fixes [Bug #18931]
2022-11-24 18:24:42 -08:00
S-H-GAMELINKS
1f4f6c9832 Using UNDEF_P macro 2022-11-16 18:58:33 +09:00
Takashi Kokubun
e7443dbbca
Rewrite Symbol#to_sym and #intern in Ruby (#6683) 2022-11-15 21:34:30 -08:00
Peter Zhu
710c1ada84 Use string's capacity to determine if reembeddable
During auto-compaction, using length to determine whether or not a
string can be re-embedded may be a problem for newly created strings.
This is because usually it requires a malloc before setting the length.
If the malloc triggers compaction, then the string may be re-embedded
and can cause crashes.
2022-11-14 16:59:43 -05:00
Peter Zhu
0468136a1b Make str_alloc_heap return a STR_NOEMBED string
This commit refactors str_alloc_heap to return a string with the
STR_NOEMBED flag set.
2022-11-03 09:09:11 -04:00
Vaevictusnet
7726f6bfff Correcting example for swapcase! method
Example, line 3, swapcase! was incorrect. implied that the swapcase! did /not/ change the starting string.
2022-10-04 10:07:01 +09:00
Peter Zhu
28a572f8bf Fix bug when slicing a string with broken encoding
Commit aa2a428 introduced a bug where non-embedded string slices copied
the encoding of the original string. If the original string had a broken
encoding but the slice has valid encoding, then the slice would be
incorrectly marked as broken encoding.
2022-09-28 09:05:23 -04:00
Peter Zhu
6f8d17e43c Make string slices views rather than copies
Just like commit 1c16645 for arrays, this commit changes string slices
to be a view rather than a copy even if it can be allocated through VWA.
2022-09-28 09:05:23 -04:00
Peter Zhu
aa2a428cfb Refactor str_substr and str_subseq
This commit extracts common code between str_substr and rb_str_subseq
into a function called str_subseq.

This commit also applies optimizations in commit 2e88bca to
rb_str_subseq.
2022-09-26 14:54:32 -04:00
Jean Boussier
2e88bca24f string.c: don't create a frozen copy for str_new_shared
str_new_shared already has all the necessary logic to do this
and is also smart enough to skip this step if the source string
is already a shared string itself.

This saves a useless String allocation on each call.
2022-09-26 13:41:17 +02:00
Kazuki Yamaguchi
5b0396473b Fix coderange calculation in String#b
Leave the new coderange unknown if the original encoding is not
ASCII-compatible. Non-ASCII-compatible encoding strings with valid or
broken coderange can end up as ascii-only.

Fixes 9a8f6e392f ("Cheaply derive code range for String#b return
value", 2022-07-25).
2022-09-26 16:44:46 +09:00
Yusuke Endoh
a78c733cc3 Revert "Revert "error.c: Let Exception#inspect inspect its message""
This reverts commit b9f030954a.

[Bug #18170]
2022-09-23 16:40:59 +09:00
Benoit Daloze
6525b6f760 Remove get_actual_encoding() and the dynamic endian detection for dummy UTF-16/UTF-32
* And simplify callers of get_actual_encoding().
* See [Feature #18949].
* See https://github.com/ruby/ruby/pull/6322#issuecomment-1242758474
2022-09-12 14:02:34 +02:00
Kazuki Yamaguchi
aff6534e32 Avoid unnecessary copying when removing the leading part of a string
Remove the superfluous str_modify_keep_cr() call from rb_str_update().
It ends up calling either rb_str_drop_bytes() or rb_str_splice_0(),
which already does checks if necessary.

The extra call makes the string "independent". This is not always
wanted, in other words, it can keep the same shared root when merely
removing the leading part of a shared string.
2022-09-09 16:03:20 +09:00
Jean Boussier
cd1724bdde rb_str_concat_literals: use rb_str_buf_append
That's about 1.30x faster.
2022-09-08 15:02:21 +02:00
Nobuyoshi Nakada
332d29df53
[DOC] non-positive base in Kernel#Integer and String#to_i 2022-09-08 11:52:16 +09:00
Nobuyoshi Nakada
576bdec03f [Bug #18973] Promote US-ASCII to ASCII-8BIT when adding 8-bit char 2022-08-31 17:27:59 +09:00
Nobuyoshi Nakada
fe4dd18db4
[DOC] Fix a typo [ci skip] 2022-08-27 12:54:42 +09:00
Nobuyoshi Nakada
43e8d9a050 Check if encoding capable object before check if ASCII compatible 2022-08-20 10:06:40 +09:00
Jean Boussier
b0b9f7201a rb_str_resize: Only clear coderange on truncation
If we are expanding the string or only stripping extra capacity
then coderange won't change, so clearing it is wasteful.
2022-08-18 10:09:08 +02:00
Jeremy Evans
49517b3bb4 Fix inspect for unicode codepoint 0x85
This is an inelegant hack, by manually checking for this specific
code point in rb_str_inspect.  Some testing indicates that this is
the only code point affected.

It's possible a better fix would be inside of lower-level encoding
code, such that rb_enc_isprint would return false and not true for
codepoint 0x85.

Fixes [Bug #16842]
2022-08-11 08:47:29 -07:00
Nobuyoshi Nakada
2d1cf658ee
Adjust indent [ci skip] 2022-07-26 18:33:21 +09:00
Kevin Menard
9a8f6e392f Cheaply derive code range for String#b return value
The result of String#b is a string with an ASCII_8BIT/BINARY encoding. That encoding is ASCII-compatible and has no byte sequences that are invalid for the encoding. If we know the receiver's code range, we can derive the resulting string's code range without needing to perform a full code range scan.
2022-07-26 09:03:44 +02:00
Jean Boussier
31a5586d1e rb_str_buf_append: add a fast path for ENC_CODERANGE_VALID
If the RHS has valid encoding, and both strings have the same
encoding, we can use the fast path.

However we need to update the LHS coderange.

```
compare-ruby: ruby 3.2.0dev (2022-07-21T14:46:32Z master cdbb9b8555) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-25T07:25:41Z string-concat-vali.. 11a2772bdd) [arm64-darwin21]
warming up...

|                    |compare-ruby|built-ruby|
|:-------------------|-----------:|---------:|
|binary_concat_7bit  |    554.816k|  556.460k|
|                    |           -|     1.00x|
|utf8_concat_7bit    |    556.367k|  555.101k|
|                    |       1.00x|         -|
|utf8_concat_UTF8    |    412.555k|  556.824k|
|                    |           -|     1.35x|
```
2022-07-25 14:18:52 +02:00
Takashi Kokubun
5b21e94beb Expand tabs [ci skip]
[Misc #18891]
2022-07-21 09:42:04 -07:00
Jeremy Evans
423b41cba7 Make String#each_line work correctly with paragraph separator and chomp
Previously, it was including one newline when chomp was used,
which is inconsistent with IO#each_line behavior. This makes
behavior consistent with IO#each_line, chomping all paragraph
separators (multiple consecutive newlines), but not single
newlines.

Partially Fixes [Bug #18768]
2022-07-21 08:02:32 -07:00
Jean Boussier
f954c5dae4 string.c: use str_enc_fastpath in TERM_LEN
Not having to fetch the rb_encoding save a significant
amount of time.

Additionally, even when we have to fetch it, we can do
it faster using `ENCODING_GET` rather than `rb_enc_get`.

```
compare-ruby: ruby 3.2.0dev (2022-07-19T08:41:40Z master cb9fd920a3) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-21T11:16:16Z faster-buffer-conc.. 4f001f0748) [arm64-darwin21]
warming up...

|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|binary_concat_utf8    |    510.580k|  565.600k|
|                      |           -|     1.11x|
|binary_concat_binary  |    512.653k|  571.483k|
|                      |           -|     1.11x|
|utf8_concat_utf8      |    511.396k|  566.879k|
|                      |           -|     1.11x|
```
2022-07-21 15:06:50 +02:00
Jean Boussier
cb9fd920a3 str_buf_cat: preserve coderange when going through fastpath
rb_str_modify clear the coderange, which in this case isn't
necessary.

```
compare-ruby: ruby 3.2.0dev (2022-07-12T15:01:11Z master 71aec68566) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-19T07:17:01Z faster-buffer-conc.. 3cad62aab4) [arm64-darwin21]
warming up...

|                      |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|binary_concat_utf8    |    360.617k|  605.091k|
|                      |           -|     1.68x|
|binary_concat_binary  |    446.650k|  605.053k|
|                      |           -|     1.35x|
|utf8_concat_utf8      |    454.166k|  597.311k|
|                      |           -|     1.32x|
```

```
|            |compare-ruby|built-ruby|
|:-----------|-----------:|---------:|
|erb_render  |      1.790M|    2.045M|
|            |           -|     1.14x|
```
2022-07-19 10:41:40 +02:00