Follow up to fix 3b7373fd00.
In that commit, the line number in the first frame was overwritten after
the whole backtrace was created. There was a problem that the line
number was overwritten even if the location was backpatched.
Instead, this commit uses first_lineno if the frame is
VM_FRAME_MAGIC_DUMMY when generating the backtrace.
Before the patch:
```
$ ./miniruby -e '[1, 2].inject(:tap)'
-e:in '<main>': wrong number of arguments (given 1, expected 0) (ArgumentError)
from -e:1:in 'Enumerable#inject'
from -e:1:in '<main>'
```
After the patch:
```
$ ./miniruby -e '[1, 2].inject(:tap)'
-e:1:in '<main>': wrong number of arguments (given 1, expected 0) (ArgumentError)
from -e:1:in 'Enumerable#inject'
from -e:1:in '<main>'
```
This implements a hash set which is wait-free for lookup and lock-free
for insert (unless resizing) to use for fstring de-duplication.
As highlighted in https://bugs.ruby-lang.org/issues/19288, heavy use of
fstrings (frozen interned strings) can significantly reduce the
parallelism of Ractors.
I tried a few other approaches first: using an RWLock, striping a series
of RWlocks (partitioning the hash N-ways to reduce lock contention), and
putting a cache in front of it. All of these improved the situation, but
were unsatisfying as all still required locks for writes (and granular
locks are awkward, since we run the risk of needing to reach a vm
barrier) and this table is somewhat write-heavy.
My main reference for this was Cliff Click's talk on a lock free
hash-table for java https://www.youtube.com/watch?v=HJ-719EGIts. It
turns out this lock-free hash set is made easier to implement by a few
properties:
* We only need a hash set rather than a hash table (we only need keys,
not values), and so the full entry can be written as a single VALUE
* As a set we only need lookup/insert/delete, no update
* Delete is only run inside GC so does not need to be atomic (It could
be made concurrent)
* I use rb_vm_barrier for the (rare) table rebuilds (It could be made
concurrent) We VM lock (but don't require other threads to stop) for
table rebuilds, as those are rare
* The conservative garbage collector makes deferred replication easy,
using a T_DATA object
Another benefits of having a table specific to fstrings is that we
compare by value on lookup/insert, but by identity on delete, as we only
want to remove the exact string which is being freed. This is faster and
provides a second way to avoid the race condition in
https://bugs.ruby-lang.org/issues/21172.
This is a pretty standard open-addressing hash table with quadratic
probing. Similar to our existing st_table or id_table. Deletes (which
happen on GC) replace existing keys with a tombstone, which is the only
type of update which can occur. Tombstones are only cleared out on
resize.
Unlike st_table, the VALUEs are stored in the hash table itself
(st_table's bins) rather than as a compact index. This avoids an extra
pointer dereference and is possible because we don't need to preserve
insertion order. The table targets a load factor of 2 (it is enlarged
once it is half full).
The instruction counter is slowing multi-Ractor applications. I had
changed it to use a thread local, but using a thread local is slowing
single threaded applications. This commit only enables the instruction
counter in YJIT stats builds until we can figure out a way to gather the
information with lower overhead.
Co-authored-by: Randy Stauner <randy.stauner@shopify.com>
`rb_vm_insns_count` is a global variable used for reporting YJIT
statistics. It is a counter that tallies the number of interpreter
instructions that have been executed, this way we can approximate how
much time we're spending in YJIT compared to the interpreter.
Unfortunately keeping this statistic means that every instruction
executed in the interpreter loop must increment the counter. Normally
this isn't a problem, but in multi-threaded situations (when Ractors are
used), incrementing this counter can become quite costly due to page
caching issues.
Additionally, since there is no locking when incrementing this global,
the count can't really make sense in a multi-threaded environment.
This commit changes `rb_vm_insns_count` to a thread local. That way each
Ractor has it's own copy of the counter and incrementing the counter
becomes quite cheap. Of course this means that in multi-threaded
situations, the value doesn't really make sense (but it didn't make
sense before because of the lack of locking).
The counter is used for YJIT statistics, and since YJIT is basically
disabled when Ractors are in use, I don't think we care about
inaccuracies (for the time being). We can revisit this counter when we
give YJIT multi-threading support, but for the time being this commit
restores multi-threaded performance.
To test this, I used the benchmark in [Bug #20489].
Here is the performance on Ruby 3.2:
```
$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 2.53 secs fish external
usr time 19.86 secs 370.00 micros 19.86 secs
sys time 0.02 secs 320.00 micros 0.02 secs
```
We can see the regression in performance on the master branch:
```
$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-10T16:22:26Z master 4a2702dafb) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 24.87 secs fish external
usr time 195.55 secs 0.00 micros 195.55 secs
sys time 0.00 secs 716.00 micros 0.00 secs
```
Here are the stats after this commit:
```
$ time RUBY_MAX_CPU=12 ./miniruby -v ../test.rb 8 8
ruby 3.5.0dev (2025-01-10T20:37:06Z tl 3ef0432779) +PRISM [x86_64-linux]
[0...1, 1...2, 2...3, 3...4, 4...5, 5...6, 6...7, 7...8]
../test.rb:43: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
________________________________________________________
Executed in 2.46 secs fish external
usr time 19.34 secs 381.00 micros 19.34 secs
sys time 0.01 secs 321.00 micros 0.01 secs
```
[Bug #20489]
* Remove 1 allocation in Enumerable#each_with_index
Previously, each call to Enumerable#each_with_index allocates 2
objects, one for the counting index, the other an imemo_ifunc passed
to `self.each` as a block.
Use `struct vm_ifunc::data` to hold the counting index directly to
remove 1 allocation.
* [DOC] Brief summary for usages of `struct vm_ifunc`
This function accepts flags:
RB_NO_KEYWORDS, RB_PASS_KEYWORDS, RB_PASS_CALLED_KEYWORDS:
Works as the same as rb_block_call_kw.
RB_BLOCK_NO_USE_PACKED_ARGS:
The given block ("bl_proc") does not use "yielded_arg" of rb_block_call_func_t.
Instead, the block accesses the yielded arguments via "argc" and "argv".
This flag allows the called method to yield arguments without allocating an Array.
[Feature #13557]
Setting the backtrace with an array of strings is lossy. The resulting
exception will return nil on `#backtrace_locations`.
By accepting an array of `Backtrace::Location` instance, we can rebuild
a `Backtrace` instance and have a fully functioning Exception.
Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>
This `st_table` is used to both mark and pin classes
defined from the C API. But `vm->mark_object_ary` already
does both much more efficiently.
Currently a Ruby process starts with 252 rooted classes,
which uses `7224B` in an `st_table` or `2016B` in an `RArray`.
So a baseline of 5kB saved, but since `mark_object_ary` is
preallocated with `1024` slots but only use `405` of them,
it's a net `7kB` save.
`vm->mark_object_ary` is also being refactored.
Prior to this changes, `mark_object_ary` was a regular `RArray`, but
since this allows for references to be moved, it was marked a second
time from `rb_vm_mark()` to pin these objects.
This has the detrimental effect of marking these references on every
minors even though it's a mostly append only list.
But using a custom TypedData we can save from having to mark
all the references on minor GC runs.
Addtionally, immediate values are now ignored and not appended
to `vm->mark_object_ary` as it's just wasted space.
when the RUBY_FREE_ON_SHUTDOWN environment variable is set, manually free memory at shutdown.
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
Co-authored-by: Peter Zhu <peter@peterzhu.ca>
This commit moves IO#readline to Ruby. In order to call C functions,
keyword arguments must be converted to hashes. Prior to this commit,
code like `io.readline(chomp: true)` would allocate a hash. This
commits moves the keyword "denaturing" to Ruby, allowing us to send
positional arguments to the C API and avoiding the hash allocation.
Here is an allocation benchmark for the method:
```
x = GC.stat(:total_allocated_objects)
File.open("/usr/share/dict/words") do |f|
f.readline(chomp: true) until f.eof?
end
p ALLOCATIONS: GC.stat(:total_allocated_objects) - x
```
Before this commit, the output was this:
```
$ make run
./miniruby -I./lib -I. -I.ext/common -r./arm64-darwin22-fake ./test.rb
{:ALLOCATIONS=>707939}
```
Now it is this:
```
$ make run
./miniruby -I./lib -I. -I.ext/common -r./arm64-darwin22-fake ./test.rb
{:ALLOCATIONS=>471962}
```
[Bug #19890] [ruby-core:114803]
Before object shapes, we were using class serial to invalidate
inline caches. Now that we use shape_id for inline cache keys,
the class serial is unnecessary.
Co-Authored-By: Aaron Patterson <tenderlove@ruby-lang.org>
This patch pushes dummy frames when loading code for the
profiling purpose.
The following methods push a dummy frame:
* `Kernel#require`
* `Kernel#load`
* `RubyVM::InstructionSequence.compile_file`
* `RubyVM::InstructionSequence.load_from_binary`
https://bugs.ruby-lang.org/issues/18559
This commit reintroduces finer-grained constant cache invalidation.
After 8008fb7 got merged, it was causing issues on token-threaded
builds (such as on Windows).
The issue was that when you're iterating through instruction sequences
and using the translator functions to get back the instruction structs,
you're either using `rb_vm_insn_null_translator` or
`rb_vm_insn_addr2insn2` depending if it's a direct-threading build.
`rb_vm_insn_addr2insn2` does some normalization to always return to
you the non-trace version of whatever instruction you're looking at.
`rb_vm_insn_null_translator` does not do that normalization.
This means that when you're looping through the instructions if you're
trying to do an opcode comparison, it can change depending on the type
of threading that you're using. This can be very confusing. So, this
commit creates a new translator function
`rb_vm_insn_normalizing_translator` to always return the non-trace
version so that opcode comparisons don't have to worry about different
configurations.
[Feature #18589]
This reverts commits for [Feature #18589]:
* 8008fb7352
"Update formatting per feedback"
* 8f6eaca2e1
"Delete ID from constant cache table if it becomes empty on ISEQ free"
* 629908586b
"Finer-grained inline constant cache invalidation"
MSWin builds on AppVeyor have been crashing since the merger.
Current behavior - caches depend on a global counter. All constant mutations cause caches to be invalidated.
```ruby
class A
B = 1
end
def foo
A::B # inline cache depends on global counter
end
foo # populate inline cache
foo # hit inline cache
C = 1 # global counter increments, all caches are invalidated
foo # misses inline cache due to `C = 1`
```
Proposed behavior - caches depend on name components. Only constant mutations with corresponding names will invalidate the cache.
```ruby
class A
B = 1
end
def foo
A::B # inline cache depends constants named "A" and "B"
end
foo # populate inline cache
foo # hit inline cache
C = 1 # caches that depend on the name "C" are invalidated
foo # hits inline cache because IC only depends on "A" and "B"
```
Examples of breaking the new cache:
```ruby
module C
# Breaks `foo` cache because "A" constant is set and the cache in foo depends
# on "A" and "B"
class A; end
end
B = 1
```
We expect the new cache scheme to be invalidated less often because names aren't frequently reused. With the cache being invalidated less, we can rely on its stability more to keep our constant references fast and reduce the need to throw away generated code in YJIT.
RubyVM::AST.of(Thread::Backtrace::Location) returns a node that
corresponds to the location. Typically, the node is a method call, but
not always.
This change also includes iseq's dump/load support of node_ids for each
instructions.
`cd` is passed to method call functions to method invocation
functions, but `cd` can be manipulated by other ractors simultaneously
so it contains thread-safety issue.
To solve this issue, this patch stores `ci` and found `cc` to `calling`
and stops to pass `cd`.
* `GC.auto_compact=`, `GC.auto_compact` can be used to control when
compaction runs. Setting `auto_compact=` to true will cause
compaction to occurr duing major collections. At the moment,
compaction adds significant overhead to major collections, so please
test first!
[Feature #17176]
According to MSVC manual (*1), cl.exe can skip including a header file
when that:
- contains #pragma once, or
- starts with #ifndef, or
- starts with #if ! defined.
GCC has a similar trick (*2), but it acts more stricter (e. g. there
must be _no tokens_ outside of #ifndef...#endif).
Sun C lacked #pragma once for a looong time. Oracle Developer Studio
12.5 finally implemented it, but we cannot assume such recent version.
This changeset modifies header files so that each of them include
strictly one #ifndef...#endif. I believe this is the most portable way
to trigger compiler optimizations. [Bug #16770]
*1: https://docs.microsoft.com/en-us/cpp/preprocessor/once
*2: https://gcc.gnu.org/onlinedocs/cppinternals/Guard-Macros.html
This patch contains several ideas:
(1) Disposable inline method cache (IMC) for race-free inline method cache
* Making call-cache (CC) as a RVALUE (GC target object) and allocate new
CC on cache miss.
* This technique allows race-free access from parallel processing
elements like RCU.
(2) Introduce per-Class method cache (pCMC)
* Instead of fixed-size global method cache (GMC), pCMC allows flexible
cache size.
* Caching CCs reduces CC allocation and allow sharing CC's fast-path
between same call-info (CI) call-sites.
(3) Invalidate an inline method cache by invalidating corresponding method
entries (MEs)
* Instead of using class serials, we set "invalidated" flag for method
entry itself to represent cache invalidation.
* Compare with using class serials, the impact of method modification
(add/overwrite/delete) is small.
* Updating class serials invalidate all method caches of the class and
sub-classes.
* Proposed approach only invalidate the method cache of only one ME.
See [Feature #16614] for more details.
Now, rb_call_info contains how to call the method with tuple of
(mid, orig_argc, flags, kwarg). Most of cases, kwarg == NULL and
mid+argc+flags only requires 64bits. So this patch packed
rb_call_info to VALUE (1 word) on such cases. If we can not
represent it in VALUE, then use imemo_callinfo which contains
conventional callinfo (rb_callinfo, renamed from rb_call_info).
iseq->body->ci_kw_size is removed because all of callinfo is VALUE
size (packed ci or a pointer to imemo_callinfo).
To access ci information, we need to use these functions:
vm_ci_mid(ci), _flag(ci), _argc(ci), _kwarg(ci).
struct rb_call_info_kw_arg is renamed to rb_callinfo_kwarg.
rb_funcallv_with_cc() and rb_method_basic_definition_p_with_cc()
is temporary removed because cd->ci should be marked.
This removes the warnings added in 2.7, and changes the behavior
so that a final positional hash is not treated as keywords or
vice-versa.
To handle the arg_setup_block splat case correctly with keyword
arguments, we need to check if we are taking a keyword hash.
That case didn't have a test, but it affects real-world code,
so add a test for it.
This removes rb_empty_keyword_given_p() and related code, as
that is not needed in Ruby 3. The empty keyword case is the
same as the no keyword case in Ruby 3.
This changes rb_scan_args to implement keyword argument
separation for C functions when the : character is used.
For backwards compatibility, it returns a duped hash.
This is a bad idea for performance, but not duping the hash
breaks at least Enumerator::ArithmeticSequence#inspect.
Instead of having RB_PASS_CALLED_KEYWORDS be a number,
simplify the code by just making it be rb_keyword_given_p().
Saves comitters' daily life by avoid #include-ing everything from
internal.h to make each file do so instead. This would significantly
speed up incremental builds.
We take the following inclusion order in this changeset:
1. "ruby/config.h", where _GNU_SOURCE is defined (must be the very
first thing among everything).
2. RUBY_EXTCONF_H if any.
3. Standard C headers, sorted alphabetically.
4. Other system headers, maybe guarded by #ifdef
5. Everything else, sorted alphabetically.
Exceptions are those win32-related headers, which tend not be self-
containing (headers have inclusion order dependencies).