Commit graph

18 commits

Author SHA1 Message Date
Jean Boussier
85c52079aa set.c: Store set_table->bins at the end of set_table->entries
This saves one pointer in `struct set_table`, which would allow
`Set` objects to still fit in 80B TypedData slots even if RTypedData
goes from 32B to 40B large.

The existing set benchmark seem to show this doesn't have a very
significant impact. Smaller sets are a bit faster, larger sets
a bit slower.

It seem consistent over multiple runs, but it's unclear how much
of that is just error margin.

```
compare-ruby: ruby 3.5.0dev (2025-08-12T02:14:57Z master 428937a536) +YJIT +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-12T07:22:26Z set-entries-bounds da30024fdc) +YJIT +PRISM [arm64-darwin24]
warming up........

|                         |compare-ruby|built-ruby|
|:------------------------|-----------:|---------:|
|new_0                    |     15.459M|   15.823M|
|                         |           -|     1.02x|
|new_10                   |      3.484M|    3.574M|
|                         |           -|     1.03x|
|new_100                  |    546.992k|  564.679k|
|                         |           -|     1.03x|
|new_1000                 |     49.391k|   48.169k|
|                         |       1.03x|         -|
|aref_0                   |     18.643M|   19.350M|
|                         |           -|     1.04x|
|aref_10                  |      5.941M|    6.006M|
|                         |           -|     1.01x|
|aref_100                 |    822.197k|  814.219k|
|                         |       1.01x|         -|
|aref_1000                |     83.230k|   79.411k|
|                         |       1.05x|         -|
```
2025-08-12 21:56:57 +02:00
Jean Boussier
5bcfc53d6f set.c: use rb_gc_mark_and_move
The `p->field = rb_gc_location(p->field)` isn't ideal because it means all
references are rewritten on compaction, regardless of whether the referenced
object has moved. This isn't good for caches nor for Copy-on-Write.

`rb_gc_mark_and_move` avoid needless writes, and most of the time allow to
have a single function for both marking and updating references.
2025-08-07 21:00:00 +02:00
Nobuyoshi Nakada
6179cc0118
[DOC] Fill undocumented documents 2025-08-04 02:23:43 +09:00
viralpraxis
d4020dd5fa [Bug #21513] Raise on converting endless range to set
ref: https://bugs.ruby-lang.org/issues/21513

Before this patch, trying to convert endless range (e.g. `(1..)`) to set
(using `to_set`) would hang
2025-07-29 18:42:18 +09:00
Jeremy Evans
2ab38691a2 Add Set C-API
This should be a minimal C-API needed to deal with Set objects. It
supports creating the sets, checking whether an element is the set,
adding and removing elements, iterating over the elements, clearing
a set, and returning the size of the set.

Co-authored-by: Nobuyoshi Nakada <nobu.nakada@gmail.com>
2025-07-11 15:24:23 +09:00
Jeremy Evans
08d4f7893e Rename some set_* functions to set_table_*
These functions conflict with the planned C-API functions. Since they
deal with the underlying set_table pointers and not Set instances,
this seems like a more accurate name as well.
2025-07-11 15:24:23 +09:00
Jeremy Evans
7c3bbfcddb Include Set subclass name in Set#inspect output
Fixes [Bug #21377]

Co-authored-by: zzak <zzak@hey.com>
2025-06-25 09:21:07 +09:00
Jeremy Evans
3a9c091cf3 Simplify Set#inspect output
As Set is now a core collection class, it should have special inspect
output.  Ideally, inspect output should be suitable to eval, similar
to array and hash (assuming the elements are also suitable to eval):

  set = Set[1, 2, 3]
  eval(set.inspect) == set # should be true

The simplest way to do this is to use the Set[] syntax.

This deliberately does not use any subclass name in the output,
similar to array and hash. It is more important that users know they
are dealing with a set than which subclass:

  Class.new(Set)[]
  # this does: Set[]
  # not: #<Class:0x00000c21c78699e0>[]

This inspect change breaks the power_assert bundled gem tests, so
add power_assert to TEST_BUNDLED_GEMS_ALLOW_FAILURES in the workflows.

Implements [Feature #21389]
2025-06-25 09:21:07 +09:00
tomoya ishida
67346a7d94
[Bug #21449] Fix Set#divide{|a,b|} using Union-find structure (#13680)
* [Bug #21449] Fix Set#divide{|a,b|} using Union-find structure

Implements Union-find structure with path compression.
Since divide{|a,b|} calls the given block n**2 times in the worst case, there is no need to implement union-by-rank or union-by-size optimization.

* Avoid internal arrays from being modified from block passed to Set#divide

Internal arrays can be modified from yielded block through ObjectSpace.
Freeze readonly array, use ALLOCV_N instead of mutable array.
2025-06-24 02:56:04 +09:00
John Hawthorn
c962735fe8 Add missing write barrier in set_i_initialize_copy
When we copy the table from one set to another we need to run write
barriers.
2025-06-09 10:06:47 -07:00
Jeremy Evans
0b07d2a1e3 Deprecate passing arguments to Set#to_set and Enumerable#to_set
Array#to_a, Hash#to_h, Enumerable#to_a, and Enumerable#to_h do not
allow you to specify subclasses.  This has undesired behavior when
passing non-Set subclasses.  All of these are currently allowed, and
none make sense:

```ruby
enum = [1,2,3].to_enum

enum.to_set(Hash)
enum.to_set(Struct.new("A", :a))
enum.to_set(ArgumentError)
enum.to_set(Thread){}
```

Users who want to create instances of a subclass of Set from an
enumerable should pass the enumerable to SetSubclass.new instead of
using to_set.
2025-06-06 01:24:05 +09:00
Jean Boussier
1ee4b43a56 Set#merge: raise if called during iteration
[Bug #21332]
2025-05-13 22:27:42 +02:00
Jeremy Evans
203199251f Remove Set#to_h
This overrides Enumerable#to_h, but doesn't handle a block, breaking
backwards compatibility.

Set#to_h was added in the marshalling support commit, but isn't
necessary for that, as the underlying function is called. Remove
the method definition to restore backwards compatibility.
2025-05-13 10:09:58 +09:00
Jeremy Evans
21035c826d Handle mutating of array passed to Set.new during iteration
This avoids a heap-use-after-free.

Fixes [Bug #21306]
2025-05-04 04:10:57 +09:00
Jeremy Evans
be665cf855 Handle mutation of array being merged into set
Check length of array during every iteration, as a #hash method
could truncate the array, resulting in heap-use-after-free.

Fixes [Bug #21305]
2025-05-04 04:10:57 +09:00
Aaron Patterson
e6974be545 Don't call hash tombstone compaction from GC compaction
Tombstone removal may possibly require allocation, and we're not allowed
to allocate during GC.  This commit also renames `set_compact` to
`set_update_references` to differentiate tombstone removal compaction with GC
object compaction.

Co-Authored-By: Max Bernstein <max.bernstein@shopify.com>
Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
2025-04-29 13:33:10 -07:00
Jeremy Evans
926411171d
Support Marshal.{dump,load} for core Set
This was missed when adding core Set, because it's handled
implicitly for T_OBJECT.

Keep marshal compatibility between core Set and stdlib Set,
so you can unmarshal core Set with stdlib Set and vice versa.

Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
2025-04-28 08:38:35 -07:00
Jeremy Evans
e4f85bfc31 Implement Set as a core class
Set has been an autoloaded standard library since Ruby 3.2.
The standard library Set is less efficient than it could be, as it
uses Hash for storage, which stores unnecessary values for each key.

Implementation details:

* Core Set uses a modified version of `st_table`, named `set_table`.
  than `s/st_/set_/`, the main difference is that the stored records
  do not have values, making them 1/3 smaller. `st_table_entry` stores
  `hash`, `key`, and `record` (value), while `set_table_entry` only
  stores `hash` and `key`.  This results in large sets using ~33% less
  memory compared to stdlib Set.  For small sets, core Set uses 12% more
  memory (160 byte object slot and 64 malloc bytes, while stdlib set
  uses 40 for Set and 160 for Hash).  More memory is used because
  the set_table is embedded and 72 bytes in the object slot are
  currently wasted. Hopefully we can make this more efficient and have
  it stored in an 80 byte object slot in the future.

* All methods are implemented as cfuncs, except the pretty_print
  methods, which were moved to `lib/pp.rb` (which is where the
  pretty_print methods for other core classes are defined).  As is
  typical for core classes, internal calls call C functions and
  not Ruby methods.  For example, to check if something is a Set,
  `rb_obj_is_kind_of` is used, instead of calling `is_a?(Set)` on the
  related object.

* Almost all methods use the same algorithm that the pure-Ruby
  implementation used.  The exception is when calling `Set#divide` with a
  block with 2-arity.  The pure-Ruby method used tsort to implement this.
  I developed an algorithm that only allocates a single intermediate
  hash and does not need tsort.

* The `flatten_merge` protected method is no longer necessary, so it
  is not implemented (it could be).

* Similar to Hash/Array, subclasses of Set are no longer reflected in
  `inspect` output.

* RDoc from stdlib Set was moved to core Set, with minor updates.

This includes a comprehensive benchmark suite for all public Set
methods.  As you would expect, the native version is faster in the
vast majority of cases, and multiple times faster in many cases.
There are a few cases where it is significantly slower:

* Set.new with no arguments (~1.6x)
* Set#compare_by_identity for small sets (~1.3x)
* Set#clone for small sets (~1.5x)
* Set#dup for small sets (~1.7x)

These are slower as Set does not currently use the AR table
optimization that Hash does, so a new set_table is initialized for
each call.  I'm not sure it's worth the complexity to have an AR
table-like optimization for small sets (for hashes it makes sense,
as small hashes are used everywhere in Ruby).

The rbs and repl_type_completor bundled gems will need updates to
support core Set.  The pull request marks them as allowed failures.

This passes all set tests with no changes.  The following specs
needed modification:

* Modifying frozen set error message (changed for the better)
* `Set#divide` when passed a 2-arity block no longer yields the same
  object as both the first and second argument (this seems like an issue
  with the previous implementation).
* Set-like objects that override `is_a?` such that `is_a?(Set)` return
  `true` are no longer treated as Set instances.
* `Set.allocate.hash` is no longer the same as `nil.hash`
* `Set#join` no longer calls `Set#to_a` (it calls the underlying C
   function).
* `Set#flatten_merge` protected method is not implemented.

Previously, `set.rb` added a `SortedSet` autoload, which loads
`set/sorted_set.rb`.  This replaces the `Set` autoload in `prelude.rb`
with a `SortedSet` autoload, but I recommend removing it and
`set/sorted_set.rb`.

This moves `test/set/test_set.rb` to `test/ruby/test_set.rb`,
reflecting that switch to a core class.  This does not move the spec
files, as I'm not sure how they should be handled.

Internally, this uses the st_* types and functions as much as
possible, and only adds set_* types and functions as needed.
The underlying set_table implementation is stored in st.c, but
there is no public C-API for it, nor is there one planned, in
order to keep the ability to change the internals going forward.

For internal uses of st_table with Qtrue values, those can
probably be replaced with set_table.  To do that, include
internal/set_table.h.  To handle symbol visibility (rb_ prefix),
internal/set_table.h uses the same macro approach that
include/ruby/st.h uses.

The Set class (rb_cSet) and all methods are defined in set.c.
There isn't currently a C-API for the Set class, though C-API
functions can be added as needed going forward.

Implements [Feature #21216]

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
Co-authored-by: Oliver Nutter <mrnoname1000@riseup.net>
2025-04-26 10:31:11 +09:00