mirror of
https://github.com/servo/servo.git
synced 2026-05-01 07:14:51 +02:00
Page:
Workweek string interning
Pages
Adding a new WebIDL binding
Alternative Logo Proposals and Related Swag
Asynchronous WebAssembly compilation project
Austin Oxidation
Autogeneration of style structs
Basic SVG support project
Beginner's guide to rebasing and squashing
Benchmarking
Benchmarks
Bots
Browser Engine Research
Build Errors FAQ
Buildbot administration
Building for Android
Building for Magic Leap
Building for UWP
Building on ARM desktop Linux
Building
CI Services we use
CSS parse error reporting
CSSOM student project
Canvas rendering project
Cargo upgrade service project
Code rust concurrency
Code Review
Code of Conduct
Coding standards
Compiler upgrade recipes
Compositor Layer Design
Contributing
Control Servo using WebDriver
Creating and viewing WARC web archives in Servo
Creating new OpenSSL Windows binary distributions
Cross compiling from linux to mac
Crowbot
Css selector matching meeting 2013 07 19
DOM Design
DOM documentation
DOM missing pieces
Debugging JS web compat issues
Debugging and editing tools
Debugging
Design
Developer tools student project
Devtools CSS errors
Devtools plans
Devtools
Diagnosing SpiderMonkey JIT issues
Eric Atkinson visit 2013 09 10
Events and sundry
Expand HTTP request response monitoring
Fetch improvement project
Firefox Reality release notes
FirefoxReality build
Firewall setup for servo master1
Focus student project
Form validation student project
GSoC project brainstorming
Garbage collected DOM
Getting started with layout
GitHub Labels
Github & Critic PR handling 101
Github workflow
Glossary
Governance
Graphics toolkit integration
HTML parser improvement project
HTMLElement binding conversion
HTTP archive support project
HTTP library requirements
Hawaii Rooting
High priority content for layout
Highfive
HoloLens 2 test plan
Home
How to generate GStreamer binaries for CI
Image load conformance student project
Image maps project
Implement HTML charset parsing project
Implement ImageBitmap project
Implement missing WebAudio automation student project
Implement support for missing XMLHttpRequest APIs
Implement worker modules
Implementing a web standard (RGSoC)
Improve specification conformance of unicode bidi library
Incremental flow tree construction
Infrastructure
Integrate xml5ever
Intern project brainstorming
Intern projects
JS objects, wrappers, and cross origin concerns 2013 08 07
Layout 2020
Layout Overview
Layout resources
Layout revamp ideas
Leo meyerovich visit 2013 07 22
Linux sandboxing
London Oxidation
London Security
Meeting 2014 10 27
Meeting 2014 12 08
Meeting 2012 02 08
Meeting 2012 02 16
Meeting 2012 07 20
Meeting 2013 04 01
Meeting 2013 04 15
Meeting 2013 04 22
Meeting 2013 04 29
Meeting 2013 05 06
Meeting 2013 05 13
Meeting 2013 05 20
Meeting 2013 06 03
Meeting 2013 06 10
Meeting 2013 06 14
Meeting 2013 06 17
Meeting 2013 06 24
Meeting 2013 07 01
Meeting 2013 07 15
Meeting 2013 07 22
Meeting 2013 07 29
Meeting 2013 08 05
Meeting 2013 08 12
Meeting 2013 08 19
Meeting 2013 09 09
Meeting 2013 09 16
Meeting 2013 09 23
Meeting 2013 09 30
Meeting 2013 10 14
Meeting 2013 10 21
Meeting 2013 10 28
Meeting 2013 11 04
Meeting 2013 11 18
Meeting 2013 11 25
Meeting 2013 12 02
Meeting 2013 12 09
Meeting 2013 12 16
Meeting 2014 01 06
Meeting 2014 01 13
Meeting 2014 01 21
Meeting 2014 01 27
Meeting 2014 02 03
Meeting 2014 02 10
Meeting 2014 02 24
Meeting 2014 03 10
Meeting 2014 03 17
Meeting 2014 03 24
Meeting 2014 03 31
Meeting 2014 04 07
Meeting 2014 04 14
Meeting 2014 04 21
Meeting 2014 04 28
Meeting 2014 05 05
Meeting 2014 05 13
Meeting 2014 05 19
Meeting 2014 06 09
Meeting 2014 06 17
Meeting 2014 06 23
Meeting 2014 06 30
Meeting 2014 07 07
Meeting 2014 07 14
Meeting 2014 07 21
Meeting 2014 07 29
Meeting 2014 08 04
Meeting 2014 08 11
Meeting 2014 08 12
Meeting 2014 08 18
Meeting 2014 08 25
Meeting 2014 09 08
Meeting 2014 09 15
Meeting 2014 09 22
Meeting 2014 09 29
Meeting 2014 10 06
Meeting 2014 10 13
Meeting 2014 10 20
Meeting 2014 11 10
Meeting 2014 11 17
Meeting 2014 11 24
Meeting 2014 12 15
Meeting 2015 01 05
Meeting 2015 01 12
Meeting 2015 01 26
Meeting 2015 02 09
Meeting 2015 02 23
Meeting 2015 03 02
Meeting 2015 03 16
Meeting 2015 03 30
Meeting 2015 04 06
Meeting 2015 04 13
Meeting 2015 04 27
Meeting 2015 05 04
Meeting 2015 05 11
Meeting 2015 05 18
Meeting 2015 06 01
Meeting 2015 06 08
Meeting 2015 06 15
Meeting 2015 07 06
Meeting 2015 07 13
Meeting 2015 07 27
Meeting 2015 08 10
Meeting 2015 08 17
Meeting 2015 08 24
Meeting 2015 08 31
Meeting 2015 09 14
Meeting 2015 09 21
Meeting 2015 09 28
Meeting 2015 10 05
Meeting 2015 10 12
Meeting 2015 10 19
Meeting 2015 10 26
Meeting 2015 11 02
Meeting 2015 11 09
Meeting 2015 11 16
Meeting 2015 11 30
Meeting 2016 01 04
Meeting 2016 01 11
Meeting 2016 01 25
Meeting 2016 02 01
Meeting 2016 02 08
Meeting 2016 02 22
Meeting 2016 03 07
Meeting 2016 03 21
Meeting Devtools Servo 2
Meetings
Microdata project
Minutes Hackathon 2012 03 27
Missing DOM features project
More ServiceWorker support project
More developer tools student project
Mozlandia Automation
Mozlandia B2S
Mozlandia JS
Mozlandia Rust In Gecko
Mozlandia WPT
Mozlandia gfx
Mozlando Devtools Servo
Mozlando Oxidation
Mozlando SM Servo
Mozlando Servo Bluetooth
Mozlando Servo MagicDOM
Mozlando Servo SMStrings
Mutation observer project
Mutation testing project
NCSU student projects
Network security project
Off main thread HTML parsing project
Offscreen canvas improvements project
Offscreen canvas project
Orlando Oxidation 2018
Oxidation 2015 11 05
Persistent sessions student project
Preparing ARM libraries for CI
Priority of CSS properties
Priority of DOM implementation
Priority of dom bindings
Private browsing student project
Profiling
Project proposal deadlines
Prototype JS form controls student project
Prototype ways of splitting the script crate
Publishing a new ANGLE NuGet version
Publishing a new app store release
Push vs Pull for caching
Random web content project
Refactor GLES2 student project
Refactor bluetooth support student project
Remaining work
Removing push notifications from IRC hooks
Replace C libraries student project
Report new contributors project
Representation of computed style
Research
Reviewer
Roadmap
Running Web Platform Tests on Servo
Rust HTML parser
Rust SpiderMonkey debugger API
Rust cssparser code walk 2013 08 02
SaltStack Administration
San Francisco Oxidation
Servo Benchmarking Report (December 2024)
Servo Benchmarking Report (November 2024)
Servo Benchmarking Report (October 2024)
Servo Layout Engines Report
Servo and SpiderMonkey Report
Servo for Gecko Developers
Specification Links
SpiderMonkey related tasks
SpiderMonkey infodump
SpiderMonkey upgrade details
Storage student project
Streaming webassembly student project
Strings
Student project brainstorm
Student projects
Styling overview
Stylo hacking guide
Summer of Code 2014: Implement XMLHttpRequest
Summer of Code 2016: Fetch API
Summer of Code 2016: File support
Summer of Code 2016: ServiceWorker infrastructure
Summer of Code projects
Summit meeting 2013 09 09
Support WebDriver based tests project
Syncing web platform tests (WPT)
TaskCluster
Testing
Tools
Tracking intermittent failures over time project
Transcription Notes from Servo Architecture talk in Suwon
Transcription notes from rust patterns talk in suwon
Transcription parallelism
Transcription rust concurrency
Transcription rust runtime
Transription layout and acid2
Trinity College Dublin student projects
UPenn student projects
Updating the Rust compiler used by Servo
Upgrading non taskcluster linux CI machines
Upgrading the UWP gstreamer binaries
Upgrading the windows LLVM binaries
Upgrading wptrunner
Using DOM types
Using Rust Spidermonkey Prototype
Using WebWorker Prototype
Version 0.1
Videos and presentations
WebAudio JS interfaces student project
WebAudio nodes student project
WebCompatBug
WebSocket student project
Webdriver student project
Webdriver tests student project
Webrender Overview
Whistler 2019 notes
Whistler Bugzilla
Whistler FFOS
Whistler GFX
Whistler Houdini1
Whistler Houdini2
Whistler Necko
Whistler Oxidation 2019
Work items for new contributors
Workweek COW DOM
Workweek alt js
Workweek android arm
Workweek boot 2 servo
Workweek compiler lints
Workweek displaylist
Workweek dogfooding
Workweek encoding
Workweek generated content
Workweek governance
Workweek graphics stack
Workweek graphics toolkit
Workweek incremental layout
Workweek js bindings status
Workweek layers
Workweek layers2
Workweek pixels
Workweek rasterization
Workweek reftests
Workweek roadmap
Workweek script crate
Workweek security
Workweek string interning
Workweek tables
Workweek writing modes
XML parser student project
infra triage notes
jQuery status
webxr.today support
No results
2
Workweek string interning
Josh Matthews edited this page 2014-07-07 17:47:11 -07:00
Table of Contents
String interning
- kmc: I've been looking into interning in the html parser. a string is either statically interned (many strings we know at compile time), dynamically interned during parsing (css classes, can't predict but need fast matching), not interned. datatype has those three variants. static is good, have macros to allow quick matches on strings that optimizes really well - llvm uses bitmask on immediate 64bit values, so most common tags in first 64 are very quick. dynamic should be orthogonal to static interning; one idea is if string is <sizeof(pointer), store those characters if it's not a static interned string. need invariant that if string can be static variant, always is. so string is either statically interned to small integer, dynamically interned to hashtable, dynamically "interned" immediate, not interned.
- jack: how do you determined inline vs. in hashtable?
- gw: set a bit?
- kmc: yes, tagging. could never intern on non-ascii, and ensure high-bit is set if pointer. current impl is rust enum, so two words, but want to turn that into one.
- jack: could use invalid utf8?
- kmc that's another option. there's a email thread, I'll put it in the etherpad. [ https://groups.google.com/d/msg/mozilla.dev.servo/1K2-Qy27e3A/LqiZ0EIT0EMJ ]
- jack: wasn't there something about sha-something? [ https://groups.google.com/d/msg/mozilla.dev.servo/4NJsID0kwoc/3BWneVlCegwJ https://github.com/mozilla/servo/issues/2217 ]
- kmc: not that good; need at least sha256, so everything gets bigger. optimizes parsing at expense of layout because you don't synchronize on hashtable, but matching is comparing four words and takes more memory. probably using strong crypto hash function is not a win even if fn is fast; interning by storing small strings immediately should be a win, lots of small strings. less synchronizing on hashtable. don't think we should worry on which datastructure, just touch it less. inline interning is one way to do that. start with a simple mutex-guarded hashtable; measure and replace if it's a problem.
- jack: what about a task instead of concurrent hashtable?
- kmc: yes, sync on channel instead of mutex.
- brson: performance will be worse.
- gw: it's a green thread context, though.
- brson: will be faster than a kernel context switch on a mutex
- kmc: yeah but [on Linux] mutex is futex, so you only hit the kernel in the contended case. I like the idea of interface of "this hands out interned strings" so we can measure different implementations easily.
DOM string representation
- kmc: another related topic is how to represent DOM strings.
- simonsapin: utf8 vs. ucs2?
- kmc: yes. I want to be aggressive in servo about pushing utf8 as far as possible in web platform, assuming people aren't using lone surrogates much.
- simonsapin: problem of ucs2 is indexing - slow path is not lone surrogates, but anything outside of Basic Multilingual Plane. if we want utf8 in JS, need to wait on spidermonkey.
- kmc: they like the idea. we can do a lazy conversion - guessing that most strings don't get touched by JS, so we could only convert when necessary. win for memory usage, since most text won't be converted at all. parser should be faster since most content comes off the wire as utf8.
- jack: what do we do right now?
- kmc: we have conversion from utf8 -> utf16 in bindings.
- jack: so we do this already?
- kmc: but we don't save the results
- jack: what would we do for indexing for the lazy conversion?
- kmc: if SM is entirely ucs2 and we do lazy conversion, no issue. interesting question is if there is anything outside of JS that requires indexing into ucs2 strings. if in world where some JS strings are not ucs2 - could flag ones that contain non-BMP chars, or even non-ascii. could do it chunk-wise (chunks of similar encodings).
- zwarich: could be tricky preserving current optimizations that benchmarks rely upon.
- jack: could be fine, since they probably create their own strings and DOM doesn't touch them.
- zwarich: all strings in webkit are ucs2 or 8 bit Latin-1. there are more cases in engine where utf8 would suffice, but if ascii then don't worry about JS interface. have specialized parsers for 8 bit strings, which is big performance win.
- kmc: my parser exploits that all html chars with special meaning are ascii, so can do byte-oriented parsing. already a big performance win and moreso with SSE string instructions. would be interesting research to try fancy adaptive string representations (e.g. succinct rank-selection dicts for UTF-8 indexing), but our short-term approach that doesn't rely on SM hacking is lazy conversion to utf16.
- simonsapin: concern that lone surrogate is not just slow path, but cannot be represented in utf8.
- kmc: only way to get lone surrogate is from JS, in which case could be ucs2 forever.
- simonsapin: if script creates dom node with lone surrogate, in rust dom node we wouldn't store utf8?
- kmc: the lazy-conversion proposal is that DOMString would be an enum of either ucs2 or utf8 string. utf8 only gets created by parsing, ucs2 is created out of script or converted before returning to script. could have small string optimization (attr names could be stored inline). haven't considered text rendering; text node could be either utf8 or utf16, do we convert or have text engine that accepts both?
- zwarich: want to avoid doing conversions everywhere.
- kmc: pretty sure harfbuzz only takes utf8 right now. ucs2 in Servo only exists in SM at this time.
- zwarich: will we convert everything in html parser to utf8? there are other encodings...
- kmc: looked at numbers - utf8 is at least 80-90% of content [ed: 80.8% according to http://w3techs.com/technologies/overview/character_encoding/all], next most common are latin1/1252 [9.9%], [then Windows-1251 for Cyrillic, 2.5%], 16 bit chinese encodings [GB2312, 1.6%], couple others...
- simonsapin: converting to utf8 is not worse than converting to ucs2.
- kmc: better, since halves most content size. almost no content is utf16 so you always have to convert once. if you have invalid utf8, it's handled before parsing stage. previous stage turns invalid utf8 into replacement char.
- kmc: parser needs to handle surrogate pair split across document.write calls, so hold buffer of surrogate until second call. when see leading surrogate, can store one code unit and next code unit is either trailing surrogate (now valid utf8) or is anything else and turn surrogate into replacement character. depends on tiny change to web spec which hsivonen is fine with (fine to turn lone surrogates into replacement character). spec assumes all browsers/parsers are ucs2, but probably fine to change spec/break compat on lone surrogates, as long as proper surrogate pairs are handled properly under this scheme.
More on string interning
- ryan: q about string intern: if strings in different formats, how to compare for equality? static string and dynamic string.
- kmc: for different encoding, natural thing for lazy conversion would be converting to utf16 if necessary.
- kmc: for static vs. dynamic interned strings, initially wrote to compare characters always. but for fast matches, we want the invariant if string can be statically interned it always is. however it may be case that strings are sometimes dynamically or not interned, and can be handled by detecting mismatch.
- ryan: when I tried last time, the conversion on the fly was quite slow. need something to make it fast, but not sure.
- kmc: in my scheme, a dynamically interned string is a pointer to string owned by hashtable. non-interned is an owning pointer to the string. in both cases, you have pointer to characters and can compare. I think you intern dynstring by storing pointer to string itself, so comparison is fast.
- ryan: do they point to the same string?
- kmc: yes, if both strings are dyn, can compare by pointer. if one isn't, need to compare characters.
- gw: we need to refcount hash table entries so we can delete unused interned strings
- kmc: can be expensive, I'm not sure how often they're copied during style recalc/selector matching
- gw: only applies to strings interned through hashtable...
- kmc: if most are statically interned or immediately interned (inline), shouldn't be issue.
- gw: think you need to be able to remove from the hashtable.
- kmc: probably sites that create classes on fly and fill up the table. no way to determine all owners.
- ryan: how do we do case-insensitive comparison on interned strings?
- kmc: tricky. for the parser things mostly get lowercased before interning. in general at interning time if you don't know will be compared caselessly you have no benefit from interning.
- gw: some parts of the DOM today store both original and lower-case versions; we could just intern both [That's not for case-insensitive matching, though.]
- kmc: if you refcounted them together it could save memory. I think many details will wait until we have implementation and see performance numbers.
- zwarich: important to think about 8 bit character case, since webkit and blink see big benefit from optimizing. also when there's SM support, there is no need to be fancy to expose those to JS.
- kmc: ascii-only strings?
- zwarich: Webkit and Blink do it for latin1. when you create string object, can create from ucs2 buffer or ascii buffer. benefit is in JS engine can just pass around smaller, less-fancy indexing data structures.
- kmc: fixed-size for certain number of characters is tried and true.
- zwarich: CSS parser should probably do similar things to HTML parser.
- simonsapin: css parser is utf8, could change. may need to change to handle lone surrogates anyways.
- jack: what would that be used for?
- jdm: maybe multiple appends of surrogates to inline css block?
- simonsapin: not quite like document.write; separate appends are separate rules, so not parsed the same way. still might be found on the web.
- jack: at some point it's ok to break these. who's working on this? gw and kmc? who will be the person to commit it to servo?
- kmc: do we see value in landing interning before parser is ready?
- zwarich: could be good for interface for interning to land before html parser so can update dom bindings, css parser, etc.
- kmc: I could pull out the interning and put it into servo before the parser is done.
- jack: maybe we can finish benchmarking first.
- simonsapin: is it only internal to the parser?
- kmc: clients of the treebuilder see Atom rather than String. can compare atoms, turn atom into string slice. basically looks like an immutable String but sometimes comparison is much faster