356 private links
- The current hardware bottleneck isn't I/O anymore but system calls.
Each system call causes a CPU mode switch between user mode and kernel mode. The switch costs 1000-1500 CPU cycles.
On a 3GHz processor, 1000-1500 cycles is about 500 nanoseconds. This might sound negligibly fast, but modern SSDs can handle over 1 million operations per second. If each operation requires a system call, you're burning 1.5 billion cycles per second just on mode switching.
A package manager can trigger 50k+ system calls to install reacts for example.
- JS adds overhead, especially with NodeJS that have layers. There are more steps in the pipeline to read the content of a file. Bun read package.json 2.2x faster than NodeJS because of it
Another use case is string optimization. package-lock files have an expected format with predefined strings (MIT, licence, etc...). These repeated strings can be optimized.
The manifest of each package is stored in a binary format
Bun stores the responses's ETag
and sends If-None-Match
header
-
The buffer for the tarball decompression is set in advance. When the data size is unknown, the buffer must be reallocated to grow (see [}. Bun buffers the entire tarball before decompressing. Most of JS packages are 1MB so it's fine (ts package is 50MB ok).
The uncompressed file size is known with the last 4 bytes of the gzip format.
Bun uses libdefalte optimized with SIMD instructions.
The comparison in NodeJS is a readStream, but it's not as efficient as a seek operation. -
Cache-friendly data layout
JSON is inefficient because each address pointer has a string step. "The CPU accesses a pointer that tells it where Next's data is located in memory. This data then contains yet another pointer to where its dependencies live, which in turn contains more pointers to the actual dependency strings."
Fetching data from RAM is slow, because CPU stores data in cache lines.
Because JSON (and especially JS objects) are stored randomly in RAM, the line cache is inefficient or will be used only for a few bytes.
This optimization works great for data that's stored sequentially, but it backfires when your data is scattered randomly across memory.
The nested structure of objects creates whats called "pointer chasing", a common anti-pattern in system programming.
For a project with 1000 packages averaging 5 dependencies, that's 2ms of pure memory latency.
5.Structure of arrays (SoA) instead of array of structs
Bun uses large contiguous buffers. While accessing a package is 8 bytes, the CPU can load an entire 64 byte cache line from packages[0]
to packages[7]
As a sidenote: Bun originally used a binary lockfile format (bun.lockb) to avoid JSON parsing overhead entirely, but binary files are impossible to review in pull requests and can't be merged when conflicts happen.
- File copying
Copying a file can be expensive as it runs first through the kernal memory. There are ways to optimize it though.
On MacOS, clonefile can clone entire directories, so it's a O(n) operation.
Linux has hardlinks. It has fallbacks such as ioctl_ficlone
for Btrfs and XFS, or copy_file_range
, or sendfile
- Multi-Core parallelism
Bun uses lock-free data structures. It also uses a thread pool of 64 concurrent HTTP connections.
Each thread gets its own memory pool.
- Conclusion
[...] npm gave us a foundation to build on, yarn made managing workspaces less painful, and pnpm came up with a clever way to save space and speed things up with hardlinks. Each worked hard to solve the problems developers were actually hitting at the time. But that world no longer exists. SSDs are 70× faster, CPUs have dozens of cores, and memory is cheap. The real bottleneck shifted from hardware speed to software abstractions. [...] The tools that will define the next decade of developer productivity are being written right now, by teams who understand that performance bottlenecks shifted when storage got fast and memory got cheap. Installing packages 25x faster isn't "magic": it's what happens when tools are built for the hardware we actually have.
It's absolutely possible to beat even the best sort implementations with domain specific knowledge, careful benchmarking and an understanding of CPU micro-architectures. At the same time, assumptions will become invalid, mistakes can creep in silently and good sort implementations can be surprisingly fast even without prior domain knowledge. If you have access to a high-quality sort implementation, think twice about replacing it with something home-grown.
Optimizing some endpoints in Rust inside a go app.
The results shows nearly 2x performance.
Modular CSS or a bundles? It follows Rethinking modular CSS and build-free design systems.
On first load, modular css files are worse.
Once the files are cached, subsequent renders take just 100ms to 200ms slower with modular files compared to one bundled file.
Given that the guiding ethos of Kelp is that the web is for everyone, it looks like I should probably be encouraging folks to use a bundled version as the main entry point.
tl;dr: the issue isn’t the @import rule itself, but that files under 1kb often end up the same size or even bigger when gzipped, so you get no compression benefits.
The experience shows that atomic css files is not optimal.
If the files I was importing were larger, it might make sense. As tiny, modular files? Not so much!
The complete library concatenated and gzipped is less than a single HTTP request. It’s just over 25-percent of the transfer size of sending modular gzipped files instead.
The naive Rust implémentation is 10 times faster than the python one.
It remains 6 times faster than the optimized one.
The Python has a collections.Counter class that is approximately as fast as the naive Rust version.
Finalement laisser son périphérique branché avec la batterie tout le temps à 100%, c'est comme prendre une grande bouffée d'air et retenir sa respiration. C'est pas bon.
Est-ce que laisser sa batterie à 20-80% est toujours une bonne idée, puisque les BMS intègrent une logique dédiée.
Replace the standard DefaultHasher
to ahash::{AHashMap, AHashSet}
to gain 18% improvements.
Only transfer the useful part of a font. It subsets static Unicode-ranges, so only a part of the font will be downloaded.
Dump the database as SQL statements instead of copying it with indexes. Then compress the resulting txt file.
# Create the backup
sqlite3 my_db.sqlite .dump | gzip -c > my_db.sqlite.txt.gz
# Reconstruct the database from the text file
cat my_local_database.db.txt | sqlite3 my_local_database.db
As complete script example:
# Create a gzip-compressed text file on the server
ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c > my_remote_database.db.txt.gz"
# Copy the gzip-compressed text file to my local machine
rsync --progress username@server:my_remote_database.db.txt.gz my_local_database.db.txt.gz
# Remove the gzip-compressed text file from my server
ssh username@server "rm my_remote_database.db.txt.gz"
# Uncompress the text file
gunzip my_local_database.db.txt.gz
# Reconstruct the database from the text file
cat my_local_database.db.txt | sqlite3 my_local_database.db
# Remove the local text file
rm my_local_database.db.txt
There should be better ways though.
Option has zero cost with Some types in memory.
This is huge:
Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database.
DOI: https://dl.acm.org/doi/10.1145/2901318.2901326
It may be useful to read it completely.
Fixes:
- compare the minimum load of each scheduling groups instead of the average
- Linux spawns threads on the same core as their parent thread: a node can steal threads from a another node by comparing the average load
and two others
It is useful to read their tools (online sanity checker for invariants such as "No core remains idle while another core is overloaded")
During the 00s,dozens of papers described new schedling algorithms, [... but] a few of them were adopted in mainstream operatin systems, mainly because it is not clear how to integrate all theseideas in scheduler safely.
Similar the part Related Work describes the current state of the research on other domains: performance bugs, kernel correctness, tracing.
The resources are available on Github: https://github.com/jplozi/wastedcores
We can expect a x8 speedup for a big transaction.
Optimization is not always a progress in every field
Similar to https://shaarli.lyokolux.space/shaare/xwiTHQ
Bit operators are the fastest, then static array, then dynamic arrays.
Objects are heavy in comparison.