297 private links
The bun single binary performs better!
64-bits pointer address can be compressed to 32 bits
Passer les PNG/JPEG qualité 90 à AVIF qualité 50 permet d'économiser au moins 75% de bande passante.
L'idée plus innovante est de compresser au préalable les ressources avant qu'elles soient utilisées.
[Précompresser avant de déployer] veut dire qu’on peut les compresser une seule fois, avec le niveau maximum, et demander à nginx de servir directement les fichiers pré-compressés. Zéro CPU à chaque requête, mais surtout un meilleur ratio au final, car on peut compresser plus fort.
En outre, Zopfli permet de compresser en .zip avec 3 à 8% d'efficacité en plus.
# Serve pre-compressed files generated at build time
gzip_static on;
brotli_static on; # nécessite libnginx-mod-http-brotli-static
# Fallback pour les contenus non pré-compressés
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_types text/plain text/css text/xml text/javascript
application/javascript application/json
application/xml image/svg+xml;
La compression Brotli permet de compresser à hauteur de 81% le HTML. La score des Web Core Vitals est passé de 70-85 à 99%.
Ok, FreeType renders font on LCD screens 40% faster
Reading a file is actually slow.
getCurrentThreadUserTime() uses many syscalls because it reads from /proc.
clock_gettime(CLOCK_THREAD_CPUTIME_ID) has only one syscall and a direct function call chain.
The optimisation can be done, but:
- The kernel policy is clear: don't break userspace
- It's undocumented anywhere!
- Author's take: if glibc depends on it, it's not going away.
This is why I like browsing commits of large open source projects. A 40-line deletion eliminated a 400x performance gap. The fix required no new kernel features, just knowledge of a stable-but-obscure Linux ABI detail.
The lessons:
- read the kernel source. POSIX tells what's portable; the kernel source code tells what's possible.
- check the old assumptions: revisiting them occasionally pays off.
Optimizations that don't need Rust:
- HTTP range requests for metadata
- Parallel downloads
- Global cache with hardlinks
- Python-free resolution
- PubGrub resolver algorithm
Rust has benefits though:
- zero-copy deserialization
- Thread-level parallelism
- No interpreter startup
- compact version representation: uv packs version into u64 integers. The micro-optimization compounds across millions of comparisons
uv is possible because of many PEP that came since 2016 (so too soon for me): PEP 518, 517, 621, and 658. There are the low-handing fruits: static metadata, no code execution to discover dependencies, and the ability to resolve everything upfront before downloading
How to optimize a rust program to squeeze maximum performance and as little RAM as possible
There are obvious for me, but they are good.
I see some are totally useless for Rust in comparison. Both have different targets though. It is moreover awesome to see 100x improvements.
The heap is a performance killer in Rust. One woraround is to swap to a more efficient memory allocator such as jemalloc.
In Cargo.toml:
[dependencies]
mimalloc = "0.1"
In main.rs:
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
The best performance optimisation is to avoid the heap. There is the heapless create for that. "The only thing to know is that the size of heapless types and collections needs to be known at compile-time."
A minimalistic UI and a minimal page weight
- The current hardware bottleneck isn't I/O anymore but system calls.
Each system call causes a CPU mode switch between user mode and kernel mode. The switch costs 1000-1500 CPU cycles.
On a 3GHz processor, 1000-1500 cycles is about 500 nanoseconds. This might sound negligibly fast, but modern SSDs can handle over 1 million operations per second. If each operation requires a system call, you're burning 1.5 billion cycles per second just on mode switching.
A package manager can trigger 50k+ system calls to install reacts for example.
- JS adds overhead, especially with NodeJS that have layers. There are more steps in the pipeline to read the content of a file. Bun read package.json 2.2x faster than NodeJS because of it
Another use case is string optimization. package-lock files have an expected format with predefined strings (MIT, licence, etc...). These repeated strings can be optimized.
The manifest of each package is stored in a binary format
Bun stores the responses's ETag and sends If-None-Match header
-
The buffer for the tarball decompression is set in advance. When the data size is unknown, the buffer must be reallocated to grow (see [}. Bun buffers the entire tarball before decompressing. Most of JS packages are 1MB so it's fine (ts package is 50MB ok).
The uncompressed file size is known with the last 4 bytes of the gzip format.
Bun uses libdefalte optimized with SIMD instructions.
The comparison in NodeJS is a readStream, but it's not as efficient as a seek operation. -
Cache-friendly data layout
JSON is inefficient because each address pointer has a string step. "The CPU accesses a pointer that tells it where Next's data is located in memory. This data then contains yet another pointer to where its dependencies live, which in turn contains more pointers to the actual dependency strings."
Fetching data from RAM is slow, because CPU stores data in cache lines.
Because JSON (and especially JS objects) are stored randomly in RAM, the line cache is inefficient or will be used only for a few bytes.
This optimization works great for data that's stored sequentially, but it backfires when your data is scattered randomly across memory.
The nested structure of objects creates whats called "pointer chasing", a common anti-pattern in system programming.
For a project with 1000 packages averaging 5 dependencies, that's 2ms of pure memory latency.
5.Structure of arrays (SoA) instead of array of structs
Bun uses large contiguous buffers. While accessing a package is 8 bytes, the CPU can load an entire 64 byte cache line from packages[0] to packages[7]
As a sidenote: Bun originally used a binary lockfile format (bun.lockb) to avoid JSON parsing overhead entirely, but binary files are impossible to review in pull requests and can't be merged when conflicts happen.
- File copying
Copying a file can be expensive as it runs first through the kernal memory. There are ways to optimize it though.
On MacOS, clonefile can clone entire directories, so it's a O(n) operation.
Linux has hardlinks. It has fallbacks such as ioctl_ficlone for Btrfs and XFS, or copy_file_range, or sendfile
- Multi-Core parallelism
Bun uses lock-free data structures. It also uses a thread pool of 64 concurrent HTTP connections.
Each thread gets its own memory pool.
- Conclusion
[...] npm gave us a foundation to build on, yarn made managing workspaces less painful, and pnpm came up with a clever way to save space and speed things up with hardlinks. Each worked hard to solve the problems developers were actually hitting at the time. But that world no longer exists. SSDs are 70× faster, CPUs have dozens of cores, and memory is cheap. The real bottleneck shifted from hardware speed to software abstractions. [...] The tools that will define the next decade of developer productivity are being written right now, by teams who understand that performance bottlenecks shifted when storage got fast and memory got cheap. Installing packages 25x faster isn't "magic": it's what happens when tools are built for the hardware we actually have.
It's absolutely possible to beat even the best sort implementations with domain specific knowledge, careful benchmarking and an understanding of CPU micro-architectures. At the same time, assumptions will become invalid, mistakes can creep in silently and good sort implementations can be surprisingly fast even without prior domain knowledge. If you have access to a high-quality sort implementation, think twice about replacing it with something home-grown.
Tuned sorting > Thuned hash table > Baseline sorting > baselinehash table.
Note sorting wins, but it depends of the data size. Tuned sorting is more efficient for bigger data sizes than 256KiB.
The quicksort sorts spread-out data more efficiently than radix sort. Radix sort is more efficient for random data.
Using a faster hash algorithm combined with radix sort performs better than quicksort though: hash with MulSwapMul, then radix sort using diverting LSD radix sort.
Hashed radix sort is more restrictive than hash tables though. Where batching is viable, hashed radix sorts are typically viable.
Optimizing some endpoints in Rust inside a go app.
The results shows nearly 2x performance.
Computer architectures encompass 13 orders of magnitude of performance! That’s roughly the difference between something like a trivial function call processing data in L1 cache to a remote call out to something in another region. People often make relatively “small” mistakes of 3 or 4 orders of magnitude, which is still crazy if you think about it, but that’s considered to be a minor sin relative to making the architecture diagram look pretty.
It's easy to accidentally perform expensive operations in software systems.
If we can delete a single unnecessary 109109 operation, that pays for a lot of unnecessary 105105 order operations.
Optimize for the higher order of magnitude.