356 private links
- Organizations don't use that much data.
Of queries that scan at least 1 MB, the median query scans about 100 MB. The 99.9th percentile query scans about 300 GB.
but 99.9% of real world queries could run on a single large node.
I did the analysis for this post using DuckDB, and it can scan the entire 11 GB Snowflake query sample on my Mac Studio in a few seconds.
When we think about new database architectures, we’re hypnotized by scaling limits. If it can’t handle petabytes, or at least terabytes, it’s not in the conversation. But most applications will never see a terabyte of data, even if they’re successful. We’re using jackhammers to drive finish nails.
As an industry, we’ve become absolutely obsessed with “scale”. Seemingly at the expense of all else, like simplicity, ease of maintenance, and reducing developer cognitive load
Years it takes to get to 10x:
10% -> ~ 24y
50% -> ~ 5.7y
200% -> ~ 2.10y
Scaling is also a luxurious issue in many cases: it means the business runs well.
- Hardware is getting really, really good
In the last decade:
SSDs got ~5.6x cheaper, 30x more on a single SSD and 11x faster in sequential reads and 18x in radom reads.
CPUs core count went up 2.6x, price went down at least 5x per core, each Turin core is also probably 2x-2.5x faster.
Distributed systems are also overkill as hardware progresses faster.