Bad Ways to Get Data
In working on my Sapper export library, I ran into a very peculiar problem: my pages were being generated a lot more slowly than expected:
- I was generating ~100 pages, and it took about 16 seconds (0.16s/page)
- https://insiderx.com/ does 30k pages, and it takes 15 minutes (0.03s/page)
- an extremely simple tweak of the default Sapper app generates 100k very content light pages in 95 seconds (0.001s/page)
- Gatsby v2 built about 5000 pages in 37 seconds, 25k pages in 7.5mins (0.01 - 0.02s/page)
- Do Expensive Reads Over and Over
- Do One Big Read Upfront
- Have No Way to Profile
- Have No Way to Test Cheaply
- Use synchronous I/O in an asynchronous environment
- Do Everything Twice
- Have No Idea if You Need to Redo Work
- Have No Way to Estimate Time For Completion
- Ultimate lesson: Have a Plan
So I was at least an order of magnitude off of where I should be. I originally thought that this was justified, as I’d been told Sapper uses puppeteer to crawl pages, but this was wrong. I thought it might be json serialization/desrialization over the local server, but this was wrong.
It was the fetching of the data.
Because I had handcooked my own data process to be a single function that, when called, returned a Big Ball of Data, it was easy to code each page call to fetch this Big Ball every single page despite not needing it. If the reads were expensive (which they were, with syntax highlighting), then they were executed for each file, for each page, over and over and over again.
Once I realized this, the fix was obvious. Refactor the data pipeline to have an upfront data read, dump it somewhere as a static, postprocessed file, and then only refer to that static file where reads are guaranteed to be cheap.
As of writing, this blog now exports its ~100 pages in 7s (0.07s/page). Still slow but half as bad.
So this is where we are now. everything happens in sequence. I can’t do anything with the result of the first file read until I am done with the last file read. I am storing the data as one big ball, which means reading it is also one big ball. Which of these are the bottlenecks? Or is it something else I don’t even know about?
It seems like a really good way to be bad at getting data quickly is to not have any information with which to improve.
Some plugin systems encourage tacking plugin after plugin with no way to debug apart from just running the code and seeing if it works. Plugins should be led by an introspection API that can be logged out and studied without actually executing. A strong parallel is GraphQL’s schema system and GraphiQL.
writeFileSync to write files. This means essentially each file writing blocks the next. Pretty silly when Node is supposed to be async by default? Very nice way to mutex yourself for convenience.
Currently I read source data, combine and save it to the Big Ball, then query against it for a slice, and then save the slice again. Why not directly read and save slices and be the end of it?
There’s a need to store indexes that cut across slices. Should we do that during, between, or after the slice read/writes?
Immediate mode is easy to debug and write, because you throw away all state and so you are only responsible for declaring new state, however of course Retained mode can be more efficient if you do it right.
The way to do this is to have the idea of “pure functions” in data fetches. Given this assumption, you can memoize on these inputs, and skip fetches.
There’s a related idea as well, where if you consider that data fetch processes can be expensive both in the initial fetch and also in the postprocessing, to do two things:
- if it is possible to fetch an index, memoize on that index and only fetch items based on specific invalidated caches.
- You can still do the full initial fetch, but still memoize the postprocessing.
The halting problem is intractable, but you can at least give a credible estimate of time to completion by using cumulative and prior results. This is important for long processes like Machine Learning.
I think this lesson is a general one from the Database world - before making big data queries, make a Query Plan! Also called a manifest. I guess the difference between a plan and a manifest is that a manifest can have useful info for others to consume, while a plan has no such obligation.
You can then make optimizations across the plan, as well as memoize parts of it based off a manifest, and so on.
That any large data pipeline should learn lessons from the data world seems so brutally obvious in retrospect, but we consistently fail to design prototypes and API’s that respect this basic principle.