Currently, prepare_indexed_txs is still one of the slowest parts of the indexer due to the disk access, even when using batched queries.
We add TxNumCache, which internally works like a "conveyor belt" of buckets that are filled with TxId -> TxNum.
Each block puts all their TxId -> TxNum tuples into the bucket at the front of the belt, one-by-one. If this bucket is full, the bucket at the end of the belt is "dropped off", emptied and moved to the front, moving all other buckets one step back. Then the new empty bucket will be filled, until it is full, etc.
The required memory will be 40B per TxId (32-byte txid + 8-byte TxNum, +small HashMap overhead), so a 100000 tx bucket will require 4MB; meaning 10 buckets require 40MB, which is the default. Users can configure the number of buckets (-chroniktxnumcachebuckets) and bucket size (-chroniktxnumcachebucketsize) to tune Chronik to their machine; the default of 40MB strikes a good balance of RAM usage and performance.
Benchmarks for the first 300000ish blocks show a significant speedup of 5.24x when using a bucket size of 100k (40MB RAM), and a speedup of 2.26x (4MB RAM) when using a bucket size of 10k. Based on this, a bucket size of 100k was chosen. One benchmark with 100 buckets and bucket size 100k (400MB RAM) showed a speedup of 7.4x, but RAM usage is excessive.
Using 5 buckets and a bucket size of 200k is slower than 10 buckets a 100k. Using 10 buckets a 200k barely speeds up indexing but uses 2x the RAM.
I also benchmarked using the faster seahash hashing algorithm for HashMap, but it made no big difference, so for now it is left at the slower but cryptographically secure default hash. This indicates that we're not compute but memory bound, which is to be expected. In the future, we might want to use a faster hashing algorithm.
bench | 300k total time [s] | 300k time prepare [s] |
noprepare | 1066.85 | 0 |
old, before D15571 | 4375.39 | 3308.54 |
base | 3148.51 | 2081.66 |
d10-b10k | 1988.43 | 921.58 |
d10-b25k | 1672.90 | 606.05 |
d10-b100k | 1464.02 | 397.17 |
d10-b200k | 1432.11 | 365.26 |
d100-b10k | 1664.05 | 597.20 |
d100-b100k | 1348.24 | 281.39 |
d10-b10k-sea | 1907.64 | 840.79 |
d10-b100k-sea | 1488.08 | 421.23 |
d5-b200k | 1648.01 | 581.16 |
For reference, a previous version of this used a simpler approach, where the cache simply stores the last depth_blocks worth of tx nums in a VecDeque of HashMaps, like a queue of blocks. The oldest block of txids was dropped off at the end, and the new block is added at the beginning. Benchmarks however showed this is slower than using the conveyor belt cache. Benchmarks are kept in this diff for future reference.
bench | 300k total time [s] | 300k time prepare [s] | 400k total time [s] | 400k time prepare [s] |
no prepare | 1066.85 | 0 | 3476.10 | 0 |
old, before D15571 | 4375.39 | 3308.54 | ||
D15571 | 3148.51 | 2081.66 | 11453.99 | 7977.89 |
block queue, depth = 10 | 2749.21 | 1682.36 | 8595.97 | 5119.87 |
block queue, depth = 100 | 2297.97 | 1231.12 | 7214.19 | 3738.08 |
block queue, depth = 1000 | 2969.34 | 1902.49 | 10865.72 | 7389.62 |