Currently, `prepare_indexed_txs` is still one of the slowest parts of the indexer due to the disk access, even when using batched queries.
In this diff, we add a `TxNumCache`, which simply stores the last `depth_blocks` worth of tx nums in a VecDeque of HashMaps. The oldest block of txids is dropped off at the endWe add `TxNumCache`, and the new block is added at the beginningwhich internally works like a "conveyor belt" of buckets that are filled with `TxId -> TxNum`.
The required memory will be 40B per TxId (32-byte txid + 8-byte TxNumEach block puts all their TxId -> TxNum tuples into the bucket at the front of the belt, one-by-one. If this bucket is full, the bucket at the end of the belt is "dropped off", +small HashMap overhead)emptied and moved to the front, so a 2000 tx block will require 80kB;moving all other buckets one step back. meaning a 10 deep cache would require around 800kBThen the new empty bucket will be filled, 100 deep 8MBuntil it is full, and 1000 deep 80MBetc.
Benchmarks show there's a speedup of about 1,69 when syncing the first 300000ish blocks with depth 100:The required memory will be 40B per TxId (32-byte txid + 8-byte TxNum, +small HashMap overhead), so a 25000 tx bucket will require 1MB; meaning a 10 bucket requires 10MB, which is the default. Users can configure the number of buckets and bucket size to tune Chronik to their machine; the default of 10MB strikes a good balance of RAM usage and performance.
Benchmarks for the first 300000ish blocks show a significant speedup of 5.24x when using a bucket size of 100k (40MB RAM), and a speedup of 2.26x (4MB RAM) when using a bucket size of 10k. Based on this, a bucket size of 25k was chosen, but it's recommended to set it to 100k if possible. One benchmark with 100 buckets and bucket size 100k (400MB RAM) showed a speedup of 7.4x.
I also benchmarked using the faster seahash hashing algorithm for HashMap, but it made no big difference, so for now it is left at the slower but cryptographically secure default hash. This indicates that we're not compute but memory bound, which is to be expected. In the future, we might want to use a faster hashing algorithm.
| bench | 300k total time [s] | 300k time prepare [s] |
| noprepare | 1066.85 | 0 |
| old, before D15571 | 4375.39 | 3308.54 |
| base | 3148.51 | 2081.66 |
| d10-b10k | 1988.43 | 921.58 |
| d10-b100k | 1464.02 | 397.17 |
| d100-b10k | 1664.05 | 597.20 |
| d100-b100k | 1348.24 | 281.39 |
| d10-b10k-sea | 1907.64 | 840.79 |
| d10-b100k-sea | 1488.08 | 421.23 |
For reference, a previous version of this used a simpler approach, where the cache simply stores the last `depth_blocks` worth of tx nums in a VecDeque of HashMaps. The oldest block of txids was dropped off at the end, and the new block is added at the beginning. Benchmarks however showed this is slower than using the conveyor belt cache. Benchmarks are kept in this diff for future reference.
| bench | 300k total time [s] | 300k time prepare [s] | 400k total time [s] | 400k time prepare [s] |
| no prepare | 1066.85 | 0 | 3476.10 | 0 |
| old, before D15571 | 4375.39 | 3308.54 | | |
| D15571 | 3148.51 | 2081.66 | 11453.99 | 7977.89 |
| this diff, depth = 10 | 2749.21 | 1682.36 | 8595.97 | 5119.87 |
| this diff, depth = 100 | 2297.97 | 1231.12 | 7214.19 | 3738.08 |
| this diff, depth = 1000 | 2969.34 | 1902.49 | 10865.72 | 7389.62 |