[Chronik] Add `TxNumCache` to speed up `prepare_indexed_txs`
03fb995c1d54
Actions

Description

[Chronik] Add TxNumCache to speed up prepare_indexed_txs

Summary:
Currently, prepare_indexed_txs is still one of the slowest parts of the indexer due to the disk access, even when using batched queries.

We add TxNumCache, which internally works like a "conveyor belt" of buckets that are filled with TxId -> TxNum.

Each block puts all their TxId -> TxNum tuples into the bucket at the front of the belt, one-by-one. If this bucket is full, the bucket at the end of the belt is "dropped off", emptied and moved to the front, moving all other buckets one step back. Then the new empty bucket will be filled, until it is full, etc.

The required memory will be 40B per TxId (32-byte txid + 8-byte TxNum, +small HashMap overhead), so a 100000 tx bucket will require 4MB; meaning 10 buckets require 40MB, which is the default. Users can configure the number of buckets (-chroniktxnumcachebuckets) and bucket size (-chroniktxnumcachebucketsize) to tune Chronik to their machine; the default of 40MB strikes a good balance of RAM usage and performance.

Benchmarks for the first 300000ish blocks show a significant speedup of 5.24x when using a bucket size of 100k (40MB RAM), and a speedup of 2.26x (4MB RAM) when using a bucket size of 10k. Based on this, a bucket size of 100k was chosen. One benchmark with 100 buckets and bucket size 100k (400MB RAM) showed a speedup of 7.4x, but RAM usage is excessive.

Using 5 buckets and a bucket size of 200k is slower than 10 buckets a 100k. Using 10 buckets a 200k barely speeds up indexing but uses 2x the RAM.

I also benchmarked using the faster seahash hashing algorithm for HashMap, but it made no big difference, so for now it is left at the slower but cryptographically secure default hash. This indicates that we're not compute but memory bound, which is to be expected. In the future, we might want to use a faster hashing algorithm.

bench	300k total time [s]	300k time prepare [s]
noprepare	1066.85	0
old, before D15571	4375.39	3308.54
base	3148.51	2081.66
d10-b10k	1988.43	921.58
d10-b25k	1672.90	606.05
d10-b100k	1464.02	397.17
d10-b200k	1432.11	365.26
d100-b10k	1664.05	597.20
d100-b100k	1348.24	281.39
d10-b10k-sea	1907.64	840.79
d10-b100k-sea	1488.08	421.23
d5-b200k	1648.01	581.16

For reference, a previous version of this used a simpler approach, where the cache simply stores the last depth_blocks worth of tx nums in a VecDeque of HashMaps, like a queue of blocks. The oldest block of txids was dropped off at the end, and the new block is added at the beginning. Benchmarks however showed this is slower than using the conveyor belt cache. Benchmarks are kept in this diff for future reference.

bench	300k total time [s]	300k time prepare [s]	400k total time [s]	400k time prepare [s]
no prepare	1066.85	0	3476.10	0
old, before D15571	4375.39	3308.54
D15571	3148.51	2081.66	11453.99	7977.89
block queue, depth = 10	2749.21	1682.36	8595.97	5119.87
block queue, depth = 100	2297.97	1231.12	7214.19	3738.08
block queue, depth = 1000	2969.34	1902.49	10865.72	7389.62