Page MenuHomePhabricator

fix p2p_compactblocks flakiness
ClosedPublic

Authored by markblundeberg on Feb 6 2020, 07:46.

Details

Summary

p2p_compactblocks has been failing lately, example:
https://build.bitcoinabc.org/viewLog.html?buildId=28193

The reason is a bit subtle, and involves both CTOR and an
apparent (mild) bug in the handling of orphan transactions.

  • When it fails, it is expecting a getdata-block with a single hash. It receives the getdata-block BUT also a getdata-tx with five txids. Hence the error message 5 != 1.
  • It doesn't make sense that any getdata-tx is sent at this time. This test case is test_incorrect_blocktxn_response and the reason for the getdata-tx is THREE TEST CASES EARLIER in test_getblocktxn_requests.
  • test_getblocktxn_requests is sending 10 transactions out of order. Many of them get put in orphan pool, but eventually they are de-orphaned (accepted) and they are all mined.
  • For some reason, bitcoind is re-requesting txes that had been de-orphaned a long time ago. (see stuff about tx_process_time in net_processing.cpp) This is probably a bug.

This fix doesn't address the root cause bug but does make this test
not induce orphan transactions, which probably it shouldn't be
relying on in the first place. I will be looking into the origin
of the incorrect orphan tx handling as well.

Test Plan
ninja all
./test/functional/test_runner.py --nocleanup p2p_compactblocks

Examine node debug.logs to ensure the absence of "orphan tx" logs
and spurious "Requesting tx" logs (they may occur when node0 and
node1 talk to each other, in which case they will happen
immediately after receiving a tx INV from the partner node.)

Diff Detail

Repository
rABC Bitcoin ABC
Branch
p2pcompact_deflake
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 9295
Build 16527: Default Diff Build & Tests
Build 16526: arc lint + arc unit

Event Timeline

good-getdata and bad-getdata near end of test_framework.log: https://build.bitcoinabc.org/repository/download/BitcoinABC_Master_BitcoinAbcMasterAsan/28193:id/test-output.tar.gz%21/bitcoin_test_runner_20200205_183212/p2p_compactblocks_542/test_framework.log

2020-02-06T02:36:05.841000Z TestFramework.mininode (DEBUG): Received message from 127.0.0.1:15337: msg_getdata(inv=[CInv(type=Block hash=7fdb3ff134fc96bf6b2b88e6c102a2ba214c40c4f4652a3599561b7622fd03b5)])
2020-02-06T02:36:05.842000Z TestFramework.mininode (DEBUG): Received message from 127.0.0.1:15337: msg_getdata(inv=[CInv(type=TX hash=6683ac8f07e73c074b211a2d372fe40000a8946d1945280f683cf20731e8380a), CInv(type=TX hash=bdf21b88966bbbbc4d243ac054519f7ce4de6b4b08ae7a905b08b3ff900740ec), CInv(type=TX hash=7ecb1213fa09ee5ebc348317601ce4d0f0ce88d16498f0a2500f4de0adca0df3), CInv(type=TX hash=dd74d819470a1f05788568a7762fa4a362798ad670ee880b632c78a9f5694799), CInv(type=TX hash=e76a9b0803ccbe34ace219f127fed6d7b63ef9561332a4565bf0114c50c74eed)])

corresponding entries near end of debug.log: https://build.bitcoinabc.org/repository/download/BitcoinABC_Master_BitcoinAbcMasterAsan/28193:id/test-output.tar.gz%21/bitcoin_test_runner_20200205_183212/p2p_compactblocks_542/node1/regtest/debug.log

2020-02-06T02:36:05.840019Z sending getdata (37 bytes) peer=2
2020-02-06T02:36:05.840368Z Requesting tx 6683ac8f07e73c074b211a2d372fe40000a8946d1945280f683cf20731e8380a peer=2
2020-02-06T02:36:05.840433Z Requesting tx bdf21b88966bbbbc4d243ac054519f7ce4de6b4b08ae7a905b08b3ff900740ec peer=2
2020-02-06T02:36:05.840501Z Requesting tx 7ecb1213fa09ee5ebc348317601ce4d0f0ce88d16498f0a2500f4de0adca0df3 peer=2
2020-02-06T02:36:05.840556Z Requesting tx dd74d819470a1f05788568a7762fa4a362798ad670ee880b632c78a9f5694799 peer=2
2020-02-06T02:36:05.840617Z Requesting tx e76a9b0803ccbe34ace219f127fed6d7b63ef9561332a4565bf0114c50c74eed peer=2
2020-02-06T02:36:05.840669Z sending getdata (181 bytes) peer=2

origin of the offending log messages: https://github.com/Bitcoin-ABC/bitcoin-abc/blob/c0dcb98975dad7c7e654d0da567d9c4c5b2a342a/src/net_processing.cpp#L4680

Note that you can check all those txids and see they were orphan txes earlier on in the debug.log ... but they definitely got mined right after being deorphaned (this is the last two paragraphs of test_getblocktxn_requests), earlier on in the debug.log:

2020-02-06T02:36:03.807075Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.810412Z stored orphan tx 21d09b4587f1a5fad1334904d669a95559535a40f8425ecbcca3b0228b3313f0 (mapsz 1 outsz 1)
2020-02-06T02:36:03.811939Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.812389Z stored orphan tx 2f606638b87b7352607c753e017a7aa9266e64e77535a7cf67451074455f4eb8 (mapsz 2 outsz 2)
2020-02-06T02:36:03.814300Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.814744Z stored orphan tx 6683ac8f07e73c074b211a2d372fe40000a8946d1945280f683cf20731e8380a (mapsz 3 outsz 3)
2020-02-06T02:36:03.815374Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.815848Z stored orphan tx 7ecb1213fa09ee5ebc348317601ce4d0f0ce88d16498f0a2500f4de0adca0df3 (mapsz 4 outsz 4)
2020-02-06T02:36:03.819822Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.820277Z stored orphan tx 8cb9459394be50b23cd540e08f664b6469256325b14d829c06f5b74a6bd91436 (mapsz 5 outsz 5)
2020-02-06T02:36:03.820916Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.821313Z stored orphan tx bdf21b88966bbbbc4d243ac054519f7ce4de6b4b08ae7a905b08b3ff900740ec (mapsz 6 outsz 6)
2020-02-06T02:36:03.821942Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.822362Z stored orphan tx c30a0b54fe21422ebf97e2ab2bbbb29013f2e98629dc3d4df0a4fdc48de5ebd6 (mapsz 7 outsz 7)
2020-02-06T02:36:03.829884Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.830358Z stored orphan tx dd74d819470a1f05788568a7762fa4a362798ad670ee880b632c78a9f5694799 (mapsz 8 outsz 8)
2020-02-06T02:36:03.831006Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.831418Z stored orphan tx e62312945183d7343499c98ef1f40d10da5ae6eb91e471fe200bf08f5158590a (mapsz 9 outsz 9)
2020-02-06T02:36:03.832026Z received: tx (101 bytes) peer=2
2020-02-06T02:36:03.832837Z Checking mempool with 1 transactions and 1 inputs
2020-02-06T02:36:03.833058Z AcceptToMemoryPool: peer=2: accepted e76a9b0803ccbe34ace219f127fed6d7b63ef9561332a4565bf0114c50c74eed (poolsz 1 txn, 0 kB)
2020-02-06T02:36:03.833704Z    accepted orphan tx c30a0b54fe21422ebf97e2ab2bbbb29013f2e98629dc3d4df0a4fdc48de5ebd6
2020-02-06T02:36:03.833818Z Checking mempool with 2 transactions and 2 inputs
2020-02-06T02:36:03.834479Z    accepted orphan tx dd74d819470a1f05788568a7762fa4a362798ad670ee880b632c78a9f5694799
2020-02-06T02:36:03.834592Z Checking mempool with 3 transactions and 3 inputs
2020-02-06T02:36:03.835389Z    accepted orphan tx bdf21b88966bbbbc4d243ac054519f7ce4de6b4b08ae7a905b08b3ff900740ec
2020-02-06T02:36:03.835531Z Checking mempool with 4 transactions and 4 inputs
2020-02-06T02:36:03.836408Z    accepted orphan tx 2f606638b87b7352607c753e017a7aa9266e64e77535a7cf67451074455f4eb8
2020-02-06T02:36:03.836521Z Checking mempool with 5 transactions and 5 inputs
2020-02-06T02:36:03.837454Z    accepted orphan tx 7ecb1213fa09ee5ebc348317601ce4d0f0ce88d16498f0a2500f4de0adca0df3
2020-02-06T02:36:03.837580Z Checking mempool with 6 transactions and 6 inputs
2020-02-06T02:36:03.838567Z    accepted orphan tx 6683ac8f07e73c074b211a2d372fe40000a8946d1945280f683cf20731e8380a
2020-02-06T02:36:03.838674Z Checking mempool with 7 transactions and 7 inputs
2020-02-06T02:36:03.839742Z    accepted orphan tx 21d09b4587f1a5fad1334904d669a95559535a40f8425ecbcca3b0228b3313f0
2020-02-06T02:36:03.839910Z Checking mempool with 8 transactions and 8 inputs
2020-02-06T02:36:03.841008Z    accepted orphan tx 8cb9459394be50b23cd540e08f664b6469256325b14d829c06f5b74a6bd91436
2020-02-06T02:36:03.841122Z Checking mempool with 9 transactions and 9 inputs
2020-02-06T02:36:03.842362Z    accepted orphan tx e62312945183d7343499c98ef1f40d10da5ae6eb91e471fe200bf08f5158590a
2020-02-06T02:36:03.842492Z Checking mempool with 10 transactions and 10 inputs
2020-02-06T02:36:03.857671Z received: ping (8 bytes) peer=2
2020-02-06T02:36:03.857745Z sending pong (8 bytes) peer=2
2020-02-06T02:36:03.870124Z Received a POST request for / from 127.0.0.1:51776
2020-02-06T02:36:03.870325Z ThreadRPCServer method=getrawmempool
2020-02-06T02:36:03.872958Z received: cmpctblock (252 bytes) peer=2
2020-02-06T02:36:03.874750Z Initialized PartiallyDownloadedBlock for block 41e9ca1d323164c000b3fca58b6dbc18a5768a0bc0f7dc8863ce82dfa98c295c using a cmpctblock of size 252
2020-02-06T02:36:03.874861Z received: blocktxn (33 bytes) peer=2
2020-02-06T02:36:03.875469Z Successfully reconstructed block 41e9ca1d323164c000b3fca58b6dbc18a5768a0bc0f7dc8863ce82dfa98c295c with 1 txn prefilled, 10 txn from mempool (incl at least 0 from extra pool) and 0 txn requested
2020-02-06T02:36:03.875863Z PeerLogicValidation::NewPoWValidBlock sending header-and-ids 41e9ca1d323164c000b3fca58b6dbc18a5768a0bc0f7dc8863ce82dfa98c295c to peer=0
2020-02-06T02:36:03.875967Z sending cmpctblock (252 bytes) peer=0
2020-02-06T02:36:03.881347Z   - Load block from disk: 0.01ms [0.01s]
2020-02-06T02:36:03.881503Z     - Sanity checks: 0.02ms [0.03s (0.07ms/blk)]
2020-02-06T02:36:03.881839Z     - Fork checks: 0.35ms [0.17s (0.37ms/blk)]
2020-02-06T02:36:03.882476Z       - Connect 11 transactions: 0.63ms (0.057ms/tx, 0.063ms/txin) [0.21s (0.47ms/blk)]
2020-02-06T02:36:03.882557Z     - Verify 10 txins: 0.72ms (0.072ms/txin) [0.29s (0.65ms/blk)]
2020-02-06T02:36:03.882817Z     - Index writing: 0.26ms [0.22s (0.50ms/blk)]
2020-02-06T02:36:03.882882Z     - Callbacks: 0.05ms [0.03s (0.07ms/blk)]
2020-02-06T02:36:03.883073Z   - Connect total: 1.75ms [0.96s (2.13ms/blk)]
2020-02-06T02:36:03.883141Z   - Flush: 0.07ms [0.06s (0.14ms/blk)]
2020-02-06T02:36:03.883237Z   - Writing chainstate: 0.10ms [0.09s (0.21ms/blk)]
2020-02-06T02:36:03.885141Z UpdateTip: new best=41e9ca1d323164c000b3fca58b6dbc18a5768a0bc0f7dc8863ce82dfa98c295c height=337 version=0x00000004 log2_work=9.4008794 tx=439 date='2020-02-06T02:36:37Z' progress=1.000000 cache=0.1MiB(428txo)
2020-02-06T02:36:03.885228Z   - Connect postprocess: 1.99ms [0.27s (0.61ms/blk)]
2020-02-06T02:36:03.885263Z - Connect block: 3.91ms [1.40s (3.12ms/blk)]
2020-02-06T02:36:03.885353Z Checking mempool with 0 transactions and 0 inputs
markblundeberg retitled this revision from fix p2p_compactblocks flakeyness to fix p2p_compactblocks flakiness.Feb 6 2020, 08:00
This revision is now accepted and ready to land.Feb 6 2020, 08:10

OK so why the orphaning handling is weird.... (this is exercising a weird edge case):

  • When we get an orphan tx, we queue with RequestTx on all the missing parents and then AddOrphanTx add it to the orphan pool.
  • (*critical for test -- this creates the delay*) In RequestTx the mininode is an inbound peer, so tx requests are delayed by ~2 seconds even for orphans.
  • When a transaction is deorphaned or mined, it does not get cleared out from m_tx_process_time because it's probably too slow to iterate.
  • When the time comes to download with m_tx_process_time, the node checks if it already has the tx using AlreadyHave:
    • Check in mempool, we have of course. (*but in the test, the txes get mined immediately*)
    • Check in confirmed UTXO set cache for the first and second outputs of the tx, as a quick heuristic for 'already mined'. (*but in the test, there is a chain of txes and the intermediate utxos are not in cache*)

So the weirdness requires multiple conditions:

  • Node wants to requests a transaction for whatever reason (either inv received or orphan of some other tx).
  • The peer is inbound, so the request is queued up and delayed by a couple of seconds.
  • During that time, the tx is mined, or received and then mined.
  • The first and second outputs of the tx are also mined quickly (likely within the same block).
  • ==> Now you get a situation where the node goes ahead and requests a tx that is already confirmed (and it may have been in mempool briefly).

The fix is probably to make it so that when a block is mined, m_tx_process_time gets cleaned out of any matches. Not sure if this is worth it.