Page MenuHomePhabricator

Fix flakey feature_pruning
AbandonedPublic

Authored by markblundeberg on Wed, Feb 5, 17:32.

Details

Reviewers
deadalnix
jasonbcox
Group Reviewers
Restricted Project
Summary

We get failures of this sort, very occasionally:
https://build.bitcoinabc.org/viewLog.html?buildId=23720&buildTypeId=BitcoinABCMasterLinux&tab=buildResultsDiv

Locally it has gotten very bad now that I'm making diffs
with sigops counting disabled, because that speeds up block
validation massively (these tests make billions of
OP_NOPs which get iterated one by one during sigop counting).

The failure is because when reorg_back starts, node 0 may still
be sending blocks to node 2 as a consequence of the previous test.
When node 2 calls invalidateblock during the reorg_back
function and then node 0 sends a block building off that invalid
chain, this earns a ban for node 0, and the test ends up timing
out later on when node 2 was supposed to have gotten a reorg chain
from node 0.

Test Plan
ninja all && ./test/functional/test_runner.py feature_pruning

Diff Detail

Repository
rABC Bitcoin ABC
Branch
feature_pruning_flake
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 9278
Build 16497: Bitcoin ABC Buildbot
Build 16496: arc lint + arc unit

Event Timeline

markblundeberg created this revision.Wed, Feb 5, 17:32
Herald added a reviewer: Restricted Project. · View Herald TranscriptWed, Feb 5, 17:32

(No joke about the speedup with sigops off... we're talking about logged ConnectBlock times going from 85 ms to 9 ms. And yes the test runs overall about 2x faster...)

markblundeberg added inline comments.
test/functional/feature_pruning.py
49

Macro sobig:

deadalnix requested changes to this revision.Wed, Feb 5, 19:00
deadalnix added a subscriber: deadalnix.

Good catch, this has been bugging me too.

test/functional/feature_pruning.py
235

The problem with that is that it is going through the RPC rather than the network, so you aren't really testing what you are supposed to anymore.

You may want to add a wait for that polls for node 2's tip and see if it is where it needs to - and fail after a timeout.

This revision now requires changes to proceed.Wed, Feb 5, 19:00
jasonbcox requested changes to this revision.Wed, Feb 5, 19:07
jasonbcox added a subscriber: jasonbcox.
jasonbcox added inline comments.
test/functional/feature_pruning.py
229

Looks to me like the sync_blocks call actually belongs here because reorg_test should not leave the test in a bad state after completing (or nearly so).

It also seems incorrect for calc_usage to be called if the nodes are not synced.

markblundeberg abandoned this revision.Thu, Feb 6, 01:37

Theres's a backport for this: D5168 . I could have sworn I looked for backports on this, but apparently I missed that.

test/functional/feature_pruning.py
229

Yep on both points. calc_usage there is checking node 2 which is the problematic one, so we definitely want it synced.

235

Hmm I don't get what you mean... sync_blocks is the standard way to do 'wait until all these nodes are on the same tip' and it's used throughout the test. What's wrong with RPC?

deadalnix added inline comments.Thu, Feb 6, 02:10
test/functional/feature_pruning.py
235

Nothing wrong with RPC, unless you ant to test that a node actually prune blocks when it recieve them as now block are not recieved via the same codepath.