Page MenuHomePhabricator

synchronize validation queue during submitblock
ClosedPublic

Authored by markblundeberg on Feb 3 2020, 16:34.

Details

Summary

I was observing some intermittent segfaults of feature_dbcrash
with my debug build, where just after submitblock was called (and
a block was connected): a CValidationInterface was having its
vtable pointer nulled out just as it attempted to execute one of
its background slots (such as UpdatedBlockTip, BlockConnected).

This seems to be the explanation: the boost signals2 system allows
disconnections to happen concurrently with slot execution, thus
the UnregisterValidationInterface can complete and sc can go out
of scope while one of its slots is running. There seems to be a very
tight race involved here, visible due to the debug build slowdown,
and because feature_dbcrash does thousands of submitblock calls.

(It's not quite as tight as just calling an empty function: for each slot
call, there are something like 8 wrappers in between when the
disconnection could be noticed, and when the CValidationInterface
method is actually called.)

https://www.boost.org/doc/libs/1_55_0/doc/html/signals2/thread-safety.html

Test Plan

I was observing failures in this setup:

CC=clang CXX=clang++ cmake -GNinja .. -DCMAKE_BUILD_TYPE=Debug
ninja check-extended

Diff Detail

Repository
rABC Bitcoin ABC
Branch
submitblock_sync
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 9244
Build 16433: Default Diff Build & Tests
Build 16432: arc lint + arc unit

Event Timeline

Was discussing this with Fabien a fair bit.

If I'm right about this, it seems this is a fairly systemic error regarding how we unregister validation interfaces (see also wallet), also done like this in Core. I find it strange that I'm the only one observing this problem, but on the other hand it's a very difficult to trigger race.

This revision is now accepted and ready to land.Feb 3 2020, 17:18

I've also noted an issue in Core: https://github.com/bitcoin/bitcoin/issues/18065 , plus a few more details (the fix I used here is really a bandaid and does not fully solve the logical problem at hand).