Page MenuHomePhabricator

Fix a potential data race between CConman and the PeerLogicValidation
ClosedPublic

Authored by Fabien on Nov 25 2020, 14:40.

Details

Summary

This issue was found by TSAN:
https://build.bitcoinabc.org/viewLog.html?buildTypeId=BitcoinABC_ResourceIntensiveBuilds_BitcoinAbcMasterTsan&buildId=135843&tab=buildLog&logTab=tree&filter=debug&expand=all&_focus=1665&guest=1

What happened here is a race on the m_max_outbound_full_relay or
m_max_outbound_block_relay conman members. There is a cirular
dependency betwen CConman and PeerLogicValidation, which is "solved"
by the use of an Init() function to pass the peer logic validation to
the conman. The racey members are initialized at this time.
But the peer logic validation starts a thread at construction time which
in turns needs access to these members. Under certain timing
circumstances, the conman initialization and the peer logic validation
can concurrently set/read these members, causing the race.

This behavior can only occur once at initialization time (either in the
main app or during tests) and only when certain timing conditions are
met, which happens under heavy load. This is why TSAN triggered on CI:
the unit tests running in parallel are very resources intensive,
especially since we have the radix insert stress test consuming 100% CPU
for quite some time and requiring over 16GB of memory. This is also the
reason why core won't notice this issue: they don't have this test.

Since the peer validation logic locks cs_main, we can do the same during
the racey init part. Since this is only called once at startup it will
not cause any performance issue.

Test Plan

Rebase this patch on top of D8376.
With TSAN:

ninja check-bitcoin-radix_tests check-bitcoin-coinselector_tests

./contrib/teamcity/build-configurations.py build-tsan

Diff Detail

Repository
rABC Bitcoin ABC
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Fabien requested review of this revision.Nov 25 2020, 14:40
deadalnix requested changes to this revision.Nov 25 2020, 16:19
deadalnix added a subscriber: deadalnix.

If that can only happen at startup, then locking at every use is clearly not the way to go.

This revision now requires changes to proceed.Nov 25 2020, 16:19
Fabien edited the summary of this revision. (Show Details)

Only lock once at startup but using cs_main. Updated summary.

This revision is now accepted and ready to land.Dec 9 2020, 12:14