Page MenuHomePhabricator

Fix TSAN out of memory when running the radix_tests/insert_stress_test
AbandonedPublic

Authored by Fabien on Jul 1 2020, 13:18.

Details

Reviewers
deadalnix
Group Reviewers
Restricted Project
Summary

The choosen values for the number of threads and elements in the test
cause TSAN to peak out at 100% CPU all cores (on a 30 cores machine !)
and over 16GB of memory.
This diff makes the parameters configurable and lower the values when
running with TSAN. This will also speed up the TSAN build significantly,
as the test actually takes over 12 minutes to complete on CI. The peak
memory usage is lowered to about 2.8GB on my machine.

Test Plan
./contrib/teamcity/build-configurations.py build-tsan

Event Timeline

Fabien requested review of this revision.Jul 1 2020, 13:18

The above comment will trigger a build failure, don't take it into account.

Snippet of first build failure:

2020-07-01T13:53:43.447000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_clone_164
2020-07-01T13:53:46.927000Z TestFramework (INFO): Stopping nodes
2020-07-01T13:53:47.333000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_clone_164 on exit
2020-07-01T13:53:47.333000Z TestFramework (INFO): Tests successful
[13:54:19]i:		 [Bitcoin ABC functional tests] wallet_txn_clone.py --mineblock
[13:54:19]i:			 [wallet_txn_clone.py --mineblock] [Test Output]
2020-07-01T13:53:13.180000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_clone_143
2020-07-01T13:53:18.026000Z TestFramework (INFO): Stopping nodes
2020-07-01T13:53:18.381000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_clone_143 on exit
2020-07-01T13:53:18.381000Z TestFramework (INFO): Tests successful
[13:54:19]i:		 [Bitcoin ABC functional tests] wallet_txn_doublespend.py
[13:54:19]i:			 [wallet_txn_doublespend.py] [Test Output]
2020-07-01T13:54:05.458000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_doublespend_194
2020-07-01T13:54:07.973000Z TestFramework (INFO): Stopping nodes
2020-07-01T13:54:08.378000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_doublespend_194 on exit
2020-07-01T13:54:08.378000Z TestFramework (INFO): Tests successful
[13:54:19]i:		 [Bitcoin ABC functional tests] wallet_txn_doublespend.py --mineblock
[13:54:19]i:			 [wallet_txn_doublespend.py --mineblock] [Test Output]
2020-07-01T13:53:15.432000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_doublespend_144
2020-07-01T13:53:20.083000Z TestFramework (INFO): Stopping nodes
2020-07-01T13:53:20.338000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_txn_doublespend_144 on exit
2020-07-01T13:53:20.338000Z TestFramework (INFO): Tests successful
[13:54:19]i:		 [Bitcoin ABC functional tests] wallet_zapwallettxes.py
[13:54:19]i:			 [wallet_zapwallettxes.py] [Test Output]
2020-07-01T13:52:51.003000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_zapwallettxes_132
2020-07-01T13:52:52.355000Z TestFramework (INFO): Mining blocks...
2020-07-01T13:53:00.587000Z TestFramework (INFO): Stopping nodes
2020-07-01T13:53:00.940000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/wallet_zapwallettxes_132 on exit
2020-07-01T13:53:00.940000Z TestFramework (INFO): Tests successful
[13:54:19]i:		 [Bitcoin ABC functional tests] feature_maxuploadtarget.py
[13:54:19]i:			 [feature_maxuploadtarget.py] [Test Output]
2020-07-01T13:50:48.343000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135011/feature_maxuploadtarget_95
2020-07-01T13:50:59.648000Z TestFramework (INFO): Peer 0 disconnected after downloading old block too many times
2020-07-01T13:51:10.088000Z TestFramework (INFO): Peer 1 able to repeatedly download new block
2020-07-01T13:51:10.194000Z TestFramework (INFO): Peer 1 disconnected after trying to download old block
2020-07-01T13:51:10.194000Z TestFramework (INFO): Advancing system time on node to clear counters...
2020-07-01T13:51:10.299000Z TestFramework (INFO): Peer 2 able to download old block
2020-07-01T13:51:10.299000Z TestFramework (INFO): Restarting nodes with -whitelist=127.0.0.1
2020-07-01T13:51:15.685000Z TestFramework (INFO): Peer still connected after trying to download old block (whitelisted)
2020-07-01T13:51:15.736000Z TestFramework (INFO): Stopping nodes
[13:54:19]i:			 [feature_maxuploadtarget.py] [Test Error Output]
Traceback (most recent call last):
  File "/work/test/functional/feature_maxuploadtarget.py", line 177, in <module>
    MaxUploadTest().main()
  File "/work/test/functional/test_framework/test_framework.py", line 210, in main
    self.stop_nodes()
  File "/work/test/functional/test_framework/test_framework.py", line 407, in stop_nodes
    node.stop_node(wait=wait)
  File "/work/test/functional/test_framework/test_node.py", line 325, in stop_node
    "Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr WARNING: too long mutex cycle found !=
[13:54:42] :			 [Creating archive artifacts.tar.gz] Archive was created, file size 225.19 MB (236136099 bytes)
[13:54:42] :		 [Publishing artifacts] Publishing 1 file using [WebPublisher]: results/artifacts => artifacts.tar.gz
[13:54:42] :		 [Publishing artifacts] Publishing 1 file using [ArtifactsCachePublisher]: results/artifacts => artifacts.tar.gz
[13:54:18] :	 [Step 1/1] ninja: build stopped: subcommand failed.
[13:54:18]i:	 [Step 1/1] ##teamcity[publishArtifacts '+:results/artifacts=>artifacts.tar.gz']
[13:54:42] :	 [Step 1/1] 
[13:54:42] :	 [Step 1/1] Build build-tsan failed with exit code 1
[13:54:42]W:	 [Step 1/1] Process exited with code 1
[13:54:51]E:	 [Step 1/1] Process exited with code 1 (Step: Command Line)

Each failure log is accessible here:
Bitcoin ABC functional tests: feature_maxuploadtarget.py

deadalnix requested changes to this revision.Jul 1 2020, 13:59
deadalnix added a subscriber: deadalnix.
deadalnix added inline comments.
src/test/main.cpp
31 ↗(On Diff #21923)

That way is by using a static constructor.

This revision now requires changes to proceed.Jul 1 2020, 13:59

Snippet of first build failure:

2020-07-01T14:03:21.722000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_clone_645
2020-07-01T14:03:25.720000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:03:26.086000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_clone_645 on exit
2020-07-01T14:03:26.086000Z TestFramework (INFO): Tests successful
[14:04:00]i:		 [Bitcoin ABC functional tests] wallet_txn_clone.py --mineblock
[14:04:00]i:			 [wallet_txn_clone.py --mineblock] [Test Output]
2020-07-01T14:02:48.452000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_clone_624
2020-07-01T14:02:53.347000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:02:53.652000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_clone_624 on exit
2020-07-01T14:02:53.652000Z TestFramework (INFO): Tests successful
[14:04:00]i:		 [Bitcoin ABC functional tests] wallet_txn_doublespend.py
[14:04:00]i:			 [wallet_txn_doublespend.py] [Test Output]
2020-07-01T14:03:45.386000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_doublespend_675
2020-07-01T14:03:47.847000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:03:48.201000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_doublespend_675 on exit
2020-07-01T14:03:48.202000Z TestFramework (INFO): Tests successful
[14:04:00]i:		 [Bitcoin ABC functional tests] wallet_txn_doublespend.py --mineblock
[14:04:00]i:			 [wallet_txn_doublespend.py --mineblock] [Test Output]
2020-07-01T14:02:50.760000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_doublespend_625
2020-07-01T14:02:55.777000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:02:56.682000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_txn_doublespend_625 on exit
2020-07-01T14:02:56.682000Z TestFramework (INFO): Tests successful
[14:04:00]i:		 [Bitcoin ABC functional tests] wallet_zapwallettxes.py
[14:04:00]i:			 [wallet_zapwallettxes.py] [Test Output]
2020-07-01T14:02:22.689000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_zapwallettxes_613
2020-07-01T14:02:24.277000Z TestFramework (INFO): Mining blocks...
2020-07-01T14:02:32.665000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:02:32.868000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/wallet_zapwallettxes_613 on exit
2020-07-01T14:02:32.868000Z TestFramework (INFO): Tests successful
[14:04:00]i:		 [Bitcoin ABC functional tests] feature_maxuploadtarget.py
[14:04:00]i:			 [feature_maxuploadtarget.py] [Test Output]
2020-07-01T14:00:12.221000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_135929/feature_maxuploadtarget_576
2020-07-01T14:00:23.422000Z TestFramework (INFO): Peer 0 disconnected after downloading old block too many times
2020-07-01T14:00:33.982000Z TestFramework (INFO): Peer 1 able to repeatedly download new block
2020-07-01T14:00:34.088000Z TestFramework (INFO): Peer 1 disconnected after trying to download old block
2020-07-01T14:00:34.088000Z TestFramework (INFO): Advancing system time on node to clear counters...
2020-07-01T14:00:34.249000Z TestFramework (INFO): Peer 2 able to download old block
2020-07-01T14:00:34.249000Z TestFramework (INFO): Restarting nodes with -whitelist=127.0.0.1
2020-07-01T14:00:40.178000Z TestFramework (INFO): Peer still connected after trying to download old block (whitelisted)
2020-07-01T14:00:40.230000Z TestFramework (INFO): Stopping nodes
[14:04:00]i:			 [feature_maxuploadtarget.py] [Test Error Output]
Traceback (most recent call last):
  File "/work/test/functional/feature_maxuploadtarget.py", line 177, in <module>
    MaxUploadTest().main()
  File "/work/test/functional/test_framework/test_framework.py", line 210, in main
    self.stop_nodes()
  File "/work/test/functional/test_framework/test_framework.py", line 407, in stop_nodes
    node.stop_node(wait=wait)
  File "/work/test/functional/test_framework/test_node.py", line 325, in stop_node
    "Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr WARNING: too long mutex cycle found !=
[14:04:24] :			 [Creating archive artifacts.tar.gz] Archive was created, file size 225.19 MB (236137506 bytes)
[14:04:24] :		 [Publishing artifacts] Publishing 1 file using [ArtifactsCachePublisher]: results/artifacts => artifacts.tar.gz
[14:04:24] :		 [Publishing artifacts] Publishing 1 file using [WebPublisher]: results/artifacts => artifacts.tar.gz
[14:03:58] :	 [Step 1/1] ninja: build stopped: subcommand failed.
[14:03:58]i:	 [Step 1/1] ##teamcity[publishArtifacts '+:results/artifacts=>artifacts.tar.gz']
[14:04:24] :	 [Step 1/1] 
[14:04:24] :	 [Step 1/1] Build build-tsan failed with exit code 1
[14:04:24]W:	 [Step 1/1] Process exited with code 1
[14:04:35]E:	 [Step 1/1] Process exited with code 1 (Step: Command Line)

Each failure log is accessible here:
Bitcoin ABC functional tests: feature_maxuploadtarget.py

Snippet of first build failure:

2020-07-01T14:11:39.224000Z TestFramework (INFO): Tests successful
[14:12:42]i:		 [Bitcoin ABC functional tests] wallet_txn_doublespend.py
[14:12:42]i:			 [wallet_txn_doublespend.py] [Test Output]
2020-07-01T14:12:26.524000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/wallet_txn_doublespend_658
2020-07-01T14:12:30.389000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:12:30.745000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/wallet_txn_doublespend_658 on exit
2020-07-01T14:12:30.745000Z TestFramework (INFO): Tests successful
[14:12:42]i:		 [Bitcoin ABC functional tests] wallet_txn_doublespend.py --mineblock
[14:12:42]i:			 [wallet_txn_doublespend.py --mineblock] [Test Output]
2020-07-01T14:11:35.722000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/wallet_txn_doublespend_608
2020-07-01T14:11:40.623000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:11:40.927000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/wallet_txn_doublespend_608 on exit
2020-07-01T14:11:40.927000Z TestFramework (INFO): Tests successful
[14:12:42]i:		 [Bitcoin ABC functional tests] wallet_zapwallettxes.py
[14:12:42]i:			 [wallet_zapwallettxes.py] [Test Output]
2020-07-01T14:11:08.077000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/wallet_zapwallettxes_596
2020-07-01T14:11:09.432000Z TestFramework (INFO): Mining blocks...
2020-07-01T14:11:17.696000Z TestFramework (INFO): Stopping nodes
2020-07-01T14:11:17.899000Z TestFramework (INFO): Cleaning up /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/wallet_zapwallettxes_596 on exit
2020-07-01T14:11:17.899000Z TestFramework (INFO): Tests successful
[14:12:42]i:		 [Bitcoin ABC functional tests] feature_maxuploadtarget.py
[14:12:42]i:			 [feature_maxuploadtarget.py] [Test Output]
2020-07-01T14:08:54.515000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-tsan/test/tmp/test_runner_₿₵_  _20200701_140819/feature_maxuploadtarget_559
2020-07-01T14:09:06.160000Z TestFramework (INFO): Peer 0 disconnected after downloading old block too many times
2020-07-01T14:09:16.298000Z TestFramework (INFO): Peer 1 able to repeatedly download new block
2020-07-01T14:09:16.404000Z TestFramework (INFO): Peer 1 disconnected after trying to download old block
2020-07-01T14:09:16.405000Z TestFramework (INFO): Advancing system time on node to clear counters...
2020-07-01T14:09:16.572000Z TestFramework (INFO): Peer 2 able to download old block
2020-07-01T14:09:16.572000Z TestFramework (INFO): Restarting nodes with -whitelist=127.0.0.1
2020-07-01T14:09:22.161000Z TestFramework (INFO): Peer still connected after trying to download old block (whitelisted)
2020-07-01T14:09:22.213000Z TestFramework (INFO): Stopping nodes
[14:12:42]i:			 [feature_maxuploadtarget.py] [Test Error Output]
Traceback (most recent call last):
  File "/work/test/functional/feature_maxuploadtarget.py", line 177, in <module>
    MaxUploadTest().main()
  File "/work/test/functional/test_framework/test_framework.py", line 210, in main
    self.stop_nodes()
  File "/work/test/functional/test_framework/test_framework.py", line 407, in stop_nodes
    node.stop_node(wait=wait)
  File "/work/test/functional/test_framework/test_node.py", line 325, in stop_node
    "Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found
WARNING: too long mutex cycle found !=
[14:13:05] :			 [Creating archive artifacts.tar.gz] Archive was created, file size 225.19 MB (236137468 bytes)
[14:13:05] :		 [Publishing artifacts] Publishing 1 file using [ArtifactsCachePublisher]: results/artifacts => artifacts.tar.gz
[14:13:05] :		 [Publishing artifacts] Publishing 1 file using [WebPublisher]: results/artifacts => artifacts.tar.gz
[14:12:40] :	 [Step 1/1] Build build-tsan failed with exit code 1
[14:12:40]i:	 [Step 1/1] ##teamcity[publishArtifacts '+:results/artifacts=>artifacts.tar.gz']
[14:13:05] :	 [Step 1/1] 
[14:13:05]W:	 [Step 1/1] Process exited with code 1
[14:13:18]E:	 [Step 1/1] Process exited with code 1 (Step: Command Line)

Each failure log is accessible here:
Bitcoin ABC functional tests: feature_maxuploadtarget.py

Move the test argument decalaration where it belongs

deadalnix requested changes to this revision.Jul 2 2020, 12:52

Now this patch looks much better, but I'd like to loop back on the motivation. This is a stress test, and the purpose of a stress test is to stress the system.

Measuring on my machine, this test lasts about a minute under tsan and takes 16G of RAM. The former is not a huge deal the later maybe. But that begs the question, why are we paying engineer time to make the test worse so they can run on underpowered machines instead of spending money on more powerful machines?

src/test/util/setup_common.h
89 ↗(On Diff #21953)

Pass by ref instead of making copies.

This revision now requires changes to proceed.Jul 2 2020, 12:52

The stress test is still running full capacity on all other tests, and when running with TSAN we are especially looking for deadlocks rather than performance issues.
The point still stands, this patch was a solution I can leverage by myself to work around the issue now but I will look at if/how we can increase the machines capacity.

Fabien planned changes to this revision.Jul 3 2020, 17:55
Fabien added a subscriber: jasonbcox.

Talked to @jasonbcox and the server will be upgraded to more memory