Page MenuHomePhabricator

[Chronik] Add logs to find deadlocks
AbandonedPublicDraft

Authored by tobias_ruck on Jan 17 2024, 10:58.

Details

Reviewers
Fabien
Group Reviewers
Restricted Project
Summary

It seems like some tests deadlock. Adding logs helps diagnosing the issue

Test Plan

ninja check-functional

Diff Detail

Repository
rABC Bitcoin ABC
Branch
chronik-add-deadlock-logs
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 26442
Build 52454: Build Diffbuild-chronik-plugins · build-chronik · chronik-client-integration-tests
Build 52453: arc lint + arc unit

Event Timeline

also add read lock acquires

rerun CI to stir up trouble

Failed tests logs:

====== Bitcoin ABC functional tests: chronik_avalanche.py ======

------- Stdout: -------
2024-01-17T17:39:04.588000Z TestFramework (INFO): Initializing test directory /work/abc-ci-builds/build-chronik/test/tmp/test_runner_₿₵_  _20240117_173746/chronik_avalanche_15
2024-01-17T17:43:18.679000Z TestFramework (ERROR): JSONRPC error
Traceback (most recent call last):
  File "/work/test/functional/test_framework/authproxy.py", line 216, in _get_response
    http_response = self.__conn.getresponse()
  File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/lib/python3.9/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/work/test/functional/test_framework/test_framework.py", line 148, in main
    self._run_test_internal()
  File "/work/test/functional/test_framework/test_framework.py", line 138, in _run_test_internal
    self.run_test()
  File "/work/test/functional/chronik_avalanche.py", line 122, in run_test
    new_block_hashes = self.generate(node, 10, sync_fun=self.no_op)
  File "/work/test/functional/test_framework/test_framework.py", line 737, in generate
    blocks = generator.generate(*args, invalid_call=False, **kwargs)
  File "/work/test/functional/test_framework/test_node.py", line 448, in generate
    return self.generatetoaddress(
  File "/work/test/functional/test_framework/test_node.py", line 461, in generatetoaddress
    return self.__getattr__("generatetoaddress")(*args, **kwargs)
  File "/work/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/work/test/functional/test_framework/authproxy.py", line 175, in __call__
    response, status = self._request(
  File "/work/test/functional/test_framework/authproxy.py", line 126, in _request
    return self._get_response()
  File "/work/test/functional/test_framework/authproxy.py", line 218, in _get_response
    raise JSONRPCException(
test_framework.authproxy.JSONRPCException: 'generatetoaddress' RPC took longer than 240 seconds. Consider using larger timeout for calls that take longer to return. (-344)
2024-01-17T17:43:18.735000Z TestFramework (INFO): Stopping nodes
2024-01-17T17:43:18.736000Z TestFramework.node0 (ERROR): Unable to stop node.
Traceback (most recent call last):
  File "/work/test/functional/test_framework/test_node.py", line 502, in stop_node
    self.stop(wait=wait)
  File "/work/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/work/test/functional/test_framework/authproxy.py", line 175, in __call__
    response, status = self._request(
  File "/work/test/functional/test_framework/authproxy.py", line 125, in _request
    self.__conn.request(method, path, postdata, headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1266, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.9/http/client.py", line 1092, in putrequest
    raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
2024-01-17T17:44:18.778000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: ''''
    def is_node_stopped(self):
        """Checks whether the node has stopped.

        Returns True if the node has stopped. False otherwise.
        This method is responsible for freeing resources (self.process)."""
        if not self.running:
            return True
        return_code = self.process.poll()
        if return_code is None:
            return False

        # process has stopped. Assert that it didn't return an error code.
        assert return_code == 0, self._node_msg(
            f"Node returned non-zero exit code ({return_code}) when stopping"
        )
        self.running = False
        self.process = None
        self.rpc_connected = False
        self.rpc = None
        self.log.debug("Node stopped")
        return True
'''
[node 0] Cleaning up leftover process
------- Stderr: -------
Traceback (most recent call last):
  File "/work/test/functional/chronik_avalanche.py", line 145, in <module>
    ChronikAvalancheTest().main()
  File "/work/test/functional/test_framework/test_framework.py", line 171, in main
    exit_code = self.shutdown()
  File "/work/test/functional/test_framework/test_framework.py", line 397, in shutdown
    self.stop_nodes()
  File "/work/test/functional/test_framework/test_framework.py", line 652, in stop_nodes
    node.wait_until_stopped()
  File "/work/test/functional/test_framework/test_node.py", line 547, in wait_until_stopped
    wait_until_helper(
  File "/work/test/functional/test_framework/util.py", line 298, in wait_until_helper
    raise AssertionError(
AssertionError: Predicate ''''
    def is_node_stopped(self):
        """Checks whether the node has stopped.

        Returns True if the node has stopped. False otherwise.
        This method is responsible for freeing resources (self.process)."""
        if not self.running:
            return True
        return_code = self.process.poll()
        if return_code is None:
            return False

        # process has stopped. Assert that it didn't return an error code.
        assert return_code == 0, self._node_msg(
            f"Node returned non-zero exit code ({return_code}) when stopping"
        )
        self.running = False
        self.process = None
        self.rpc_connected = False
        self.rpc = None
        self.log.debug("Node stopped")
        return True
''' not true after 60.0 seconds

Each failure log is accessible here:
Bitcoin ABC functional tests: chronik_avalanche.py

add some more logs, especially after locks were acquired

add logs to blockstorage.cpp

add logs to AvalancheFinalizeBlock

The build failed due to an unexpected infrastructure outage. The administrators have been notified to investigate. Sorry for the inconvenience.