Paths

Table of Contentst

[chronik-client] Fix WebSocket retry loop issues during disconnection
Needs ReviewPublic
Actions

Authored by alitayin on Wed, Apr 23, 06:44.

Details

Reviewers

Fabien
bytesofman
emack

Group Reviewers

Restricted Owners Package	(Owns No Changed Paths)
Restricted Project

Summary

The purpose of this improvement is to address the following issue: When connecting to a node that can establish a WebSocket connection (_websocketUrlConnects) but immediately throws onerror or onclose after connection, the system will keep attempting to reconnect to this faulty node instead of switching to another one, until the WebSocket connection itself fails to establish. This leads to high-frequency requests to the faulty node when it's the only URL available, consuming significant resources, while also failing to properly switch to available nodes when the faulty node cannot switch correctly.

The solution is to add a delay using setTimeout (which may also prevent stack overflow issues in extreme cases, though uncertain)
while ensuring node switching. Dynamic fallback delay calculation is used to adjust delay time based on different number of nodes.
For example, with only 1 node, it will continue trying with a 500ms delay - it won't exit but at least won't cause the
aforementioned issues. With 5 nodes, the base delay is 500/square of node count, which is 20ms, then the delay time varies according
to different this._workingIndex values (0 to 4):

When this._workingIndex = 0: 20 * 1 = 20ms
When this._workingIndex = 1: 20 * 2 = 40ms
When this._workingIndex = 2: 20 * 3 = 60ms
When this._workingIndex = 3: 20 * 4 = 80ms
When this._workingIndex = 4: 20 * 5 = 100ms

The purpose is to avoid switching nodes too quickly when all nodes are unavailable (can pass _websocketUrlConnects but actually
faulty) while maintaining retry efficiency. For example, 1 node takes 500ms, 2 nodes take 125+250=375ms, 3 nodes take
55+111+166=332ms.

This solution can solve the aforementioned issues. It allows the reconnection mechanism to function properly in various situations
without getting stuck in resource-intensive loops.

Test Plan

Added a local test script.

// Alitatest.ts
import { ChronikClient, ConnectionStrategy } from './src/ChronikClient';

async function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function main() {
    try {
        const urls = [
            'https://chronik.e.cash',
            'https://chronik1.alitayin.com',
            'https://chronik-native1.fabien.cash',
            'https://chronik-native2.fabien.cash',
            'https://chronik-native3.fabien.cash',
            'https://chronik2.alitayin.com'
        ];

        console.log('===== Starting WebSocket Reconnection Test =====');
        
        const chronik = await ChronikClient.useStrategy(
            ConnectionStrategy.ClosestFirst, 
            urls
        );
        
        const wsEndpoint = chronik.ws({
            onMessage: (msg) => {
                if (msg.type === 'Block') {
                    console.log(`Received block message: ${msg.blockHash}`);
                } else if (msg.type === 'Tx') {
                    console.log(`Received transaction message: ${msg.txid}`);
                } else if (msg.type === 'Error') {
                    console.log(`Received error message: ${msg.msg}`);
                }
            },
            onConnect: () => console.log('WebSocket connected'),
            onReconnect: () => console.log('WebSocket reconnecting'),
            onError: () => console.log('WebSocket connection error'),
            onEnd: () => console.log('WebSocket connection ended'),
        });
        
        console.log('Connecting to WebSocket...');
        await wsEndpoint.waitForOpen();
        
        console.log('Subscribing to block updates...');
        wsEndpoint.subscribeToBlocks();
        
        await sleep(60000); 
        
        wsEndpoint.close();
        console.log('===== WebSocket Reconnection Test Completed =====');

    } catch (error) {
        console.error('Test error:', error);
        return 1;
    }
    return 0;
}

main().then(process.exit);

When directly testing this script, the nodes connect normally without any failures. We need to simulate nodes that are normally reachable but immediately disconnect, so we need to add the following at the end of wsEndpoint.connected = new Promise (line 333):

console.log(`Node ${this._workingIndex} connected, closing immediately to test reconnection mechanism`);
setTimeout(() => ws.close(), 100);

This allows us to simulate nodes that establish a WebSocket connection but actually fail right after. Then, when we run the script, we get:

===== Starting WebSocket Reconnection Test =====
  1. https://chronik2.alitayin.com - latency: 165ms
  2. https://chronik-native2.fabien.cash - latency: 580ms
  3. https://chronik-native3.fabien.cash - latency: 647ms
  4. https://chronik.e.cash - latency: 896ms
  5. https://chronik1.alitayin.com - latency: >1000ms
  6. https://chronik-native1.fabien.cash - latency: >1000ms
Connecting to WebSocket...
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
Subscribing to block updates...
WebSocket reconnecting
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism

This reproduces the continuous loop on node 0 before fix.

Then, if we test again with our fix, we get:

===== Starting WebSocket Reconnection Test =====
  1. https://chronik2.alitayin.com - latency: 134ms
  2. https://chronik1.alitayin.com - latency: 137ms
  3. https://chronik-native2.fabien.cash - latency: 567ms
  4. https://chronik-native3.fabien.cash - latency: 569ms
  5. https://chronik.e.cash - latency: 869ms
  6. https://chronik-native1.fabien.cash - latency: >1000ms
Connecting to WebSocket...
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
Subscribing to block updates...
WebSocket reconnecting
Reconnecting to node 1 with a delay of 13 ms
WebSocket connected
Node 1 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 2 with a delay of 27 ms
WebSocket connected
Node 2 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 3 with a delay of 41 ms
WebSocket connected
Node 3 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 4 with a delay of 55 ms
WebSocket connected
Node 4 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 5 with a delay of 69 ms

Now, node switching works smoothly. When we set the number of nodes to 1, the delay becomes 500ms.

Diff Detail

Repository

rABC Bitcoin ABC

Branch

alita0423

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 33057
Build 65598: Build Diff	ecash-lib-integration-tests · ecash-agora-integration-tests · ecash-herald-tests · token-server-tests · chronik-client-integration-tests · chronik-client-tests
Build 65597: arc lint + arc unit

Event Timeline

alitayin created this revision.Wed, Apr 23, 06:44

Owners added a reviewer: Restricted Owners Package.Wed, Apr 23, 06:44

Herald added a reviewer: Restricted Project. · View Herald TranscriptWed, Apr 23, 06:44

alitayin requested review of this revision.Wed, Apr 23, 06:44

Harbormaster completed remote builds in B33057: Diff 53617.Wed, Apr 23, 06:55

alitayin edited the test plan for this revision. (Show Details)Wed, Apr 23, 06:58

alitayin edited the summary of this revision. (Show Details)Wed, Apr 23, 07:40

Fabien added inline comments.Wed, Apr 23, 08:46

modules/chronik-client/src/failoverProxy.ts
288	What is this callback expected to do ?
294	What is the point of the square ? Why not divide by the number of nodes directly so we loop at a constant time independently of the url array length ?

alitayin added inline comments.Wed, Apr 23, 09:50

modules/chronik-client/src/failoverProxy.ts
288	Users can manage error handling themselves, as we only reduce the retry frequency but don't exit. If all nodes are unavailable, it will continue to be in a reconnection waiting process. Applications can catch this and provide corresponding UI feedback, logging, or other appropriate responses.
294	This non-linear design aims to allow users to retry immediately with shorter delays when first encountering failures. For example, comparing *20ms 40ms 60ms 80ms 100ms* (300ms in total) with *60ms 60ms 60ms 60ms 60ms* , the former means faster switching attempts with "better nodes" and "relatively healthy node index order". As failures increase, the retry delay starts to increase, and it's assumed that "nodes further down the index" can appropriately receive more delay time.

alitayin added inline comments.Wed, Apr 23, 09:58

modules/chronik-client/src/failoverProxy.ts
294	In fact, the total time decreases as the number of urls increases. With 1 node, it takes 500ms; with 2 urls, it takes 125+250=375ms; with 5 urls, it takes 300ms. This is because the delay is designed to balance retry efficiency and url request rates. The more urls there are, the shorter the total duration can be. Considering the additional time needed for _websocketUrlConnects, the actual delay before each url is "retried" will still be greater than 500ms.

Revision Contents
Changeset List

Path

Size

modules/

chronik-client/

src/

failoverProxy.ts

15 lines

Diff 53617

View Options

modules/chronik-client/src/failoverProxy.ts

[chronik-client] Fix WebSocket retry loop issues during disconnectionNeeds ReviewPublicActions