Page MenuHomePhabricator

[chronik-client] Fix WebSocket retry loop issues during disconnection
Needs ReviewPublic

Authored by alitayin on Wed, Apr 23, 06:44.

Details

Reviewers
Fabien
bytesofman
emack
Group Reviewers
Restricted Owners Package(Owns No Changed Paths)
Restricted Project
Summary

The purpose of this improvement is to address the following issue: When connecting to a node that can establish a WebSocket connection (_websocketUrlConnects) but immediately throws onerror or onclose after connection, the system will keep attempting to reconnect to this faulty node instead of switching to another one, until the WebSocket connection itself fails to establish. This leads to high-frequency requests to the faulty node when it's the only URL available, consuming significant resources, while also failing to properly switch to available nodes when the faulty node cannot switch correctly.

The solution is to add a delay using setTimeout (which may also prevent stack overflow issues in extreme cases, though uncertain)
while ensuring node switching. Dynamic fallback delay calculation is used to adjust delay time based on different number of nodes.
For example, with only 1 node, it will continue trying with a 500ms delay - it won't exit but at least won't cause the
aforementioned issues. With 5 nodes, the base delay is 500/square of node count, which is 20ms, then the delay time varies according
to different this._workingIndex values (0 to 4):

When this._workingIndex = 0: 20 * 1 = 20ms
When this._workingIndex = 1: 20 * 2 = 40ms
When this._workingIndex = 2: 20 * 3 = 60ms
When this._workingIndex = 3: 20 * 4 = 80ms
When this._workingIndex = 4: 20 * 5 = 100ms

The purpose is to avoid switching nodes too quickly when all nodes are unavailable (can pass _websocketUrlConnects but actually
faulty) while maintaining retry efficiency. For example, 1 node takes 500ms, 2 nodes take 125+250=375ms, 3 nodes take
55+111+166=332ms.

This solution can solve the aforementioned issues. It allows the reconnection mechanism to function properly in various situations
without getting stuck in resource-intensive loops.

Test Plan

Added a local test script.

// Alitatest.ts
import { ChronikClient, ConnectionStrategy } from './src/ChronikClient';

async function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function main() {
    try {
        const urls = [
            'https://chronik.e.cash',
            'https://chronik1.alitayin.com',
            'https://chronik-native1.fabien.cash',
            'https://chronik-native2.fabien.cash',
            'https://chronik-native3.fabien.cash',
            'https://chronik2.alitayin.com'
        ];

        console.log('===== Starting WebSocket Reconnection Test =====');
        
        const chronik = await ChronikClient.useStrategy(
            ConnectionStrategy.ClosestFirst, 
            urls
        );
        
        const wsEndpoint = chronik.ws({
            onMessage: (msg) => {
                if (msg.type === 'Block') {
                    console.log(`Received block message: ${msg.blockHash}`);
                } else if (msg.type === 'Tx') {
                    console.log(`Received transaction message: ${msg.txid}`);
                } else if (msg.type === 'Error') {
                    console.log(`Received error message: ${msg.msg}`);
                }
            },
            onConnect: () => console.log('WebSocket connected'),
            onReconnect: () => console.log('WebSocket reconnecting'),
            onError: () => console.log('WebSocket connection error'),
            onEnd: () => console.log('WebSocket connection ended'),
        });
        
        console.log('Connecting to WebSocket...');
        await wsEndpoint.waitForOpen();
        
        console.log('Subscribing to block updates...');
        wsEndpoint.subscribeToBlocks();
        
        await sleep(60000); 
        
        wsEndpoint.close();
        console.log('===== WebSocket Reconnection Test Completed =====');

    } catch (error) {
        console.error('Test error:', error);
        return 1;
    }
    return 0;
}

main().then(process.exit);

When directly testing this script, the nodes connect normally without any failures. We need to simulate nodes that are normally reachable but immediately disconnect, so we need to add the following at the end of wsEndpoint.connected = new Promise (line 333):

console.log(`Node ${this._workingIndex} connected, closing immediately to test reconnection mechanism`);
setTimeout(() => ws.close(), 100);

This allows us to simulate nodes that establish a WebSocket connection but actually fail right after. Then, when we run the script, we get:

===== Starting WebSocket Reconnection Test =====
  1. https://chronik2.alitayin.com - latency: 165ms
  2. https://chronik-native2.fabien.cash - latency: 580ms
  3. https://chronik-native3.fabien.cash - latency: 647ms
  4. https://chronik.e.cash - latency: 896ms
  5. https://chronik1.alitayin.com - latency: >1000ms
  6. https://chronik-native1.fabien.cash - latency: >1000ms
Connecting to WebSocket...
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
Subscribing to block updates...
WebSocket reconnecting
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism

This reproduces the continuous loop on node 0 before fix.

Then, if we test again with our fix, we get:

===== Starting WebSocket Reconnection Test =====
  1. https://chronik2.alitayin.com - latency: 134ms
  2. https://chronik1.alitayin.com - latency: 137ms
  3. https://chronik-native2.fabien.cash - latency: 567ms
  4. https://chronik-native3.fabien.cash - latency: 569ms
  5. https://chronik.e.cash - latency: 869ms
  6. https://chronik-native1.fabien.cash - latency: >1000ms
Connecting to WebSocket...
WebSocket connected
Node 0 connected, closing immediately to test reconnection mechanism
Subscribing to block updates...
WebSocket reconnecting
Reconnecting to node 1 with a delay of 13 ms
WebSocket connected
Node 1 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 2 with a delay of 27 ms
WebSocket connected
Node 2 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 3 with a delay of 41 ms
WebSocket connected
Node 3 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 4 with a delay of 55 ms
WebSocket connected
Node 4 connected, closing immediately to test reconnection mechanism
WebSocket reconnecting
Reconnecting to node 5 with a delay of 69 ms

Now, node switching works smoothly. When we set the number of nodes to 1, the delay becomes 500ms.

Event Timeline

Owners added a reviewer: Restricted Owners Package.Wed, Apr 23, 06:44
modules/chronik-client/src/failoverProxy.ts
288

What is this callback expected to do ?

294

What is the point of the square ? Why not divide by the number of nodes directly so we loop at a constant time independently of the url array length ?

modules/chronik-client/src/failoverProxy.ts
288

Users can manage error handling themselves, as we only reduce the retry frequency but don't exit. If all nodes are unavailable, it will continue to be in a reconnection waiting process. Applications can catch this and provide corresponding UI feedback, logging, or other appropriate responses.

294

This non-linear design aims to allow users to retry immediately with shorter delays when first encountering failures.

For example, comparing 20ms 40ms 60ms 80ms 100ms (300ms in total) with 60ms 60ms 60ms 60ms 60ms , the former means faster switching attempts with "better nodes" and "relatively healthy node index order". As failures increase, the retry delay starts to increase, and it's assumed that "nodes further down the index" can appropriately receive more delay time.

modules/chronik-client/src/failoverProxy.ts
294

In fact, the total time decreases as the number of urls increases. With 1 node, it takes 500ms; with 2 urls, it takes 125+250=375ms; with 5 urls, it takes 300ms. This is because the delay is designed to balance retry efficiency and url request rates. The more urls there are, the shorter the total duration can be. Considering the additional time needed for _websocketUrlConnects, the actual delay before each url is "retried" will still be greater than 500ms.