Page MenuHomePhabricator

[chronik-client] Fix WebSocket retry loop issues during disconnection
ClosedPublic

Authored by alitayin on Wed, Apr 23, 06:44.

Details

Reviewers
Fabien
bytesofman
emack
Group Reviewers
Restricted Owners Package(Owns No Changed Paths)
Restricted Project
Commits
rABCca9b14c07a55: [chronik-client] Fix WebSocket retry loop issues during disconnection
Summary

Removed the previous delay handling. Only ensure in this diff that faulty urls can be switched correctly. The improvements to the WsEndpoint object mentioned in the diff will be handled in future diffs.

before modification: When a WebSocket connection to a URL is terminated due to a fault, triggering the onclose event callback function, the system automatically attempts to reconnect to this closed url for failover, which may not be successful and poses a risk of recursion.

After modification: When a WebSocket connection to a URL is terminated due to a fault, triggering the onclose event callback function, the system automatically attempts to connect to the next available url for failover.

Test Plan

npm test

Event Timeline

Owners added a reviewer: Restricted Owners Package.Wed, Apr 23, 06:44
modules/chronik-client/src/failoverProxy.ts
288

What is this callback expected to do ?

294

What is the point of the square ? Why not divide by the number of nodes directly so we loop at a constant time independently of the url array length ?

modules/chronik-client/src/failoverProxy.ts
288

Users can manage error handling themselves, as we only reduce the retry frequency but don't exit. If all nodes are unavailable, it will continue to be in a reconnection waiting process. Applications can catch this and provide corresponding UI feedback, logging, or other appropriate responses.

294

This non-linear design aims to allow users to retry immediately with shorter delays when first encountering failures.

For example, comparing 20ms 40ms 60ms 80ms 100ms (300ms in total) with 60ms 60ms 60ms 60ms 60ms , the former means faster switching attempts with "better nodes" and "relatively healthy node index order". As failures increase, the retry delay starts to increase, and it's assumed that "nodes further down the index" can appropriately receive more delay time.

modules/chronik-client/src/failoverProxy.ts
294

In fact, the total time decreases as the number of urls increases. With 1 node, it takes 500ms; with 2 urls, it takes 125+250=375ms; with 5 urls, it takes 300ms. This is because the delay is designed to balance retry efficiency and url request rates. The more urls there are, the shorter the total duration can be. Considering the additional time needed for _websocketUrlConnects, the actual delay before each url is "retried" will still be greater than 500ms.

this is a complicated fix to a niche issue that impacts behavior everywhere. I don't see a compelling case for the impact of this diff justifying its complexity.

a node that can establish a WebSocket connection (_websocketUrlConnects) but immediately throws onerror or onclose after connection,

Is this happening somewhere? what kind of chronik server condition causes this?

This revision now requires changes to proceed.Wed, Apr 23, 16:40

“A node that can establish a WebSocket connection (_websocketUrlConnects) but immediately throws onerror or onclose after connecting,” is not an accurate description; it’s more of an extreme example.

However, the onerror and onclose being addressed here are specifically for failures occurring within chronik itself for various reasons. For instance, there was such an incident on March 17th, which caused chronik-native1 to encounter a large number of reconnection loops.

The changes here only target WebSocket failures in chronik and do not affect other user behaviors (such as proto Error messages). For example, due to node maintenance, resource exhaustion, or network anomalies, it is more likely that a normally functioning node experiences fluctuations and triggers onerror. In these cases, the client continues making requests aggressively and cannot switch to a valid node.

The improvements themselves are not complicated:

It simply increments the index by 1.
It adds a slight delay before retrying to control the request frequency.
If nonlinear delays (e.g., exponential backoff) aren’t a good choice, we can switch to using a fixed delay instead, as Fabien suggested.

Fabien requested changes to this revision.Thu, Apr 24, 08:44

I think there is a fundamental issue, either with the analysis or the explanation of the analysis.
This diff does 2 things:

  • It switches to the next url when the ws is closed and this is not deliberate.
  • It adds a delay before switching the urls.

Let's go back to the claims of this diff:

if there are internal failures in chronik, whether network-related, resource-related, manual operations, or others, during onerror\onclose handling, the system will continuously attempt to reconnect to this faulty URL instead of switching to another URL

  1. The onerror case is not handled. More interesting, it doesn't work and never worked to begin with. The callback will never be called, not with this code and not on master (this is not a bug introduced here). Also what would fire this ? What kind of error is this ?
  2. The current code expects that if a ws connection is closed, you can either reconnect immediately and everything works fine, or it will fail to connect and switch to the next url because of the loop. You are claiming that the first assumption could be incorrect, the client could connect but the server keeps closing the ws connection. I don't know if this is really something that we saw in practice but there is a potential misbehavior here. Your proposal to fix this is by increasing the working index, which may or may not work (it's actually very dubious because the next call to connectWs will override the value anyway if the first array element connects) but before we jump on this there is one question that remains unanswered (which I asked in my previous review): what is the onReconnect() supposed to do ? Isn't it already the mechanism to handle this case ? Is it only used in any project for anything ?

Now regarding the delay, this diff is trying to solve a high frequency reconnexion issue from the client when the above mentioned loop error occurres:

  1. If this is another issue, and should be another diff.
  2. If this diff succeeds in fixing the reconnexion issue already so there is no need for any delay because there is no reconnexion anymore.

There might be reasons for delaying a reconnexion attempt but I don't think the rationale here makes sense, and the execution is not good either. If you fixes the underlying issue the problem is gone.

Regarding the review itself, the test plan is not the best. For example, the script you provided will not produce the expected output (the connection log is missing). I suggest you write a test that triggers the issue, so we can check it fails before the patch and works after the patch.

This revision now requires changes to proceed.Thu, Apr 24, 08:44

Based on offline discussions,
Joey's view is correct. "A node that can establish a WebSocket connection (_websocketUrlConnects) but immediately throws onerror or onclose after connection" was inferred from a March failure,

but this was my incorrect analysis of the cause of this failure. So this issue hasn't been captured in detail yet.

However, switching nodes still makes sense (regarding on.close not being improved in this diff), and index+1 does work. I suggest keeping it but improving onReconnect.

therefore the initial assumption of a "magical faulty node" has been put aside, nodes that can establish connections can work normally, even a re-indexing node can be indexed and won't onclose, so it's difficult to fall into the "assumed" loop, the delay can be removed,

@bytesofman @Fabien

Removed the previous delay handling. Only ensure in this diff that faulty nodes can be switched correctly.

alitayin edited the test plan for this revision. (Show Details)
alitayin edited the summary of this revision. (Show Details)

Add README.md and npm version patch

Fabien requested changes to this revision.Mon, Apr 28, 07:27
Fabien added inline comments.
modules/chronik-client/test/test.ts
339 ↗(On Diff #53714)

Please check the connectWsCallCount didn't increment after a manual close and the working index didn't change. You want to make sure that manually closing doesn't trigger the backend switch

This revision now requires changes to proceed.Mon, Apr 28, 07:27

We really need a functional test for this feature. This should be another diff, and could be done after this one since it already comes with a test.

Add test for manual close

Tail of the build log:

  hdwallet.ts                  |       0 |        0 |       0 |       0 | 12-176                    
  hmac.ts                      |       0 |        0 |       0 |       0 | 16-73                     
  index.ts                     |       0 |        0 |       0 |       0 |                           
  indexBrowser.ts              |       0 |        0 |       0 |       0 |                           
  indexNodeJs.ts               |       0 |        0 |       0 |       0 |                           
  initBrowser.ts               |       0 |      100 |       0 |       0 | 11-17                     
  initNodeJs.ts                |   54.54 |      100 |     100 |      80 | 10                        
  messages.ts                  |       0 |        0 |       0 |       0 | 18-123                    
  mnemonic.ts                  |       0 |        0 |       0 |       0 | 9-144                     
  op.ts                        |   20.13 |    23.33 |   36.36 |   39.47 | ...09,111,119-124,135-163 
  opcode.ts                    |    50.2 |    83.33 |     100 |     100 | 1                         
  pbkdf2.ts                    |       0 |      100 |       0 |       0 | 17-51                     
  script.ts                    |   28.44 |    20.58 |   29.03 |   50.87 | ...33-144,150,160,182-193 
  sigHashType.ts               |      40 |       25 |   46.15 |   78.94 | 26-38                     
  tx.ts                        |   47.25 |    45.23 |   47.61 |   87.23 | 110,114,123-125,144       
  txBuilder.ts                 |   40.74 |    33.92 |   54.54 |   80.43 | ...62,181-186,191,261-265 
  unsignedTx.ts                |   25.27 |       16 |   30.76 |   46.15 | ...12,320,326-329,345,357 
 src/address                   |   11.35 |    15.15 |    5.12 |   22.41 |                           
  address.ts                   |   10.95 |    11.36 |    3.22 |   21.05 | ...39-240,255-256,266-344 
  legacyaddr.ts                |   12.04 |    22.72 |    12.5 |      25 | ...9,23-38,70-111,124-128 
 src/ffi                       |   14.37 |    12.87 |    8.25 |   14.54 |                           
  ecash_lib_wasm_bg_browser.js |       0 |      100 |     100 |       0 | 1                         
  ecash_lib_wasm_browser.js    |       0 |        0 |       0 |       0 | 3-681                     
  ecash_lib_wasm_nodejs.js     |    29.9 |    36.11 |   17.64 |   30.09 | ...19,526-591,597-598,602 
 src/io                        |   29.72 |    41.17 |   37.87 |   57.44 |                           
  bytes.ts                     |    7.04 |       60 |   10.52 |   13.88 | 12,23-83                  
  hex.ts                       |   41.55 |       50 |   44.44 |   82.35 | 41-45,50,58               
  int.ts                       |       0 |        0 |       0 |       0 |                           
  str.ts                       |   46.15 |    83.33 |      40 |   85.71 | 15                        
  varsize.ts                   |   16.32 |    21.05 |      40 |      32 | 14-24,40-47               
  writer.ts                    |       0 |        0 |       0 |       0 |                           
  writerbytes.ts               |   42.62 |    40.62 |   53.33 |   83.87 | 34,44,54,64,80            
  writerlength.ts              |   53.33 |    83.33 |   53.84 |     100 | 1                         
 src/payment                   |       0 |        0 |       0 |       0 |                           
  asn1.ts                      |       0 |        0 |       0 |       0 | 77-439                    
  index.ts                     |       0 |        0 |       0 |       0 |                           
  x509.ts                      |       0 |      100 |     100 |       0 | 5-15                      
 src/test                      |   46.05 |    37.68 |      50 |   89.04 |                           
  testRunner.ts                |   46.05 |    37.68 |      50 |   89.04 | 66-68,81-82,105,114,157   
 src/token                     |   31.19 |    24.63 |   34.56 |   46.42 |                           
  alp.parse.ts                 |       0 |        0 |       0 |       0 | 102-256                   
  alp.ts                       |   42.68 |    53.12 |   43.47 |   83.33 | 119-132,151               
  common.ts                    |      52 |    83.33 |     100 |     100 | 1                         
  empp.ts                      |   52.17 |       60 |   57.14 |   91.66 | 12                        
  slp.parse.ts                 |       0 |        0 |       0 |       0 | 128-382                   
  slp.ts                       |   47.16 |    33.82 |      52 |   90.36 | ...82,188,196,199,218,223 
-------------------------------|---------|----------|---------|---------|---------------------------

##teamcity[blockOpened name='Code Coverage Summary']
##teamcity[buildStatisticValue key='CodeCoverageAbsBCovered' value='879']
##teamcity[buildStatisticValue key='CodeCoverageAbsBTotal' value='4206']
##teamcity[buildStatisticValue key='CodeCoverageAbsRCovered' value='263']
##teamcity[buildStatisticValue key='CodeCoverageAbsRTotal' value='1193']
##teamcity[buildStatisticValue key='CodeCoverageAbsMCovered' value='144']
##teamcity[buildStatisticValue key='CodeCoverageAbsMTotal' value='692']
##teamcity[buildStatisticValue key='CodeCoverageAbsLCovered' value='858']
##teamcity[buildStatisticValue key='CodeCoverageAbsLTotal' value='3057']
##teamcity[blockClosed name='Code Coverage Summary']
mv: cannot stat 'test_results/ecash-lib-integration-tests-junit.xml': No such file or directory
Build ecash-lib-integration-tests failed with exit code 1
This revision is now accepted and ready to land.Mon, Apr 28, 12:19