Page MenuHomePhabricator

[avalanche] Make the network more resilient to temporary network slow downs
ClosedPublic

Authored by Fabien on Sep 26 2023, 13:29.

Details

Summary

Under some severe network congestion it is possible for an avalanche response to arrive after the query timed out. This could lead to the ban score of the peer to slowly increase over time, eventually leading to a disconnect after some time.

Increasing the query timeout is only reducing the occurrence but not solving the issue, and will add some delay before a misbehaving node gets banned.

This diff implements a windowed fault counter, so the node banscore is only increased after a number of misbehaving messages occurred. If the node resumes to correct operation, the counter is reset after some time. This allows for a little tolerance if a node is facing networking issues while still preventing DoS.

Test Plan
ninja all check-all

Diff Detail