Page MenuHomePhabricator

[avalanche] Compute a node activity score
ClosedPublic

Authored by Fabien on Jul 27 2021, 12:25.

Details

Reviewers
deadalnix
Group Reviewers
Restricted Project
Maniphest Tasks
Restricted Maniphest Task
Commits
rABC37aa80b686c4: [avalanche] Compute a node activity score
Summary

This score reflects the node activity over the last invs polled. It ranges between 0 and 100 so it can be easily expressed as a percentage.
This score can be used for determining which node is the most reliable in a peer and protect it from eviction.

Ref T1634.

Test Plan
ninja all check-all

Diff Detail

Event Timeline

Fabien requested review of this revision.Jul 27 2021, 12:25
deadalnix requested changes to this revision.Jul 27 2021, 14:00
deadalnix added a subscriber: deadalnix.
deadalnix added inline comments.
src/net.cpp
3064 ↗(On Diff #29313)

It sound like you should compute this live instead. This accumulate errors, and require a lock, that seems worse on all fronts. I highly doubt a couple of floating points ops are most costly than this (especially since floating point divide is actually cheaper than integer divide, and all the require atomic ops would be read only, while the mutex require a write).

3096 ↗(On Diff #29313)

This is not the right place. You don't even know if these nodes will exchange anything yet.

src/net.h
1058 ↗(On Diff #29313)

Shouldn't the mutex be on the smart pointer instead? The metric themselves can be atomic anyways, so a mutex doesn't seem necessary.

src/net_processing.cpp
4350 ↗(On Diff #29313)

Do we want this actually separated from the avalanche state? Maybe this should just be part of it.

This revision now requires changes to proceed.Jul 27 2021, 14:00

Move to AvalancheState, run the score update in the scheduler queue and use RCU to make it lock free most of the time.

deadalnix requested changes to this revision.Aug 2 2021, 19:25

I don't understand what this whole RCU business achieves. It's trivially completely wrong: the RCU pointer is stuffed into a larger structure that might be deleted anyways, nothing's actually protected. And it seems that you fitted all the metrics into one value that is small enough to be atomic anyways, so is RCU needed at all?

In any case, it is also self evident that the scheduler business is wrong, for the same reason the RCU business is wrong: it can't handle node deletion.

src/net.cpp
3056 ↗(On Diff #29326)

When does it stop?

3065 ↗(On Diff #29326)

I don't think this code will misbehave if the count is zero.

3071 ↗(On Diff #29326)

dito

3082 ↗(On Diff #29326)

I though you were implementing an exponential low pass filter. You should update every window.

3099 ↗(On Diff #29326)

This achieves absolutely nothing meaningful. nextScore is assigned from availability score, so you are just assigning it to itself. But i can't even figure out what you meant to do here. The whole point of RCU to be able to wait for other to stop using a resource to do something with it, for instance compute statistics on value that won't ever be updated. For that to happen, you need to remove the pointer from public view, then wait for others to reach a revision after the pointer is out of public view, then you know you are the only owner of the resosurce and you can do whatever with it - for instance destroy it.

If you leave it there, then it doesn't matter how long you wait, someone will still be able to see it.

But it seems that you didn't even try to do that because you jumped through hoops to make all the stats fit into an atomic, and you reset them all atomically this way, so I'm not sure why is there even an attempt at doing some swapping business.

I wish I could be of more help, but none of this makes any sense to me. i don't even know what you are trying to do with this.

This revision now requires changes to proceed.Aug 2 2021, 19:25
  • Use a single sheduler task, so node deletion is not a problem
  • use a simple Mutex to lock only during updates, reads are lock free
  • use a different score computation, no longer special case when there is no poll during the window
deadalnix requested changes to this revision.Aug 4 2021, 21:39
deadalnix added inline comments.
src/avalanche/processor.cpp
702 ↗(On Diff #29327)

If the node doesn't have an avalanche state, we shouldn't be polling it.

src/net.cpp
3068 ↗(On Diff #29327)

Why? This comment says what I already know because I can read the code, but it doesn't tell me what i want to know: why.

3070 ↗(On Diff #29327)

This is bad way to compute the expo, because difference between floats tends to destroy precision, and in this case, precision errors do not cancel out. What you want is something akin to

(previousScore + k * (2 * votes - polls)) / (1 + k)

No need to do as flat what you can do as int, and no need to fear the division. It is faster for floats than it is for ints, and do not lose precision (same goes for multiplications).

src/net.h
103 ↗(On Diff #29327)

Refreshing all the node every minutes seems like it's way overkill.

1051 ↗(On Diff #29327)

Why does this need to be a different object? I don't think it does.

src/net_processing.cpp
1557 ↗(On Diff #29327)

Does that pass sanitizers? You are capturing by reference, but that lambda clearly outlive what's captured. You need to make things explicit, like for the other lambda.

This revision now requires changes to proceed.Aug 4 2021, 21:39
  • Remove the redundant check the m_avalanche_state != nullptr. This requires to build it for the tests.
  • Don't clamp the min score to 0. It was only there to make it more human friendly by keeping the score always a positive value but was not really needed.
  • Compute without any float substraction.
  • Bump the refresh time to 10 minutes.
  • Remove the AvalancheStatistics object and merge it with the AvalancheState.
  • Explicitely capture this for the scheduler task.
deadalnix requested changes to this revision.Aug 5 2021, 11:44
deadalnix added inline comments.
src/net.cpp
3068

Do not do floating point ops when you don't need to.

I also guess that decay factor is negative here, and you end up doing a fair bit of double negations that makes the code confusing.

src/net.h
1077

Add new lines in between these.

This revision now requires changes to proceed.Aug 5 2021, 11:44
  • Add a couple new lines
  • Use uppercase for the precomputed decay factor, since it's a constant
  • Simplify the reading by directly using the well known form of the exponential computation
  • 2. * votes - polls don't need to be float calculation
This revision is now accepted and ready to land.Aug 6 2021, 01:29
This revision was automatically updated to reflect the committed changes.