Waiting too long when the process has finished only stalls the build agent
for an unnecessarily long period of time. If the logs are not available a few
seconds after the process completes, it should warn and continue.
Background:
We've started to see an increase in the number of CI builds that
appear to timeout and yet many of them are completed or nearly so.
In a common failure case such as this one, note how the build supposedly times out
and then the log flushes all at once: https://build.bitcoinabc.org/viewLog.html?buildId=77854&buildTypeId=BitcoinABC_BitcoinAbcStaging&tab=buildLog&_focus=1940
Yet scrolling to the bottom, you'll find the tests passed at the end.
The root cause of this is still unknown, although there are some hints that it
may be due to one or a combination of the following:
- Python version (CI is currently on 3.7, and lots of changes to asyncio were made from 3.6 - 3.8, so it's hard to know if this is an issue that was fixed or not).
- Python stdout buffering behavior. I'm running an A/B test to see if setting PYTHONUNBUFFERED modifies the behavior of the log flushing, but this will take time to evaluate. The docs indicate that this is supposed to be enabled by default in 3.7, but a quick reading of the Python source code indicates this may not be true.
- Docker hanging. For example, https://github.com/moby/moby/issues/41052 was resolved recently, but all of our build agents should have this fix, so I think this is less likely.
I have been unsuccessful in reproducing this locally, so the next best strategy
is for us to side-step the issue until we either reproduce or it gets fixed from
regular upgrades to our infra. This issue only affects a minority of builds,
but causes a disproportionate negative impact on devs as one failure out of
many builds on a diff causes a lot of noise during review.