IRC logs for #buildstream for Monday, 2022-02-21

*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has joined #buildstream08:04
*** tpollard <tpollard!tompollard@cpc109023-salf6-2-0-cust777.10-2.cable.virginm.net> has joined #buildstream09:36
*** tpollard <tpollard!tompollard@cpc109023-salf6-2-0-cust777.10-2.cable.virginm.net> has quit IRC17:30
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has quit IRC17:33
nanonymejuergbi: I really wonder at this point if using regular Github public runners would work more reliable than Apache runners18:58
juergbinanonyme: I don't know but I don't think we could even choose those18:59
nanonymeYeah, I guess you would have to stop being an Apache project to do so18:59
juergbiin that case we would have stayed on gitlab19:00
nanonymeMaybe it makes sense to move back if situation doesn't improve? The project is at the moment at the point where project members merge PR's that do not go through CI because build infrastructure is so broken19:01
nanonymeHaving CI that gets routinely ignored is pointless19:02
juergbiyes, we definitely need to solve this19:02
juergbinanonyme: are there ever significant issues for jobs that don't exceed 30min or 'only' for longer jobs?19:03
nanonymeIs there any way to escalate to Apache?19:03
juergbiyes, we can definitely talk to the infrastructure team. I don't know whether tristan already has19:04
nanonymejuergbi: I don't know, bst master overall is super-slow. Even linter takes almost ten minutes to run (it takes a minute on bst-1)19:04
juergbithis line takes 7 min: lint installdeps: -rrequirements/requirements.txt, -rrequirements/dev-requirements.txt19:05
juergbiso I suspect the issue there is that it's compiling grpcio or some other Python module19:05
nanonymeHmm, why would it build grpcio in the first place anyway? Shouldn't it be using pre-built wheels? Those are what bst consumers will be using anyway, most likely unless they use distro packages for bst19:06
juergbisplitting up the tests might help (non-integration vs. integration and also separate [external] plugin tests)19:07
nanonymeIt did help a bit to make everything sequential through https://github.com/apache/buildstream/pull/159319:07
juergbiI don't know. it's just my suspicion because installing from wheels shouldn't take that long19:07
nanonymeBut not much, everything still hangs even if we run tests one by one19:07
nanonymejuergbi:  and https://github.com/apache/buildstream/runs/5267010220?check_suite_focus=true was still almost ten minutes even with guarantee that nothing else was running at the same time19:09
nanonymeIt would be nice if that had timestamps19:11
juergbinanonyme: press Shift-T with the log open19:14
nanonymejuergbi: but yeah, infra team really needs to be consulted. It looks like builds are somehow unclearly being terminated probably from runner side so GitHub workflow goes into hanged state.19:14
juergbinanonyme: I've pinged someone from the infra team. let's see whether he can provide some insight19:15
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has joined #buildstream19:16
nanonymeGreat, thanks19:17
nanonymejuergbi: but yeah, lessons learned, concurrency can be efficiently managed through using needs between jobs to construct a pipeline and using max-parallel with matrix. This may make individual things faster to pass.19:19
juergbiour own runners could theoretically be an option: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners19:20
juergbibut we'd need someone to sponsor and maintain19:20
nanonymeIndeed19:20
juergbia single beefy machine could actually be really useful and not that expensive but still19:21
nanonymeAt that point it's a bit questionable how much benefit you're getting out of being an Apache project anymore but anyway19:21
juergbiIt makes a lot of sense for a project to be under a legal entity19:26
juergbimany other ASF projects apparently use GitHub runners without issues right now19:31
juergbiwondering whether we changed anything that has triggered this issue19:31
juergbitrying action with a revert https://github.com/apache/buildstream/actions/runs/187804461119:46
nanonymejuergbi: https://github.com/apache/buildstream/runs/5279286321?check_suite_focus=true already shows signs of hang19:55
nanonymeEven at ten minutes19:55
nanonymeWith well-behaving builds you can get console output by opening the job, in this bad case you can't19:56
nanonymejuergbi: I accidentally ran PR with changes from last March or so and it failed the same way. There have been PR's since that have not failed.19:57
nanonymeIt might be repository configuration issue though19:57
juergbithat's interesting19:57
juergbiin the summary I see two kinds of errors19:58
juergbi* An error occurred while provisioning resources (Error Type: Disconnect).19:58
juergbi* The hosted runner: GitHub Actions 51 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.19:58
juergbiAt least the first one might be completely independent of the actual job19:58
nanonymejuergbi: right, I was originally suspecting it might be something to do with overusage of resources, hence in my MR I put needs so only matrix build runs and then set max-parallel to 1 so matrix runs one node at a time20:00
nanonymeBut it still failed :D20:00
juergbinanonyme: ever saw issues on bst-1?20:01
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has quit IRC20:01
nanonymejuergbi: no but bst-1 tests also finish very fast20:01
nanonymejuergbi: eg https://github.com/apache/buildstream/actions/runs/185680062520:01
juergbiright but if there is a runner provisioning error, I'd expect bst-1 to occasionally be affected as well20:02
juergbithe lower job count may help, of course20:02
nanonymeYes, that's why I lowered master to same as bst-1 but master still fails, bst-1 never fails20:02
juergbiright, that's really odd20:03
juergbiunless there are some kind of long-term resource limits per target branch20:03
juergbiwell, my test action run doesn't even have a PR20:03
nanonymeI just hope they haven't done something done some master to main configuration that breaks CI in master20:04
juergbiafter pretty much exactly 30 min 6 jobs fail at the same time20:16
juergbitwo with: An error occurred while provisioning resources (Error Type: Disconnect).20:16
juergbifour with: Received request to deprovision: The request was cancelled by the remote provider.20:16
nanonymejuergbi: is it really failing at 30 minutes? The jobs stop emitting output far earlier20:17
juergbithe error message/status appeared after 30 min20:17
juergbiI suspect it's a provisioning timeout20:17
nanonymeHmm, but what about my PR? It was clearly running only one thing and that got as far as 58% tests done until it hanged and failed at again around 30 minutes20:18
nanonymejuergbi: can you create a copy of master with some other name and we do test pipeline there?20:19
juergbithat errored out with what error message?20:19
juergbithe test run I did was not a PR, so there is no direct connection to master, just my branch20:20
nanonymeNo error, output just stopped20:20
juergbinanonyme: can you link to the one you're referring to?20:20
nanonymejuergbi: last thing I saw in logs tends to be around ests/integration/artifact.py::test_cache_buildtrees PASSED              [ 58%], then it just stopped emitting anything for ten minutes and failed at thirty minutes20:21
juergbiok but what's the corresponding error message in the action summary?20:22
nanonymeI don't know, where can you see it? https://github.com/apache/buildstream/runs/5265739784?check_suite_focus=true20:22
nanonymeOh, there20:23
juergbiReceived request to deprovision: The request was cancelled by the remote provider.20:23
nanonymeBut why do they get cancelled? Are they protesting against the branch name master or what? :D20:23
juergbithis means that this error happens after it was successfully provisioned20:24
nanonymeRight20:24
nanonymehttps://github.com/apache/buildstream/actions/runs/1873212020 also exotic output20:24
juergbiit would be interesting to aggressively cut down tests20:25
juergbioptimizing CI job time would be a good thing anyway, I suppose20:26
nanonymejuergbi: ah yes, exactly. So those ran sequentially. Looks like somehow runner died while executing Debian 10 tests and the Fedora ones failed because runner was down20:26
nanonymeIt's weird that that would happen though. Doesn't Apache have more than one runner?20:28
nanonymejuergbi: next iteration: I'll re-order matrix so Fedora 35 is run first, not Debian 10 https://github.com/apache/buildstream/runs/5279821414?check_suite_focus=true20:45
juergbimaybe we could run the most important tests first in jobs that hopefully don't take so long20:47
juergbiand later tests would be for additional distro coverage, external plugins etc.20:47
nanonymeLike smoke tests. Sure.20:47
juergbiwe need to speed up the tox installation part such that we can utilize the working CI time for actual tests20:48
nanonymeIt would help some if we built and cached wheels but I'm not sure if that really works20:49
nanonymeLike grpcio wheel could be cached to avoid recompilation20:49
nanonymeStill, regardless, there's something dramatically wrong with the infra20:50
nanonymejuergbi: in general we're invoking setup.py quite a lot of times for deps and that's slow. Wheels are always preferrable, they can just be unpacked.20:51
nanonymejuergbi: what? Why do we have hardcoded 30 minute timeout with tox?20:53
juergbioh, indeed, I forgot about that20:54
nanonymeWell, that does definitely explain why all runs stop at 30 minutes20:54
juergbiah but that's per test20:54
juergbinot per job20:54
nanonymeAre you sure?20:54
juergbiand would result in a pytest error, not a GitHub runner error20:54
nanonymeWell, pytest timeout method is raising signal20:55
nanonymeMaybe that signal is breaking everything20:55
nanonymeErm, tox timeout method even20:55
juergbiit's pytest, not tox, right?20:55
juergbiin setup.cfg20:56
nanonymeYeah. It just outputs20:56
nanonymeMon, 21 Feb 2022 20:51:46 GMT timeout: 1800.0s20:56
nanonymeMon, 21 Feb 2022 20:51:46 GMT timeout method: signal20:56
juergbiyes, that's the pytest-timeout plugin config20:56
juergbiI don't think that's related20:57
nanonymeWe'll see soon. If this iteration fails, I'll remove timeout for next one.21:01
nanonymejuergbi: hmm, F35 running as first one, *again* output completely stopped at tests/integration/artifact.py::test_cache_buildtrees PASSED              [ 58%]21:10
juergbiI think that's the slowest test21:11
juergbitook 33.6s locally21:12
nanonymeyeah but it said passed21:12
nanonymeSo it's clearly hanging after that21:12
nanonymejuergbi: test_preserve_environment is the first test that didn't run21:13
nanonymejuergbi: hmm, where are we actually writing temp data during build with tmpdir? It's not in /tmp, right?21:15
nanonymeJust wondering, what if we're writing to RAM when we think we're writing to disk21:16
nanonymeAh, that cannot be the case. We use --basetemp ./tmp21:17
juergbitox/pytest create tmp directories, iirc21:17
nanonymeSo we write under taht21:17
nanonymejuergbi: well, yes, but it matters where we tell it to write them. I suppose ./tmp is fine though, that should probably be actually disk21:18

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!