*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has joined #buildstream | 08:04 | |
*** tpollard <tpollard!tompollard@cpc109023-salf6-2-0-cust777.10-2.cable.virginm.net> has joined #buildstream | 09:36 | |
*** tpollard <tpollard!tompollard@cpc109023-salf6-2-0-cust777.10-2.cable.virginm.net> has quit IRC | 17:30 | |
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has quit IRC | 17:33 | |
nanonyme | juergbi: I really wonder at this point if using regular Github public runners would work more reliable than Apache runners | 18:58 |
---|---|---|
juergbi | nanonyme: I don't know but I don't think we could even choose those | 18:59 |
nanonyme | Yeah, I guess you would have to stop being an Apache project to do so | 18:59 |
juergbi | in that case we would have stayed on gitlab | 19:00 |
nanonyme | Maybe it makes sense to move back if situation doesn't improve? The project is at the moment at the point where project members merge PR's that do not go through CI because build infrastructure is so broken | 19:01 |
nanonyme | Having CI that gets routinely ignored is pointless | 19:02 |
juergbi | yes, we definitely need to solve this | 19:02 |
juergbi | nanonyme: are there ever significant issues for jobs that don't exceed 30min or 'only' for longer jobs? | 19:03 |
nanonyme | Is there any way to escalate to Apache? | 19:03 |
juergbi | yes, we can definitely talk to the infrastructure team. I don't know whether tristan already has | 19:04 |
nanonyme | juergbi: I don't know, bst master overall is super-slow. Even linter takes almost ten minutes to run (it takes a minute on bst-1) | 19:04 |
juergbi | this line takes 7 min: lint installdeps: -rrequirements/requirements.txt, -rrequirements/dev-requirements.txt | 19:05 |
juergbi | so I suspect the issue there is that it's compiling grpcio or some other Python module | 19:05 |
nanonyme | Hmm, why would it build grpcio in the first place anyway? Shouldn't it be using pre-built wheels? Those are what bst consumers will be using anyway, most likely unless they use distro packages for bst | 19:06 |
juergbi | splitting up the tests might help (non-integration vs. integration and also separate [external] plugin tests) | 19:07 |
nanonyme | It did help a bit to make everything sequential through https://github.com/apache/buildstream/pull/1593 | 19:07 |
juergbi | I don't know. it's just my suspicion because installing from wheels shouldn't take that long | 19:07 |
nanonyme | But not much, everything still hangs even if we run tests one by one | 19:07 |
nanonyme | juergbi: and https://github.com/apache/buildstream/runs/5267010220?check_suite_focus=true was still almost ten minutes even with guarantee that nothing else was running at the same time | 19:09 |
nanonyme | It would be nice if that had timestamps | 19:11 |
juergbi | nanonyme: press Shift-T with the log open | 19:14 |
nanonyme | juergbi: but yeah, infra team really needs to be consulted. It looks like builds are somehow unclearly being terminated probably from runner side so GitHub workflow goes into hanged state. | 19:14 |
juergbi | nanonyme: I've pinged someone from the infra team. let's see whether he can provide some insight | 19:15 |
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has joined #buildstream | 19:16 | |
nanonyme | Great, thanks | 19:17 |
nanonyme | juergbi: but yeah, lessons learned, concurrency can be efficiently managed through using needs between jobs to construct a pipeline and using max-parallel with matrix. This may make individual things faster to pass. | 19:19 |
juergbi | our own runners could theoretically be an option: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners | 19:20 |
juergbi | but we'd need someone to sponsor and maintain | 19:20 |
nanonyme | Indeed | 19:20 |
juergbi | a single beefy machine could actually be really useful and not that expensive but still | 19:21 |
nanonyme | At that point it's a bit questionable how much benefit you're getting out of being an Apache project anymore but anyway | 19:21 |
juergbi | It makes a lot of sense for a project to be under a legal entity | 19:26 |
juergbi | many other ASF projects apparently use GitHub runners without issues right now | 19:31 |
juergbi | wondering whether we changed anything that has triggered this issue | 19:31 |
juergbi | trying action with a revert https://github.com/apache/buildstream/actions/runs/1878044611 | 19:46 |
nanonyme | juergbi: https://github.com/apache/buildstream/runs/5279286321?check_suite_focus=true already shows signs of hang | 19:55 |
nanonyme | Even at ten minutes | 19:55 |
nanonyme | With well-behaving builds you can get console output by opening the job, in this bad case you can't | 19:56 |
nanonyme | juergbi: I accidentally ran PR with changes from last March or so and it failed the same way. There have been PR's since that have not failed. | 19:57 |
nanonyme | It might be repository configuration issue though | 19:57 |
juergbi | that's interesting | 19:57 |
juergbi | in the summary I see two kinds of errors | 19:58 |
juergbi | * An error occurred while provisioning resources (Error Type: Disconnect). | 19:58 |
juergbi | * The hosted runner: GitHub Actions 51 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. | 19:58 |
juergbi | At least the first one might be completely independent of the actual job | 19:58 |
nanonyme | juergbi: right, I was originally suspecting it might be something to do with overusage of resources, hence in my MR I put needs so only matrix build runs and then set max-parallel to 1 so matrix runs one node at a time | 20:00 |
nanonyme | But it still failed :D | 20:00 |
juergbi | nanonyme: ever saw issues on bst-1? | 20:01 |
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has quit IRC | 20:01 | |
nanonyme | juergbi: no but bst-1 tests also finish very fast | 20:01 |
nanonyme | juergbi: eg https://github.com/apache/buildstream/actions/runs/1856800625 | 20:01 |
juergbi | right but if there is a runner provisioning error, I'd expect bst-1 to occasionally be affected as well | 20:02 |
juergbi | the lower job count may help, of course | 20:02 |
nanonyme | Yes, that's why I lowered master to same as bst-1 but master still fails, bst-1 never fails | 20:02 |
juergbi | right, that's really odd | 20:03 |
juergbi | unless there are some kind of long-term resource limits per target branch | 20:03 |
juergbi | well, my test action run doesn't even have a PR | 20:03 |
nanonyme | I just hope they haven't done something done some master to main configuration that breaks CI in master | 20:04 |
juergbi | after pretty much exactly 30 min 6 jobs fail at the same time | 20:16 |
juergbi | two with: An error occurred while provisioning resources (Error Type: Disconnect). | 20:16 |
juergbi | four with: Received request to deprovision: The request was cancelled by the remote provider. | 20:16 |
nanonyme | juergbi: is it really failing at 30 minutes? The jobs stop emitting output far earlier | 20:17 |
juergbi | the error message/status appeared after 30 min | 20:17 |
juergbi | I suspect it's a provisioning timeout | 20:17 |
nanonyme | Hmm, but what about my PR? It was clearly running only one thing and that got as far as 58% tests done until it hanged and failed at again around 30 minutes | 20:18 |
nanonyme | juergbi: can you create a copy of master with some other name and we do test pipeline there? | 20:19 |
juergbi | that errored out with what error message? | 20:19 |
juergbi | the test run I did was not a PR, so there is no direct connection to master, just my branch | 20:20 |
nanonyme | No error, output just stopped | 20:20 |
juergbi | nanonyme: can you link to the one you're referring to? | 20:20 |
nanonyme | juergbi: last thing I saw in logs tends to be around ests/integration/artifact.py::test_cache_buildtrees PASSED [ 58%], then it just stopped emitting anything for ten minutes and failed at thirty minutes | 20:21 |
juergbi | ok but what's the corresponding error message in the action summary? | 20:22 |
nanonyme | I don't know, where can you see it? https://github.com/apache/buildstream/runs/5265739784?check_suite_focus=true | 20:22 |
nanonyme | Oh, there | 20:23 |
juergbi | Received request to deprovision: The request was cancelled by the remote provider. | 20:23 |
nanonyme | But why do they get cancelled? Are they protesting against the branch name master or what? :D | 20:23 |
juergbi | this means that this error happens after it was successfully provisioned | 20:24 |
nanonyme | Right | 20:24 |
nanonyme | https://github.com/apache/buildstream/actions/runs/1873212020 also exotic output | 20:24 |
juergbi | it would be interesting to aggressively cut down tests | 20:25 |
juergbi | optimizing CI job time would be a good thing anyway, I suppose | 20:26 |
nanonyme | juergbi: ah yes, exactly. So those ran sequentially. Looks like somehow runner died while executing Debian 10 tests and the Fedora ones failed because runner was down | 20:26 |
nanonyme | It's weird that that would happen though. Doesn't Apache have more than one runner? | 20:28 |
nanonyme | juergbi: next iteration: I'll re-order matrix so Fedora 35 is run first, not Debian 10 https://github.com/apache/buildstream/runs/5279821414?check_suite_focus=true | 20:45 |
juergbi | maybe we could run the most important tests first in jobs that hopefully don't take so long | 20:47 |
juergbi | and later tests would be for additional distro coverage, external plugins etc. | 20:47 |
nanonyme | Like smoke tests. Sure. | 20:47 |
juergbi | we need to speed up the tox installation part such that we can utilize the working CI time for actual tests | 20:48 |
nanonyme | It would help some if we built and cached wheels but I'm not sure if that really works | 20:49 |
nanonyme | Like grpcio wheel could be cached to avoid recompilation | 20:49 |
nanonyme | Still, regardless, there's something dramatically wrong with the infra | 20:50 |
nanonyme | juergbi: in general we're invoking setup.py quite a lot of times for deps and that's slow. Wheels are always preferrable, they can just be unpacked. | 20:51 |
nanonyme | juergbi: what? Why do we have hardcoded 30 minute timeout with tox? | 20:53 |
juergbi | oh, indeed, I forgot about that | 20:54 |
nanonyme | Well, that does definitely explain why all runs stop at 30 minutes | 20:54 |
juergbi | ah but that's per test | 20:54 |
juergbi | not per job | 20:54 |
nanonyme | Are you sure? | 20:54 |
juergbi | and would result in a pytest error, not a GitHub runner error | 20:54 |
nanonyme | Well, pytest timeout method is raising signal | 20:55 |
nanonyme | Maybe that signal is breaking everything | 20:55 |
nanonyme | Erm, tox timeout method even | 20:55 |
juergbi | it's pytest, not tox, right? | 20:55 |
juergbi | in setup.cfg | 20:56 |
nanonyme | Yeah. It just outputs | 20:56 |
nanonyme | Mon, 21 Feb 2022 20:51:46 GMT timeout: 1800.0s | 20:56 |
nanonyme | Mon, 21 Feb 2022 20:51:46 GMT timeout method: signal | 20:56 |
juergbi | yes, that's the pytest-timeout plugin config | 20:56 |
juergbi | I don't think that's related | 20:57 |
nanonyme | We'll see soon. If this iteration fails, I'll remove timeout for next one. | 21:01 |
nanonyme | juergbi: hmm, F35 running as first one, *again* output completely stopped at tests/integration/artifact.py::test_cache_buildtrees PASSED [ 58%] | 21:10 |
juergbi | I think that's the slowest test | 21:11 |
juergbi | took 33.6s locally | 21:12 |
nanonyme | yeah but it said passed | 21:12 |
nanonyme | So it's clearly hanging after that | 21:12 |
nanonyme | juergbi: test_preserve_environment is the first test that didn't run | 21:13 |
nanonyme | juergbi: hmm, where are we actually writing temp data during build with tmpdir? It's not in /tmp, right? | 21:15 |
nanonyme | Just wondering, what if we're writing to RAM when we think we're writing to disk | 21:16 |
nanonyme | Ah, that cannot be the case. We use --basetemp ./tmp | 21:17 |
juergbi | tox/pytest create tmp directories, iirc | 21:17 |
nanonyme | So we write under taht | 21:17 |
nanonyme | juergbi: well, yes, but it matters where we tell it to write them. I suppose ./tmp is fine though, that should probably be actually disk | 21:18 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!