IRC logs for #buildstream for Monday, 2022-02-21

*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has joined #buildstream		08:04
*** tpollard <tpollard!tompollard@cpc109023-salf6-2-0-cust777.10-2.cable.virginm.net> has joined #buildstream		09:36
*** tpollard <tpollard!tompollard@cpc109023-salf6-2-0-cust777.10-2.cable.virginm.net> has quit IRC		17:30
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has quit IRC		17:33
nanonyme	juergbi: I really wonder at this point if using regular Github public runners would work more reliable than Apache runners	18:58
juergbi	nanonyme: I don't know but I don't think we could even choose those	18:59
nanonyme	Yeah, I guess you would have to stop being an Apache project to do so	18:59
juergbi	in that case we would have stayed on gitlab	19:00
nanonyme	Maybe it makes sense to move back if situation doesn't improve? The project is at the moment at the point where project members merge PR's that do not go through CI because build infrastructure is so broken	19:01
nanonyme	Having CI that gets routinely ignored is pointless	19:02
juergbi	yes, we definitely need to solve this	19:02
juergbi	nanonyme: are there ever significant issues for jobs that don't exceed 30min or 'only' for longer jobs?	19:03
nanonyme	Is there any way to escalate to Apache?	19:03
juergbi	yes, we can definitely talk to the infrastructure team. I don't know whether tristan already has	19:04
nanonyme	juergbi: I don't know, bst master overall is super-slow. Even linter takes almost ten minutes to run (it takes a minute on bst-1)	19:04
juergbi	this line takes 7 min: lint installdeps: -rrequirements/requirements.txt, -rrequirements/dev-requirements.txt	19:05
juergbi	so I suspect the issue there is that it's compiling grpcio or some other Python module	19:05
nanonyme	Hmm, why would it build grpcio in the first place anyway? Shouldn't it be using pre-built wheels? Those are what bst consumers will be using anyway, most likely unless they use distro packages for bst	19:06
juergbi	splitting up the tests might help (non-integration vs. integration and also separate [external] plugin tests)	19:07
nanonyme	It did help a bit to make everything sequential through https://github.com/apache/buildstream/pull/1593	19:07
juergbi	I don't know. it's just my suspicion because installing from wheels shouldn't take that long	19:07
nanonyme	But not much, everything still hangs even if we run tests one by one	19:07
nanonyme	juergbi: and https://github.com/apache/buildstream/runs/5267010220?check_suite_focus=true was still almost ten minutes even with guarantee that nothing else was running at the same time	19:09
nanonyme	It would be nice if that had timestamps	19:11
juergbi	nanonyme: press Shift-T with the log open	19:14
nanonyme	juergbi: but yeah, infra team really needs to be consulted. It looks like builds are somehow unclearly being terminated probably from runner side so GitHub workflow goes into hanged state.	19:14
juergbi	nanonyme: I've pinged someone from the infra team. let's see whether he can provide some insight	19:15
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has joined #buildstream		19:16
nanonyme	Great, thanks	19:17
nanonyme	juergbi: but yeah, lessons learned, concurrency can be efficiently managed through using needs between jobs to construct a pipeline and using max-parallel with matrix. This may make individual things faster to pass.	19:19
juergbi	our own runners could theoretically be an option: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners	19:20
juergbi	but we'd need someone to sponsor and maintain	19:20
nanonyme	Indeed	19:20
juergbi	a single beefy machine could actually be really useful and not that expensive but still	19:21
nanonyme	At that point it's a bit questionable how much benefit you're getting out of being an Apache project anymore but anyway	19:21
juergbi	It makes a lot of sense for a project to be under a legal entity	19:26
juergbi	many other ASF projects apparently use GitHub runners without issues right now	19:31
juergbi	wondering whether we changed anything that has triggered this issue	19:31
juergbi	trying action with a revert https://github.com/apache/buildstream/actions/runs/1878044611	19:46
nanonyme	juergbi: https://github.com/apache/buildstream/runs/5279286321?check_suite_focus=true already shows signs of hang	19:55
nanonyme	Even at ten minutes	19:55
nanonyme	With well-behaving builds you can get console output by opening the job, in this bad case you can't	19:56
nanonyme	juergbi: I accidentally ran PR with changes from last March or so and it failed the same way. There have been PR's since that have not failed.	19:57
nanonyme	It might be repository configuration issue though	19:57
juergbi	that's interesting	19:57
juergbi	in the summary I see two kinds of errors	19:58
juergbi	* An error occurred while provisioning resources (Error Type: Disconnect).	19:58
juergbi	* The hosted runner: GitHub Actions 51 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.	19:58
juergbi	At least the first one might be completely independent of the actual job	19:58
nanonyme	juergbi: right, I was originally suspecting it might be something to do with overusage of resources, hence in my MR I put needs so only matrix build runs and then set max-parallel to 1 so matrix runs one node at a time	20:00
nanonyme	But it still failed :D	20:00
juergbi	nanonyme: ever saw issues on bst-1?	20:01
*** coldtom <coldtom!coldtom@2a00:23c7:5e9a:5301:0:0:0:1b8> has quit IRC		20:01
nanonyme	juergbi: no but bst-1 tests also finish very fast	20:01
nanonyme	juergbi: eg https://github.com/apache/buildstream/actions/runs/1856800625	20:01
juergbi	right but if there is a runner provisioning error, I'd expect bst-1 to occasionally be affected as well	20:02
juergbi	the lower job count may help, of course	20:02
nanonyme	Yes, that's why I lowered master to same as bst-1 but master still fails, bst-1 never fails	20:02
juergbi	right, that's really odd	20:03
juergbi	unless there are some kind of long-term resource limits per target branch	20:03
juergbi	well, my test action run doesn't even have a PR	20:03
nanonyme	I just hope they haven't done something done some master to main configuration that breaks CI in master	20:04
juergbi	after pretty much exactly 30 min 6 jobs fail at the same time	20:16
juergbi	two with: An error occurred while provisioning resources (Error Type: Disconnect).	20:16
juergbi	four with: Received request to deprovision: The request was cancelled by the remote provider.	20:16
nanonyme	juergbi: is it really failing at 30 minutes? The jobs stop emitting output far earlier	20:17
juergbi	the error message/status appeared after 30 min	20:17
juergbi	I suspect it's a provisioning timeout	20:17
nanonyme	Hmm, but what about my PR? It was clearly running only one thing and that got as far as 58% tests done until it hanged and failed at again around 30 minutes	20:18
nanonyme	juergbi: can you create a copy of master with some other name and we do test pipeline there?	20:19
juergbi	that errored out with what error message?	20:19
juergbi	the test run I did was not a PR, so there is no direct connection to master, just my branch	20:20
nanonyme	No error, output just stopped	20:20
juergbi	nanonyme: can you link to the one you're referring to?	20:20
nanonyme	juergbi: last thing I saw in logs tends to be around ests/integration/artifact.py::test_cache_buildtrees PASSED [ 58%], then it just stopped emitting anything for ten minutes and failed at thirty minutes	20:21
juergbi	ok but what's the corresponding error message in the action summary?	20:22
nanonyme	I don't know, where can you see it? https://github.com/apache/buildstream/runs/5265739784?check_suite_focus=true	20:22
nanonyme	Oh, there	20:23
juergbi	Received request to deprovision: The request was cancelled by the remote provider.	20:23
nanonyme	But why do they get cancelled? Are they protesting against the branch name master or what? :D	20:23
juergbi	this means that this error happens after it was successfully provisioned	20:24
nanonyme	Right	20:24
nanonyme	https://github.com/apache/buildstream/actions/runs/1873212020 also exotic output	20:24
juergbi	it would be interesting to aggressively cut down tests	20:25
juergbi	optimizing CI job time would be a good thing anyway, I suppose	20:26
nanonyme	juergbi: ah yes, exactly. So those ran sequentially. Looks like somehow runner died while executing Debian 10 tests and the Fedora ones failed because runner was down	20:26
nanonyme	It's weird that that would happen though. Doesn't Apache have more than one runner?	20:28
nanonyme	juergbi: next iteration: I'll re-order matrix so Fedora 35 is run first, not Debian 10 https://github.com/apache/buildstream/runs/5279821414?check_suite_focus=true	20:45
juergbi	maybe we could run the most important tests first in jobs that hopefully don't take so long	20:47
juergbi	and later tests would be for additional distro coverage, external plugins etc.	20:47
nanonyme	Like smoke tests. Sure.	20:47
juergbi	we need to speed up the tox installation part such that we can utilize the working CI time for actual tests	20:48
nanonyme	It would help some if we built and cached wheels but I'm not sure if that really works	20:49
nanonyme	Like grpcio wheel could be cached to avoid recompilation	20:49
nanonyme	Still, regardless, there's something dramatically wrong with the infra	20:50
nanonyme	juergbi: in general we're invoking setup.py quite a lot of times for deps and that's slow. Wheels are always preferrable, they can just be unpacked.	20:51
nanonyme	juergbi: what? Why do we have hardcoded 30 minute timeout with tox?	20:53
juergbi	oh, indeed, I forgot about that	20:54
nanonyme	Well, that does definitely explain why all runs stop at 30 minutes	20:54
juergbi	ah but that's per test	20:54
juergbi	not per job	20:54
nanonyme	Are you sure?	20:54
juergbi	and would result in a pytest error, not a GitHub runner error	20:54
nanonyme	Well, pytest timeout method is raising signal	20:55
nanonyme	Maybe that signal is breaking everything	20:55
nanonyme	Erm, tox timeout method even	20:55
juergbi	it's pytest, not tox, right?	20:55
juergbi	in setup.cfg	20:56
nanonyme	Yeah. It just outputs	20:56
nanonyme	Mon, 21 Feb 2022 20:51:46 GMT timeout: 1800.0s	20:56
nanonyme	Mon, 21 Feb 2022 20:51:46 GMT timeout method: signal	20:56
juergbi	yes, that's the pytest-timeout plugin config	20:56
juergbi	I don't think that's related	20:57
nanonyme	We'll see soon. If this iteration fails, I'll remove timeout for next one.	21:01
nanonyme	juergbi: hmm, F35 running as first one, again output completely stopped at tests/integration/artifact.py::test_cache_buildtrees PASSED [ 58%]	21:10
juergbi	I think that's the slowest test	21:11
juergbi	took 33.6s locally	21:12
nanonyme	yeah but it said passed	21:12
nanonyme	So it's clearly hanging after that	21:12
nanonyme	juergbi: test_preserve_environment is the first test that didn't run	21:13
nanonyme	juergbi: hmm, where are we actually writing temp data during build with tmpdir? It's not in /tmp, right?	21:15
nanonyme	Just wondering, what if we're writing to RAM when we think we're writing to disk	21:16
nanonyme	Ah, that cannot be the case. We use --basetemp ./tmp	21:17
juergbi	tox/pytest create tmp directories, iirc	21:17
nanonyme	So we write under taht	21:17
nanonyme	juergbi: well, yes, but it matters where we tell it to write them. I suppose ./tmp is fine though, that should probably be actually disk	21:18

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!