IRC logs for #buildstream for Thursday, 2020-05-14

*** cs-shadow has quit IRC		00:23
*** aminbegood has joined #buildstream		03:20
*** tristan has quit IRC		03:44
*** aminbegood has quit IRC		03:44
*** aminbegood has joined #buildstream		05:16
*** aminbegood has quit IRC		05:20
*** aminbegood has joined #buildstream		05:21
*** traveltissues has quit IRC		05:37
*** traveltissues has joined #buildstream		05:37
*** aminbegood has quit IRC		06:30
juergbi	benschubert: afaict, the issue is that `config.clone()` in `Element.__extract_config()` doesn't actually clone `ScalarNode` (its clone implementation is a no-op)	06:46
juergbi	this means that when the first Element of a plugin is instantiated, the default config of that plugin is permanently expanded with the variables of that first Element	06:49
juergbi	as `Variables.expand()` replaces `value` of the original `ScalarNode`	06:50
juergbi	so in the context of freedesktop-sdk/glibc, we see the issue depending on whether the first Element has the `builddir` variable set to empty string or not	06:50
juergbi	a proper implementation of `ScalarNode.clone()` fixes the issue, however, this might have a performance penalty	06:51
juergbi	https://paste.gnome.org/pdjbzxzwt	06:52
*** slaf has quit IRC		06:58
*** narispo has quit IRC		07:12
*** narispo has joined #buildstream		07:12
*** pointswaves has joined #buildstream		07:14
*** seanborg_ has joined #buildstream		07:20
benschubert	Oh good catch. Yeah I'll make a patch and benchmark then (unless you're on it?) Thanks juergbi	07:20
juergbi	no, please go ahead	07:20
juergbi	we should also add a test case. shouldn't be too difficult with two elements of the same kind but different variables	07:21
benschubert	yep definitely	07:22
juergbi	WSalmon, pointswaves: the above should fix the overlap issue	07:24
juergbi	(may need full rebuild)	07:24
WSalmon	Thanks juergbi	07:35
WSalmon	juergbi, is that on a branch somewhere? Dose it need a load of performance testing/doing better before merging? in order to use it in my CI its a lot easier if its a branch so can i make one if you dont have one?	07:39
juergbi	not on a branch yet, benschubert is taking care of it	07:39
juergbi	feel free to create a temporary branch yourself if you need it immediately	07:40
*** pointswaves has quit IRC		07:40
juergbi	I hope we can merge it today to master, latest tomorrow	07:40
*** WSalmon has left #buildstream		07:40
*** WSalmon has joined #buildstream		07:41
benschubert	WSalmon: !1929 if you want it in advance	07:41
WSalmon	thanks benschubert	07:42
* WSalmon -> looks		07:42
benschubert	juergbi: :/ seems to take a noticeable hit on 'show	08:04
juergbi	:-/	08:05
juergbi	benschubert: a possible alternative would be to create a new ScalarNode only on expand, instead of setting .value of the existing node	08:05
benschubert	yeah I'll finish the complete benchmark then try this solution	08:06
juergbi	however, ScalarNode.clone() would still be misleading, unless we can make `value` immutable somehow	08:06
benschubert	but then it becomes murky depending how you call it	08:06
benschubert	from python it is immutable	08:06
benschubert	but I was accessing it from C	08:06
juergbi	ah ok, so maybe a big comment that we must never set value in the .pyx	08:08
juergbi	benschubert: murky because it is no longer fully in-place?	08:08
benschubert	yep	08:08
juergbi	maybe it would be better to change it to create a whole new tree	08:08
*** jude has joined #buildstream		08:09
benschubert	_or_ should we have the composite do a full clone?	08:09
juergbi	might not be a big performance difference if we anyway have to recreate the many ScalarNodes	08:09
benschubert	that way we only do this for scalars	08:09
juergbi	not sure what you mean. two clone variants, the one used by composite would also be cloning scalarnodes?	08:10
benschubert	yeah	08:11
benschubert	since the problem only occurs when we share nodes	08:11
juergbi	seems fairly confusing to me	08:11
juergbi	I think either ScalarNode should be immutable and then clone can stay as it is	08:11
juergbi	or it is mutable and clone() always has to actually clone it	08:11
juergbi	a mix seems prone to bugs	08:11
benschubert	ok, then we need to take the hit I guess, unless we force `expand` to only accept mappings (in which case we can swap the Scalars without problem)	08:12
benschubert	but it's again more risky	08:12
benschubert	oh well, guess we'll need to take the hit	08:12
juergbi	benschubert: do you think recreating the whole tree is even that much expensive than only recreating ScalarNodes?	08:13
juergbi	I'd expect the number of ScalarNodes to be higher	08:13
juergbi	but I suppose adding those ScalarNodes to their parent is also not free...	08:13
benschubert	on base-files, it's 7MB more, and 0.15s more	08:14
benschubert	which is not _that_ much, but it's a very barebone element definition	08:14
juergbi	do you have another project with more variables as a better benchmark?	08:15
benschubert	I can try fsdk, but that's a small project again	08:16
juergbi	bst show on freedesktop-sdk might be interesting	08:16
juergbi	yes, but relative timing might still be interesting	08:16
benschubert	and internally, it would take me too much time to update	08:16
juergbi	assuming there isn't too much noise	08:16
benschubert	sure I can try, but the noise is mainly waiting for casd to get up :'D	08:16
juergbi	hehe	08:16
juergbi	well, we actually do this in the background	08:16
juergbi	and casd master now uses a disk usage file	08:16
juergbi	or on empty cache it's anyway fairly fast	08:17
benschubert	do you know of a good target element to show on fsdk?	08:17
juergbi	(well, grpc init is actually not as fast as I'd like)	08:17
juergbi	hm, sdk-image.bst is really way too fast	08:18
juergbi	create a helper project with 100 junctions to fdo-sdk ;)	08:19
benschubert	x)	08:19
*** slaf has joined #buildstream		08:21
benschubert	which will again be testing casd, as we'll be ingesting the junction	08:21
juergbi	not in the second invocation with the junction cache	08:23
*** seanborg_ has quit IRC		08:23
juergbi	assuming we don't block on anything else casd	08:23
benschubert	right	08:23
*** seanborg_ has joined #buildstream		08:23
*** tpollard has joined #buildstream		08:29
*** santi has joined #buildstream		08:42
*** seanborg_ has quit IRC		08:48
*** seanborg_ has joined #buildstream		08:48
*** mohan43u has quit IRC		08:52
*** mohan43u has joined #buildstream		08:53
*** aminbegood has joined #buildstream		10:47
*** aminbegood has quit IRC		11:07
*** tristan has joined #buildstream		11:35
*** seanborg_ has quit IRC		12:05
juergbi	tristan: shall we get rid of _legacy_command_steps in buildelement.py?	12:05
*** seanborg_ has joined #buildstream		12:06
juergbi	I think we should also tweak get_unique_key() to skip empty command lists	12:06
*** ChanServ sets mode: +o tristan		12:06
tristan	benschubert, is it possible that this new behavior of infinite loops when job's terminate themselves because of a BUG exception (where the Job has a None `process`), i.e. https://gitlab.com/BuildStream/buildstream/-/issues/1309#note_342481685 ... is due to the cleanup of how we exit BuildStream from some weeks ago ?	12:06
juergbi	that way we could in the future add command lists without affecting the cache key	12:06
tristan	juergbi, Yes I think we should just drop those, indeed; need to be able to add things without affecting cache key	12:07
tristan	hmmm, I thought that was a benschubert MR but I can't find it in closed MRs	12:08
* tristan should reproduce this again and get to the bottom of it		12:09
tristan	I wonder if we can test for it, but I doubt it	12:09
tristan	looks like tpollard's https://gitlab.com/BuildStream/buildstream/-/merge_requests/1788 might be related to this, though	12:11
juergbi	tristan: can't we have a test plugin raise a bad exception triggering a BUG?	12:11
tristan	juergbi, I don't think so, I mean we do have some of those already	12:12
juergbi	and/or even SIGKILL itself inside assemble(), if that may lead to different behavior	12:12
tristan	I suspect that we now have a larger surface area that is unaffected by plugins which can cause this cycle to happen	12:12
tristan	Once you're inside assemble(), Job certainly has a .process assigned	12:13
tristan	This is happening while a Job exists, but process is None	12:13
* tristan recalls we used to have only a couple such areas with really only a few lines of code, in Job and Queue, where we needed to be very careful to not have bugs		12:14
tristan	When Jobs exit and just before they start	12:14
juergbi	tristan: hm, but casd `process` should actually be assigned in the main process early on, long before we fork	12:15
tristan	juergbi, I don't see how this is related to casd process	12:15
tristan	I mean, Job processes	12:15
tristan	each one is individually forked	12:15
juergbi	ah, sorry, got confused with CASDProcessManager `process` variable	12:16
juergbi	not related	12:16
tristan	For some reason, Job.terminate() is being called with a None process	12:16
tristan	and that raises an exception, which results eventually in... you guessed it: calling Job.terminate()	12:16
*** santi has quit IRC		12:18
*** santi has joined #buildstream		12:18
tristan	Asides from https://gitlab.com/BuildStream/buildstream/-/issues/1309, I can reproduce this in my active junction branch by reverting the part which makes element names get printed in full	12:19
tristan	So I can debug from there...	12:19
juergbi	tristan: maybe the solution is to call job.start() before self._active_jobs.append(job) in Scheduler._start_job(), fixing the race condition	12:19
juergbi	and to make it easy to debug/hit, add a sleep between the two	12:19
tristan	I'm hoping we can reduce the LOC where this could possibly happen	12:20
juergbi	if we switch the order, I'd expect it to be completely fixed	12:20
tristan	Maybe starting it earlier could help	12:20
juergbi	well	12:20
juergbi	it may introduce the inverse	12:20
juergbi	job being started but not killed	12:20
tristan	I'm sure that with job control, there's always going to be a critical window where we just have to be very careful	12:21
tristan	We just need to make sure it's a very small window	12:21
juergbi	job control SIGINT is handled asynchronously, should be possible race-free	12:22
*** santi has quit IRC		12:22
juergbi	as long as the main process doesn't get SIGKILL or similar, I don't think we have to accept a race condition	12:23
juergbi	(and in the SIGKILL case there is nothing portable we can do, although on Linux there is PDEATHSIG)	12:23
*** santi has joined #buildstream		12:25
tristan	Yeah, still the job fork is a bit delicate with sigaction	12:29
tristan	And I was more referring to when a job completes... there is like one or two lines of code in Queue where we collect the job and before we handle exceptions in that context (as I recall)	12:30
tristan	that is a tiny window, I recall a time when we did crazy things like calculate state in that window, causing any core crashes to come across horribly	12:31
tristan	currently we now update workspaces and manage used schedule resources in that 2 lines, so that part hasn't changed drastically	12:32
benschubert	tristan: I think in that case, the "right" fix is in the exit to check if process is not None. I think I know why that happens but would need some time to deep dive again (and I can't today)	12:33
tristan	benschubert, Perhaps just ignore terminate() when process is None right ? I'll look into it, might also make sense to taint the job (I think it has a `terminated` variable already ?) and raise an error if that job is started after having been terminated	12:35
benschubert	tristan: yep, I think this would be the most reliable fix there	12:36
tristan	I'll reproduce and see what happens	12:36
benschubert	if you manage to have a test case, that would be great	12:37
tristan	Yeah I wish, that's the part I'm not sure, I haven't seen it happen with plugin code	12:37
tristan	Hmmmm, not that easy unfortunately	13:36
tristan	Looks like the scheduler really wants the unstarted jobs to terminate, or smth	13:39
tristan	the mainloop just keeps running indefinitely	13:39
tristan	first it hangs until a timeout, then crashes again on job.kill(), then I fix job.kill() in a similar way... and the scheduler just stays there forever	13:39
tristan	I think the bug here, is that we're adding jobs to scheduler._active_jobs, but they aren't actually active jobs	13:40
tristan	Why are we doing that at all ?	13:40
* tristan thinks that we only create jobs as they are ready to be spawned, and carefully with the allowed resources		13:41
tristan	We loop backwards over the queues and Queue.harvest_jobs(), then spawn them immediately, only every having active jobs in the list	13:41
tristan	Then again, my case of reproducing this is very specific, it has to do with announcing state to the frontend	13:43
tristan	this raises an error when we are in the process of launching a job	13:43
tristan	Announcing state is a thing which the core has no control over what happens, so it is a variable/possibly large window	13:44
tristan	We shouldnt (A) Create the job... (B) Announce the state to the frontend (... where whatever can happen ...) (C) Launch the job	13:44
tristan	Instead we should have a tiny window, either announce the job before it exists, or after it's been launched, but not in "mid-launch"	13:45
tristan	That could explain my case, but I'm curious about https://gitlab.com/BuildStream/buildstream/-/issues/1309	13:45
* tristan supposes this scheduler notification thing which talks to Stream is an artifact of the unfinished work in splitting out the scheduler from the frontend		13:58
tristan	I think I'd prefer we didn't have that if we're not going to finish that	13:59
tristan	it's what State() is for, and it's duplicated in this notification handler, hard to follow for something not useful	13:59
tpollard	indeed it was the basis for that	14:01
tristan	Yeah it makes sense, I wonder if it could have been hidden inside State() though	14:04
tristan	i.e. if we split up the scheduler process from the frontend, the forked State() object could have an IPC connection to itself in the main process	14:04
tpollard	we were basically achieving that with two Queues()	14:05
tristan	Ok so, definitely in scheduler.py:Scheduler.start_job(), the code is dangerous as it launches a notification to an abstract third party (which is what the core should consider the frontend) right in between adding the job to the list and starting the job	14:06
tristan	So callbacks offloaded to the frontend are done so in a 'danger zone'	14:06
tristan	Strangely, fixing this still results in a hanging loop	14:06
tristan	(it does at least only show relevant stack traces and doesn't loop infinitely spewing stack traces, though)	14:07
tristan	tpollard, so... what was wrong with two Queues() ?	14:07
tristan	seems a low price to pay in order to save us from a double API abstraction for the core to deliver notifications to the frontend	14:08
tpollard	tristan: there wasn't per se, but you can't implement the queues until you have the subprocess split	14:09
tristan	Right	14:09
tpollard	the WIP branch is still up & showed some promise	14:09
tpollard	although I've not rebased it in a long long time	14:10
tristan	tpollard, what I'm curious about, is how come the Stream doesnt just hand over the State object to the Scheduler and any core side components for those components to use directly to deliver notifications on	14:10
tpollard	off the top of my head it probably has unpicklable references on it	14:10
tristan	i.e. why is there an extra abstraction for the scheduler to invoke a Stream() callback so that Stream() can tell State() so that State() talks to frontend	14:10
tpollard	we block the stream and the scheduler objects from being pickled	14:11
tristan	State() shouldn't, I mean... it's whole purpose is to be a delivery system	14:11
tristan	it only needs to hold Queue counts and hold simple objects representing the ongoing tasks	14:12
tristan	and it was sort of designed with the intention that it could do a messaging system, so we would have a single focal point for this	14:12
tpollard	https://gitlab.com/BuildStream/buildstream/-/merge_requests/1613	14:12
* tristan is pretty sure we could do away with this arduous double notification system even if we do land this process separation		14:14
tristan	Maybe we could even do it conditionally (only on systems where we know it will buy us a speed increase ?)	14:14
tpollard	That was proposed at one point	14:15
tristan	if its all nicely encapsulated in State + one conditional point where we fork() or not, it would probably not be horrible to maintain that way	14:15
tristan	State() could just "know" by checking utils._is_main_process() or such	14:16
tristan	Well, a bit tricky with the Queues and such, probably needs some help from Stream() to pass along a file descriptor or "Queue" if it forked	14:17
tpollard	State is basically a placeholder for callbacks to the frontend	14:18
tpollard	if you picked that over to a different process I'm not sure how you'd get it to work	14:18
tristan	I don't know why you would want to pickle it	14:18
tristan	State is "The module for the core to deliver messages to the frontend"	14:18
tristan	What you want is to have a handle to it everywhere that you need to deliver messages, and not care what process you are in: leave that to State to figure out	14:19
tpollard	'how come the Stream doesnt just hand over the State object to the Scheduler'	14:19
tristan	Its purpose is to avoid specific marshalling of messages to focal points, by being the focal point	14:19
tpollard	if the scheduler is in a seperate process, it needs to be pickled	14:19
tristan	tpollard, the State in the frontend would clearly not be the same instance in the scheduler process	14:20
tristan	Maybe you would instantiate it twice, handing it a Queue from whence the process was forked on the subprocess side	14:20
tristan	So that State could send messages to it's counterpart in the frontend	14:20
tpollard	you could have two states() talking to each other yes	14:21
tristan	You would have, whether you originally had to pickle it or not	14:21
tpollard	which is basically what the WIP was doing, via queues in Stream	14:21
tristan	When I went through designing this with jonathanmaw, we kind of made sure that State was used this way in the code base; with the foresight that the process separation work would only have to figure out how to make that transparent (and not have messages marshalled here and there in various places)	14:22
tristan	tpollard, right, it is doing that now, but it's doing it with an extra level of abstraction we could live without	14:23
tristan	If I want to track what happens with a message or notification, I need to know twice as much as I need :)	14:23
tristan	This quite bothers me hehe	14:24
tpollard	as I said, if there's not going to be work to implement the process split then I agree it's added noise	14:25
tristan	tpollard, I'm now convinced it's added noise either way	14:25
tristan	if we do process separation, we can keep it all focused in one place	14:26
tpollard	there's way more that the frontend depends on when the process are split than just tasks tracked on State()	14:26
tristan	It needs to talk to stream in some way I suppose for suspend/resume and there is the whole story of shelling into failed builds on demand	14:27
tristan	This seems all quite orthogonal to the logging/state tracking focal point of State()	14:28
* tristan has to leave the closing coffee shop and go catch a midnight ferry...		14:29
tpollard	sounds like an adventure!	14:29
*** tristan has quit IRC		14:32
*** toscalix has joined #buildstream		17:46
*** santi has quit IRC		17:46
*** tpollard has quit IRC		18:06
*** toscalix has quit IRC		18:26
*** toscalix has joined #buildstream		18:49
*** jude has quit IRC		18:53
*** phildawson_ has joined #buildstream		19:12
*** phildawson has quit IRC		19:14
*** phildawson_ has quit IRC		19:15
*** seanborg_ has quit IRC		19:54
*** seanborg_ has joined #buildstream		19:57
*** benschubert has quit IRC		20:00
*** seanborg_ has quit IRC		20:04
*** seanborg_ has joined #buildstream		20:48
*** seanborg_ has quit IRC		20:51
*** toscalix has quit IRC		21:03
*** xjuan has joined #buildstream		21:09
*** xjuan has quit IRC		21:13
*** xjuan has joined #buildstream		21:21
*** mohan43u has quit IRC		22:10
*** mohan43u has joined #buildstream		22:17
*** xjuan has quit IRC		22:25

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!