*** cs-shadow has quit IRC | 00:23 | |
*** aminbegood has joined #buildstream | 03:20 | |
*** tristan has quit IRC | 03:44 | |
*** aminbegood has quit IRC | 03:44 | |
*** aminbegood has joined #buildstream | 05:16 | |
*** aminbegood has quit IRC | 05:20 | |
*** aminbegood has joined #buildstream | 05:21 | |
*** traveltissues has quit IRC | 05:37 | |
*** traveltissues has joined #buildstream | 05:37 | |
*** aminbegood has quit IRC | 06:30 | |
juergbi | benschubert: afaict, the issue is that `config.clone()` in `Element.__extract_config()` doesn't actually clone `ScalarNode` (its clone implementation is a no-op) | 06:46 |
---|---|---|
juergbi | this means that when the first Element of a plugin is instantiated, the default config of that plugin is permanently expanded with the variables of that first Element | 06:49 |
juergbi | as `Variables.expand()` replaces `value` of the original `ScalarNode` | 06:50 |
juergbi | so in the context of freedesktop-sdk/glibc, we see the issue depending on whether the first Element has the `builddir` variable set to empty string or not | 06:50 |
juergbi | a proper implementation of `ScalarNode.clone()` fixes the issue, however, this might have a performance penalty | 06:51 |
juergbi | https://paste.gnome.org/pdjbzxzwt | 06:52 |
*** slaf has quit IRC | 06:58 | |
*** narispo has quit IRC | 07:12 | |
*** narispo has joined #buildstream | 07:12 | |
*** pointswaves has joined #buildstream | 07:14 | |
*** seanborg_ has joined #buildstream | 07:20 | |
benschubert | Oh good catch. Yeah I'll make a patch and benchmark then (unless you're on it?) Thanks juergbi | 07:20 |
juergbi | no, please go ahead | 07:20 |
juergbi | we should also add a test case. shouldn't be too difficult with two elements of the same kind but different variables | 07:21 |
benschubert | yep definitely | 07:22 |
juergbi | WSalmon, pointswaves: the above should fix the overlap issue | 07:24 |
juergbi | (may need full rebuild) | 07:24 |
WSalmon | Thanks juergbi | 07:35 |
WSalmon | juergbi, is that on a branch somewhere? Dose it need a load of performance testing/doing better before merging? in order to use it in my CI its a lot easier if its a branch so can i make one if you dont have one? | 07:39 |
juergbi | not on a branch yet, benschubert is taking care of it | 07:39 |
juergbi | feel free to create a temporary branch yourself if you need it immediately | 07:40 |
*** pointswaves has quit IRC | 07:40 | |
juergbi | I hope we can merge it today to master, latest tomorrow | 07:40 |
*** WSalmon has left #buildstream | 07:40 | |
*** WSalmon has joined #buildstream | 07:41 | |
benschubert | WSalmon: !1929 if you want it in advance | 07:41 |
WSalmon | thanks benschubert | 07:42 |
* WSalmon -> looks | 07:42 | |
benschubert | juergbi: :/ seems to take a noticeable hit on 'show | 08:04 |
juergbi | :-/ | 08:05 |
juergbi | benschubert: a possible alternative would be to create a new ScalarNode only on expand, instead of setting .value of the existing node | 08:05 |
benschubert | yeah I'll finish the complete benchmark then try this solution | 08:06 |
juergbi | however, ScalarNode.clone() would still be misleading, unless we can make `value` immutable somehow | 08:06 |
benschubert | but then it becomes murky depending how you call it | 08:06 |
benschubert | from python it is immutable | 08:06 |
benschubert | but I was accessing it from C | 08:06 |
juergbi | ah ok, so maybe a big comment that we must never set value in the .pyx | 08:08 |
juergbi | benschubert: murky because it is no longer fully in-place? | 08:08 |
benschubert | yep | 08:08 |
juergbi | maybe it would be better to change it to create a whole new tree | 08:08 |
*** jude has joined #buildstream | 08:09 | |
benschubert | _or_ should we have the composite do a full clone? | 08:09 |
juergbi | might not be a big performance difference if we anyway have to recreate the many ScalarNodes | 08:09 |
benschubert | that way we only do this for scalars | 08:09 |
juergbi | not sure what you mean. two clone variants, the one used by composite would also be cloning scalarnodes? | 08:10 |
benschubert | yeah | 08:11 |
benschubert | since the problem only occurs when we share nodes | 08:11 |
juergbi | seems fairly confusing to me | 08:11 |
juergbi | I think either ScalarNode should be immutable and then clone can stay as it is | 08:11 |
juergbi | or it is mutable and clone() always has to actually clone it | 08:11 |
juergbi | a mix seems prone to bugs | 08:11 |
benschubert | ok, then we need to take the hit I guess, unless we force `expand` to only accept mappings (in which case we can swap the Scalars without problem) | 08:12 |
benschubert | but it's again more risky | 08:12 |
benschubert | oh well, guess we'll need to take the hit | 08:12 |
juergbi | benschubert: do you think recreating the whole tree is even that much expensive than only recreating ScalarNodes? | 08:13 |
juergbi | I'd expect the number of ScalarNodes to be higher | 08:13 |
juergbi | but I suppose adding those ScalarNodes to their parent is also not free... | 08:13 |
benschubert | on base-files, it's 7MB more, and 0.15s more | 08:14 |
benschubert | which is not _that_ much, but it's a very barebone element definition | 08:14 |
juergbi | do you have another project with more variables as a better benchmark? | 08:15 |
benschubert | I can try fsdk, but that's a small project again | 08:16 |
juergbi | bst show on freedesktop-sdk might be interesting | 08:16 |
juergbi | yes, but relative timing might still be interesting | 08:16 |
benschubert | and internally, it would take me too much time to update | 08:16 |
juergbi | assuming there isn't too much noise | 08:16 |
benschubert | sure I can try, but the noise is mainly waiting for casd to get up :'D | 08:16 |
juergbi | hehe | 08:16 |
juergbi | well, we actually do this in the background | 08:16 |
juergbi | and casd master now uses a disk usage file | 08:16 |
juergbi | or on empty cache it's anyway fairly fast | 08:17 |
benschubert | do you know of a good target element to show on fsdk? | 08:17 |
juergbi | (well, grpc init is actually not as fast as I'd like) | 08:17 |
juergbi | hm, sdk-image.bst is really way too fast | 08:18 |
juergbi | create a helper project with 100 junctions to fdo-sdk ;) | 08:19 |
benschubert | x) | 08:19 |
*** slaf has joined #buildstream | 08:21 | |
benschubert | which will again be testing casd, as we'll be ingesting the junction | 08:21 |
juergbi | not in the second invocation with the junction cache | 08:23 |
*** seanborg_ has quit IRC | 08:23 | |
juergbi | assuming we don't block on anything else casd | 08:23 |
benschubert | right | 08:23 |
*** seanborg_ has joined #buildstream | 08:23 | |
*** tpollard has joined #buildstream | 08:29 | |
*** santi has joined #buildstream | 08:42 | |
*** seanborg_ has quit IRC | 08:48 | |
*** seanborg_ has joined #buildstream | 08:48 | |
*** mohan43u has quit IRC | 08:52 | |
*** mohan43u has joined #buildstream | 08:53 | |
*** aminbegood has joined #buildstream | 10:47 | |
*** aminbegood has quit IRC | 11:07 | |
*** tristan has joined #buildstream | 11:35 | |
*** seanborg_ has quit IRC | 12:05 | |
juergbi | tristan: shall we get rid of _legacy_command_steps in buildelement.py? | 12:05 |
*** seanborg_ has joined #buildstream | 12:06 | |
juergbi | I think we should also tweak get_unique_key() to skip empty command lists | 12:06 |
*** ChanServ sets mode: +o tristan | 12:06 | |
tristan | benschubert, is it possible that this new behavior of infinite loops when job's terminate themselves because of a BUG exception (where the Job has a None `process`), i.e. https://gitlab.com/BuildStream/buildstream/-/issues/1309#note_342481685 ... is due to the cleanup of how we exit BuildStream from some weeks ago ? | 12:06 |
juergbi | that way we could in the future add command lists without affecting the cache key | 12:06 |
tristan | juergbi, Yes I think we should just drop those, indeed; need to be able to add things without affecting cache key | 12:07 |
tristan | hmmm, I thought that was a benschubert MR but I can't find it in closed MRs | 12:08 |
* tristan should reproduce this again and get to the bottom of it | 12:09 | |
tristan | I wonder if we can test for it, but I doubt it | 12:09 |
tristan | looks like tpollard's https://gitlab.com/BuildStream/buildstream/-/merge_requests/1788 might be related to this, though | 12:11 |
juergbi | tristan: can't we have a test plugin raise a bad exception triggering a BUG? | 12:11 |
tristan | juergbi, I don't think so, I mean we do have some of those already | 12:12 |
juergbi | and/or even SIGKILL itself inside assemble(), if that may lead to different behavior | 12:12 |
tristan | I suspect that we now have a larger surface area that is unaffected by plugins which can cause this cycle to happen | 12:12 |
tristan | Once you're inside assemble(), Job certainly has a .process assigned | 12:13 |
tristan | This is happening while a Job exists, but process is None | 12:13 |
* tristan recalls we used to have only a couple such areas with really only a few lines of code, in Job and Queue, where we needed to be very careful to not have bugs | 12:14 | |
tristan | When Jobs exit and just before they start | 12:14 |
juergbi | tristan: hm, but casd `process` should actually be assigned in the main process early on, long before we fork | 12:15 |
tristan | juergbi, I don't see how this is related to casd process | 12:15 |
tristan | I mean, Job processes | 12:15 |
tristan | each one is individually forked | 12:15 |
juergbi | ah, sorry, got confused with CASDProcessManager `process` variable | 12:16 |
juergbi | not related | 12:16 |
tristan | For some reason, Job.terminate() is being called with a None process | 12:16 |
tristan | and that raises an exception, which results eventually in... you guessed it: calling Job.terminate() | 12:16 |
*** santi has quit IRC | 12:18 | |
*** santi has joined #buildstream | 12:18 | |
tristan | Asides from https://gitlab.com/BuildStream/buildstream/-/issues/1309, I can reproduce this in my active junction branch by reverting the part which makes element names get printed in full | 12:19 |
tristan | So I can debug from there... | 12:19 |
juergbi | tristan: maybe the solution is to call job.start() before self._active_jobs.append(job) in Scheduler._start_job(), fixing the race condition | 12:19 |
juergbi | and to make it easy to debug/hit, add a sleep between the two | 12:19 |
tristan | I'm hoping we can reduce the LOC where this could possibly happen | 12:20 |
juergbi | if we switch the order, I'd expect it to be completely fixed | 12:20 |
tristan | Maybe starting it earlier could help | 12:20 |
juergbi | well | 12:20 |
juergbi | it may introduce the inverse | 12:20 |
juergbi | job being started but not killed | 12:20 |
tristan | I'm sure that with job control, there's always going to be a critical window where we just have to be very careful | 12:21 |
tristan | We just need to make sure it's a very small window | 12:21 |
juergbi | job control SIGINT is handled asynchronously, should be possible race-free | 12:22 |
*** santi has quit IRC | 12:22 | |
juergbi | as long as the main process doesn't get SIGKILL or similar, I don't think we have to accept a race condition | 12:23 |
juergbi | (and in the SIGKILL case there is nothing portable we can do, although on Linux there is PDEATHSIG) | 12:23 |
*** santi has joined #buildstream | 12:25 | |
tristan | Yeah, still the job fork is a bit delicate with sigaction | 12:29 |
tristan | And I was more referring to when a job completes... there is like one or two lines of code in Queue where we collect the job and before we handle exceptions in that context (as I recall) | 12:30 |
tristan | that is a tiny window, I recall a time when we did crazy things like calculate state in that window, causing any core crashes to come across horribly | 12:31 |
tristan | currently we now update workspaces and manage used schedule resources in that 2 lines, so that part hasn't changed drastically | 12:32 |
benschubert | tristan: I think in that case, the "right" fix is in the exit to check if process is not None. I *think* I know why that happens but would need some time to deep dive again (and I can't today) | 12:33 |
tristan | benschubert, Perhaps just ignore terminate() when process is None right ? I'll look into it, might also make sense to taint the job (I think it has a `terminated` variable already ?) and raise an error if that job is started after having been terminated | 12:35 |
benschubert | tristan: yep, I think this would be the most reliable fix there | 12:36 |
tristan | I'll reproduce and see what happens | 12:36 |
benschubert | if you manage to have a test case, that would be great | 12:37 |
tristan | Yeah I wish, that's the part I'm not sure, I haven't seen it happen with plugin code | 12:37 |
tristan | Hmmmm, not that easy unfortunately | 13:36 |
tristan | Looks like the scheduler really wants the unstarted jobs to terminate, or smth | 13:39 |
tristan | the mainloop just keeps running indefinitely | 13:39 |
tristan | first it hangs until a timeout, then crashes again on job.kill(), then I fix job.kill() in a similar way... and the scheduler just stays there forever | 13:39 |
tristan | I think the bug here, is that we're adding jobs to scheduler._active_jobs, but they aren't actually active jobs | 13:40 |
tristan | Why are we doing that at all ? | 13:40 |
* tristan thinks that we only create jobs as they are ready to be spawned, and carefully with the allowed resources | 13:41 | |
tristan | We loop backwards over the queues and Queue.harvest_jobs(), then spawn them immediately, only every having active jobs in the list | 13:41 |
tristan | Then again, my case of reproducing this is very specific, it has to do with announcing state to the frontend | 13:43 |
tristan | this raises an error when we are in the process of launching a job | 13:43 |
tristan | Announcing state is a thing which the core has no control over what happens, so it is a variable/possibly large window | 13:44 |
tristan | We shouldnt (A) Create the job... (B) Announce the state to the frontend (... where whatever can happen ...) (C) Launch the job | 13:44 |
tristan | Instead we should have a tiny window, either announce the job before it exists, or after it's been launched, but not in "mid-launch" | 13:45 |
tristan | That could explain my case, but I'm curious about https://gitlab.com/BuildStream/buildstream/-/issues/1309 | 13:45 |
* tristan supposes this scheduler notification thing which talks to Stream is an artifact of the unfinished work in splitting out the scheduler from the frontend | 13:58 | |
tristan | I think I'd prefer we didn't have that if we're not going to finish that | 13:59 |
tristan | it's what State() is for, and it's duplicated in this notification handler, hard to follow for something not useful | 13:59 |
tpollard | indeed it was the basis for that | 14:01 |
tristan | Yeah it makes sense, I wonder if it could have been hidden inside State() though | 14:04 |
tristan | i.e. if we split up the scheduler process from the frontend, the forked State() object could have an IPC connection to itself in the main process | 14:04 |
tpollard | we were basically achieving that with two Queues() | 14:05 |
tristan | Ok so, definitely in scheduler.py:Scheduler.start_job(), the code is dangerous as it launches a notification to an abstract third party (which is what the core should consider the frontend) right in between adding the job to the list and starting the job | 14:06 |
tristan | So callbacks offloaded to the frontend are done so in a 'danger zone' | 14:06 |
tristan | Strangely, fixing this still results in a hanging loop | 14:06 |
tristan | (it does at least only show relevant stack traces and doesn't loop infinitely spewing stack traces, though) | 14:07 |
tristan | tpollard, so... what was wrong with two Queues() ? | 14:07 |
tristan | seems a low price to pay in order to save us from a double API abstraction for the core to deliver notifications to the frontend | 14:08 |
tpollard | tristan: there wasn't per se, but you can't implement the queues until you have the subprocess split | 14:09 |
tristan | Right | 14:09 |
tpollard | the WIP branch is still up & showed some promise | 14:09 |
tpollard | although I've not rebased it in a long long time | 14:10 |
tristan | tpollard, what I'm curious about, is how come the Stream doesnt just hand over the State object to the Scheduler and any core side components for those components to use directly to deliver notifications on | 14:10 |
tpollard | off the top of my head it probably has unpicklable references on it | 14:10 |
tristan | i.e. why is there an *extra* abstraction for the scheduler to invoke a Stream() callback so that Stream() can tell State() so that State() talks to frontend | 14:10 |
tpollard | we block the stream and the scheduler objects from being pickled | 14:11 |
tristan | State() shouldn't, I mean... it's whole purpose is to be a delivery system | 14:11 |
tristan | it only needs to hold Queue counts and hold simple objects representing the ongoing tasks | 14:12 |
tristan | and it was sort of designed with the intention that it could do a messaging system, so we would have a single focal point for this | 14:12 |
tpollard | https://gitlab.com/BuildStream/buildstream/-/merge_requests/1613 | 14:12 |
* tristan is pretty sure we could do away with this arduous double notification system even if we do land this process separation | 14:14 | |
tristan | Maybe we could even do it conditionally (only on systems where we know it will buy us a speed increase ?) | 14:14 |
tpollard | That was proposed at one point | 14:15 |
tristan | if its all nicely encapsulated in State + one conditional point where we fork() or not, it would probably not be horrible to maintain that way | 14:15 |
tristan | State() could just "know" by checking utils._is_main_process() or such | 14:16 |
tristan | Well, a bit tricky with the Queues and such, probably needs some help from Stream() to pass along a file descriptor or "Queue" if it forked | 14:17 |
tpollard | State is basically a placeholder for callbacks to the frontend | 14:18 |
tpollard | if you picked that over to a different process I'm not sure how you'd get it to work | 14:18 |
tristan | I don't know why you would want to pickle it | 14:18 |
tristan | State is "The module for the core to deliver messages to the frontend" | 14:18 |
tristan | What you want is to have a handle to it everywhere that you need to deliver messages, and not care what process you are in: leave that to State to figure out | 14:19 |
tpollard | 'how come the Stream doesnt just hand over the State object to the Scheduler' | 14:19 |
tristan | Its purpose is to avoid specific marshalling of messages to focal points, by being the focal point | 14:19 |
tpollard | if the scheduler is in a seperate process, it needs to be pickled | 14:19 |
tristan | tpollard, the State in the frontend would clearly not be the same instance in the scheduler process | 14:20 |
tristan | Maybe you would instantiate it twice, handing it a Queue from whence the process was forked on the subprocess side | 14:20 |
tristan | So that State could send messages to it's counterpart in the frontend | 14:20 |
tpollard | you could have two states() talking to each other yes | 14:21 |
tristan | You would have, whether you originally had to pickle it or not | 14:21 |
tpollard | which is basically what the WIP was doing, via queues in Stream | 14:21 |
tristan | When I went through designing this with jonathanmaw, we kind of made sure that State was used this way in the code base; with the foresight that the process separation work would only have to figure out how to make that transparent (and not have messages marshalled here and there in various places) | 14:22 |
tristan | tpollard, right, it is doing that now, but it's doing it with an extra level of abstraction we could live without | 14:23 |
tristan | If I want to track what happens with a message or notification, I need to know twice as much as I need :) | 14:23 |
tristan | This quite bothers me hehe | 14:24 |
tpollard | as I said, if there's not going to be work to implement the process split then I agree it's added noise | 14:25 |
tristan | tpollard, I'm now convinced it's added noise either way | 14:25 |
tristan | if we do process separation, we can keep it all focused in one place | 14:26 |
tpollard | there's way more that the frontend depends on when the process are split than just tasks tracked on State() | 14:26 |
tristan | It needs to talk to stream in some way I suppose for suspend/resume and there is the whole story of shelling into failed builds on demand | 14:27 |
tristan | This seems all quite orthogonal to the logging/state tracking focal point of State() | 14:28 |
* tristan has to leave the closing coffee shop and go catch a midnight ferry... | 14:29 | |
tpollard | sounds like an adventure! | 14:29 |
*** tristan has quit IRC | 14:32 | |
*** toscalix has joined #buildstream | 17:46 | |
*** santi has quit IRC | 17:46 | |
*** tpollard has quit IRC | 18:06 | |
*** toscalix has quit IRC | 18:26 | |
*** toscalix has joined #buildstream | 18:49 | |
*** jude has quit IRC | 18:53 | |
*** phildawson_ has joined #buildstream | 19:12 | |
*** phildawson has quit IRC | 19:14 | |
*** phildawson_ has quit IRC | 19:15 | |
*** seanborg_ has quit IRC | 19:54 | |
*** seanborg_ has joined #buildstream | 19:57 | |
*** benschubert has quit IRC | 20:00 | |
*** seanborg_ has quit IRC | 20:04 | |
*** seanborg_ has joined #buildstream | 20:48 | |
*** seanborg_ has quit IRC | 20:51 | |
*** toscalix has quit IRC | 21:03 | |
*** xjuan has joined #buildstream | 21:09 | |
*** xjuan has quit IRC | 21:13 | |
*** xjuan has joined #buildstream | 21:21 | |
*** mohan43u has quit IRC | 22:10 | |
*** mohan43u has joined #buildstream | 22:17 | |
*** xjuan has quit IRC | 22:25 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!