IRC logs for #buildstream for Thursday, 2020-05-14

*** cs-shadow has quit IRC00:23
*** aminbegood has joined #buildstream03:20
*** tristan has quit IRC03:44
*** aminbegood has quit IRC03:44
*** aminbegood has joined #buildstream05:16
*** aminbegood has quit IRC05:20
*** aminbegood has joined #buildstream05:21
*** traveltissues has quit IRC05:37
*** traveltissues has joined #buildstream05:37
*** aminbegood has quit IRC06:30
juergbibenschubert: afaict, the issue is that `config.clone()` in `Element.__extract_config()` doesn't actually clone `ScalarNode` (its clone implementation is a no-op)06:46
juergbithis means that when the first Element of a plugin is instantiated, the default config of that plugin is permanently expanded with the variables of that first Element06:49
juergbias `Variables.expand()` replaces `value` of the original `ScalarNode`06:50
juergbiso in the context of freedesktop-sdk/glibc, we see the issue depending on whether the first Element has the `builddir` variable set to empty string or not06:50
juergbia proper implementation of `ScalarNode.clone()` fixes the issue, however, this might have a performance penalty06:51
juergbihttps://paste.gnome.org/pdjbzxzwt06:52
*** slaf has quit IRC06:58
*** narispo has quit IRC07:12
*** narispo has joined #buildstream07:12
*** pointswaves has joined #buildstream07:14
*** seanborg_ has joined #buildstream07:20
benschubertOh good catch. Yeah I'll make a patch and benchmark then (unless you're on it?) Thanks juergbi07:20
juergbino, please go ahead07:20
juergbiwe should also add a test case. shouldn't be too difficult with two elements of the same kind but different variables07:21
benschubertyep definitely07:22
juergbiWSalmon, pointswaves: the above should fix the overlap issue07:24
juergbi(may need full rebuild)07:24
WSalmonThanks juergbi07:35
WSalmonjuergbi, is that on a branch somewhere? Dose it need a load of performance testing/doing better before merging? in order to use it in my CI its a lot easier if its a branch so can i make one if you dont have one?07:39
juergbinot on a branch yet, benschubert is taking care of it07:39
juergbifeel free to create a temporary branch yourself if you need it immediately07:40
*** pointswaves has quit IRC07:40
juergbiI hope we can merge it today to master, latest tomorrow07:40
*** WSalmon has left #buildstream07:40
*** WSalmon has joined #buildstream07:41
benschubertWSalmon: !1929 if you want it in advance07:41
WSalmonthanks benschubert07:42
* WSalmon -> looks07:42
benschubertjuergbi: :/ seems to take a noticeable hit on 'show08:04
juergbi:-/08:05
juergbibenschubert: a possible alternative would be to create a new ScalarNode only on expand, instead of setting .value of the existing node08:05
benschubertyeah I'll finish the complete benchmark then try this solution08:06
juergbihowever, ScalarNode.clone() would still be misleading, unless we can make `value` immutable somehow08:06
benschubertbut then it becomes murky depending how you call it08:06
benschubertfrom python it is immutable08:06
benschubertbut I was accessing it from C08:06
juergbiah ok, so maybe a big comment that we must never set value in the .pyx08:08
juergbibenschubert: murky because it is no longer fully in-place?08:08
benschubertyep08:08
juergbimaybe it would be better to change it to create a whole new tree08:08
*** jude has joined #buildstream08:09
benschubert_or_ should we have the composite do a full clone?08:09
juergbimight not be a big performance difference if we anyway have to recreate the many ScalarNodes08:09
benschubertthat way we only do this for scalars08:09
juergbinot sure what you mean. two clone variants, the one used by composite would also be cloning scalarnodes?08:10
benschubertyeah08:11
benschubertsince the problem only occurs when we share nodes08:11
juergbiseems fairly confusing to me08:11
juergbiI think either ScalarNode should be immutable and then clone can stay as it is08:11
juergbior it is mutable and clone() always has to actually clone it08:11
juergbia mix seems prone to bugs08:11
benschubertok, then we need to take the hit I guess, unless we force `expand` to only accept mappings (in which case we can swap the Scalars without problem)08:12
benschubertbut it's again more risky08:12
benschubertoh well, guess we'll need to take the hit08:12
juergbibenschubert: do you think recreating the whole tree is even that much expensive than only recreating ScalarNodes?08:13
juergbiI'd expect the number of ScalarNodes to be higher08:13
juergbibut I suppose adding those ScalarNodes to their parent is also not free...08:13
benschuberton base-files, it's 7MB more, and 0.15s more08:14
benschubertwhich is not _that_ much, but it's a very barebone element definition08:14
juergbido you have another project with more variables as a better benchmark?08:15
benschubertI can try fsdk, but that's a small project again08:16
juergbibst show on freedesktop-sdk might be interesting08:16
juergbiyes, but relative timing might still be interesting08:16
benschubertand internally, it would take me too much time to update08:16
juergbiassuming there isn't too much noise08:16
benschubertsure I can try, but the noise is mainly waiting for casd to get up :'D08:16
juergbihehe08:16
juergbiwell, we actually do this in the background08:16
juergbiand casd master now uses a disk usage file08:16
juergbior on empty cache it's anyway fairly fast08:17
benschubertdo you know of a good target element to show on fsdk?08:17
juergbi(well, grpc init is actually not as fast as I'd like)08:17
juergbihm, sdk-image.bst is really way too fast08:18
juergbicreate a helper project with 100 junctions to fdo-sdk ;)08:19
benschubertx)08:19
*** slaf has joined #buildstream08:21
benschubertwhich will again be testing casd, as we'll be ingesting the junction08:21
juergbinot in the second invocation with the junction cache08:23
*** seanborg_ has quit IRC08:23
juergbiassuming we don't block on anything else casd08:23
benschubertright08:23
*** seanborg_ has joined #buildstream08:23
*** tpollard has joined #buildstream08:29
*** santi has joined #buildstream08:42
*** seanborg_ has quit IRC08:48
*** seanborg_ has joined #buildstream08:48
*** mohan43u has quit IRC08:52
*** mohan43u has joined #buildstream08:53
*** aminbegood has joined #buildstream10:47
*** aminbegood has quit IRC11:07
*** tristan has joined #buildstream11:35
*** seanborg_ has quit IRC12:05
juergbitristan: shall we get rid of _legacy_command_steps in buildelement.py?12:05
*** seanborg_ has joined #buildstream12:06
juergbiI think we should also tweak get_unique_key() to skip empty command lists12:06
*** ChanServ sets mode: +o tristan12:06
tristanbenschubert, is it possible that this new behavior of infinite loops when job's terminate themselves because of a BUG exception (where the Job has a None `process`), i.e. https://gitlab.com/BuildStream/buildstream/-/issues/1309#note_342481685 ... is due to the cleanup of how we exit BuildStream from some weeks ago ?12:06
juergbithat way we could in the future add command lists without affecting the cache key12:06
tristanjuergbi, Yes I think we should just drop those, indeed; need to be able to add things without affecting cache key12:07
tristanhmmm, I thought that was a benschubert MR but I can't find it in closed MRs12:08
* tristan should reproduce this again and get to the bottom of it12:09
tristanI wonder if we can test for it, but I doubt it12:09
tristanlooks like tpollard's https://gitlab.com/BuildStream/buildstream/-/merge_requests/1788 might be related to this, though12:11
juergbitristan: can't we have a test plugin raise a bad exception triggering a BUG?12:11
tristanjuergbi, I don't think so, I mean we do have some of those already12:12
juergbiand/or even SIGKILL itself inside assemble(), if that may lead to different behavior12:12
tristanI suspect that we now have a larger surface area that is unaffected by plugins which can cause this cycle to happen12:12
tristanOnce you're inside assemble(), Job certainly has a .process assigned12:13
tristanThis is happening while a Job exists, but process is None12:13
* tristan recalls we used to have only a couple such areas with really only a few lines of code, in Job and Queue, where we needed to be very careful to not have bugs12:14
tristanWhen Jobs exit and just before they start12:14
juergbitristan: hm, but casd `process` should actually be assigned in the main process early on, long before we fork12:15
tristanjuergbi, I don't see how this is related to casd process12:15
tristanI mean, Job processes12:15
tristaneach one is individually forked12:15
juergbiah, sorry, got confused with CASDProcessManager `process` variable12:16
juergbinot related12:16
tristanFor some reason, Job.terminate() is being called with a None process12:16
tristanand that raises an exception, which results eventually in... you guessed it: calling Job.terminate()12:16
*** santi has quit IRC12:18
*** santi has joined #buildstream12:18
tristanAsides from https://gitlab.com/BuildStream/buildstream/-/issues/1309, I can reproduce this in my active junction branch by reverting the part which makes element names get printed in full12:19
tristanSo I can debug from there...12:19
juergbitristan: maybe the solution is to call job.start() before self._active_jobs.append(job) in Scheduler._start_job(), fixing the race condition12:19
juergbiand to make it easy to debug/hit, add a sleep between the two12:19
tristanI'm hoping we can reduce the LOC where this could possibly happen12:20
juergbiif we switch the order, I'd expect it to be completely fixed12:20
tristanMaybe starting it earlier could help12:20
juergbiwell12:20
juergbiit may introduce the inverse12:20
juergbijob being started but not killed12:20
tristanI'm sure that with job control, there's always going to be a critical window where we just have to be very careful12:21
tristanWe just need to make sure it's a very small window12:21
juergbijob control SIGINT is handled asynchronously, should be possible race-free12:22
*** santi has quit IRC12:22
juergbias long as the main process doesn't get SIGKILL or similar, I don't think we have to accept a race condition12:23
juergbi(and in the SIGKILL case there is nothing portable we can do, although on Linux there is PDEATHSIG)12:23
*** santi has joined #buildstream12:25
tristanYeah, still the job fork is a bit delicate with sigaction12:29
tristanAnd I was more referring to when a job completes... there is like one or two lines of code in Queue where we collect the job and before we handle exceptions in that context (as I recall)12:30
tristanthat is a tiny window, I recall a time when we did crazy things like calculate state in that window, causing any core crashes to come across horribly12:31
tristancurrently we now update workspaces and manage used schedule resources in that 2 lines, so that part hasn't changed drastically12:32
benschuberttristan: I think in that case, the "right" fix is in the exit to check if process is not None. I *think* I know why that happens but would need some time to deep dive again (and I can't today)12:33
tristanbenschubert, Perhaps just ignore terminate() when process is None right ? I'll look into it, might also make sense to taint the job (I think it has a `terminated` variable already ?) and raise an error if that job is started after having been terminated12:35
benschuberttristan: yep, I think this would be the most reliable fix there12:36
tristanI'll reproduce and see what happens12:36
benschubertif you manage to have a test case, that would be great12:37
tristanYeah I wish, that's the part I'm not sure, I haven't seen it happen with plugin code12:37
tristanHmmmm, not that easy unfortunately13:36
tristanLooks like the scheduler really wants the unstarted jobs to terminate, or smth13:39
tristanthe mainloop just keeps running indefinitely13:39
tristanfirst it hangs until a timeout, then crashes again on job.kill(), then I fix job.kill() in a similar way... and the scheduler just stays there forever13:39
tristanI think the bug here, is that we're adding jobs to scheduler._active_jobs, but they aren't actually active jobs13:40
tristanWhy are we doing that at all ?13:40
* tristan thinks that we only create jobs as they are ready to be spawned, and carefully with the allowed resources13:41
tristanWe loop backwards over the queues and Queue.harvest_jobs(), then spawn them immediately, only every having active jobs in the list13:41
tristanThen again, my case of reproducing this is very specific, it has to do with announcing state to the frontend13:43
tristanthis raises an error when we are in the process of launching a job13:43
tristanAnnouncing state is a thing which the core has no control over what happens, so it is a variable/possibly large window13:44
tristanWe shouldnt (A) Create the job... (B) Announce the state to the frontend (... where whatever can happen ...) (C) Launch the job13:44
tristanInstead we should have a tiny window, either announce the job before it exists, or after it's been launched, but not in "mid-launch"13:45
tristanThat could explain my case, but I'm curious about https://gitlab.com/BuildStream/buildstream/-/issues/130913:45
* tristan supposes this scheduler notification thing which talks to Stream is an artifact of the unfinished work in splitting out the scheduler from the frontend13:58
tristanI think I'd prefer we didn't have that if we're not going to finish that13:59
tristanit's what State() is for, and it's duplicated in this notification handler, hard to follow for something not useful13:59
tpollardindeed it was the basis for that14:01
tristanYeah it makes sense, I wonder if it could have been hidden inside State() though14:04
tristani.e. if we split up the scheduler process from the frontend, the forked State() object could have an IPC connection to itself in the main process14:04
tpollardwe were basically achieving that with two Queues()14:05
tristanOk so, definitely in scheduler.py:Scheduler.start_job(), the code is dangerous as it launches a notification to an abstract third party (which is what the core should consider the frontend) right in between adding the job to the list and starting the job14:06
tristanSo callbacks offloaded to the frontend are done so in a 'danger zone'14:06
tristanStrangely, fixing this still results in a hanging loop14:06
tristan(it does at least only show relevant stack traces and doesn't loop infinitely spewing stack traces, though)14:07
tristantpollard, so... what was wrong with two Queues() ?14:07
tristanseems a low price to pay in order to save us from a double API abstraction for the core to deliver notifications to the frontend14:08
tpollardtristan: there wasn't per se, but you can't implement the queues until you have the subprocess split14:09
tristanRight14:09
tpollardthe WIP branch is still up & showed some promise14:09
tpollardalthough I've not rebased it in a long long time14:10
tristantpollard, what I'm curious about, is how come the Stream doesnt just hand over the State object to the Scheduler and any core side components for those components to use directly to deliver notifications on14:10
tpollardoff the top of my head it probably has unpicklable references on it14:10
tristani.e. why is there an *extra* abstraction for the scheduler to invoke a Stream() callback so that Stream() can tell State() so that State() talks to frontend14:10
tpollardwe block the stream and the scheduler objects from being pickled14:11
tristanState() shouldn't, I mean... it's whole purpose is to be a delivery system14:11
tristanit only needs to hold Queue counts and hold simple objects representing the ongoing tasks14:12
tristanand it was sort of designed with the intention that it could do a messaging system, so we would have a single focal point for this14:12
tpollardhttps://gitlab.com/BuildStream/buildstream/-/merge_requests/161314:12
* tristan is pretty sure we could do away with this arduous double notification system even if we do land this process separation14:14
tristanMaybe we could even do it conditionally (only on systems where we know it will buy us a speed increase ?)14:14
tpollardThat was proposed at one point14:15
tristanif its all nicely encapsulated in State + one conditional point where we fork() or not, it would probably not be horrible to maintain that way14:15
tristanState() could just "know" by checking utils._is_main_process() or such14:16
tristanWell, a bit tricky with the Queues and such, probably needs some help from Stream() to pass along a file descriptor or "Queue" if it forked14:17
tpollardState is basically a placeholder for callbacks to the frontend14:18
tpollardif you picked that over to a different process I'm not sure how you'd get it to work14:18
tristanI don't know why you would want to pickle it14:18
tristanState is "The module for the core to deliver messages to the frontend"14:18
tristanWhat you want is to have a handle to it everywhere that you need to deliver messages, and not care what process you are in: leave that to State to figure out14:19
tpollard'how come the Stream doesnt just hand over the State object to the Scheduler'14:19
tristanIts purpose is to avoid specific marshalling of messages to focal points, by being the focal point14:19
tpollardif the scheduler is in a seperate process, it needs to be pickled14:19
tristantpollard, the State in the frontend would clearly not be the same instance in the scheduler process14:20
tristanMaybe you would instantiate it twice, handing it a Queue from whence the process was forked on the subprocess side14:20
tristanSo that State could send messages to it's counterpart in the frontend14:20
tpollardyou could have two states() talking to each other yes14:21
tristanYou would have, whether you originally had to pickle it or not14:21
tpollardwhich is basically what the WIP was doing, via queues in Stream14:21
tristanWhen I went through designing this with jonathanmaw, we kind of made sure that State was used this way in the code base; with the foresight that the process separation work would only have to figure out how to make that transparent (and not have messages marshalled here and there in various places)14:22
tristantpollard, right, it is doing that now, but it's doing it with an extra level of abstraction we could live without14:23
tristanIf I want to track what happens with a message or notification, I need to know twice as much as I need :)14:23
tristanThis quite bothers me hehe14:24
tpollardas I said, if there's not going to be work to implement the process split then I agree it's added noise14:25
tristantpollard, I'm now convinced it's added noise either way14:25
tristanif we do process separation, we can keep it all focused in one place14:26
tpollardthere's way more that the frontend depends on when the process are split than just tasks tracked on State()14:26
tristanIt needs to talk to stream in some way I suppose for suspend/resume and there is the whole story of shelling into failed builds on demand14:27
tristanThis seems all quite orthogonal to the logging/state tracking focal point of State()14:28
* tristan has to leave the closing coffee shop and go catch a midnight ferry...14:29
tpollardsounds like an adventure!14:29
*** tristan has quit IRC14:32
*** toscalix has joined #buildstream17:46
*** santi has quit IRC17:46
*** tpollard has quit IRC18:06
*** toscalix has quit IRC18:26
*** toscalix has joined #buildstream18:49
*** jude has quit IRC18:53
*** phildawson_ has joined #buildstream19:12
*** phildawson has quit IRC19:14
*** phildawson_ has quit IRC19:15
*** seanborg_ has quit IRC19:54
*** seanborg_ has joined #buildstream19:57
*** benschubert has quit IRC20:00
*** seanborg_ has quit IRC20:04
*** seanborg_ has joined #buildstream20:48
*** seanborg_ has quit IRC20:51
*** toscalix has quit IRC21:03
*** xjuan has joined #buildstream21:09
*** xjuan has quit IRC21:13
*** xjuan has joined #buildstream21:21
*** mohan43u has quit IRC22:10
*** mohan43u has joined #buildstream22:17
*** xjuan has quit IRC22:25

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!