IRC logs for #buildstream for Thursday, 2020-07-02

*** tristan has quit IRC06:59
*** benschubert has joined #buildstream07:27
benschuberttristan: oh ! great :D07:27
*** tristan has joined #buildstream07:35
*** ChanServ sets mode: +o tristan07:35
tristanbenschubert, could you launch another benchmark on the latest please ?08:00
tristanJust see how we're doing08:00
benschubertsure, I'll do that in 30 minutes, just need to finish some urgent stuff :)08:00
tristansure08:01
tristanI think this one is a winner, I'll go deeper anyway but would just like to know :)08:01
benschuberttristan: running!08:13
*** hasebastian has joined #buildstream08:13
benschubertBy the way does someone remember how jennis generated the debian project? I could not find the script to do that :)08:13
Frazer61hi, does anyone have any thoughts on the `bst source show` feature, since the secure downloads may not be entirely possible for the time being as it wouldnt be very defined, so this is the next best thing08:33
*** santi has joined #buildstream08:37
benschubertFrazer61: that might be better discussed on the ML :) One thing I could see is showing all the `config` for the each source in the given element, or something like that?08:38
Frazer61tristan wanted to doscuss it on the irc chat to get others involved more so "Better to have this conversation in public." at least for the brainstorming08:41
coldtomimo it might be better to open up some kind of interface to the yaml loader as public API, and have this source checking in some python script08:43
Frazer61so avoid bst source show ?08:46
benschubertI'd rather go with the `source show`, It's easier to open a cli interface than an API for that08:49
tristanI suggested that we have this discussion on IRC because, with the mailing list, this will be a long drawn out affair08:51
tristanIt may still be, but having some direction to get started on the mailing list would be helpful I thnk08:51
tristan*think08:51
benschuberttristan: as long as the important part of the discussion is on the ML that's fine :)08:52
tristanYeah of course08:52
tristanI think that Frazer61 is not currently armed with all the context he needs to post a full proposal08:52
tristanOne thing I think we need to consider here are the needs of collect_manifest08:53
benschubertright08:53
tristanI.e., having `bst source show` give us URLs requires some kind of plugin interaction I think, if we do that and don't consider the rest, we've kind of wasted some opportunity08:54
tristanCore access to raw 'config', what are the benefits and drawbacks ?08:54
tristanI think off the bat, it doesn't allow a consumer of the API (either Source API or command line), to know what it is looking at08:54
tristanI.e. there is nothing standard at all about source `config` data, except for the `directory` it will be staged in08:55
tristanThe rest is a free for all, steered by general convention08:55
tristanThe benefit of course is that we could implement something fast and easy, and hand over the problem space to the consumer, I don't think that is ideal08:56
tristanEspecially considering, it will be the same as https://gitlab.com/BuildStream/bst-plugins-experimental/-/issues/208:57
benschubertOn the other hand, project authors have the knowledge on what sources they are using so might be the ones at best to decide what to look for?08:57
tristanEven if we report a 'kind' string for each source, the end user doesn't have the right to know what that 'kind' string means08:57
benschubertWhy don't they?08:58
tristanbecause any project can introduce a plugin with any kind string08:58
tristanthe `git` source on your toplevel might be something completely different from the `git` source of a junctioned project08:58
tristanand the junctioned project can change it's `git` source later on to mean something different08:58
benschubertSo for example, if I want a linter that ensures only a subset of all 'kind' plugins are used in my project, how would I do that without it?08:59
Frazer61freedesktop-sdk is like thar i think08:59
tristanbenschubert, you don't have that ability08:59
benschuberttristan: plugins are defined in a scope, should we scope the `kind` ?08:59
benschuberttristan: I think that might be a common thing asked for projects to be able to lint on08:59
benschubertI know you can't now08:59
benschubertbut that's deifnitely something we should consider09:00
tristanbenschubert, I've been thinking about extensibility of core plugins, and in that context; I think we need to make use of core plugins explicit (in the project.conf origins) and also implement `alias` feature to rename plugins who's names might otherwise clash09:00
tristanbenschubert, I think perhaps we could have something like plugin identities at some point, but maybe that is completely orthogonal to providing concise information through `bst source show` and Source python APIs ?09:01
tristanbenschubert, for instance, it might make sense to associate a hosting/url/unique identity with a plugin, but I *think* we can have that conversation separately ?09:02
benschuberttristan: not 100%, in a `bst source show` I would definitely expect to see a kind :)09:02
tristanWe don't show a kind in `bst show` do we ?09:02
coldtomjust as an aside, opening up some interface to the yaml loader would allow replacements for collect_manifest, collect_integration and any other build metadata -> sandbox plugin09:03
tristancoldtom, not in a useful way09:03
tristancoldtom, i.e. not by just exposing raw nodes, because as stated in this backlog, raw nodes are not meaningful09:03
benschuberttristan: I don't know, the way we went with was parsing the yaml ourselves, and forbidding anything that is not basic yaml (not that this is a good solution)09:03
benschubertbut sure, let's not take too long on that point09:04
tristancoldtom, the configuration of a plugin is only meaningful to the plugin parsing it09:04
tristanI think that it's important for me as a user of `bst source show`, to know 2 things primarily... (A) A URI or a list of URIs... (B) Some additional identifying information for each URI reported for that source09:05
benschuberttristan: one thing that I would personally find very useful would be able to get the parsed configuration for the sources09:05
tristanI would not be particularly pleased if I had to have intimate knowledge of Source configurations in order to derive these meaningful aspects09:06
tristanbenschubert, Why ?09:06
tristanAnd what is "the parsed configuration" ?09:06
tristanMaybe we are saying the same thing ?09:06
tristanFor (B) above, I would consider that like a checksum of a tarball URI, or a commit sha of a git URI09:07
tristanSomething uniquely identifying09:07
tristanOf course raw YAML wouldnt give you that anyway, because refs are possibly in project.refs09:07
benschuberttristan: Here are a few things I want to ensure on a project list of sources:09:08
benschubert- every uri-type field is using an alias09:08
benschubert- Only plugins in a given list are allowed09:08
benschubertyeah I don't mean raw yaml09:08
tristanRight :)09:08
coldtomah, i should clarify, when i say the loader, i actually mean "some API to grab Elements and Sources given an element of input"09:08
benschubertbut rather a dump of the source's configuration after parsing09:08
tristanbenschubert, for the list of plugins, I think we need to consider that separately because there is nothing to identify them afaics, but we should keep in mind that that needs to be extensible09:09
tristani.e. for "only given plugins are allowed"09:09
benschuberttristan: I'm fine with that :)09:09
tristanAnother thing I would want to know is cached state of course, like bst show09:10
tristanWhether (A) The source is cached locally, (B) The source has a known ref and can be fetched, (C) The source cannot possibly fetched because there is no ref, and so it needs to be tracked09:11
tristani.e. the Consistency enum state of a Source09:12
benschubertright09:12
benschuberthaving something similar to the `bst show --format` (anything that make sense also in the source context)09:13
tristanOrthogonal tangent we can discard: There is a scenario with remote execution where we might want to consider "cached locally" as "cached on the RE build cloud"... that would effect `bst show` (artifacts) and `bst source show` (sources) equally, so I think it is irrelevant for this conversation09:14
tristanYeah, for the command line we'd want format09:14
tristanWhat bothers me about `bst source show` in general, is how to represent lists09:14
tristanDo we output YAML ?09:15
tristanfor those details ?09:15
benschubertI'd love that, it's so much easier to parse09:15
tristanSources have 0-N URLs09:15
tristan(and as such, refs, although internally a ref might be a list)09:15
benschubertrefs can be anything09:16
benschubertwe have some refs that are json blobs09:16
tristanYes, but I think we want to have a concept of "A unique identifier for this URI" in an outward facing API right ?09:16
tristanSo it would be a Source's responsibility to provide that09:17
tristanEven if it's only for the purpose of `bst source show`, or use in other plugins which want to make queries09:17
tristanThere is also an interesting facet about Source URIs; their URIs lists are variable in length09:18
tristanSometimes a Source has 0 or 1 URI, but more URIs after tracking09:18
tristanLike git submodules, or rust crates09:19
benschuberttristan: having `bst source show` ask each plugin to give the info seems a goot idea09:19
benschubertso the ref can be formatted nicely, etc09:20
benschubertthat could return a mappingNode, with all the info09:20
tristanRight, maybe the API needs to be a string based query09:20
tristanAh you mean, trust the plugin to format a dictionary in a specific format ?09:21
benschubertlike in some refs, there is some things that are "internal" to the plugin that you don't need to expose09:21
benschubertyep09:21
tristanCan we do away with the trust thing ?09:21
tristanhehe09:21
benschubertwell, yes? Because as you said the only thing that knows what's important is the plugin :)09:21
benschuberthence why we don't want a 'uri' checker in Coe09:21
benschuberthence why we don't want a 'uri' checker in Core09:21
tristanI'd much prefer to have a more controlled API which the core can query sources with, like: urls = Source.get_info(InfoType.URLS)09:22
tristanAnd the core creates YAML (or not) as needed for output on the CLI09:23
tristanget_info() or such could be private _get_info() until we are absolutely sure that Elements in reverse deps should be allowed to see this information09:23
tristanAnd we guarantee a stable format if the `bst source show` user decided they wanted yaml output (maybe default is something like `bst show` one liners ?)09:24
tristanHmmm09:25
tristanSo that's a good point... what about sources which don't ever have a URI, we already have some of those (except they grow URIs after tracking)09:25
tristanDo we need to know that context ?09:25
tristanURIs vs "Discovered URIs" ?09:26
benschubertand then what about things that use `uri` as something else? Like, I could envision the `pip` package to use `uri` for the registry. Do we want this? ;)09:27
tristanbenschubert, in fact, since the core can use Node() APIs to validate fields, the Plugin could return MappingNode and there is no trust issue, I'm not sure if that is more convenient than constructing the mapping node in Core, though09:28
tristanOh, the name of the field in Element/Source configuration is of course irrelevant09:28
tristanonly meaningful data must be reported09:29
tristanAh and yeah, that09:29
tristanThe shape of the data indeed varies wildly09:29
tristana url can be a base, I think it is similar for the crate plugin too09:29
benschubertwell the registry is important might not be the same between all, though this would not be reported as an `uri`, so what should be? What I'm trying to say is that it's quite hard to have a consistent non-plugin dependent representation here09:29
tristanbenschubert, in that case, `pip` and `crate` maybe have a 'primary URI' which is the base URI which never has a ref, and secondary URIs (all of the individual crates) have refs09:30
tristanYeah it is, but I think that we're only interested in commonality, something that can be understood equally for all plugins09:31
tristanIf we want to query for URIs, then what are our reasons to do so ?09:31
tristan  * Policy URI schemes (gnome-build-meta wants this)09:32
benschubertensure they all have an alias set?09:32
tristan  * Ensure internal mirroring of all input data09:32
benschubert(thus you don't want the expanded uri, just the normal one)09:32
tristanbenschubert, I don't think we want that alias thing in this context09:32
benschuberthow do you ensure all uris are aliased then?09:32
tristanbenschubert, We should already have a warning (configurable to fatal ?) if ever an unaliased URI is used right ?09:33
benschubertoh maybe09:33
benschubertyou might be right sorry09:33
tristanIf we don't we should09:33
tristanWe already have all the mirroring URI indirection which should make that possible09:33
tristanPlugins which use hard coded URIs or discover URIs and fail to use the indirections, are buggy (and could equally hide these unaliased URIs from users in a new API)09:34
tristanFrazer61, Taking notes ?09:34
benschuberthttps://gitlab.com/snippets/1992018 btw :)09:34
tristanStill slower ?!?!?!09:35
tristanDamn09:35
tristancrazy09:35
Frazer61tristan yeah, just dont want to interrupt just yet09:35
benschubertmemory's much better though09:35
tristanFrazer61, just checking09:36
tristanSo another reason is to create manifests09:36
tristanreporting on the provenance of build inputs09:37
tristanCertainly not for discrete processing of the outputs or automation of tasks based on the output09:37
tristanValidation, reporting09:38
tristanChecking that internal mirrors have everything we are using, may be an automatable task, but it can't be realistically achieved without the assistance of BuildStream itself09:39
tristanBecause you can't have knowledge of the meaning of refs09:39
tristanYou can only know that they are unique for a given URL09:39
tristanI think that we are essentially on the same page that (A) The source reports concise data about URIs and refs  (B) A source can have 0-N URIs  (C) Some URIs are discovered during processing (tracking)09:41
tristanAnd I think that we can enforce that a unique string can be reported for each URI (internally based on "ref")09:42
tristanEven if it's not entirely true ?09:42
tristanI *think* we don't need to distinguish between primary URIs (somehow encoded into configuration) and discovered URIs09:43
tristanWe will also need extensibility to support plugin unique identities09:44
benschubertI think that's good enough yeah09:44
tristanFor extensibility, we'll need to define what happens when a Source is asked for something which it has not been implemented to report09:45
tristanFor a CLI feature, the user probably needs to decide09:46
tristanI guess that in this sense, we could start with a Source.list_uris() and Source.get_display_ref(uri) API, and add additional APIs for reporting other things later on09:46
* benschubert is worried abouth the number of APIs that might and up being required09:47
tristanIn that way we handle an InimplementedError() or such09:47
benschubertbut I guess we don't really have a choice09:47
tristanYeah I don't like it either09:47
tristanThen again, I'm less concerned about Source plugins - they are high effort and potentially fewer09:48
benschubertright09:48
benschubertand the base ones will probably be well maintained (for the most used ones)09:49
tristanYup09:49
tristanNow if people can stop being annoying and implementing weird stuff like rust, JS and python09:50
tristanWe won't have to deal with these whacky concepts of language central distribution vectors09:50
tristans/central/centric09:50
benschubertI feel you :)09:50
tristanWe can only pray :)09:50
benschubertpyproject.toml with pep517 is quite fun to go around09:50
benschubertat least gentoo had a great shot at it and it's now manageable in a sandbox09:51
* tristan is genuinely surprised by this latest benchmark :-S09:52
tristanThe code is actually running consistently faster with the variables test09:52
tristanbut still 0.5 seconds behind on 6K debian elements :-S09:52
tristanWhich means, something about repetition, we are doing the *same thing* many times and should be able to not do it09:53
tristanAha !09:53
tristanno, not really aha :-S09:54
tristanThe old code had different hashing, but still per-Variables-instance tables09:54
Frazer61so are we at agreement that we will store the URIs or whole URL in a manifest, and bst source show will just read the manifest?10:01
tristanFrazer61, not really no10:02
tristanFrazer61, We are in agreement that we first need a Source plugin addition for reporting a list of URIs and a corresponding list of refs, both as simple strings10:03
tristanAbove I suggested get_uris() and get_display_ref(uri), but maybe it's simpler to have a single API which returns two lists10:04
tristanThat part we can refine in the proposal10:04
tristanFrazer61, using that new API, we can implement `bst source show` in such a way that it can use the `--format` argument (similar to `bst show`) to display (A) cached state, (B) URIs (C) refs for a given source10:07
Frazer61i would rather have 2 functions, rather than 1 function returning 2 data lists, but thats just my personal preference10:08
tristanFrazer61, some additional thinking needs to go into how the `bst source show` command works, specifically we need to consider how to identify sources of a plugin10:08
tristanFrazer61, one function might be easier to implement - remember that in this case, we have few calling instances vs many implementors (and the priority is on ease and clarity for implementors, not callers)10:09
tristanThe core we can always control, plugins are for people to implement at will10:09
tristanFrazer61, from here, it would be great if you could collect the general output of this and try to fill in any gaps you can see (like how to address/list per element sources when calling `bst source show foo.bst`)10:10
tristanFrazer61, and provide a writeup for the ML, I would expect a back-and-forth on the ML but not too much10:11
tristanWith this, collect_manifest could safely be tranformed into a shell script that calls `bst show` and `bst source show`10:12
tristanor maybe a python script which does the same10:12
Frazer61ok, ill do some experimenting10:13
tristan(and of course gnome-build-meta could parse `bst source show --deps all gnome.bst --format '%{uris}'` or similar to easily implement a git accept hook)10:13
*** tristan has quit IRC10:17
*** tristan has joined #buildstream10:58
*** ChanServ sets mode: +o tristan10:58
douglaswinshipCan anyone help me with these CI failures? They all happened fairly close together, and they all relied on a freedesktop-sdk branch that I had just rebased to include a bunch of extra commits from master, so I thought they might be connected. But they all look completely different.11:23
douglaswinshipI think maybe some of them are just runner issues, but I haven't seen them before.11:23
douglaswinshipThe easiest ones are these two: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/617168643 & https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/61716881211:23
douglaswinshipThe error message there is "[00:00:40] FAILURE components/sdl2-ttf.bst: Failed to pull artifact with status UNAVAILABLE: failed to connect to all addresses".11:24
douglaswinshipDoes that just mean the remote caches went down / couldn't be reached, and I should just re-run the job?11:24
douglaswinshipI'm just surprised that that causes a build failure. I thought caches were there to make the build run faster, but I didn't think they were essential for a build.11:24
douglaswinship(There are a few more, that look different, but I don't want to completely spam the channel.11:24
coldtomyep that's caused by cache issues douglaswinship11:27
coldtomimo that should not cause build failures though11:28
douglaswinshipNext question: Any ideas on this one? https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/61651057211:31
douglaswinshipcoldtom: and thanks. Will re-run11:31
douglaswinshipre the last link, the first issue I see is "make: *** [Makefile:968: all] Error 2", and don't see anything before that that suggests any problem to me. So I'm not sure where to start.11:33
coldtomnot sure on that one, it's possible the error happened further up the build log and it's not caught there, the full log might be more informative11:44
douglaswinshipAh yes, this might be more relevant: "make[2]: *** No rule to make target 'cpp.texi', needed by 'doc/cpp.info'.  Stop."12:15
tristanbenschubert, so I'm not sure how wise it will be to dig down as far as the strings, but I've got a working theory that code is slowed down by my theory that we should be interning the variable names13:02
benschuberttristan: do you mean that if we stop interning it will be better?13:02
tristanbenschubert, locally I've run the variables benchmark a bunch of times with different configurations, and it would seem that interned values ("%{foo} and %{bar}" strings) but not interned variable names is better13:03
tristanIt has to do with frequency of calling sys.intern() vs using those comparisons13:03
benschubertyeah fair point, I can re-benchmark a version without interning if you want :)13:04
tristanUpon entry into Variables instantiation, interning all the variable strings every time has an expense13:04
tristanI pushed one13:04
* benschubert should setup a on-demande benchmarking system13:04
benschubertok let me start a benchmakr13:04
tristanThanks a lot for being there throughout :)13:04
tristanI've been giving consideration to deeper intimate string manipulation, but I worry a bit13:05
tristanPython has dragons in their strings it seems13:05
tristanit's not as straight forward as "everything is UTF-8"13:05
benschubertyeah I would really prefer not going that deep13:05
benschubertHappy to help, seems I have quite easy way sof benchmarking, so if it can make buildstream faster, It's not a big cost :D13:06
tristanWe have some help from https://github.com/cython/cython/blob/master/Cython/Includes/cpython/unicode.pxd13:06
tristanBut if we do strings, I think it's safer to use straight C API from Python.h13:06
tristanBecause that lets us go from str -> explicitly utf8 string -> str13:07
tristanWhereas the cython wrapper gives you ` char* PyUnicode_AS_DATA(object o)`... (whatever "DATA" is: fend for yourself)13:07
benschubertbenchmark running!13:09
tristan(with `cdef extern from "Python.h"` we can have `const char* PyUnicode_AsUTF8(object unicode)` and the opposite)13:09
tristanI was thinking of trying the "".join() in C and tracking the length of each string etc13:10
tristanFor the final resolutions13:10
tristannot sure we gain anything13:10
tristanThe python `dict` is hard to kick too, if we went full string it would be better to have a C hash table13:10
benschubertWhat I'm wondering now is if it's really worth going that far? The idea was to make the code more readable and understandable (+the lazy resolution for junctions), and I wonder if we are not making it worse if we go too far that road?13:11
tristan:(13:12
benschubertI mean, if it means that we have a blazing fast variable resolution I'm happy with it13:12
tristanI know, its terrible, I don't understand13:12
tristanWhat was there before was confusing lists, plus manually raising RecursionError(), not supporting more than hard coded limit of recursion13:12
benschubertyeah before is clearly not great13:13
benschubertso clean well documented C/Cython code is still probably better13:13
benschubertI just would really like us not to ever have a segfault because of that :)13:13
benschuberton the other hand, with enough tests, it would probably be good13:13
tristanYeah I wouldn't worry much about that13:13
tristanThere's not huge opportunity for segfaulting here, except when you introduce code changes initially13:14
benschubertright13:14
benschubertand the current implementation actually could segfault in very specific cases13:14
tristanWell, it will python stack trace when reaching the limits of recursion, I'm not that familiar anymore with the current implementation heh13:15
benschubertthe check is not actually perfect and if you are already deep in a stack you can actually make the code segfault :) but that's a very weird setup required13:17
benschubertso would not happen during normal usage13:17
benschubertbut I think you are right if we are careful with C and it actually give us easier to understand code and allows us to squeeze a good chunk of perfs here, it's worth it13:18
tristanOh yeah13:18
tristanOf course13:18
tristanThat recursion check is not even accurate13:18
benschubertit's a "best effort" check :'D13:18
tristanYou'd have to hit the glibc stack limit13:18
tristanI guess13:18
benschubertyep13:18
tristanbest effort could at least observe Python->internal_stack_depth_counter or smth on entry ;-)13:19
tristanand still possibly be wrong13:19
benschubertthus your rewrite being much better even if we get more C code13:19
* tristan does really like it better....13:20
tristanbut I'm obviously biased :)13:20
benschubertI'll like it better once it's faster :D13:20
tristanreally upsetting the performance issues though13:20
* tristan could add pointer array13:22
tristanbenschubert, maybe give a look at the main resolve loop https://gitlab.com/BuildStream/buildstream/-/blob/tristan/partial-variables/src/buildstream/_variables.pyx#L258, let me know if you spot anything ?13:23
tristanObvious things include, I could store deps in a pointer array (possibly realloc() it's length to double it's size any time I add a new item), instead of a python list13:23
tristanEven start out pointer arrays with enough space for say 8 values, possibly hitting realloc and then 16, 32, etc13:24
tristan8 * sizeof(PyObject *)13:24
tristanProbably it will very rarely need to realloc13:25
tristanbut the algorithm, checking circular refs and collecting the ordered list of dependencies in the same loop, appears to be pretty solid to me13:26
tristanI could be missing something obvious I don't know13:26
benschuberttristan: mmh that seems a lot of context switch right now, might have the brain more free this weekend if you don't mind?13:26
tristanDon't worry about it :)13:26
tristanThe moral support and benchmarking is good enough, don't let me disturb you too much if you're on a roll :)13:26
douglaswinshipjuergbi: You were talking to WSalmon on tuesday about buildbox-casd, and it still having some memory issues even under 0.0.9?13:33
douglaswinshipI have an example of a build failure in CI. The casd log is here https://gitlab.com/libreml/libreml/-/jobs/618055732/artifacts/browse/cache/buildstream/logs/_casd/13:33
benschuberttristan: I'll have a look this weekend if you want, just that for today it would be too much of a context switch :)13:33
douglaswinship"   what():  OutOfSpaceException exception thrown at [buildboxcasd_lrulocalcas.cpp:325], errMsg = "Insufficient storage quota"   "13:33
WSalmondouglaswinship, interesting13:35
WSalmoncphang how much disk/RAM dose that runner have?13:35
tristanbenschubert, I'm sure you will eventually review it anyway, and if it's not really ready for review by this weekend and performing acceptably, then I'll appreciate you're looking at it by then too :)13:35
juergbidouglaswinship: could the out of space/quota error be correct, i.e., you either don't have enough disk space or quota configured too low to cover a whole bst session?13:36
benschuberttristan: seems good!13:36
juergbidouglaswinship: it is supposed to handle that case with a better error message (on the BuildStream side), however, it would be interesting to know whether the error is bogus or whether it's just an issue of not properly handling/forwarding this error13:36
*** cphang has joined #buildstream13:37
WSalmonjuergbi, as in it might be crashing while doing it or it might just be the issue and we should report it better?13:37
WSalmonjuergbi, as in it might be crashing while doing disk mangment or it might just be the issue and we should report it better?13:37
juergbirunning out of space should result in a termination of the build session but with a reasonable error message13:38
WSalmon+113:38
WSalmoni seemed to rember it doing so at one point13:38
WSalmonbut not 100%13:38
cphangWSalmon https://gitlab.com/libreml/infrastructure/infrastructure/-/blob/master/ansible/roles/gitlab-runner/defaults/main.yml#L44 50GB13:38
juergbiwe start casd with --protect-session-blobs, which prevents casd from expiring any blobs created or accessed in the current bst session13:38
juergbias buildstream can't deal with blobs suddenly vanishing mid-session13:39
juergbihowever, casd is supposed to return a suitable grpc error in that case, which buildstream should be handling13:39
juergbiand I think we have a test case for this13:39
WSalmonyep i though at one point it did say that in some cases but i might be thinking of bst1.413:39
juergbiwe have test_artifact_too_large to verify this13:40
juergbihowever, we may hit this in different code paths and might not handle it properly in all of them13:41
juergbi(in bst or casd)13:41
WSalmondouglaswinship, that job looked like it had built a LOT of stuff, so a simple restart should be enough13:41
WSalmonbut thanks for bring this case to our attention13:41
douglaswinshipSo if I'm undertanding this right: If buildbox had responded perfectly, then the CI job would still have failed, but buildboc would have given a more helpful error message? (ie buildbox-casd isn't the actual reason this pipeline failed?)13:42
WSalmoncan you document this in a issue, just some links and a copy and past of the error message, the title should be something like, report filling cas in one run better etc13:42
douglaswinshipor is it that we're not sure either wwy?13:42
WSalmondouglaswinship, it should give a more helpful message in any case so we should have a issue there but there may be more at play13:43
juergbiat first glance it seems primarily like a buildbox-casd bug13:43
juergbiit shouldn't terminate13:43
douglaswinshipdang. That's frustrating.13:44
WSalmonbut 50G is not a lot for a big rebuild..13:44
WSalmonso it might be a genuin issue13:44
WSalmonbut not being handdled well13:45
cphangWSalmon agreed, and a supporting argument for this is that the aarch64 jobs succeeded13:45
WSalmoncphang, have you rebuilt everything including FD on those runners before?13:45
cphangWSalmon yeh I'm pretty sure, we've handled changes to bootstrap import before w/o cache13:46
WSalmonummm, interesting13:46
douglaswinshipWSalmon: should I be raising this as an issue in buildstream, or the buildbox-casd repo? (Or both?)13:46
WSalmonat least buildstream, as it should be reporting things better13:46
cphangAs a temporary workaround douglaswinship maybe I'd suggest bumping up the size of https://gitlab.com/libreml/infrastructure/infrastructure/-/blob/master/ansible/roles/gitlab-runner/defaults/main.yml#L44 to 128GB, I'm happy to approve that and I'll restart the gitlab runner (this will probably be end of day).13:46
WSalmonjuergbi, do you want us to raise something on buildbox-casd about this too?13:47
WSalmonsounds like it should have handdled more gracefully?13:47
juergbiyes, should be fixed in buildbox-casd13:48
WSalmondouglaswinship, ^13:49
douglaswinshipWSalmon, juergbi: thanks :)13:53
*** tristan has quit IRC14:03
*** traveltissues has joined #buildstream15:00
*** tristan has joined #buildstream16:00
*** ChanServ sets mode: +o tristan16:00
tristanbenschubert, any word on the internless run ?16:00
benschuberttristan: woops forgot about that, let me see16:01
tristan.... drumroll ...16:01
benschuberthttps://gitlab.com/snippets/1992018#note_372673299 better but not quite there :)16:02
tristan.25 seconds behind16:02
tristanOk next time with arrays16:03
tristanI also noticed that I'm not doing the optimization of making sure value dependencies are unique (since they are not interned anymore, the identity comparison is worthless), that means a loop per ValuePart per ValueClass, but as those are cached, they wont be meaningful really16:05
* tristan can either remove that unique check since "%{foo} and %{foo} and %{foo}" is a rare value expression16:05
tristanor I can make it a string compare, but probably just ignore and leave the redundancy in16:06
benschuberttristan: possible yeah :)16:35
Frazer61hi, does anyone know the best was to use the src/buildstream/testing/_sourcetests/project to experiment with the master bst?16:39
* tristan has ObjectArray, but realizes he should have PointerArray with a free_func & element_size17:04
tristanAt least with ObjectArray I now have the deps list in resolve() as an array17:05
tristanPointerArray would let me change the ValueParts to aligned chunks in an array17:05
tristanFrazer61, I think that those tests are a more difficult angle to work from17:06
tristanFrazer61, most of the test cases in tests/ for instance, you can just run BuildStream on them and they work (for the majority which are statically committed tests)17:07
tristanOr, the integration tests17:07
tristanFrazer61, What I'm doing, which might not be considered "best", is to run `tox -e venv /bin/bash` to run BuildStream17:08
tristanI exit the shell and re-enter every time I make code changes17:08
Frazer61i assume not but does it matter if i use zsh instead of bash, assuming /bin/bash overrides with bash17:09
tristanIt doesn't matter17:09
Frazer61ok, thanks17:09
tristanIn zsh, there is no completion17:09
tristanIn bash, the awesome completion is broken, cause douglaswinship still didn't submit that patch which fixes it17:10
tristanThe one I know he has hiding in his workspace and keeping all to himself17:10
tristanbenschubert, you're probably dropping off soon... maybe run one more benchmark before you go ?17:12
*** santi has quit IRC17:12
tristanAnd let me know what the result is when you get back online tomorrow ?17:12
benschuberttristan: Sure, I'll run that now17:12
benschubertperfect timing, since I'll need my desktop soon :D17:13
Frazer61thanks tristan17:13
*** santi has joined #buildstream17:13
tristanGreat, _resolve() is looking very white and beige now hehe17:14
tristanin fact, the only yellow line is _resolve() itself17:14
tristan`__Pyx_INCREF(Py_None);` <-- that looks to me a little dumb to be generating :-S17:20
tristanYou'd think that one could rely on `None` existing throughout the course of a python session17:20
benschubertTrue :)17:20
* tristan wonders how many python bugs revolve around imbalanced refcounts on None ;-P17:21
tristanThose, along side threading issues, are like the most difficult things to debug17:21
* tristan should write a little loop which takes valentind's variables testing element and multiplies it N times, each time dumping it as a new element which depends on the last17:26
tristanThen I could have a test more similar17:26
tristanRunning `bst show` on that small test gives quite erratic results17:26
tristanAlso, how come a yaml cache doesn't get created, is that obsolete ?17:28
tristanbenschubert, in your benchmarks, you have show-cached (I presume it's run a second time without removing the cache), there used to be some sort of extra file used to assist the yaml load, is that it ?17:29
* tristan thinks that was removed in favor of just yaml code that works better all the time17:29
benschuberttristan: no more yaml cache as it was slower yep :)17:29
tristanOk yeah, that's good17:30
benschubertNo `show cached` means that you have all your artifacts cached locally17:30
tristanAhh17:30
benschubertwhich is impacted by casd doing the checks :)17:30
benschubertIt's basically:17:30
benschubertFrom an empty state -> show -> source fetch -> show - fetched -> build -> show -cache17:30
tristanRight, I've noticed that the difference between my branch and master is not affected (it's about the same difference all the time)17:31
tristanlast time .25 seconds in both scenarios17:31
benschubertwhich is expected since you are not touching anything around the 'caching'17:31
tristancython is pretty smart17:33
tristanif I do `foo = PyMem_Malloc(...)` and then `if not foo: ... raise MemoryError()`17:34
tristanThen cython does `if (unlikely(__pyx_t_1)) {`17:34
tristanfor the generated `not` clause17:35
tristanSo it's got some idea/knowledge about the reason why I'm checking my return (perhaps a bit spooky)17:35
tristanOr rather, probably PyMem_Malloc() has some annotation to it, indicating that it shouldnt be returning NULL17:36
benschubertThat's also possible? I'm not familiar enough with the internals of python17:37
* tristan finds this area particularly fun, I like this level of the stack17:37
tristanAs a glib/GObject guy, I've always enjoyed Obj-C17:38
tristanPython appears very similar under the thin hood17:38
benschuberttristan: One thing I'm very happy about is having someone else looking into all the cython/optimized part of BuildStream, we're going to be able to do much better :)17:40
tristanCareful hehe17:40
tristanIf I get excited, there may no longer be an excuse to keep the python around, and it will all just meld into C17:40
tristanheh17:40
tristanlink the interpreter and invoke it for plugins17:41
benschuberttristan: I think a "C" layer for performance-critical stuff, a "cython" layer for performance stuff but not that critical and "python" for the 90% of hte codebase would probably be a good thing :)17:42
*** dftxbs3e has quit IRC17:42
benschubertBut I'd be careful at bringing in "too much" C, as it's potentially harder to involve new people17:42
tristanThat 90% quickly becomes 10%17:42
tristanRight I see what you mean17:43
*** dftxbs3e has joined #buildstream17:43
tristanI kind of feel like ideally, the high level business logic belongs in a language like python where it's hard to get things wrong17:43
benschubertYep, and the "critical" paths can be optimized in Cython/C when we need17:44
tristanthe more critical stuff in C would be good, it also kind of scares away people from touching critical stuff17:44
tristanWhich... I guess has drawbacks as well as bonuses17:44
benschubertyep17:45
*** hasebastian has quit IRC17:45
benschubertwhich is why I initially when with Cython for the initial optimizations, since it's still very close to python and harder to shoot yourself in the foot17:45
benschuberttristan: actually speaking of C, what tool would you recommend to track what's startting and stopping threads? I've still got 4 threads lingering when I run tests in tox :/ (not pytest itself funnily enough)17:48
tristanif this flies, I should put the PointerArray in a different file, maybe we can convert the loader code to use C arrays for dependencies and such17:49
benschubertwe could17:49
benschubertI had as a general idea of trying to dig more into the "c" interface of ruamel and work more in Cython/C than python there17:49
benschubertand so the loader could use this17:50
tristanYeah, I think we just need a real C implementation of round tripping YAML17:50
tristanIt really makes no sense to have this in python17:50
benschuberttristan: actually...17:50
benschubertWe keep the location in the parser right?17:50
tristanruamel.yaml is in a sad state too, and fails to preserve formatting when round tripping17:50
tristanThe "location" ?17:51
benschubertSo we could use a fast C (libyaml?) parser for the "base case" and when we need to write, we could reload the file itself with Ruamel and find the right thing to overwrite no?17:51
benschubertIt would be more work17:51
benschubertbut might give us the best of both worlds17:51
tristanHmmm17:51
benschubertuntil we decide we're too fed up and re-write a yaml parser :D17:51
tristanSounds like a lot of work and residual maintenance just to avoid a proper rewrite17:52
benschuberttristan: rewriting ruamel in cython/C ? with only the things we'd need?17:52
cphangtristan benschubert would something like https://github.com/tlsa/libcyaml be useful?17:53
tristanThat's another option, I was thinking it really should be a pure C round tripping YAML implementation17:53
tristancphang, does it roundtrip ?17:53
tristanI don't think so17:53
benschubertyeah, not sure what's better than https://github.com/yaml/libyaml there? Do you know cphang ?17:54
tristanlooking at that page, looks like it does load/save, but not preservation of line/column/whitespace/comments etc17:54
tristanruamel.yaml even manages to lose some comments completely when you put them in the "wrong" places17:55
benschubertyeah :/ and it being on sourceforge doesn't help for contributing17:56
tristanis it on sourceforge ?!17:56
tristanjaysus I didn't know the yaml format itself was even that old17:57
benschubertit moved from bitbucket to sourceforge17:57
benschubertbecause bitbucket dropped mercurial support17:57
* tristan wonders if they upgraded to CVS !17:58
tristanEverything in CVS good ol' days are back !17:58
benschubertNah, they just dropped to only support git17:59
benschubertand this project prefered moving to sourceforge rather than move to git17:59
tristanI meant ruamel.yaml moving to sourceforge so they could... use CVS but... yeah same difference I guess17:59
tristanWell, it does remind me of the CVS days, when you really needed to have a "Configuration Manager" for every dev team18:00
cphangbenschubert tristan the main value of libcyaml is that you can define schemas to parse the yaml and map them into structs https://github.com/tlsa/libcyaml/blob/master/docs/guide.md to my knowledge libcyaml uses libyaml under the hood.18:00
tristanCause developers and VCSes are like oil and water18:00
tristanyeah doesn't seem to fit the bill18:01
benschubertcphang: we don't really have a schema, since each plugin has different stuff, so might not be the best fit, but thanks for showing :)18:08
benschuberttristan: you might want to revert your last change: https://gitlab.com/snippets/199201818:08
benschubert*https://gitlab.com/snippets/1992018#note_372759175 sorry18:08
tristanCrap18:09
tristanAnd the change is (A) Use a more generic array implementation (instead of PyObject ** and Py_ssize_t on ValueClass, we use ObjectArray with has a function call for appending values)18:15
tristanAnd change a `list` for an `ObjectArray *`, oh also the new ObjectArray holds references to the objects18:15
tristanbut that should be an atomic operation18:16
tristansurely assigning `(PyObject **)array[idx] = <PyObject>value && Py_INCREF(value)` must be cheaper than list.append(value)18:17
benschubertWhat's the C code generated for both?18:17
tristannot easy to say, because using a list calls into python18:18
tristanand iterating it/popping it also18:18
tristanwhile iterating over a pointer array should be lightning fast18:19
benschubertright18:19
tristanwe don't technically need the ref/unref statements, I just thought they would be safer (in this case, we know these strings are already owned by other means)18:20
tristanI'll try to cook up a more intensive local benchmark here tomorrow18:22
tristanIn any case, this particular run is telling18:22
tristanIt shows that the tiniest things are really having a huge impact on the result18:22
tristanWell, huge, is a vast overstatement18:22
tristanWe are talking about .25 of a second for (project.conf variables) * 600018:23
tristanstill it's a visible difference18:23
benschubertstill around 5-ish% of the run yeah :)18:23
benschubertAnd this is a "simple" 6k elements project18:24
benschubertso for more realistic examples it will probably be worse :/18:24
tristanI'm not sure of that18:24
benschubertyeah that's actually true, cas would be taking a bigger part for bigger projects18:25
tristanbenschubert, the more realistic and complex a project gets, the more composition will occur; but variable resolution will still happen once flatly for each element18:25
tristanIf you have a variety of element kinds and custom variable assignments in elements, this wont effect resolution (much) is my point18:25
tristanyou might end up having a wider value cache18:25
tristan(cache of regex'ed "%{foo} and %{bar}" strings)18:26
tristanthose will indeed be more expensive18:26
benschubertyeah, but the time will be mostly taken by casd looking if things are in cas right?18:27
benschubertwell, I'm sure of that actually ;)18:27
tristanSure but that's another aspect18:27
tristanI don't think that technically a larger artifact should take a longer time to verify if it is cached, though18:28
tristanevery artifact should take the same time to check18:28
benschubertOk, should we remove a part of the "optimizations" that are not really some and just ensure the code is the easiest to maintain right now? And we can revisit later maybe?18:28
benschuberttristan: would it? You'd need to check each blob18:28
tristanbenschubert, I am using the word "should" here :)18:29
tristanI don't know what it is18:29
tristanBut definitely, it should18:29
benschubertwe'd need an indexed cas for that (and thus know what's stored), I don't think casd supports that yet? juergbi will know more about this subject than me though18:30
tristanCache expiry is certainly hard to get right but, we shouldnt have to verify the blobs related to a ref just to determine that the ref is valid when checking for a ref18:30
tristanAt worst, we'd find out when it's time to build something that we were unable to store all the elements required for the build in the local cache size18:30
tristanand have a late stage failure18:31
juergbiBuildStream can't handle that18:31
juergbiso we have to check up front18:31
juergbiwe obviously don't want the whole build session to fail if a blob is missing18:31
tristanWe can bail out and say "sorry you wasted your time, next time get a larger disk"18:31
juergbiwe already do this if the cache is not large enough for the current session18:32
juergbihowever, at the beginning of the session we check existing artifacts whether they are completely available18:32
juergbiseparate scenario18:32
tristanjuergbi, what's the worst that can happen though ? You wouldn't need to rebuild everything in a subsequent session, you'd have to increase your `quota:` in your config, give it some more space, and rebuild18:32
tristanand then you will only build the missing blob and continue from there18:32
juergbifor that we need to know that the blob is missing18:33
benschuberttristan: the problem is if half an artifact is missing18:33
tristanjuergbi, Right, I'm guessing that check is overkill right ? As soon as we remove blobs shouldn't we be removing refs ?18:33
juergbiwhich means that we have to check each blob for each artifact18:33
juergbino, that's what we used to do18:33
juergbiwhich resulted in expiry taking 24h18:33
* benschubert remembers this with dread18:34
juergbiso we switched to the blob-based expiry18:34
benschubertit awas more 72 hours but hey :)18:34
cphangthis feels like a referential integrity problem..18:34
juergbiwho's counting ;)18:34
tristanyeah18:34
tristanCould this not be solvable with like, sqlite ?18:34
juergbiwe could probably do something more fancy if we move to a proper database18:34
* tristan beat you to it18:34
tristanheh18:34
juergbihehe18:34
juergbicreating an index in a database could certainly help18:35
juergbibut it's also not trivial18:35
juergbiand we also don't want to overcomplicate casd18:35
juergbi(although a simple database would likely make sense even if we keep blob-based expiry)18:35
cphangBut  you *only* need to guarantee that referential integrity over the duration of a BuildStream invocation.18:35
tristanIn that scenario, we could run a GC of refs at the beginning of a session and hopefully that GC is less costly when there are no refs to cleanup18:35
benschubertcphang: you'd need to build it each start then, which takes some time18:35
benschubertcphang: easily multiple minutes with large-ish projects18:36
juergbifollowing all references can take a very long time18:36
juergbibtw: the buildgrid cas server now has an indexed cas with expiry support18:36
* cphang feels like Bazel has faced similar problems in the past..18:36
tristanbenschubert, every time ? or only when blobs have expired ?18:36
benschuberttristan: not sure I understand the question?18:37
juergbiI think keeping blob-based expiry is the right approach, however, we can hopefully speed up the blob checks with a database index18:37
benschubertthat's what I would be thinking18:38
tristanKnowing benschubert's definition of "large-ish" I think I would be happy to spend a couple of minutes at `bst build` launch time, every time the last build resulted in expiry of blobs18:38
cphangjuergbi buildbarn has an ADR basically on this issue: https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md, to satisfy the guarantee that CAS blobs cannot expire over the invocation of a Bazel build18:38
tristanbenschubert, I mean if nothing expired last run, it should return right away18:38
tristanno refs to GC18:38
benschuberttristan: a couple of minutes at each start is prohibitive, and things expire quickly, I'm regularly working with 2TB+ of cache :)18:38
juergbicphang: we have a protection of session blobs in casd as well. the issue we're talking about is integrity across sessions18:39
benschuberteven casd cleanup when going over takes north of 15 minutes18:39
benschubertanyways, I'll drop off now :) have a good evening/morning/night/other !18:39
juergbitristan: also, with support for partial CAS (e.g. intentionally only storing the directory objects without the files locally) reference-based expiry probably doesn't make sense18:40
juergbithis can be very useful with remote execution where the client doesn't need the intermediate blobs18:40
persiaI think object lifecycle management is a known-hard problem.  There's some reasonable examples in the HBase sources, but I've definitely encountered people who complained about near-quota behaviours causing dependency-production-and-purge cycles preventing query resolution.18:41
persiaDon't fear using an actual database for it: maybe sqllite, or maybe some custom columnar store.18:41
tristanindexing the blobs for improved "check if it's present" can work too18:41
tristanI guess the shape of data plays into this too18:41
juergbiyes, likely18:42
tristanWhen people have an artifact with many files, they are doing it wrong already18:42
juergbidon't forget sources. they are also in CAS18:42
juergbialthough, they don't have to be checked that frequently, I suppose18:42
tristansome few artifacts (compositions) might have many files18:42
benschuberttristan: I fear we have a different baseline for what "many" files are :'D18:43
tristanbenschubert, I doubt it... I mean, having build elements with huge amounts of files in the artifacts, is akin to writing huge programs in one big file.c18:43
tristancommon sense dictates you want things to be modular18:44
persiaI can think of many reasons to create artifacts that are not compositions that are hundreds of files, and even more reasons to create compositions with thousands of files which are then themselves components in larger compositions.18:44
tristanData18:44
tristanNot code, indeed18:44
persiaI hadn't conceived of data.  My reasons are entirely related to code being compiled into things.18:45
tristanBut even then, if you have language packs for instance, you'd still want to break them into per language artifacts18:45
tristanWell, then we might again have a different definition of "huge", I think that a webkit of firefox installation meets the criteria of "huge" when it comes to software builds18:46
juergbilong build time and lots of files in the artifact are fairly independent18:47
persiaIf you have a debian-based system, it might be worth querying which packages you have installed, and running `dpkg -L <pkgname> | wc -l` against each of them.18:47
persiaYou may be surprised about how many end up with a very large number of files.18:47
juergbialthough e.g. libreoffice covers both. almost 33k files for a full un-split build here18:48
persiaYep.  Chromium is another fun one :)18:48
tristanholy crap18:48
persiaYep.18:48
juergbimost are IDL and documentation, though18:48
tristanSee, but we are talking about monolithic cathedrals which I think everyone can agree needs to go through the xorg exercise18:48
tristanwhich was once a similar breed, but sensibly became modular18:49
juergbimaybe they should but we have to deal with real projects18:49
juergbialso, spreading all these files out to multiple artifacts in a single pipeline doesn't reduce the total number of blob checks18:50
tristanYeah, I'm just saying that if the indexing is slow for the (potentially few) projects which are ridiculously monolithic, it's justified to say "You're doing it wrong" ;-)18:50
tristantrue, but it should give you better feedback18:50
juergbipotentially, yes18:51
juergbibtw: I also have vague plans of parallelizing those checks18:51
tristanI've been saying this for years !18:51
juergbicasd supports real threads, not like Python18:51
juergbi;)18:51
tristannobody got around to it yet18:52
tristansilly18:52
juergbiso if we use async requests in Python, we can get this parallel without GIL blocking us18:52
tristanI don't think GIL is a question here, we could have been doing this last year18:53
tristanit's just not the right time to do it while benschubert is in the midst of changing the parallelization model18:53
juergbiwell, without casd it would be fairly difficult to really parallelize18:53
tristanif we had CacheCheckJob in the first place, it wouldn't change much18:53
juergbias the Python threads probably wouldn't be worth it18:53
tristanNo no, we parallelize thought jobs in the scheduler18:54
tristanright now that's done with multiprocessing18:54
tristanif benschubert changes it to threads, then it's threads18:54
juergbiwould have to check whether the overhead is not too high for these checks18:54
tristanThat could have worked a long time ago18:54
juergbiwith the fork overhead, I'm not sure18:55
juergbifork in BuildStream is far slower than raw fork typically is in Linux18:55
tristanReally ?18:55
juergbiyes, it shows up very high on profiles18:55
tristanWhat for ?18:55
juergbiI'm not sure, I suspect atfork handlers18:55
juergbigRPC could be involved18:55
tristanPython or smth I guess18:55
tristanEven then, we could have easily batched them18:56
tristanthrow 100 elements at each CacheCheck job18:57
juergbiyes18:57
juergbiit's mainly a question how to refactor the code to allow this18:58
juergbias it's currently handled by each elemenet18:58
juergbianyway, not something for today18:58
tristanThe scheduler has element specific jobs, but since a long time ago that is a subset of it's jobs18:59
tristanbut yeah, better to not fiddle in there while benschubert is in it18:59
juergbithere are no non-element jobs anymore, iirc, however, we could probably re-add that if needed19:01
tristanThe framework is quasi in place anyway19:03
tristanwhat with the scheduler resources and such19:03
tristanof course the logging has to be different for such batch jobs, but that seems not challenging and unworthy of concern :)19:04
benschubertpython being slow for fork:19:20
benschubert-> close the event loop, all file descriptor + other stuff19:20
benschubert-> the refcounting GC means copying all the memory pages on fork19:20
benschubertOtherwise I'm not sure I understood everything juergbi: would you be doing the async requests in grpc in the scheduler or before?19:20
benschubertIf before: that would not affect me19:20
benschubertif during: I think that will be great, I wanted to try the asyncio support of grpc now too19:20
juergbiI was planning before19:21
juergbiit's not something I'm working on right now, though19:21
juergbiand probably also not anytime soon19:21
benschubertyeah if it's before I would not be affected, so you can work on that if/when you want19:21
benschubertjuergbi: btw, how long to run the non integration tests on your system again? (and which parallelism?)19:21
juergbihave to rerun to check19:22
benschubertrouughly, don't need super accurate :)19:22
juergbiwith integration but without coverage it's 2min 15s with -n 1219:23
juergbi(Python 3.8)19:23
juergbiwith coverage it's a minute more or so (and problematic with 3.8)19:23
benschubertok! thanks, so threading is indeed slightly slower even on the test suite for now19:23
benschubertI still get occasional lingering threads but I've got no idea why19:24
juergbido they cause any issues?19:24
benschubertNo, but it would be cleaner for testing no?19:25
juergbior is it just about making sure it's all working cleanly19:25
benschubertUnless we don't care and I remove the check entirely19:25
juergbiyes, definitely19:25
juergbiwithout integration, without coverage: 1353 passed, 150 skipped in 52.12s19:25
benschubertok my tests take 2m12 with -n10 :/19:25
benschubertoh well19:25
juergbi-nocover?19:26
benschubertI'll try a bit further then cleanup my branch and make a WIP PR with thread checking disabled for tests19:26
benschubertpytest -n 10 directly19:26
juergbiah ok19:26
juergbiis tox behaving now?19:26
benschubertnope19:26
juergbi:-/19:26
benschubertstill getting lingering threads with tox19:27
benschubertIt's 4 threads all the time19:27
juergbiah, the lingering thread issue is only with tox, not with pytest itself?19:27
benschubertand they are not python threads19:27
benschubertwith pytest rarely (once every 2-3 test runs)19:27
juergbido you use a shared gRPC channel for casd?19:28
benschubertwith tox all the time, 90% of the tests fail19:28
benschubertI think so, I haven't changed how that was handled19:28
juergbihave you removed the explicit closing of gRPC before spawning new job?19:28
juergbithat shouldn't be necessary anymore19:28
benschuberthttps://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#57eb6e43c19557b544ffb8be1ee3811ea76c8d95_558_552 yes AFAICT19:29
benschubertmaybe do you see something obvious on the diff?19:30
juergbiyes, you've removed it: prepare_fork19:32
benschubertinteresting: tox -e venv /bin/bash then pytest fails19:38
benschubertso, that is something related to my environment19:38
benschubertthat's fishy19:38
* tristan would hope that any parallelism is done with Scheduler invocations, for consistency, and to follow the same rules layed out in the Resources19:57
tristanScheduler can be invoked multiple times with different queues19:58
juergbiit only makes sense if any actual scheduling is required20:01
juergbiI have to look into it again but it's possible batching is all that's needed, in which case it's simpler and more efficient without involving scheduler20:02
benschubertjuergbi: I have a working theory for my threading problem: If I don't start the cacheusagemonitor at all (now a thread in my implementation), then I don't have those lingering threads. So it seems that having two threads sharing the connection cause problems? Is there anything weird around GRPC I need to close/handle/lock?20:03
juergbiI'm not familiar with Python gRPC's API semantics with regards to multiple (Python) threads20:04
juergbihowever, why would it be different for cacheusagemonitor and job threads?20:04
benschubertI have no idea20:05
juergbiunless job threads are simply less likely to conflict timing wise in the tests20:05
benschubertremoving the starting make all tests pass but one (the one around checking the cache usage)20:06
juergbiis the cacheusagemonitor thread shut down / joined properly at the end of each 'cli' invocation?20:06
benschubertYep20:06
benschubertDo you see something blatant here: https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#eee68cb7737cd5a69ac2b8ddcf0119307cf48977_734_739 ?20:07
juergbiand done in such a way that it's not forcefully torn down in the middle of a grpc request?20:07
benschubertwait, seems like some tests actually still fail20:07
benschubertso it's not perfect20:07
benschubertbut better20:07
benschubertso it seems like this is indeed the problem, grpc not shutting down nicely20:07
juergbiself._connection.close() in CacheUsageMonitor seems odd to me20:08
juergbiI don't think the CacheUsageMonitor owns the connection20:08
juergbibut no idea whether that's related to the issue20:09
juergbihow/where do you terminate this thread?20:09
benschubertit actually has no effect, removing it doesn't fix the problem20:09
juergbiand do you know how exactly this is done internally? trigger a Python exception in that thread?20:09
benschuberthttps://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#eee68cb7737cd5a69ac2b8ddcf0119307cf48977_134_136 for shutdown20:09
juergbiah, right, that seems clean20:10
benschubert(adds a 5s delay so will have to do it differently after) but yeah :/20:10
benschubertand I tried joining the thread before closing the connection right after20:10
benschubertand this didn't change either20:11
juergbiI would definitely do the join before20:11
juergbias it might still be using the connection20:11
juergbiI'd suggest trying to find documentation on python gRPC thread safety20:12
juergbisharing a channel might not be supported at all or require certain precautions20:12
juergbialso have to wrap my head around GIL in this context20:13
juergbiwith GIL presumably fewer things go wrong than without, but it's not magic thread safety20:13
benschuberthttps://github.com/grpc/grpc/issues/9320 it seems correcy20:13
juergbiI'm not used to this odd middle way20:13
juergbiok, so the gRPC side should be fine, hm20:14
benschuberthttps://github.com/googleapis/google-cloud-python/issues/327220:16
benschubertSo the 4 is indeed the grpc threads20:16
juergbiour _establish_connection is not really thread-safe20:17
juergbidon't know whether this could cause this issue, though20:17
benschubertooh good point20:18
juergbibut should be fixed in any case20:18
benschubertwell I can easily lock it and see what happens :)20:18
juergbiyep20:18
benschubertor force creation at the start anyways20:18
benschubertjuergbi: oh gosh, that was it20:22
juergbi:)20:22
benschubertso we'd actually create two20:22
benschubertuh, more errors, apparently not the only part x')20:23
tristanjuergbi, I think that one would want to throttle the number of parallel checks; querying for cached state can mean 3 things; (A) Checking cas for cached artifact, (B) Checking cas for cached sources, (C) Checking local disk for sources20:30
tristanWouldn't it make sense to not launch all of that for hundreds of elements simultaneously ? if it would make sense: That is scheduling I think20:31
benschuberttristan: we are meaning at the start, before showing all elemetns, getting their state as we need to prepare the scheduler. This makes sense to batch all of them in a single request to casd that should it as fast as possible :)20:34
juergbiyes, it really depends on where/how we do the batching20:35
*** xjuan_ has joined #buildstream20:40
tristanI mean, the scheduler takes care of throttling parallel jobs so as not to overstress the hardware20:43
tristanThe user gets to say things like how many fetches and builds go on in parallel20:43
tristanIf we're going to launch jobs which need to be throttled and run in parallel, and we have a Scheduler with the Resources stuff to control such throttling, we should not re-invent that from scratch, but reuse the Scheduler object and Job harnesses to do so20:44
*** xjuan_ has quit IRC20:45
tristaneven if it appears to be simpler to write something simpler that doesnt use the scheduler, there's no reason why using the scheduler cannot be as simple20:45
* tristan believes he had this conversation before, upon finding other threads/subprocesses/multiprocessing hacks in startup phase, over a year ago20:46
* tristan would just like to use one tool for the job consistently, and not consider that the Scheduler is only for the main build process20:47
juergbiI would definitely not reinvent a scheduler in buildstream20:47
juergbihowever, e.g. casd (as a gRPC server) has a thread pool20:47
juergbiand essentially letting that take care of scheduling for the startup check could also make sense20:48
tristanRight, but launching 4 threads in a hard coded manner which do something in parallel at the beginning, would not look like inventing a scheduler, but would still be20:48
tristanAlso, I think that the jobs need to check local disk, not only do gRPC jobs20:48
juergbiyes, I also wouldn't want to do that20:48
juergbiit's casd that performs the actual disk checks (updating timestamps)20:49
tristaneach element needs to check local disk to get Source consistency (in the absence of source cache)20:49
tristanSeems logical to couple all cache checks (including local disk) into one group20:49
juergbiyes, so there is the non-CAS part first20:49
tristanI think non-cas part is after20:50
tristanWe only need to check disk if source cache is empty ? or maybe we still need to know, not sure20:50
juergbifirst check is artifact, we only check anything source-related if we don't have an artifact20:50
juergbiand the main part of the artifact check is CAS20:50
juergbiif the artifact is missing or incomplete, we check the source cache, which is very similar to the artifact check20:51
tristanRight, in builds we can probably even discard the checks for source consistency but, I have a feeling that's not a long term reliable rule20:51
tristanSome scenarios might require knowing20:51
juergbisource consistency, i.e., having ref doesn't involve the disk, though, right?20:52
juergbi(except .bst and project.refs)20:52
benschubertit can if you need to be staged before a ref20:52
tristanconsistency includes whether it's cached and whether the ref is known or not (tristate)20:52
benschubert(local sources, workspaces)20:52
tristanbut consistency has been replaced under the hood20:52
tristanI still think it's a decent thing to bring back if only for `bst source show` output20:53
juergbibut you don't want to pay the price for all other operations20:53
tristan(not implementation-wise, there are good reasons why this was broken down into 2 methods under the hood)20:53
juergbiright, can be built on top20:53
juergbibtw: are you still awake?20:54
juergbior again20:54
tristanI also suspect we may hit a turning point where sources might be required by reverse deps20:54
* tristan is about to crash :)20:55
tristancan't sleep early this week20:55
tristancython has put me into a sleep at 7am pattern20:55
juergbievil cython20:55
*** dftxbs3e has quit IRC23:06
*** dftxbs3e has joined #buildstream23:23

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!