IRC logs for #buildstream for Thursday, 2020-07-02

*** tristan has quit IRC		06:59
*** benschubert has joined #buildstream		07:27
benschubert	tristan: oh ! great :D	07:27
*** tristan has joined #buildstream		07:35
*** ChanServ sets mode: +o tristan		07:35
tristan	benschubert, could you launch another benchmark on the latest please ?	08:00
tristan	Just see how we're doing	08:00
benschubert	sure, I'll do that in 30 minutes, just need to finish some urgent stuff :)	08:00
tristan	sure	08:01
tristan	I think this one is a winner, I'll go deeper anyway but would just like to know :)	08:01
benschubert	tristan: running!	08:13
*** hasebastian has joined #buildstream		08:13
benschubert	By the way does someone remember how jennis generated the debian project? I could not find the script to do that :)	08:13
Frazer61	hi, does anyone have any thoughts on the `bst source show` feature, since the secure downloads may not be entirely possible for the time being as it wouldnt be very defined, so this is the next best thing	08:33
*** santi has joined #buildstream		08:37
benschubert	Frazer61: that might be better discussed on the ML :) One thing I could see is showing all the `config` for the each source in the given element, or something like that?	08:38
Frazer61	tristan wanted to doscuss it on the irc chat to get others involved more so "Better to have this conversation in public." at least for the brainstorming	08:41
coldtom	imo it might be better to open up some kind of interface to the yaml loader as public API, and have this source checking in some python script	08:43
Frazer61	so avoid bst source show ?	08:46
benschubert	I'd rather go with the `source show`, It's easier to open a cli interface than an API for that	08:49
tristan	I suggested that we have this discussion on IRC because, with the mailing list, this will be a long drawn out affair	08:51
tristan	It may still be, but having some direction to get started on the mailing list would be helpful I thnk	08:51
tristan	*think	08:51
benschubert	tristan: as long as the important part of the discussion is on the ML that's fine :)	08:52
tristan	Yeah of course	08:52
tristan	I think that Frazer61 is not currently armed with all the context he needs to post a full proposal	08:52
tristan	One thing I think we need to consider here are the needs of collect_manifest	08:53
benschubert	right	08:53
tristan	I.e., having `bst source show` give us URLs requires some kind of plugin interaction I think, if we do that and don't consider the rest, we've kind of wasted some opportunity	08:54
tristan	Core access to raw 'config', what are the benefits and drawbacks ?	08:54
tristan	I think off the bat, it doesn't allow a consumer of the API (either Source API or command line), to know what it is looking at	08:54
tristan	I.e. there is nothing standard at all about source `config` data, except for the `directory` it will be staged in	08:55
tristan	The rest is a free for all, steered by general convention	08:55
tristan	The benefit of course is that we could implement something fast and easy, and hand over the problem space to the consumer, I don't think that is ideal	08:56
tristan	Especially considering, it will be the same as https://gitlab.com/BuildStream/bst-plugins-experimental/-/issues/2	08:57
benschubert	On the other hand, project authors have the knowledge on what sources they are using so might be the ones at best to decide what to look for?	08:57
tristan	Even if we report a 'kind' string for each source, the end user doesn't have the right to know what that 'kind' string means	08:57
benschubert	Why don't they?	08:58
tristan	because any project can introduce a plugin with any kind string	08:58
tristan	the `git` source on your toplevel might be something completely different from the `git` source of a junctioned project	08:58
tristan	and the junctioned project can change it's `git` source later on to mean something different	08:58
benschubert	So for example, if I want a linter that ensures only a subset of all 'kind' plugins are used in my project, how would I do that without it?	08:59
Frazer61	freedesktop-sdk is like thar i think	08:59
tristan	benschubert, you don't have that ability	08:59
benschubert	tristan: plugins are defined in a scope, should we scope the `kind` ?	08:59
benschubert	tristan: I think that might be a common thing asked for projects to be able to lint on	08:59
benschubert	I know you can't now	08:59
benschubert	but that's deifnitely something we should consider	09:00
tristan	benschubert, I've been thinking about extensibility of core plugins, and in that context; I think we need to make use of core plugins explicit (in the project.conf origins) and also implement `alias` feature to rename plugins who's names might otherwise clash	09:00
tristan	benschubert, I think perhaps we could have something like plugin identities at some point, but maybe that is completely orthogonal to providing concise information through `bst source show` and Source python APIs ?	09:01
tristan	benschubert, for instance, it might make sense to associate a hosting/url/unique identity with a plugin, but I think we can have that conversation separately ?	09:02
benschubert	tristan: not 100%, in a `bst source show` I would definitely expect to see a kind :)	09:02
tristan	We don't show a kind in `bst show` do we ?	09:02
coldtom	just as an aside, opening up some interface to the yaml loader would allow replacements for collect_manifest, collect_integration and any other build metadata -> sandbox plugin	09:03
tristan	coldtom, not in a useful way	09:03
tristan	coldtom, i.e. not by just exposing raw nodes, because as stated in this backlog, raw nodes are not meaningful	09:03
benschubert	tristan: I don't know, the way we went with was parsing the yaml ourselves, and forbidding anything that is not basic yaml (not that this is a good solution)	09:03
benschubert	but sure, let's not take too long on that point	09:04
tristan	coldtom, the configuration of a plugin is only meaningful to the plugin parsing it	09:04
tristan	I think that it's important for me as a user of `bst source show`, to know 2 things primarily... (A) A URI or a list of URIs... (B) Some additional identifying information for each URI reported for that source	09:05
benschubert	tristan: one thing that I would personally find very useful would be able to get the parsed configuration for the sources	09:05
tristan	I would not be particularly pleased if I had to have intimate knowledge of Source configurations in order to derive these meaningful aspects	09:06
tristan	benschubert, Why ?	09:06
tristan	And what is "the parsed configuration" ?	09:06
tristan	Maybe we are saying the same thing ?	09:06
tristan	For (B) above, I would consider that like a checksum of a tarball URI, or a commit sha of a git URI	09:07
tristan	Something uniquely identifying	09:07
tristan	Of course raw YAML wouldnt give you that anyway, because refs are possibly in project.refs	09:07
benschubert	tristan: Here are a few things I want to ensure on a project list of sources:	09:08
benschubert	- every uri-type field is using an alias	09:08
benschubert	- Only plugins in a given list are allowed	09:08
benschubert	yeah I don't mean raw yaml	09:08
tristan	Right :)	09:08
coldtom	ah, i should clarify, when i say the loader, i actually mean "some API to grab Elements and Sources given an element of input"	09:08
benschubert	but rather a dump of the source's configuration after parsing	09:08
tristan	benschubert, for the list of plugins, I think we need to consider that separately because there is nothing to identify them afaics, but we should keep in mind that that needs to be extensible	09:09
tristan	i.e. for "only given plugins are allowed"	09:09
benschubert	tristan: I'm fine with that :)	09:09
tristan	Another thing I would want to know is cached state of course, like bst show	09:10
tristan	Whether (A) The source is cached locally, (B) The source has a known ref and can be fetched, (C) The source cannot possibly fetched because there is no ref, and so it needs to be tracked	09:11
tristan	i.e. the Consistency enum state of a Source	09:12
benschubert	right	09:12
benschubert	having something similar to the `bst show --format` (anything that make sense also in the source context)	09:13
tristan	Orthogonal tangent we can discard: There is a scenario with remote execution where we might want to consider "cached locally" as "cached on the RE build cloud"... that would effect `bst show` (artifacts) and `bst source show` (sources) equally, so I think it is irrelevant for this conversation	09:14
tristan	Yeah, for the command line we'd want format	09:14
tristan	What bothers me about `bst source show` in general, is how to represent lists	09:14
tristan	Do we output YAML ?	09:15
tristan	for those details ?	09:15
benschubert	I'd love that, it's so much easier to parse	09:15
tristan	Sources have 0-N URLs	09:15
tristan	(and as such, refs, although internally a ref might be a list)	09:15
benschubert	refs can be anything	09:16
benschubert	we have some refs that are json blobs	09:16
tristan	Yes, but I think we want to have a concept of "A unique identifier for this URI" in an outward facing API right ?	09:16
tristan	So it would be a Source's responsibility to provide that	09:17
tristan	Even if it's only for the purpose of `bst source show`, or use in other plugins which want to make queries	09:17
tristan	There is also an interesting facet about Source URIs; their URIs lists are variable in length	09:18
tristan	Sometimes a Source has 0 or 1 URI, but more URIs after tracking	09:18
tristan	Like git submodules, or rust crates	09:19
benschubert	tristan: having `bst source show` ask each plugin to give the info seems a goot idea	09:19
benschubert	so the ref can be formatted nicely, etc	09:20
benschubert	that could return a mappingNode, with all the info	09:20
tristan	Right, maybe the API needs to be a string based query	09:20
tristan	Ah you mean, trust the plugin to format a dictionary in a specific format ?	09:21
benschubert	like in some refs, there is some things that are "internal" to the plugin that you don't need to expose	09:21
benschubert	yep	09:21
tristan	Can we do away with the trust thing ?	09:21
tristan	hehe	09:21
benschubert	well, yes? Because as you said the only thing that knows what's important is the plugin :)	09:21
benschubert	hence why we don't want a 'uri' checker in Coe	09:21
benschubert	hence why we don't want a 'uri' checker in Core	09:21
tristan	I'd much prefer to have a more controlled API which the core can query sources with, like: urls = Source.get_info(InfoType.URLS)	09:22
tristan	And the core creates YAML (or not) as needed for output on the CLI	09:23
tristan	get_info() or such could be private _get_info() until we are absolutely sure that Elements in reverse deps should be allowed to see this information	09:23
tristan	And we guarantee a stable format if the `bst source show` user decided they wanted yaml output (maybe default is something like `bst show` one liners ?)	09:24
tristan	Hmmm	09:25
tristan	So that's a good point... what about sources which don't ever have a URI, we already have some of those (except they grow URIs after tracking)	09:25
tristan	Do we need to know that context ?	09:25
tristan	URIs vs "Discovered URIs" ?	09:26
benschubert	and then what about things that use `uri` as something else? Like, I could envision the `pip` package to use `uri` for the registry. Do we want this? ;)	09:27
tristan	benschubert, in fact, since the core can use Node() APIs to validate fields, the Plugin could return MappingNode and there is no trust issue, I'm not sure if that is more convenient than constructing the mapping node in Core, though	09:28
tristan	Oh, the name of the field in Element/Source configuration is of course irrelevant	09:28
tristan	only meaningful data must be reported	09:29
tristan	Ah and yeah, that	09:29
tristan	The shape of the data indeed varies wildly	09:29
tristan	a url can be a base, I think it is similar for the crate plugin too	09:29
benschubert	well the registry is important might not be the same between all, though this would not be reported as an `uri`, so what should be? What I'm trying to say is that it's quite hard to have a consistent non-plugin dependent representation here	09:29
tristan	benschubert, in that case, `pip` and `crate` maybe have a 'primary URI' which is the base URI which never has a ref, and secondary URIs (all of the individual crates) have refs	09:30
tristan	Yeah it is, but I think that we're only interested in commonality, something that can be understood equally for all plugins	09:31
tristan	If we want to query for URIs, then what are our reasons to do so ?	09:31
tristan	* Policy URI schemes (gnome-build-meta wants this)	09:32
benschubert	ensure they all have an alias set?	09:32
tristan	* Ensure internal mirroring of all input data	09:32
benschubert	(thus you don't want the expanded uri, just the normal one)	09:32
tristan	benschubert, I don't think we want that alias thing in this context	09:32
benschubert	how do you ensure all uris are aliased then?	09:32
tristan	benschubert, We should already have a warning (configurable to fatal ?) if ever an unaliased URI is used right ?	09:33
benschubert	oh maybe	09:33
benschubert	you might be right sorry	09:33
tristan	If we don't we should	09:33
tristan	We already have all the mirroring URI indirection which should make that possible	09:33
tristan	Plugins which use hard coded URIs or discover URIs and fail to use the indirections, are buggy (and could equally hide these unaliased URIs from users in a new API)	09:34
tristan	Frazer61, Taking notes ?	09:34
benschubert	https://gitlab.com/snippets/1992018 btw :)	09:34
tristan	Still slower ?!?!?!	09:35
tristan	Damn	09:35
tristan	crazy	09:35
Frazer61	tristan yeah, just dont want to interrupt just yet	09:35
benschubert	memory's much better though	09:35
tristan	Frazer61, just checking	09:36
tristan	So another reason is to create manifests	09:36
tristan	reporting on the provenance of build inputs	09:37
tristan	Certainly not for discrete processing of the outputs or automation of tasks based on the output	09:37
tristan	Validation, reporting	09:38
tristan	Checking that internal mirrors have everything we are using, may be an automatable task, but it can't be realistically achieved without the assistance of BuildStream itself	09:39
tristan	Because you can't have knowledge of the meaning of refs	09:39
tristan	You can only know that they are unique for a given URL	09:39
tristan	I think that we are essentially on the same page that (A) The source reports concise data about URIs and refs (B) A source can have 0-N URIs (C) Some URIs are discovered during processing (tracking)	09:41
tristan	And I think that we can enforce that a unique string can be reported for each URI (internally based on "ref")	09:42
tristan	Even if it's not entirely true ?	09:42
tristan	I think we don't need to distinguish between primary URIs (somehow encoded into configuration) and discovered URIs	09:43
tristan	We will also need extensibility to support plugin unique identities	09:44
benschubert	I think that's good enough yeah	09:44
tristan	For extensibility, we'll need to define what happens when a Source is asked for something which it has not been implemented to report	09:45
tristan	For a CLI feature, the user probably needs to decide	09:46
tristan	I guess that in this sense, we could start with a Source.list_uris() and Source.get_display_ref(uri) API, and add additional APIs for reporting other things later on	09:46
* benschubert is worried abouth the number of APIs that might and up being required		09:47
tristan	In that way we handle an InimplementedError() or such	09:47
benschubert	but I guess we don't really have a choice	09:47
tristan	Yeah I don't like it either	09:47
tristan	Then again, I'm less concerned about Source plugins - they are high effort and potentially fewer	09:48
benschubert	right	09:48
benschubert	and the base ones will probably be well maintained (for the most used ones)	09:49
tristan	Yup	09:49
tristan	Now if people can stop being annoying and implementing weird stuff like rust, JS and python	09:50
tristan	We won't have to deal with these whacky concepts of language central distribution vectors	09:50
tristan	s/central/centric	09:50
benschubert	I feel you :)	09:50
tristan	We can only pray :)	09:50
benschubert	pyproject.toml with pep517 is quite fun to go around	09:50
benschubert	at least gentoo had a great shot at it and it's now manageable in a sandbox	09:51
* tristan is genuinely surprised by this latest benchmark :-S		09:52
tristan	The code is actually running consistently faster with the variables test	09:52
tristan	but still 0.5 seconds behind on 6K debian elements :-S	09:52
tristan	Which means, something about repetition, we are doing the same thing many times and should be able to not do it	09:53
tristan	Aha !	09:53
tristan	no, not really aha :-S	09:54
tristan	The old code had different hashing, but still per-Variables-instance tables	09:54
Frazer61	so are we at agreement that we will store the URIs or whole URL in a manifest, and bst source show will just read the manifest?	10:01
tristan	Frazer61, not really no	10:02
tristan	Frazer61, We are in agreement that we first need a Source plugin addition for reporting a list of URIs and a corresponding list of refs, both as simple strings	10:03
tristan	Above I suggested get_uris() and get_display_ref(uri), but maybe it's simpler to have a single API which returns two lists	10:04
tristan	That part we can refine in the proposal	10:04
tristan	Frazer61, using that new API, we can implement `bst source show` in such a way that it can use the `--format` argument (similar to `bst show`) to display (A) cached state, (B) URIs (C) refs for a given source	10:07
Frazer61	i would rather have 2 functions, rather than 1 function returning 2 data lists, but thats just my personal preference	10:08
tristan	Frazer61, some additional thinking needs to go into how the `bst source show` command works, specifically we need to consider how to identify sources of a plugin	10:08
tristan	Frazer61, one function might be easier to implement - remember that in this case, we have few calling instances vs many implementors (and the priority is on ease and clarity for implementors, not callers)	10:09
tristan	The core we can always control, plugins are for people to implement at will	10:09
tristan	Frazer61, from here, it would be great if you could collect the general output of this and try to fill in any gaps you can see (like how to address/list per element sources when calling `bst source show foo.bst`)	10:10
tristan	Frazer61, and provide a writeup for the ML, I would expect a back-and-forth on the ML but not too much	10:11
tristan	With this, collect_manifest could safely be tranformed into a shell script that calls `bst show` and `bst source show`	10:12
tristan	or maybe a python script which does the same	10:12
Frazer61	ok, ill do some experimenting	10:13
tristan	(and of course gnome-build-meta could parse `bst source show --deps all gnome.bst --format '%{uris}'` or similar to easily implement a git accept hook)	10:13
*** tristan has quit IRC		10:17
*** tristan has joined #buildstream		10:58
*** ChanServ sets mode: +o tristan		10:58
douglaswinship	Can anyone help me with these CI failures? They all happened fairly close together, and they all relied on a freedesktop-sdk branch that I had just rebased to include a bunch of extra commits from master, so I thought they might be connected. But they all look completely different.	11:23
douglaswinship	I think maybe some of them are just runner issues, but I haven't seen them before.	11:23
douglaswinship	The easiest ones are these two: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/617168643 & https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/617168812	11:23
douglaswinship	The error message there is "[00:00:40] FAILURE components/sdl2-ttf.bst: Failed to pull artifact with status UNAVAILABLE: failed to connect to all addresses".	11:24
douglaswinship	Does that just mean the remote caches went down / couldn't be reached, and I should just re-run the job?	11:24
douglaswinship	I'm just surprised that that causes a build failure. I thought caches were there to make the build run faster, but I didn't think they were essential for a build.	11:24
douglaswinship	(There are a few more, that look different, but I don't want to completely spam the channel.	11:24
coldtom	yep that's caused by cache issues douglaswinship	11:27
coldtom	imo that should not cause build failures though	11:28
douglaswinship	Next question: Any ideas on this one? https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/616510572	11:31
douglaswinship	coldtom: and thanks. Will re-run	11:31
douglaswinship	re the last link, the first issue I see is "make: *** [Makefile:968: all] Error 2", and don't see anything before that that suggests any problem to me. So I'm not sure where to start.	11:33
coldtom	not sure on that one, it's possible the error happened further up the build log and it's not caught there, the full log might be more informative	11:44
douglaswinship	Ah yes, this might be more relevant: "make[2]: *** No rule to make target 'cpp.texi', needed by 'doc/cpp.info'. Stop."	12:15
tristan	benschubert, so I'm not sure how wise it will be to dig down as far as the strings, but I've got a working theory that code is slowed down by my theory that we should be interning the variable names	13:02
benschubert	tristan: do you mean that if we stop interning it will be better?	13:02
tristan	benschubert, locally I've run the variables benchmark a bunch of times with different configurations, and it would seem that interned values ("%{foo} and %{bar}" strings) but not interned variable names is better	13:03
tristan	It has to do with frequency of calling sys.intern() vs using those comparisons	13:03
benschubert	yeah fair point, I can re-benchmark a version without interning if you want :)	13:04
tristan	Upon entry into Variables instantiation, interning all the variable strings every time has an expense	13:04
tristan	I pushed one	13:04
* benschubert should setup a on-demande benchmarking system		13:04
benschubert	ok let me start a benchmakr	13:04
tristan	Thanks a lot for being there throughout :)	13:04
tristan	I've been giving consideration to deeper intimate string manipulation, but I worry a bit	13:05
tristan	Python has dragons in their strings it seems	13:05
tristan	it's not as straight forward as "everything is UTF-8"	13:05
benschubert	yeah I would really prefer not going that deep	13:05
benschubert	Happy to help, seems I have quite easy way sof benchmarking, so if it can make buildstream faster, It's not a big cost :D	13:06
tristan	We have some help from https://github.com/cython/cython/blob/master/Cython/Includes/cpython/unicode.pxd	13:06
tristan	But if we do strings, I think it's safer to use straight C API from Python.h	13:06
tristan	Because that lets us go from str -> explicitly utf8 string -> str	13:07
tristan	Whereas the cython wrapper gives you ` char* PyUnicode_AS_DATA(object o)`... (whatever "DATA" is: fend for yourself)	13:07
benschubert	benchmark running!	13:09
tristan	(with `cdef extern from "Python.h"` we can have `const char* PyUnicode_AsUTF8(object unicode)` and the opposite)	13:09
tristan	I was thinking of trying the "".join() in C and tracking the length of each string etc	13:10
tristan	For the final resolutions	13:10
tristan	not sure we gain anything	13:10
tristan	The python `dict` is hard to kick too, if we went full string it would be better to have a C hash table	13:10
benschubert	What I'm wondering now is if it's really worth going that far? The idea was to make the code more readable and understandable (+the lazy resolution for junctions), and I wonder if we are not making it worse if we go too far that road?	13:11
tristan	:(	13:12
benschubert	I mean, if it means that we have a blazing fast variable resolution I'm happy with it	13:12
tristan	I know, its terrible, I don't understand	13:12
tristan	What was there before was confusing lists, plus manually raising RecursionError(), not supporting more than hard coded limit of recursion	13:12
benschubert	yeah before is clearly not great	13:13
benschubert	so clean well documented C/Cython code is still probably better	13:13
benschubert	I just would really like us not to ever have a segfault because of that :)	13:13
benschubert	on the other hand, with enough tests, it would probably be good	13:13
tristan	Yeah I wouldn't worry much about that	13:13
tristan	There's not huge opportunity for segfaulting here, except when you introduce code changes initially	13:14
benschubert	right	13:14
benschubert	and the current implementation actually could segfault in very specific cases	13:14
tristan	Well, it will python stack trace when reaching the limits of recursion, I'm not that familiar anymore with the current implementation heh	13:15
benschubert	the check is not actually perfect and if you are already deep in a stack you can actually make the code segfault :) but that's a very weird setup required	13:17
benschubert	so would not happen during normal usage	13:17
benschubert	but I think you are right if we are careful with C and it actually give us easier to understand code and allows us to squeeze a good chunk of perfs here, it's worth it	13:18
tristan	Oh yeah	13:18
tristan	Of course	13:18
tristan	That recursion check is not even accurate	13:18
benschubert	it's a "best effort" check :'D	13:18
tristan	You'd have to hit the glibc stack limit	13:18
tristan	I guess	13:18
benschubert	yep	13:18
tristan	best effort could at least observe Python->internal_stack_depth_counter or smth on entry ;-)	13:19
tristan	and still possibly be wrong	13:19
benschubert	thus your rewrite being much better even if we get more C code	13:19
* tristan does really like it better....		13:20
tristan	but I'm obviously biased :)	13:20
benschubert	I'll like it better once it's faster :D	13:20
tristan	really upsetting the performance issues though	13:20
* tristan could add pointer array		13:22
tristan	benschubert, maybe give a look at the main resolve loop https://gitlab.com/BuildStream/buildstream/-/blob/tristan/partial-variables/src/buildstream/_variables.pyx#L258, let me know if you spot anything ?	13:23
tristan	Obvious things include, I could store deps in a pointer array (possibly realloc() it's length to double it's size any time I add a new item), instead of a python list	13:23
tristan	Even start out pointer arrays with enough space for say 8 values, possibly hitting realloc and then 16, 32, etc	13:24
tristan	8 * sizeof(PyObject *)	13:24
tristan	Probably it will very rarely need to realloc	13:25
tristan	but the algorithm, checking circular refs and collecting the ordered list of dependencies in the same loop, appears to be pretty solid to me	13:26
tristan	I could be missing something obvious I don't know	13:26
benschubert	tristan: mmh that seems a lot of context switch right now, might have the brain more free this weekend if you don't mind?	13:26
tristan	Don't worry about it :)	13:26
tristan	The moral support and benchmarking is good enough, don't let me disturb you too much if you're on a roll :)	13:26
douglaswinship	juergbi: You were talking to WSalmon on tuesday about buildbox-casd, and it still having some memory issues even under 0.0.9?	13:33
douglaswinship	I have an example of a build failure in CI. The casd log is here https://gitlab.com/libreml/libreml/-/jobs/618055732/artifacts/browse/cache/buildstream/logs/_casd/	13:33
benschubert	tristan: I'll have a look this weekend if you want, just that for today it would be too much of a context switch :)	13:33
douglaswinship	" what(): OutOfSpaceException exception thrown at [buildboxcasd_lrulocalcas.cpp:325], errMsg = "Insufficient storage quota" "	13:33
WSalmon	douglaswinship, interesting	13:35
WSalmon	cphang how much disk/RAM dose that runner have?	13:35
tristan	benschubert, I'm sure you will eventually review it anyway, and if it's not really ready for review by this weekend and performing acceptably, then I'll appreciate you're looking at it by then too :)	13:35
juergbi	douglaswinship: could the out of space/quota error be correct, i.e., you either don't have enough disk space or quota configured too low to cover a whole bst session?	13:36
benschubert	tristan: seems good!	13:36
juergbi	douglaswinship: it is supposed to handle that case with a better error message (on the BuildStream side), however, it would be interesting to know whether the error is bogus or whether it's just an issue of not properly handling/forwarding this error	13:36
*** cphang has joined #buildstream		13:37
WSalmon	juergbi, as in it might be crashing while doing it or it might just be the issue and we should report it better?	13:37
WSalmon	juergbi, as in it might be crashing while doing disk mangment or it might just be the issue and we should report it better?	13:37
juergbi	running out of space should result in a termination of the build session but with a reasonable error message	13:38
WSalmon	+1	13:38
WSalmon	i seemed to rember it doing so at one point	13:38
WSalmon	but not 100%	13:38
cphang	WSalmon https://gitlab.com/libreml/infrastructure/infrastructure/-/blob/master/ansible/roles/gitlab-runner/defaults/main.yml#L44 50GB	13:38
juergbi	we start casd with --protect-session-blobs, which prevents casd from expiring any blobs created or accessed in the current bst session	13:38
juergbi	as buildstream can't deal with blobs suddenly vanishing mid-session	13:39
juergbi	however, casd is supposed to return a suitable grpc error in that case, which buildstream should be handling	13:39
juergbi	and I think we have a test case for this	13:39
WSalmon	yep i though at one point it did say that in some cases but i might be thinking of bst1.4	13:39
juergbi	we have test_artifact_too_large to verify this	13:40
juergbi	however, we may hit this in different code paths and might not handle it properly in all of them	13:41
juergbi	(in bst or casd)	13:41
WSalmon	douglaswinship, that job looked like it had built a LOT of stuff, so a simple restart should be enough	13:41
WSalmon	but thanks for bring this case to our attention	13:41
douglaswinship	So if I'm undertanding this right: If buildbox had responded perfectly, then the CI job would still have failed, but buildboc would have given a more helpful error message? (ie buildbox-casd isn't the actual reason this pipeline failed?)	13:42
WSalmon	can you document this in a issue, just some links and a copy and past of the error message, the title should be something like, report filling cas in one run better etc	13:42
douglaswinship	or is it that we're not sure either wwy?	13:42
WSalmon	douglaswinship, it should give a more helpful message in any case so we should have a issue there but there may be more at play	13:43
juergbi	at first glance it seems primarily like a buildbox-casd bug	13:43
juergbi	it shouldn't terminate	13:43
douglaswinship	dang. That's frustrating.	13:44
WSalmon	but 50G is not a lot for a big rebuild..	13:44
WSalmon	so it might be a genuin issue	13:44
WSalmon	but not being handdled well	13:45
cphang	WSalmon agreed, and a supporting argument for this is that the aarch64 jobs succeeded	13:45
WSalmon	cphang, have you rebuilt everything including FD on those runners before?	13:45
cphang	WSalmon yeh I'm pretty sure, we've handled changes to bootstrap import before w/o cache	13:46
WSalmon	ummm, interesting	13:46
douglaswinship	WSalmon: should I be raising this as an issue in buildstream, or the buildbox-casd repo? (Or both?)	13:46
WSalmon	at least buildstream, as it should be reporting things better	13:46
cphang	As a temporary workaround douglaswinship maybe I'd suggest bumping up the size of https://gitlab.com/libreml/infrastructure/infrastructure/-/blob/master/ansible/roles/gitlab-runner/defaults/main.yml#L44 to 128GB, I'm happy to approve that and I'll restart the gitlab runner (this will probably be end of day).	13:46
WSalmon	juergbi, do you want us to raise something on buildbox-casd about this too?	13:47
WSalmon	sounds like it should have handdled more gracefully?	13:47
juergbi	yes, should be fixed in buildbox-casd	13:48
WSalmon	douglaswinship, ^	13:49
douglaswinship	WSalmon, juergbi: thanks :)	13:53
*** tristan has quit IRC		14:03
*** traveltissues has joined #buildstream		15:00
*** tristan has joined #buildstream		16:00
*** ChanServ sets mode: +o tristan		16:00
tristan	benschubert, any word on the internless run ?	16:00
benschubert	tristan: woops forgot about that, let me see	16:01
tristan	.... drumroll ...	16:01
benschubert	https://gitlab.com/snippets/1992018#note_372673299 better but not quite there :)	16:02
tristan	.25 seconds behind	16:02
tristan	Ok next time with arrays	16:03
tristan	I also noticed that I'm not doing the optimization of making sure value dependencies are unique (since they are not interned anymore, the identity comparison is worthless), that means a loop per ValuePart per ValueClass, but as those are cached, they wont be meaningful really	16:05
* tristan can either remove that unique check since "%{foo} and %{foo} and %{foo}" is a rare value expression		16:05
tristan	or I can make it a string compare, but probably just ignore and leave the redundancy in	16:06
benschubert	tristan: possible yeah :)	16:35
Frazer61	hi, does anyone know the best was to use the src/buildstream/testing/_sourcetests/project to experiment with the master bst?	16:39
* tristan has ObjectArray, but realizes he should have PointerArray with a free_func & element_size		17:04
tristan	At least with ObjectArray I now have the deps list in resolve() as an array	17:05
tristan	PointerArray would let me change the ValueParts to aligned chunks in an array	17:05
tristan	Frazer61, I think that those tests are a more difficult angle to work from	17:06
tristan	Frazer61, most of the test cases in tests/ for instance, you can just run BuildStream on them and they work (for the majority which are statically committed tests)	17:07
tristan	Or, the integration tests	17:07
tristan	Frazer61, What I'm doing, which might not be considered "best", is to run `tox -e venv /bin/bash` to run BuildStream	17:08
tristan	I exit the shell and re-enter every time I make code changes	17:08
Frazer61	i assume not but does it matter if i use zsh instead of bash, assuming /bin/bash overrides with bash	17:09
tristan	It doesn't matter	17:09
Frazer61	ok, thanks	17:09
tristan	In zsh, there is no completion	17:09
tristan	In bash, the awesome completion is broken, cause douglaswinship still didn't submit that patch which fixes it	17:10
tristan	The one I know he has hiding in his workspace and keeping all to himself	17:10
tristan	benschubert, you're probably dropping off soon... maybe run one more benchmark before you go ?	17:12
*** santi has quit IRC		17:12
tristan	And let me know what the result is when you get back online tomorrow ?	17:12
benschubert	tristan: Sure, I'll run that now	17:12
benschubert	perfect timing, since I'll need my desktop soon :D	17:13
Frazer61	thanks tristan	17:13
*** santi has joined #buildstream		17:13
tristan	Great, _resolve() is looking very white and beige now hehe	17:14
tristan	in fact, the only yellow line is _resolve() itself	17:14
tristan	`__Pyx_INCREF(Py_None);` <-- that looks to me a little dumb to be generating :-S	17:20
tristan	You'd think that one could rely on `None` existing throughout the course of a python session	17:20
benschubert	True :)	17:20
* tristan wonders how many python bugs revolve around imbalanced refcounts on None ;-P		17:21
tristan	Those, along side threading issues, are like the most difficult things to debug	17:21
* tristan should write a little loop which takes valentind's variables testing element and multiplies it N times, each time dumping it as a new element which depends on the last		17:26
tristan	Then I could have a test more similar	17:26
tristan	Running `bst show` on that small test gives quite erratic results	17:26
tristan	Also, how come a yaml cache doesn't get created, is that obsolete ?	17:28
tristan	benschubert, in your benchmarks, you have show-cached (I presume it's run a second time without removing the cache), there used to be some sort of extra file used to assist the yaml load, is that it ?	17:29
* tristan thinks that was removed in favor of just yaml code that works better all the time		17:29
benschubert	tristan: no more yaml cache as it was slower yep :)	17:29
tristan	Ok yeah, that's good	17:30
benschubert	No `show cached` means that you have all your artifacts cached locally	17:30
tristan	Ahh	17:30
benschubert	which is impacted by casd doing the checks :)	17:30
benschubert	It's basically:	17:30
benschubert	From an empty state -> show -> source fetch -> show - fetched -> build -> show -cache	17:30
tristan	Right, I've noticed that the difference between my branch and master is not affected (it's about the same difference all the time)	17:31
tristan	last time .25 seconds in both scenarios	17:31
benschubert	which is expected since you are not touching anything around the 'caching'	17:31
tristan	cython is pretty smart	17:33
tristan	if I do `foo = PyMem_Malloc(...)` and then `if not foo: ... raise MemoryError()`	17:34
tristan	Then cython does `if (unlikely(__pyx_t_1)) {`	17:34
tristan	for the generated `not` clause	17:35
tristan	So it's got some idea/knowledge about the reason why I'm checking my return (perhaps a bit spooky)	17:35
tristan	Or rather, probably PyMem_Malloc() has some annotation to it, indicating that it shouldnt be returning NULL	17:36
benschubert	That's also possible? I'm not familiar enough with the internals of python	17:37
* tristan finds this area particularly fun, I like this level of the stack		17:37
tristan	As a glib/GObject guy, I've always enjoyed Obj-C	17:38
tristan	Python appears very similar under the thin hood	17:38
benschubert	tristan: One thing I'm very happy about is having someone else looking into all the cython/optimized part of BuildStream, we're going to be able to do much better :)	17:40
tristan	Careful hehe	17:40
tristan	If I get excited, there may no longer be an excuse to keep the python around, and it will all just meld into C	17:40
tristan	heh	17:40
tristan	link the interpreter and invoke it for plugins	17:41
benschubert	tristan: I think a "C" layer for performance-critical stuff, a "cython" layer for performance stuff but not that critical and "python" for the 90% of hte codebase would probably be a good thing :)	17:42
*** dftxbs3e has quit IRC		17:42
benschubert	But I'd be careful at bringing in "too much" C, as it's potentially harder to involve new people	17:42
tristan	That 90% quickly becomes 10%	17:42
tristan	Right I see what you mean	17:43
*** dftxbs3e has joined #buildstream		17:43
tristan	I kind of feel like ideally, the high level business logic belongs in a language like python where it's hard to get things wrong	17:43
benschubert	Yep, and the "critical" paths can be optimized in Cython/C when we need	17:44
tristan	the more critical stuff in C would be good, it also kind of scares away people from touching critical stuff	17:44
tristan	Which... I guess has drawbacks as well as bonuses	17:44
benschubert	yep	17:45
*** hasebastian has quit IRC		17:45
benschubert	which is why I initially when with Cython for the initial optimizations, since it's still very close to python and harder to shoot yourself in the foot	17:45
benschubert	tristan: actually speaking of C, what tool would you recommend to track what's startting and stopping threads? I've still got 4 threads lingering when I run tests in tox :/ (not pytest itself funnily enough)	17:48
tristan	if this flies, I should put the PointerArray in a different file, maybe we can convert the loader code to use C arrays for dependencies and such	17:49
benschubert	we could	17:49
benschubert	I had as a general idea of trying to dig more into the "c" interface of ruamel and work more in Cython/C than python there	17:49
benschubert	and so the loader could use this	17:50
tristan	Yeah, I think we just need a real C implementation of round tripping YAML	17:50
tristan	It really makes no sense to have this in python	17:50
benschubert	tristan: actually...	17:50
benschubert	We keep the location in the parser right?	17:50
tristan	ruamel.yaml is in a sad state too, and fails to preserve formatting when round tripping	17:50
tristan	The "location" ?	17:51
benschubert	So we could use a fast C (libyaml?) parser for the "base case" and when we need to write, we could reload the file itself with Ruamel and find the right thing to overwrite no?	17:51
benschubert	It would be more work	17:51
benschubert	but might give us the best of both worlds	17:51
tristan	Hmmm	17:51
benschubert	until we decide we're too fed up and re-write a yaml parser :D	17:51
tristan	Sounds like a lot of work and residual maintenance just to avoid a proper rewrite	17:52
benschubert	tristan: rewriting ruamel in cython/C ? with only the things we'd need?	17:52
cphang	tristan benschubert would something like https://github.com/tlsa/libcyaml be useful?	17:53
tristan	That's another option, I was thinking it really should be a pure C round tripping YAML implementation	17:53
tristan	cphang, does it roundtrip ?	17:53
tristan	I don't think so	17:53
benschubert	yeah, not sure what's better than https://github.com/yaml/libyaml there? Do you know cphang ?	17:54
tristan	looking at that page, looks like it does load/save, but not preservation of line/column/whitespace/comments etc	17:54
tristan	ruamel.yaml even manages to lose some comments completely when you put them in the "wrong" places	17:55
benschubert	yeah :/ and it being on sourceforge doesn't help for contributing	17:56
tristan	is it on sourceforge ?!	17:56
tristan	jaysus I didn't know the yaml format itself was even that old	17:57
benschubert	it moved from bitbucket to sourceforge	17:57
benschubert	because bitbucket dropped mercurial support	17:57
* tristan wonders if they upgraded to CVS !		17:58
tristan	Everything in CVS good ol' days are back !	17:58
benschubert	Nah, they just dropped to only support git	17:59
benschubert	and this project prefered moving to sourceforge rather than move to git	17:59
tristan	I meant ruamel.yaml moving to sourceforge so they could... use CVS but... yeah same difference I guess	17:59
tristan	Well, it does remind me of the CVS days, when you really needed to have a "Configuration Manager" for every dev team	18:00
cphang	benschubert tristan the main value of libcyaml is that you can define schemas to parse the yaml and map them into structs https://github.com/tlsa/libcyaml/blob/master/docs/guide.md to my knowledge libcyaml uses libyaml under the hood.	18:00
tristan	Cause developers and VCSes are like oil and water	18:00
tristan	yeah doesn't seem to fit the bill	18:01
benschubert	cphang: we don't really have a schema, since each plugin has different stuff, so might not be the best fit, but thanks for showing :)	18:08
benschubert	tristan: you might want to revert your last change: https://gitlab.com/snippets/1992018	18:08
benschubert	*https://gitlab.com/snippets/1992018#note_372759175 sorry	18:08
tristan	Crap	18:09
tristan	And the change is (A) Use a more generic array implementation (instead of PyObject ** and Py_ssize_t on ValueClass, we use ObjectArray with has a function call for appending values)	18:15
tristan	And change a `list` for an `ObjectArray *`, oh also the new ObjectArray holds references to the objects	18:15
tristan	but that should be an atomic operation	18:16
tristan	surely assigning `(PyObject **)array[idx] = <PyObject>value && Py_INCREF(value)` must be cheaper than list.append(value)	18:17
benschubert	What's the C code generated for both?	18:17
tristan	not easy to say, because using a list calls into python	18:18
tristan	and iterating it/popping it also	18:18
tristan	while iterating over a pointer array should be lightning fast	18:19
benschubert	right	18:19
tristan	we don't technically need the ref/unref statements, I just thought they would be safer (in this case, we know these strings are already owned by other means)	18:20
tristan	I'll try to cook up a more intensive local benchmark here tomorrow	18:22
tristan	In any case, this particular run is telling	18:22
tristan	It shows that the tiniest things are really having a huge impact on the result	18:22
tristan	Well, huge, is a vast overstatement	18:22
tristan	We are talking about .25 of a second for (project.conf variables) * 6000	18:23
tristan	still it's a visible difference	18:23
benschubert	still around 5-ish% of the run yeah :)	18:23
benschubert	And this is a "simple" 6k elements project	18:24
benschubert	so for more realistic examples it will probably be worse :/	18:24
tristan	I'm not sure of that	18:24
benschubert	yeah that's actually true, cas would be taking a bigger part for bigger projects	18:25
tristan	benschubert, the more realistic and complex a project gets, the more composition will occur; but variable resolution will still happen once flatly for each element	18:25
tristan	If you have a variety of element kinds and custom variable assignments in elements, this wont effect resolution (much) is my point	18:25
tristan	you might end up having a wider value cache	18:25
tristan	(cache of regex'ed "%{foo} and %{bar}" strings)	18:26
tristan	those will indeed be more expensive	18:26
benschubert	yeah, but the time will be mostly taken by casd looking if things are in cas right?	18:27
benschubert	well, I'm sure of that actually ;)	18:27
tristan	Sure but that's another aspect	18:27
tristan	I don't think that technically a larger artifact should take a longer time to verify if it is cached, though	18:28
tristan	every artifact should take the same time to check	18:28
benschubert	Ok, should we remove a part of the "optimizations" that are not really some and just ensure the code is the easiest to maintain right now? And we can revisit later maybe?	18:28
benschubert	tristan: would it? You'd need to check each blob	18:28
tristan	benschubert, I am using the word "should" here :)	18:29
tristan	I don't know what it is	18:29
tristan	But definitely, it should	18:29
benschubert	we'd need an indexed cas for that (and thus know what's stored), I don't think casd supports that yet? juergbi will know more about this subject than me though	18:30
tristan	Cache expiry is certainly hard to get right but, we shouldnt have to verify the blobs related to a ref just to determine that the ref is valid when checking for a ref	18:30
tristan	At worst, we'd find out when it's time to build something that we were unable to store all the elements required for the build in the local cache size	18:30
tristan	and have a late stage failure	18:31
juergbi	BuildStream can't handle that	18:31
juergbi	so we have to check up front	18:31
juergbi	we obviously don't want the whole build session to fail if a blob is missing	18:31
tristan	We can bail out and say "sorry you wasted your time, next time get a larger disk"	18:31
juergbi	we already do this if the cache is not large enough for the current session	18:32
juergbi	however, at the beginning of the session we check existing artifacts whether they are completely available	18:32
juergbi	separate scenario	18:32
tristan	juergbi, what's the worst that can happen though ? You wouldn't need to rebuild everything in a subsequent session, you'd have to increase your `quota:` in your config, give it some more space, and rebuild	18:32
tristan	and then you will only build the missing blob and continue from there	18:32
juergbi	for that we need to know that the blob is missing	18:33
benschubert	tristan: the problem is if half an artifact is missing	18:33
tristan	juergbi, Right, I'm guessing that check is overkill right ? As soon as we remove blobs shouldn't we be removing refs ?	18:33
juergbi	which means that we have to check each blob for each artifact	18:33
juergbi	no, that's what we used to do	18:33
juergbi	which resulted in expiry taking 24h	18:33
* benschubert remembers this with dread		18:34
juergbi	so we switched to the blob-based expiry	18:34
benschubert	it awas more 72 hours but hey :)	18:34
cphang	this feels like a referential integrity problem..	18:34
juergbi	who's counting ;)	18:34
tristan	yeah	18:34
tristan	Could this not be solvable with like, sqlite ?	18:34
juergbi	we could probably do something more fancy if we move to a proper database	18:34
* tristan beat you to it		18:34
tristan	heh	18:34
juergbi	hehe	18:34
juergbi	creating an index in a database could certainly help	18:35
juergbi	but it's also not trivial	18:35
juergbi	and we also don't want to overcomplicate casd	18:35
juergbi	(although a simple database would likely make sense even if we keep blob-based expiry)	18:35
cphang	But you only need to guarantee that referential integrity over the duration of a BuildStream invocation.	18:35
tristan	In that scenario, we could run a GC of refs at the beginning of a session and hopefully that GC is less costly when there are no refs to cleanup	18:35
benschubert	cphang: you'd need to build it each start then, which takes some time	18:35
benschubert	cphang: easily multiple minutes with large-ish projects	18:36
juergbi	following all references can take a very long time	18:36
juergbi	btw: the buildgrid cas server now has an indexed cas with expiry support	18:36
* cphang feels like Bazel has faced similar problems in the past..		18:36
tristan	benschubert, every time ? or only when blobs have expired ?	18:36
benschubert	tristan: not sure I understand the question?	18:37
juergbi	I think keeping blob-based expiry is the right approach, however, we can hopefully speed up the blob checks with a database index	18:37
benschubert	that's what I would be thinking	18:38
tristan	Knowing benschubert's definition of "large-ish" I think I would be happy to spend a couple of minutes at `bst build` launch time, every time the last build resulted in expiry of blobs	18:38
cphang	juergbi buildbarn has an ADR basically on this issue: https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md, to satisfy the guarantee that CAS blobs cannot expire over the invocation of a Bazel build	18:38
tristan	benschubert, I mean if nothing expired last run, it should return right away	18:38
tristan	no refs to GC	18:38
benschubert	tristan: a couple of minutes at each start is prohibitive, and things expire quickly, I'm regularly working with 2TB+ of cache :)	18:38
juergbi	cphang: we have a protection of session blobs in casd as well. the issue we're talking about is integrity across sessions	18:39
benschubert	even casd cleanup when going over takes north of 15 minutes	18:39
benschubert	anyways, I'll drop off now :) have a good evening/morning/night/other !	18:39
juergbi	tristan: also, with support for partial CAS (e.g. intentionally only storing the directory objects without the files locally) reference-based expiry probably doesn't make sense	18:40
juergbi	this can be very useful with remote execution where the client doesn't need the intermediate blobs	18:40
persia	I think object lifecycle management is a known-hard problem. There's some reasonable examples in the HBase sources, but I've definitely encountered people who complained about near-quota behaviours causing dependency-production-and-purge cycles preventing query resolution.	18:41
persia	Don't fear using an actual database for it: maybe sqllite, or maybe some custom columnar store.	18:41
tristan	indexing the blobs for improved "check if it's present" can work too	18:41
tristan	I guess the shape of data plays into this too	18:41
juergbi	yes, likely	18:42
tristan	When people have an artifact with many files, they are doing it wrong already	18:42
juergbi	don't forget sources. they are also in CAS	18:42
juergbi	although, they don't have to be checked that frequently, I suppose	18:42
tristan	some few artifacts (compositions) might have many files	18:42
benschubert	tristan: I fear we have a different baseline for what "many" files are :'D	18:43
tristan	benschubert, I doubt it... I mean, having build elements with huge amounts of files in the artifacts, is akin to writing huge programs in one big file.c	18:43
tristan	common sense dictates you want things to be modular	18:44
persia	I can think of many reasons to create artifacts that are not compositions that are hundreds of files, and even more reasons to create compositions with thousands of files which are then themselves components in larger compositions.	18:44
tristan	Data	18:44
tristan	Not code, indeed	18:44
persia	I hadn't conceived of data. My reasons are entirely related to code being compiled into things.	18:45
tristan	But even then, if you have language packs for instance, you'd still want to break them into per language artifacts	18:45
tristan	Well, then we might again have a different definition of "huge", I think that a webkit of firefox installation meets the criteria of "huge" when it comes to software builds	18:46
juergbi	long build time and lots of files in the artifact are fairly independent	18:47
persia	If you have a debian-based system, it might be worth querying which packages you have installed, and running `dpkg -L <pkgname> \| wc -l` against each of them.	18:47
persia	You may be surprised about how many end up with a very large number of files.	18:47
juergbi	although e.g. libreoffice covers both. almost 33k files for a full un-split build here	18:48
persia	Yep. Chromium is another fun one :)	18:48
tristan	holy crap	18:48
persia	Yep.	18:48
juergbi	most are IDL and documentation, though	18:48
tristan	See, but we are talking about monolithic cathedrals which I think everyone can agree needs to go through the xorg exercise	18:48
tristan	which was once a similar breed, but sensibly became modular	18:49
juergbi	maybe they should but we have to deal with real projects	18:49
juergbi	also, spreading all these files out to multiple artifacts in a single pipeline doesn't reduce the total number of blob checks	18:50
tristan	Yeah, I'm just saying that if the indexing is slow for the (potentially few) projects which are ridiculously monolithic, it's justified to say "You're doing it wrong" ;-)	18:50
tristan	true, but it should give you better feedback	18:50
juergbi	potentially, yes	18:51
juergbi	btw: I also have vague plans of parallelizing those checks	18:51
tristan	I've been saying this for years !	18:51
juergbi	casd supports real threads, not like Python	18:51
juergbi	;)	18:51
tristan	nobody got around to it yet	18:52
tristan	silly	18:52
juergbi	so if we use async requests in Python, we can get this parallel without GIL blocking us	18:52
tristan	I don't think GIL is a question here, we could have been doing this last year	18:53
tristan	it's just not the right time to do it while benschubert is in the midst of changing the parallelization model	18:53
juergbi	well, without casd it would be fairly difficult to really parallelize	18:53
tristan	if we had CacheCheckJob in the first place, it wouldn't change much	18:53
juergbi	as the Python threads probably wouldn't be worth it	18:53
tristan	No no, we parallelize thought jobs in the scheduler	18:54
tristan	right now that's done with multiprocessing	18:54
tristan	if benschubert changes it to threads, then it's threads	18:54
juergbi	would have to check whether the overhead is not too high for these checks	18:54
tristan	That could have worked a long time ago	18:54
juergbi	with the fork overhead, I'm not sure	18:55
juergbi	fork in BuildStream is far slower than raw fork typically is in Linux	18:55
tristan	Really ?	18:55
juergbi	yes, it shows up very high on profiles	18:55
tristan	What for ?	18:55
juergbi	I'm not sure, I suspect atfork handlers	18:55
juergbi	gRPC could be involved	18:55
tristan	Python or smth I guess	18:55
tristan	Even then, we could have easily batched them	18:56
tristan	throw 100 elements at each CacheCheck job	18:57
juergbi	yes	18:57
juergbi	it's mainly a question how to refactor the code to allow this	18:58
juergbi	as it's currently handled by each elemenet	18:58
juergbi	anyway, not something for today	18:58
tristan	The scheduler has element specific jobs, but since a long time ago that is a subset of it's jobs	18:59
tristan	but yeah, better to not fiddle in there while benschubert is in it	18:59
juergbi	there are no non-element jobs anymore, iirc, however, we could probably re-add that if needed	19:01
tristan	The framework is quasi in place anyway	19:03
tristan	what with the scheduler resources and such	19:03
tristan	of course the logging has to be different for such batch jobs, but that seems not challenging and unworthy of concern :)	19:04
benschubert	python being slow for fork:	19:20
benschubert	-> close the event loop, all file descriptor + other stuff	19:20
benschubert	-> the refcounting GC means copying all the memory pages on fork	19:20
benschubert	Otherwise I'm not sure I understood everything juergbi: would you be doing the async requests in grpc in the scheduler or before?	19:20
benschubert	If before: that would not affect me	19:20
benschubert	if during: I think that will be great, I wanted to try the asyncio support of grpc now too	19:20
juergbi	I was planning before	19:21
juergbi	it's not something I'm working on right now, though	19:21
juergbi	and probably also not anytime soon	19:21
benschubert	yeah if it's before I would not be affected, so you can work on that if/when you want	19:21
benschubert	juergbi: btw, how long to run the non integration tests on your system again? (and which parallelism?)	19:21
juergbi	have to rerun to check	19:22
benschubert	rouughly, don't need super accurate :)	19:22
juergbi	with integration but without coverage it's 2min 15s with -n 12	19:23
juergbi	(Python 3.8)	19:23
juergbi	with coverage it's a minute more or so (and problematic with 3.8)	19:23
benschubert	ok! thanks, so threading is indeed slightly slower even on the test suite for now	19:23
benschubert	I still get occasional lingering threads but I've got no idea why	19:24
juergbi	do they cause any issues?	19:24
benschubert	No, but it would be cleaner for testing no?	19:25
juergbi	or is it just about making sure it's all working cleanly	19:25
benschubert	Unless we don't care and I remove the check entirely	19:25
juergbi	yes, definitely	19:25
juergbi	without integration, without coverage: 1353 passed, 150 skipped in 52.12s	19:25
benschubert	ok my tests take 2m12 with -n10 :/	19:25
benschubert	oh well	19:25
juergbi	-nocover?	19:26
benschubert	I'll try a bit further then cleanup my branch and make a WIP PR with thread checking disabled for tests	19:26
benschubert	pytest -n 10 directly	19:26
juergbi	ah ok	19:26
juergbi	is tox behaving now?	19:26
benschubert	nope	19:26
juergbi	:-/	19:26
benschubert	still getting lingering threads with tox	19:27
benschubert	It's 4 threads all the time	19:27
juergbi	ah, the lingering thread issue is only with tox, not with pytest itself?	19:27
benschubert	and they are not python threads	19:27
benschubert	with pytest rarely (once every 2-3 test runs)	19:27
juergbi	do you use a shared gRPC channel for casd?	19:28
benschubert	with tox all the time, 90% of the tests fail	19:28
benschubert	I think so, I haven't changed how that was handled	19:28
juergbi	have you removed the explicit closing of gRPC before spawning new job?	19:28
juergbi	that shouldn't be necessary anymore	19:28
benschubert	https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#57eb6e43c19557b544ffb8be1ee3811ea76c8d95_558_552 yes AFAICT	19:29
benschubert	maybe do you see something obvious on the diff?	19:30
juergbi	yes, you've removed it: prepare_fork	19:32
benschubert	interesting: tox -e venv /bin/bash then pytest fails	19:38
benschubert	so, that is something related to my environment	19:38
benschubert	that's fishy	19:38
* tristan would hope that any parallelism is done with Scheduler invocations, for consistency, and to follow the same rules layed out in the Resources		19:57
tristan	Scheduler can be invoked multiple times with different queues	19:58
juergbi	it only makes sense if any actual scheduling is required	20:01
juergbi	I have to look into it again but it's possible batching is all that's needed, in which case it's simpler and more efficient without involving scheduler	20:02
benschubert	juergbi: I have a working theory for my threading problem: If I don't start the cacheusagemonitor at all (now a thread in my implementation), then I don't have those lingering threads. So it seems that having two threads sharing the connection cause problems? Is there anything weird around GRPC I need to close/handle/lock?	20:03
juergbi	I'm not familiar with Python gRPC's API semantics with regards to multiple (Python) threads	20:04
juergbi	however, why would it be different for cacheusagemonitor and job threads?	20:04
benschubert	I have no idea	20:05
juergbi	unless job threads are simply less likely to conflict timing wise in the tests	20:05
benschubert	removing the starting make all tests pass but one (the one around checking the cache usage)	20:06
juergbi	is the cacheusagemonitor thread shut down / joined properly at the end of each 'cli' invocation?	20:06
benschubert	Yep	20:06
benschubert	Do you see something blatant here: https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#eee68cb7737cd5a69ac2b8ddcf0119307cf48977_734_739 ?	20:07
juergbi	and done in such a way that it's not forcefully torn down in the middle of a grpc request?	20:07
benschubert	wait, seems like some tests actually still fail	20:07
benschubert	so it's not perfect	20:07
benschubert	but better	20:07
benschubert	so it seems like this is indeed the problem, grpc not shutting down nicely	20:07
juergbi	self._connection.close() in CacheUsageMonitor seems odd to me	20:08
juergbi	I don't think the CacheUsageMonitor owns the connection	20:08
juergbi	but no idea whether that's related to the issue	20:09
juergbi	how/where do you terminate this thread?	20:09
benschubert	it actually has no effect, removing it doesn't fix the problem	20:09
juergbi	and do you know how exactly this is done internally? trigger a Python exception in that thread?	20:09
benschubert	https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#eee68cb7737cd5a69ac2b8ddcf0119307cf48977_134_136 for shutdown	20:09
juergbi	ah, right, that seems clean	20:10
benschubert	(adds a 5s delay so will have to do it differently after) but yeah :/	20:10
benschubert	and I tried joining the thread before closing the connection right after	20:10
benschubert	and this didn't change either	20:11
juergbi	I would definitely do the join before	20:11
juergbi	as it might still be using the connection	20:11
juergbi	I'd suggest trying to find documentation on python gRPC thread safety	20:12
juergbi	sharing a channel might not be supported at all or require certain precautions	20:12
juergbi	also have to wrap my head around GIL in this context	20:13
juergbi	with GIL presumably fewer things go wrong than without, but it's not magic thread safety	20:13
benschubert	https://github.com/grpc/grpc/issues/9320 it seems correcy	20:13
juergbi	I'm not used to this odd middle way	20:13
juergbi	ok, so the gRPC side should be fine, hm	20:14
benschubert	https://github.com/googleapis/google-cloud-python/issues/3272	20:16
benschubert	So the 4 is indeed the grpc threads	20:16
juergbi	our _establish_connection is not really thread-safe	20:17
juergbi	don't know whether this could cause this issue, though	20:17
benschubert	ooh good point	20:18
juergbi	but should be fixed in any case	20:18
benschubert	well I can easily lock it and see what happens :)	20:18
juergbi	yep	20:18
benschubert	or force creation at the start anyways	20:18
benschubert	juergbi: oh gosh, that was it	20:22
juergbi	:)	20:22
benschubert	so we'd actually create two	20:22
benschubert	uh, more errors, apparently not the only part x')	20:23
tristan	juergbi, I think that one would want to throttle the number of parallel checks; querying for cached state can mean 3 things; (A) Checking cas for cached artifact, (B) Checking cas for cached sources, (C) Checking local disk for sources	20:30
tristan	Wouldn't it make sense to not launch all of that for hundreds of elements simultaneously ? if it would make sense: That is scheduling I think	20:31
benschubert	tristan: we are meaning at the start, before showing all elemetns, getting their state as we need to prepare the scheduler. This makes sense to batch all of them in a single request to casd that should it as fast as possible :)	20:34
juergbi	yes, it really depends on where/how we do the batching	20:35
*** xjuan_ has joined #buildstream		20:40
tristan	I mean, the scheduler takes care of throttling parallel jobs so as not to overstress the hardware	20:43
tristan	The user gets to say things like how many fetches and builds go on in parallel	20:43
tristan	If we're going to launch jobs which need to be throttled and run in parallel, and we have a Scheduler with the Resources stuff to control such throttling, we should not re-invent that from scratch, but reuse the Scheduler object and Job harnesses to do so	20:44
*** xjuan_ has quit IRC		20:45
tristan	even if it appears to be simpler to write something simpler that doesnt use the scheduler, there's no reason why using the scheduler cannot be as simple	20:45
* tristan believes he had this conversation before, upon finding other threads/subprocesses/multiprocessing hacks in startup phase, over a year ago		20:46
* tristan would just like to use one tool for the job consistently, and not consider that the Scheduler is only for the main build process		20:47
juergbi	I would definitely not reinvent a scheduler in buildstream	20:47
juergbi	however, e.g. casd (as a gRPC server) has a thread pool	20:47
juergbi	and essentially letting that take care of scheduling for the startup check could also make sense	20:48
tristan	Right, but launching 4 threads in a hard coded manner which do something in parallel at the beginning, would not look like inventing a scheduler, but would still be	20:48
tristan	Also, I think that the jobs need to check local disk, not only do gRPC jobs	20:48
juergbi	yes, I also wouldn't want to do that	20:48
juergbi	it's casd that performs the actual disk checks (updating timestamps)	20:49
tristan	each element needs to check local disk to get Source consistency (in the absence of source cache)	20:49
tristan	Seems logical to couple all cache checks (including local disk) into one group	20:49
juergbi	yes, so there is the non-CAS part first	20:49
tristan	I think non-cas part is after	20:50
tristan	We only need to check disk if source cache is empty ? or maybe we still need to know, not sure	20:50
juergbi	first check is artifact, we only check anything source-related if we don't have an artifact	20:50
juergbi	and the main part of the artifact check is CAS	20:50
juergbi	if the artifact is missing or incomplete, we check the source cache, which is very similar to the artifact check	20:51
tristan	Right, in builds we can probably even discard the checks for source consistency but, I have a feeling that's not a long term reliable rule	20:51
tristan	Some scenarios might require knowing	20:51
juergbi	source consistency, i.e., having ref doesn't involve the disk, though, right?	20:52
juergbi	(except .bst and project.refs)	20:52
benschubert	it can if you need to be staged before a ref	20:52
tristan	consistency includes whether it's cached and whether the ref is known or not (tristate)	20:52
benschubert	(local sources, workspaces)	20:52
tristan	but consistency has been replaced under the hood	20:52
tristan	I still think it's a decent thing to bring back if only for `bst source show` output	20:53
juergbi	but you don't want to pay the price for all other operations	20:53
tristan	(not implementation-wise, there are good reasons why this was broken down into 2 methods under the hood)	20:53
juergbi	right, can be built on top	20:53
juergbi	btw: are you still awake?	20:54
juergbi	or again	20:54
tristan	I also suspect we may hit a turning point where sources might be required by reverse deps	20:54
* tristan is about to crash :)		20:55
tristan	can't sleep early this week	20:55
tristan	cython has put me into a sleep at 7am pattern	20:55
juergbi	evil cython	20:55
*** dftxbs3e has quit IRC		23:06
*** dftxbs3e has joined #buildstream		23:23

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!