*** tristan has quit IRC | 06:59 | |
*** benschubert has joined #buildstream | 07:27 | |
benschubert | tristan: oh ! great :D | 07:27 |
---|---|---|
*** tristan has joined #buildstream | 07:35 | |
*** ChanServ sets mode: +o tristan | 07:35 | |
tristan | benschubert, could you launch another benchmark on the latest please ? | 08:00 |
tristan | Just see how we're doing | 08:00 |
benschubert | sure, I'll do that in 30 minutes, just need to finish some urgent stuff :) | 08:00 |
tristan | sure | 08:01 |
tristan | I think this one is a winner, I'll go deeper anyway but would just like to know :) | 08:01 |
benschubert | tristan: running! | 08:13 |
*** hasebastian has joined #buildstream | 08:13 | |
benschubert | By the way does someone remember how jennis generated the debian project? I could not find the script to do that :) | 08:13 |
Frazer61 | hi, does anyone have any thoughts on the `bst source show` feature, since the secure downloads may not be entirely possible for the time being as it wouldnt be very defined, so this is the next best thing | 08:33 |
*** santi has joined #buildstream | 08:37 | |
benschubert | Frazer61: that might be better discussed on the ML :) One thing I could see is showing all the `config` for the each source in the given element, or something like that? | 08:38 |
Frazer61 | tristan wanted to doscuss it on the irc chat to get others involved more so "Better to have this conversation in public." at least for the brainstorming | 08:41 |
coldtom | imo it might be better to open up some kind of interface to the yaml loader as public API, and have this source checking in some python script | 08:43 |
Frazer61 | so avoid bst source show ? | 08:46 |
benschubert | I'd rather go with the `source show`, It's easier to open a cli interface than an API for that | 08:49 |
tristan | I suggested that we have this discussion on IRC because, with the mailing list, this will be a long drawn out affair | 08:51 |
tristan | It may still be, but having some direction to get started on the mailing list would be helpful I thnk | 08:51 |
tristan | *think | 08:51 |
benschubert | tristan: as long as the important part of the discussion is on the ML that's fine :) | 08:52 |
tristan | Yeah of course | 08:52 |
tristan | I think that Frazer61 is not currently armed with all the context he needs to post a full proposal | 08:52 |
tristan | One thing I think we need to consider here are the needs of collect_manifest | 08:53 |
benschubert | right | 08:53 |
tristan | I.e., having `bst source show` give us URLs requires some kind of plugin interaction I think, if we do that and don't consider the rest, we've kind of wasted some opportunity | 08:54 |
tristan | Core access to raw 'config', what are the benefits and drawbacks ? | 08:54 |
tristan | I think off the bat, it doesn't allow a consumer of the API (either Source API or command line), to know what it is looking at | 08:54 |
tristan | I.e. there is nothing standard at all about source `config` data, except for the `directory` it will be staged in | 08:55 |
tristan | The rest is a free for all, steered by general convention | 08:55 |
tristan | The benefit of course is that we could implement something fast and easy, and hand over the problem space to the consumer, I don't think that is ideal | 08:56 |
tristan | Especially considering, it will be the same as https://gitlab.com/BuildStream/bst-plugins-experimental/-/issues/2 | 08:57 |
benschubert | On the other hand, project authors have the knowledge on what sources they are using so might be the ones at best to decide what to look for? | 08:57 |
tristan | Even if we report a 'kind' string for each source, the end user doesn't have the right to know what that 'kind' string means | 08:57 |
benschubert | Why don't they? | 08:58 |
tristan | because any project can introduce a plugin with any kind string | 08:58 |
tristan | the `git` source on your toplevel might be something completely different from the `git` source of a junctioned project | 08:58 |
tristan | and the junctioned project can change it's `git` source later on to mean something different | 08:58 |
benschubert | So for example, if I want a linter that ensures only a subset of all 'kind' plugins are used in my project, how would I do that without it? | 08:59 |
Frazer61 | freedesktop-sdk is like thar i think | 08:59 |
tristan | benschubert, you don't have that ability | 08:59 |
benschubert | tristan: plugins are defined in a scope, should we scope the `kind` ? | 08:59 |
benschubert | tristan: I think that might be a common thing asked for projects to be able to lint on | 08:59 |
benschubert | I know you can't now | 08:59 |
benschubert | but that's deifnitely something we should consider | 09:00 |
tristan | benschubert, I've been thinking about extensibility of core plugins, and in that context; I think we need to make use of core plugins explicit (in the project.conf origins) and also implement `alias` feature to rename plugins who's names might otherwise clash | 09:00 |
tristan | benschubert, I think perhaps we could have something like plugin identities at some point, but maybe that is completely orthogonal to providing concise information through `bst source show` and Source python APIs ? | 09:01 |
tristan | benschubert, for instance, it might make sense to associate a hosting/url/unique identity with a plugin, but I *think* we can have that conversation separately ? | 09:02 |
benschubert | tristan: not 100%, in a `bst source show` I would definitely expect to see a kind :) | 09:02 |
tristan | We don't show a kind in `bst show` do we ? | 09:02 |
coldtom | just as an aside, opening up some interface to the yaml loader would allow replacements for collect_manifest, collect_integration and any other build metadata -> sandbox plugin | 09:03 |
tristan | coldtom, not in a useful way | 09:03 |
tristan | coldtom, i.e. not by just exposing raw nodes, because as stated in this backlog, raw nodes are not meaningful | 09:03 |
benschubert | tristan: I don't know, the way we went with was parsing the yaml ourselves, and forbidding anything that is not basic yaml (not that this is a good solution) | 09:03 |
benschubert | but sure, let's not take too long on that point | 09:04 |
tristan | coldtom, the configuration of a plugin is only meaningful to the plugin parsing it | 09:04 |
tristan | I think that it's important for me as a user of `bst source show`, to know 2 things primarily... (A) A URI or a list of URIs... (B) Some additional identifying information for each URI reported for that source | 09:05 |
benschubert | tristan: one thing that I would personally find very useful would be able to get the parsed configuration for the sources | 09:05 |
tristan | I would not be particularly pleased if I had to have intimate knowledge of Source configurations in order to derive these meaningful aspects | 09:06 |
tristan | benschubert, Why ? | 09:06 |
tristan | And what is "the parsed configuration" ? | 09:06 |
tristan | Maybe we are saying the same thing ? | 09:06 |
tristan | For (B) above, I would consider that like a checksum of a tarball URI, or a commit sha of a git URI | 09:07 |
tristan | Something uniquely identifying | 09:07 |
tristan | Of course raw YAML wouldnt give you that anyway, because refs are possibly in project.refs | 09:07 |
benschubert | tristan: Here are a few things I want to ensure on a project list of sources: | 09:08 |
benschubert | - every uri-type field is using an alias | 09:08 |
benschubert | - Only plugins in a given list are allowed | 09:08 |
benschubert | yeah I don't mean raw yaml | 09:08 |
tristan | Right :) | 09:08 |
coldtom | ah, i should clarify, when i say the loader, i actually mean "some API to grab Elements and Sources given an element of input" | 09:08 |
benschubert | but rather a dump of the source's configuration after parsing | 09:08 |
tristan | benschubert, for the list of plugins, I think we need to consider that separately because there is nothing to identify them afaics, but we should keep in mind that that needs to be extensible | 09:09 |
tristan | i.e. for "only given plugins are allowed" | 09:09 |
benschubert | tristan: I'm fine with that :) | 09:09 |
tristan | Another thing I would want to know is cached state of course, like bst show | 09:10 |
tristan | Whether (A) The source is cached locally, (B) The source has a known ref and can be fetched, (C) The source cannot possibly fetched because there is no ref, and so it needs to be tracked | 09:11 |
tristan | i.e. the Consistency enum state of a Source | 09:12 |
benschubert | right | 09:12 |
benschubert | having something similar to the `bst show --format` (anything that make sense also in the source context) | 09:13 |
tristan | Orthogonal tangent we can discard: There is a scenario with remote execution where we might want to consider "cached locally" as "cached on the RE build cloud"... that would effect `bst show` (artifacts) and `bst source show` (sources) equally, so I think it is irrelevant for this conversation | 09:14 |
tristan | Yeah, for the command line we'd want format | 09:14 |
tristan | What bothers me about `bst source show` in general, is how to represent lists | 09:14 |
tristan | Do we output YAML ? | 09:15 |
tristan | for those details ? | 09:15 |
benschubert | I'd love that, it's so much easier to parse | 09:15 |
tristan | Sources have 0-N URLs | 09:15 |
tristan | (and as such, refs, although internally a ref might be a list) | 09:15 |
benschubert | refs can be anything | 09:16 |
benschubert | we have some refs that are json blobs | 09:16 |
tristan | Yes, but I think we want to have a concept of "A unique identifier for this URI" in an outward facing API right ? | 09:16 |
tristan | So it would be a Source's responsibility to provide that | 09:17 |
tristan | Even if it's only for the purpose of `bst source show`, or use in other plugins which want to make queries | 09:17 |
tristan | There is also an interesting facet about Source URIs; their URIs lists are variable in length | 09:18 |
tristan | Sometimes a Source has 0 or 1 URI, but more URIs after tracking | 09:18 |
tristan | Like git submodules, or rust crates | 09:19 |
benschubert | tristan: having `bst source show` ask each plugin to give the info seems a goot idea | 09:19 |
benschubert | so the ref can be formatted nicely, etc | 09:20 |
benschubert | that could return a mappingNode, with all the info | 09:20 |
tristan | Right, maybe the API needs to be a string based query | 09:20 |
tristan | Ah you mean, trust the plugin to format a dictionary in a specific format ? | 09:21 |
benschubert | like in some refs, there is some things that are "internal" to the plugin that you don't need to expose | 09:21 |
benschubert | yep | 09:21 |
tristan | Can we do away with the trust thing ? | 09:21 |
tristan | hehe | 09:21 |
benschubert | well, yes? Because as you said the only thing that knows what's important is the plugin :) | 09:21 |
benschubert | hence why we don't want a 'uri' checker in Coe | 09:21 |
benschubert | hence why we don't want a 'uri' checker in Core | 09:21 |
tristan | I'd much prefer to have a more controlled API which the core can query sources with, like: urls = Source.get_info(InfoType.URLS) | 09:22 |
tristan | And the core creates YAML (or not) as needed for output on the CLI | 09:23 |
tristan | get_info() or such could be private _get_info() until we are absolutely sure that Elements in reverse deps should be allowed to see this information | 09:23 |
tristan | And we guarantee a stable format if the `bst source show` user decided they wanted yaml output (maybe default is something like `bst show` one liners ?) | 09:24 |
tristan | Hmmm | 09:25 |
tristan | So that's a good point... what about sources which don't ever have a URI, we already have some of those (except they grow URIs after tracking) | 09:25 |
tristan | Do we need to know that context ? | 09:25 |
tristan | URIs vs "Discovered URIs" ? | 09:26 |
benschubert | and then what about things that use `uri` as something else? Like, I could envision the `pip` package to use `uri` for the registry. Do we want this? ;) | 09:27 |
tristan | benschubert, in fact, since the core can use Node() APIs to validate fields, the Plugin could return MappingNode and there is no trust issue, I'm not sure if that is more convenient than constructing the mapping node in Core, though | 09:28 |
tristan | Oh, the name of the field in Element/Source configuration is of course irrelevant | 09:28 |
tristan | only meaningful data must be reported | 09:29 |
tristan | Ah and yeah, that | 09:29 |
tristan | The shape of the data indeed varies wildly | 09:29 |
tristan | a url can be a base, I think it is similar for the crate plugin too | 09:29 |
benschubert | well the registry is important might not be the same between all, though this would not be reported as an `uri`, so what should be? What I'm trying to say is that it's quite hard to have a consistent non-plugin dependent representation here | 09:29 |
tristan | benschubert, in that case, `pip` and `crate` maybe have a 'primary URI' which is the base URI which never has a ref, and secondary URIs (all of the individual crates) have refs | 09:30 |
tristan | Yeah it is, but I think that we're only interested in commonality, something that can be understood equally for all plugins | 09:31 |
tristan | If we want to query for URIs, then what are our reasons to do so ? | 09:31 |
tristan | * Policy URI schemes (gnome-build-meta wants this) | 09:32 |
benschubert | ensure they all have an alias set? | 09:32 |
tristan | * Ensure internal mirroring of all input data | 09:32 |
benschubert | (thus you don't want the expanded uri, just the normal one) | 09:32 |
tristan | benschubert, I don't think we want that alias thing in this context | 09:32 |
benschubert | how do you ensure all uris are aliased then? | 09:32 |
tristan | benschubert, We should already have a warning (configurable to fatal ?) if ever an unaliased URI is used right ? | 09:33 |
benschubert | oh maybe | 09:33 |
benschubert | you might be right sorry | 09:33 |
tristan | If we don't we should | 09:33 |
tristan | We already have all the mirroring URI indirection which should make that possible | 09:33 |
tristan | Plugins which use hard coded URIs or discover URIs and fail to use the indirections, are buggy (and could equally hide these unaliased URIs from users in a new API) | 09:34 |
tristan | Frazer61, Taking notes ? | 09:34 |
benschubert | https://gitlab.com/snippets/1992018 btw :) | 09:34 |
tristan | Still slower ?!?!?! | 09:35 |
tristan | Damn | 09:35 |
tristan | crazy | 09:35 |
Frazer61 | tristan yeah, just dont want to interrupt just yet | 09:35 |
benschubert | memory's much better though | 09:35 |
tristan | Frazer61, just checking | 09:36 |
tristan | So another reason is to create manifests | 09:36 |
tristan | reporting on the provenance of build inputs | 09:37 |
tristan | Certainly not for discrete processing of the outputs or automation of tasks based on the output | 09:37 |
tristan | Validation, reporting | 09:38 |
tristan | Checking that internal mirrors have everything we are using, may be an automatable task, but it can't be realistically achieved without the assistance of BuildStream itself | 09:39 |
tristan | Because you can't have knowledge of the meaning of refs | 09:39 |
tristan | You can only know that they are unique for a given URL | 09:39 |
tristan | I think that we are essentially on the same page that (A) The source reports concise data about URIs and refs (B) A source can have 0-N URIs (C) Some URIs are discovered during processing (tracking) | 09:41 |
tristan | And I think that we can enforce that a unique string can be reported for each URI (internally based on "ref") | 09:42 |
tristan | Even if it's not entirely true ? | 09:42 |
tristan | I *think* we don't need to distinguish between primary URIs (somehow encoded into configuration) and discovered URIs | 09:43 |
tristan | We will also need extensibility to support plugin unique identities | 09:44 |
benschubert | I think that's good enough yeah | 09:44 |
tristan | For extensibility, we'll need to define what happens when a Source is asked for something which it has not been implemented to report | 09:45 |
tristan | For a CLI feature, the user probably needs to decide | 09:46 |
tristan | I guess that in this sense, we could start with a Source.list_uris() and Source.get_display_ref(uri) API, and add additional APIs for reporting other things later on | 09:46 |
* benschubert is worried abouth the number of APIs that might and up being required | 09:47 | |
tristan | In that way we handle an InimplementedError() or such | 09:47 |
benschubert | but I guess we don't really have a choice | 09:47 |
tristan | Yeah I don't like it either | 09:47 |
tristan | Then again, I'm less concerned about Source plugins - they are high effort and potentially fewer | 09:48 |
benschubert | right | 09:48 |
benschubert | and the base ones will probably be well maintained (for the most used ones) | 09:49 |
tristan | Yup | 09:49 |
tristan | Now if people can stop being annoying and implementing weird stuff like rust, JS and python | 09:50 |
tristan | We won't have to deal with these whacky concepts of language central distribution vectors | 09:50 |
tristan | s/central/centric | 09:50 |
benschubert | I feel you :) | 09:50 |
tristan | We can only pray :) | 09:50 |
benschubert | pyproject.toml with pep517 is quite fun to go around | 09:50 |
benschubert | at least gentoo had a great shot at it and it's now manageable in a sandbox | 09:51 |
* tristan is genuinely surprised by this latest benchmark :-S | 09:52 | |
tristan | The code is actually running consistently faster with the variables test | 09:52 |
tristan | but still 0.5 seconds behind on 6K debian elements :-S | 09:52 |
tristan | Which means, something about repetition, we are doing the *same thing* many times and should be able to not do it | 09:53 |
tristan | Aha ! | 09:53 |
tristan | no, not really aha :-S | 09:54 |
tristan | The old code had different hashing, but still per-Variables-instance tables | 09:54 |
Frazer61 | so are we at agreement that we will store the URIs or whole URL in a manifest, and bst source show will just read the manifest? | 10:01 |
tristan | Frazer61, not really no | 10:02 |
tristan | Frazer61, We are in agreement that we first need a Source plugin addition for reporting a list of URIs and a corresponding list of refs, both as simple strings | 10:03 |
tristan | Above I suggested get_uris() and get_display_ref(uri), but maybe it's simpler to have a single API which returns two lists | 10:04 |
tristan | That part we can refine in the proposal | 10:04 |
tristan | Frazer61, using that new API, we can implement `bst source show` in such a way that it can use the `--format` argument (similar to `bst show`) to display (A) cached state, (B) URIs (C) refs for a given source | 10:07 |
Frazer61 | i would rather have 2 functions, rather than 1 function returning 2 data lists, but thats just my personal preference | 10:08 |
tristan | Frazer61, some additional thinking needs to go into how the `bst source show` command works, specifically we need to consider how to identify sources of a plugin | 10:08 |
tristan | Frazer61, one function might be easier to implement - remember that in this case, we have few calling instances vs many implementors (and the priority is on ease and clarity for implementors, not callers) | 10:09 |
tristan | The core we can always control, plugins are for people to implement at will | 10:09 |
tristan | Frazer61, from here, it would be great if you could collect the general output of this and try to fill in any gaps you can see (like how to address/list per element sources when calling `bst source show foo.bst`) | 10:10 |
tristan | Frazer61, and provide a writeup for the ML, I would expect a back-and-forth on the ML but not too much | 10:11 |
tristan | With this, collect_manifest could safely be tranformed into a shell script that calls `bst show` and `bst source show` | 10:12 |
tristan | or maybe a python script which does the same | 10:12 |
Frazer61 | ok, ill do some experimenting | 10:13 |
tristan | (and of course gnome-build-meta could parse `bst source show --deps all gnome.bst --format '%{uris}'` or similar to easily implement a git accept hook) | 10:13 |
*** tristan has quit IRC | 10:17 | |
*** tristan has joined #buildstream | 10:58 | |
*** ChanServ sets mode: +o tristan | 10:58 | |
douglaswinship | Can anyone help me with these CI failures? They all happened fairly close together, and they all relied on a freedesktop-sdk branch that I had just rebased to include a bunch of extra commits from master, so I thought they might be connected. But they all look completely different. | 11:23 |
douglaswinship | I think maybe some of them are just runner issues, but I haven't seen them before. | 11:23 |
douglaswinship | The easiest ones are these two: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/617168643 & https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/617168812 | 11:23 |
douglaswinship | The error message there is "[00:00:40] FAILURE components/sdl2-ttf.bst: Failed to pull artifact with status UNAVAILABLE: failed to connect to all addresses". | 11:24 |
douglaswinship | Does that just mean the remote caches went down / couldn't be reached, and I should just re-run the job? | 11:24 |
douglaswinship | I'm just surprised that that causes a build failure. I thought caches were there to make the build run faster, but I didn't think they were essential for a build. | 11:24 |
douglaswinship | (There are a few more, that look different, but I don't want to completely spam the channel. | 11:24 |
coldtom | yep that's caused by cache issues douglaswinship | 11:27 |
coldtom | imo that should not cause build failures though | 11:28 |
douglaswinship | Next question: Any ideas on this one? https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/jobs/616510572 | 11:31 |
douglaswinship | coldtom: and thanks. Will re-run | 11:31 |
douglaswinship | re the last link, the first issue I see is "make: *** [Makefile:968: all] Error 2", and don't see anything before that that suggests any problem to me. So I'm not sure where to start. | 11:33 |
coldtom | not sure on that one, it's possible the error happened further up the build log and it's not caught there, the full log might be more informative | 11:44 |
douglaswinship | Ah yes, this might be more relevant: "make[2]: *** No rule to make target 'cpp.texi', needed by 'doc/cpp.info'. Stop." | 12:15 |
tristan | benschubert, so I'm not sure how wise it will be to dig down as far as the strings, but I've got a working theory that code is slowed down by my theory that we should be interning the variable names | 13:02 |
benschubert | tristan: do you mean that if we stop interning it will be better? | 13:02 |
tristan | benschubert, locally I've run the variables benchmark a bunch of times with different configurations, and it would seem that interned values ("%{foo} and %{bar}" strings) but not interned variable names is better | 13:03 |
tristan | It has to do with frequency of calling sys.intern() vs using those comparisons | 13:03 |
benschubert | yeah fair point, I can re-benchmark a version without interning if you want :) | 13:04 |
tristan | Upon entry into Variables instantiation, interning all the variable strings every time has an expense | 13:04 |
tristan | I pushed one | 13:04 |
* benschubert should setup a on-demande benchmarking system | 13:04 | |
benschubert | ok let me start a benchmakr | 13:04 |
tristan | Thanks a lot for being there throughout :) | 13:04 |
tristan | I've been giving consideration to deeper intimate string manipulation, but I worry a bit | 13:05 |
tristan | Python has dragons in their strings it seems | 13:05 |
tristan | it's not as straight forward as "everything is UTF-8" | 13:05 |
benschubert | yeah I would really prefer not going that deep | 13:05 |
benschubert | Happy to help, seems I have quite easy way sof benchmarking, so if it can make buildstream faster, It's not a big cost :D | 13:06 |
tristan | We have some help from https://github.com/cython/cython/blob/master/Cython/Includes/cpython/unicode.pxd | 13:06 |
tristan | But if we do strings, I think it's safer to use straight C API from Python.h | 13:06 |
tristan | Because that lets us go from str -> explicitly utf8 string -> str | 13:07 |
tristan | Whereas the cython wrapper gives you ` char* PyUnicode_AS_DATA(object o)`... (whatever "DATA" is: fend for yourself) | 13:07 |
benschubert | benchmark running! | 13:09 |
tristan | (with `cdef extern from "Python.h"` we can have `const char* PyUnicode_AsUTF8(object unicode)` and the opposite) | 13:09 |
tristan | I was thinking of trying the "".join() in C and tracking the length of each string etc | 13:10 |
tristan | For the final resolutions | 13:10 |
tristan | not sure we gain anything | 13:10 |
tristan | The python `dict` is hard to kick too, if we went full string it would be better to have a C hash table | 13:10 |
benschubert | What I'm wondering now is if it's really worth going that far? The idea was to make the code more readable and understandable (+the lazy resolution for junctions), and I wonder if we are not making it worse if we go too far that road? | 13:11 |
tristan | :( | 13:12 |
benschubert | I mean, if it means that we have a blazing fast variable resolution I'm happy with it | 13:12 |
tristan | I know, its terrible, I don't understand | 13:12 |
tristan | What was there before was confusing lists, plus manually raising RecursionError(), not supporting more than hard coded limit of recursion | 13:12 |
benschubert | yeah before is clearly not great | 13:13 |
benschubert | so clean well documented C/Cython code is still probably better | 13:13 |
benschubert | I just would really like us not to ever have a segfault because of that :) | 13:13 |
benschubert | on the other hand, with enough tests, it would probably be good | 13:13 |
tristan | Yeah I wouldn't worry much about that | 13:13 |
tristan | There's not huge opportunity for segfaulting here, except when you introduce code changes initially | 13:14 |
benschubert | right | 13:14 |
benschubert | and the current implementation actually could segfault in very specific cases | 13:14 |
tristan | Well, it will python stack trace when reaching the limits of recursion, I'm not that familiar anymore with the current implementation heh | 13:15 |
benschubert | the check is not actually perfect and if you are already deep in a stack you can actually make the code segfault :) but that's a very weird setup required | 13:17 |
benschubert | so would not happen during normal usage | 13:17 |
benschubert | but I think you are right if we are careful with C and it actually give us easier to understand code and allows us to squeeze a good chunk of perfs here, it's worth it | 13:18 |
tristan | Oh yeah | 13:18 |
tristan | Of course | 13:18 |
tristan | That recursion check is not even accurate | 13:18 |
benschubert | it's a "best effort" check :'D | 13:18 |
tristan | You'd have to hit the glibc stack limit | 13:18 |
tristan | I guess | 13:18 |
benschubert | yep | 13:18 |
tristan | best effort could at least observe Python->internal_stack_depth_counter or smth on entry ;-) | 13:19 |
tristan | and still possibly be wrong | 13:19 |
benschubert | thus your rewrite being much better even if we get more C code | 13:19 |
* tristan does really like it better.... | 13:20 | |
tristan | but I'm obviously biased :) | 13:20 |
benschubert | I'll like it better once it's faster :D | 13:20 |
tristan | really upsetting the performance issues though | 13:20 |
* tristan could add pointer array | 13:22 | |
tristan | benschubert, maybe give a look at the main resolve loop https://gitlab.com/BuildStream/buildstream/-/blob/tristan/partial-variables/src/buildstream/_variables.pyx#L258, let me know if you spot anything ? | 13:23 |
tristan | Obvious things include, I could store deps in a pointer array (possibly realloc() it's length to double it's size any time I add a new item), instead of a python list | 13:23 |
tristan | Even start out pointer arrays with enough space for say 8 values, possibly hitting realloc and then 16, 32, etc | 13:24 |
tristan | 8 * sizeof(PyObject *) | 13:24 |
tristan | Probably it will very rarely need to realloc | 13:25 |
tristan | but the algorithm, checking circular refs and collecting the ordered list of dependencies in the same loop, appears to be pretty solid to me | 13:26 |
tristan | I could be missing something obvious I don't know | 13:26 |
benschubert | tristan: mmh that seems a lot of context switch right now, might have the brain more free this weekend if you don't mind? | 13:26 |
tristan | Don't worry about it :) | 13:26 |
tristan | The moral support and benchmarking is good enough, don't let me disturb you too much if you're on a roll :) | 13:26 |
douglaswinship | juergbi: You were talking to WSalmon on tuesday about buildbox-casd, and it still having some memory issues even under 0.0.9? | 13:33 |
douglaswinship | I have an example of a build failure in CI. The casd log is here https://gitlab.com/libreml/libreml/-/jobs/618055732/artifacts/browse/cache/buildstream/logs/_casd/ | 13:33 |
benschubert | tristan: I'll have a look this weekend if you want, just that for today it would be too much of a context switch :) | 13:33 |
douglaswinship | " what(): OutOfSpaceException exception thrown at [buildboxcasd_lrulocalcas.cpp:325], errMsg = "Insufficient storage quota" " | 13:33 |
WSalmon | douglaswinship, interesting | 13:35 |
WSalmon | cphang how much disk/RAM dose that runner have? | 13:35 |
tristan | benschubert, I'm sure you will eventually review it anyway, and if it's not really ready for review by this weekend and performing acceptably, then I'll appreciate you're looking at it by then too :) | 13:35 |
juergbi | douglaswinship: could the out of space/quota error be correct, i.e., you either don't have enough disk space or quota configured too low to cover a whole bst session? | 13:36 |
benschubert | tristan: seems good! | 13:36 |
juergbi | douglaswinship: it is supposed to handle that case with a better error message (on the BuildStream side), however, it would be interesting to know whether the error is bogus or whether it's just an issue of not properly handling/forwarding this error | 13:36 |
*** cphang has joined #buildstream | 13:37 | |
WSalmon | juergbi, as in it might be crashing while doing it or it might just be the issue and we should report it better? | 13:37 |
WSalmon | juergbi, as in it might be crashing while doing disk mangment or it might just be the issue and we should report it better? | 13:37 |
juergbi | running out of space should result in a termination of the build session but with a reasonable error message | 13:38 |
WSalmon | +1 | 13:38 |
WSalmon | i seemed to rember it doing so at one point | 13:38 |
WSalmon | but not 100% | 13:38 |
cphang | WSalmon https://gitlab.com/libreml/infrastructure/infrastructure/-/blob/master/ansible/roles/gitlab-runner/defaults/main.yml#L44 50GB | 13:38 |
juergbi | we start casd with --protect-session-blobs, which prevents casd from expiring any blobs created or accessed in the current bst session | 13:38 |
juergbi | as buildstream can't deal with blobs suddenly vanishing mid-session | 13:39 |
juergbi | however, casd is supposed to return a suitable grpc error in that case, which buildstream should be handling | 13:39 |
juergbi | and I think we have a test case for this | 13:39 |
WSalmon | yep i though at one point it did say that in some cases but i might be thinking of bst1.4 | 13:39 |
juergbi | we have test_artifact_too_large to verify this | 13:40 |
juergbi | however, we may hit this in different code paths and might not handle it properly in all of them | 13:41 |
juergbi | (in bst or casd) | 13:41 |
WSalmon | douglaswinship, that job looked like it had built a LOT of stuff, so a simple restart should be enough | 13:41 |
WSalmon | but thanks for bring this case to our attention | 13:41 |
douglaswinship | So if I'm undertanding this right: If buildbox had responded perfectly, then the CI job would still have failed, but buildboc would have given a more helpful error message? (ie buildbox-casd isn't the actual reason this pipeline failed?) | 13:42 |
WSalmon | can you document this in a issue, just some links and a copy and past of the error message, the title should be something like, report filling cas in one run better etc | 13:42 |
douglaswinship | or is it that we're not sure either wwy? | 13:42 |
WSalmon | douglaswinship, it should give a more helpful message in any case so we should have a issue there but there may be more at play | 13:43 |
juergbi | at first glance it seems primarily like a buildbox-casd bug | 13:43 |
juergbi | it shouldn't terminate | 13:43 |
douglaswinship | dang. That's frustrating. | 13:44 |
WSalmon | but 50G is not a lot for a big rebuild.. | 13:44 |
WSalmon | so it might be a genuin issue | 13:44 |
WSalmon | but not being handdled well | 13:45 |
cphang | WSalmon agreed, and a supporting argument for this is that the aarch64 jobs succeeded | 13:45 |
WSalmon | cphang, have you rebuilt everything including FD on those runners before? | 13:45 |
cphang | WSalmon yeh I'm pretty sure, we've handled changes to bootstrap import before w/o cache | 13:46 |
WSalmon | ummm, interesting | 13:46 |
douglaswinship | WSalmon: should I be raising this as an issue in buildstream, or the buildbox-casd repo? (Or both?) | 13:46 |
WSalmon | at least buildstream, as it should be reporting things better | 13:46 |
cphang | As a temporary workaround douglaswinship maybe I'd suggest bumping up the size of https://gitlab.com/libreml/infrastructure/infrastructure/-/blob/master/ansible/roles/gitlab-runner/defaults/main.yml#L44 to 128GB, I'm happy to approve that and I'll restart the gitlab runner (this will probably be end of day). | 13:46 |
WSalmon | juergbi, do you want us to raise something on buildbox-casd about this too? | 13:47 |
WSalmon | sounds like it should have handdled more gracefully? | 13:47 |
juergbi | yes, should be fixed in buildbox-casd | 13:48 |
WSalmon | douglaswinship, ^ | 13:49 |
douglaswinship | WSalmon, juergbi: thanks :) | 13:53 |
*** tristan has quit IRC | 14:03 | |
*** traveltissues has joined #buildstream | 15:00 | |
*** tristan has joined #buildstream | 16:00 | |
*** ChanServ sets mode: +o tristan | 16:00 | |
tristan | benschubert, any word on the internless run ? | 16:00 |
benschubert | tristan: woops forgot about that, let me see | 16:01 |
tristan | .... drumroll ... | 16:01 |
benschubert | https://gitlab.com/snippets/1992018#note_372673299 better but not quite there :) | 16:02 |
tristan | .25 seconds behind | 16:02 |
tristan | Ok next time with arrays | 16:03 |
tristan | I also noticed that I'm not doing the optimization of making sure value dependencies are unique (since they are not interned anymore, the identity comparison is worthless), that means a loop per ValuePart per ValueClass, but as those are cached, they wont be meaningful really | 16:05 |
* tristan can either remove that unique check since "%{foo} and %{foo} and %{foo}" is a rare value expression | 16:05 | |
tristan | or I can make it a string compare, but probably just ignore and leave the redundancy in | 16:06 |
benschubert | tristan: possible yeah :) | 16:35 |
Frazer61 | hi, does anyone know the best was to use the src/buildstream/testing/_sourcetests/project to experiment with the master bst? | 16:39 |
* tristan has ObjectArray, but realizes he should have PointerArray with a free_func & element_size | 17:04 | |
tristan | At least with ObjectArray I now have the deps list in resolve() as an array | 17:05 |
tristan | PointerArray would let me change the ValueParts to aligned chunks in an array | 17:05 |
tristan | Frazer61, I think that those tests are a more difficult angle to work from | 17:06 |
tristan | Frazer61, most of the test cases in tests/ for instance, you can just run BuildStream on them and they work (for the majority which are statically committed tests) | 17:07 |
tristan | Or, the integration tests | 17:07 |
tristan | Frazer61, What I'm doing, which might not be considered "best", is to run `tox -e venv /bin/bash` to run BuildStream | 17:08 |
tristan | I exit the shell and re-enter every time I make code changes | 17:08 |
Frazer61 | i assume not but does it matter if i use zsh instead of bash, assuming /bin/bash overrides with bash | 17:09 |
tristan | It doesn't matter | 17:09 |
Frazer61 | ok, thanks | 17:09 |
tristan | In zsh, there is no completion | 17:09 |
tristan | In bash, the awesome completion is broken, cause douglaswinship still didn't submit that patch which fixes it | 17:10 |
tristan | The one I know he has hiding in his workspace and keeping all to himself | 17:10 |
tristan | benschubert, you're probably dropping off soon... maybe run one more benchmark before you go ? | 17:12 |
*** santi has quit IRC | 17:12 | |
tristan | And let me know what the result is when you get back online tomorrow ? | 17:12 |
benschubert | tristan: Sure, I'll run that now | 17:12 |
benschubert | perfect timing, since I'll need my desktop soon :D | 17:13 |
Frazer61 | thanks tristan | 17:13 |
*** santi has joined #buildstream | 17:13 | |
tristan | Great, _resolve() is looking very white and beige now hehe | 17:14 |
tristan | in fact, the only yellow line is _resolve() itself | 17:14 |
tristan | `__Pyx_INCREF(Py_None);` <-- that looks to me a little dumb to be generating :-S | 17:20 |
tristan | You'd think that one could rely on `None` existing throughout the course of a python session | 17:20 |
benschubert | True :) | 17:20 |
* tristan wonders how many python bugs revolve around imbalanced refcounts on None ;-P | 17:21 | |
tristan | Those, along side threading issues, are like the most difficult things to debug | 17:21 |
* tristan should write a little loop which takes valentind's variables testing element and multiplies it N times, each time dumping it as a new element which depends on the last | 17:26 | |
tristan | Then I could have a test more similar | 17:26 |
tristan | Running `bst show` on that small test gives quite erratic results | 17:26 |
tristan | Also, how come a yaml cache doesn't get created, is that obsolete ? | 17:28 |
tristan | benschubert, in your benchmarks, you have show-cached (I presume it's run a second time without removing the cache), there used to be some sort of extra file used to assist the yaml load, is that it ? | 17:29 |
* tristan thinks that was removed in favor of just yaml code that works better all the time | 17:29 | |
benschubert | tristan: no more yaml cache as it was slower yep :) | 17:29 |
tristan | Ok yeah, that's good | 17:30 |
benschubert | No `show cached` means that you have all your artifacts cached locally | 17:30 |
tristan | Ahh | 17:30 |
benschubert | which is impacted by casd doing the checks :) | 17:30 |
benschubert | It's basically: | 17:30 |
benschubert | From an empty state -> show -> source fetch -> show - fetched -> build -> show -cache | 17:30 |
tristan | Right, I've noticed that the difference between my branch and master is not affected (it's about the same difference all the time) | 17:31 |
tristan | last time .25 seconds in both scenarios | 17:31 |
benschubert | which is expected since you are not touching anything around the 'caching' | 17:31 |
tristan | cython is pretty smart | 17:33 |
tristan | if I do `foo = PyMem_Malloc(...)` and then `if not foo: ... raise MemoryError()` | 17:34 |
tristan | Then cython does `if (unlikely(__pyx_t_1)) {` | 17:34 |
tristan | for the generated `not` clause | 17:35 |
tristan | So it's got some idea/knowledge about the reason why I'm checking my return (perhaps a bit spooky) | 17:35 |
tristan | Or rather, probably PyMem_Malloc() has some annotation to it, indicating that it shouldnt be returning NULL | 17:36 |
benschubert | That's also possible? I'm not familiar enough with the internals of python | 17:37 |
* tristan finds this area particularly fun, I like this level of the stack | 17:37 | |
tristan | As a glib/GObject guy, I've always enjoyed Obj-C | 17:38 |
tristan | Python appears very similar under the thin hood | 17:38 |
benschubert | tristan: One thing I'm very happy about is having someone else looking into all the cython/optimized part of BuildStream, we're going to be able to do much better :) | 17:40 |
tristan | Careful hehe | 17:40 |
tristan | If I get excited, there may no longer be an excuse to keep the python around, and it will all just meld into C | 17:40 |
tristan | heh | 17:40 |
tristan | link the interpreter and invoke it for plugins | 17:41 |
benschubert | tristan: I think a "C" layer for performance-critical stuff, a "cython" layer for performance stuff but not that critical and "python" for the 90% of hte codebase would probably be a good thing :) | 17:42 |
*** dftxbs3e has quit IRC | 17:42 | |
benschubert | But I'd be careful at bringing in "too much" C, as it's potentially harder to involve new people | 17:42 |
tristan | That 90% quickly becomes 10% | 17:42 |
tristan | Right I see what you mean | 17:43 |
*** dftxbs3e has joined #buildstream | 17:43 | |
tristan | I kind of feel like ideally, the high level business logic belongs in a language like python where it's hard to get things wrong | 17:43 |
benschubert | Yep, and the "critical" paths can be optimized in Cython/C when we need | 17:44 |
tristan | the more critical stuff in C would be good, it also kind of scares away people from touching critical stuff | 17:44 |
tristan | Which... I guess has drawbacks as well as bonuses | 17:44 |
benschubert | yep | 17:45 |
*** hasebastian has quit IRC | 17:45 | |
benschubert | which is why I initially when with Cython for the initial optimizations, since it's still very close to python and harder to shoot yourself in the foot | 17:45 |
benschubert | tristan: actually speaking of C, what tool would you recommend to track what's startting and stopping threads? I've still got 4 threads lingering when I run tests in tox :/ (not pytest itself funnily enough) | 17:48 |
tristan | if this flies, I should put the PointerArray in a different file, maybe we can convert the loader code to use C arrays for dependencies and such | 17:49 |
benschubert | we could | 17:49 |
benschubert | I had as a general idea of trying to dig more into the "c" interface of ruamel and work more in Cython/C than python there | 17:49 |
benschubert | and so the loader could use this | 17:50 |
tristan | Yeah, I think we just need a real C implementation of round tripping YAML | 17:50 |
tristan | It really makes no sense to have this in python | 17:50 |
benschubert | tristan: actually... | 17:50 |
benschubert | We keep the location in the parser right? | 17:50 |
tristan | ruamel.yaml is in a sad state too, and fails to preserve formatting when round tripping | 17:50 |
tristan | The "location" ? | 17:51 |
benschubert | So we could use a fast C (libyaml?) parser for the "base case" and when we need to write, we could reload the file itself with Ruamel and find the right thing to overwrite no? | 17:51 |
benschubert | It would be more work | 17:51 |
benschubert | but might give us the best of both worlds | 17:51 |
tristan | Hmmm | 17:51 |
benschubert | until we decide we're too fed up and re-write a yaml parser :D | 17:51 |
tristan | Sounds like a lot of work and residual maintenance just to avoid a proper rewrite | 17:52 |
benschubert | tristan: rewriting ruamel in cython/C ? with only the things we'd need? | 17:52 |
cphang | tristan benschubert would something like https://github.com/tlsa/libcyaml be useful? | 17:53 |
tristan | That's another option, I was thinking it really should be a pure C round tripping YAML implementation | 17:53 |
tristan | cphang, does it roundtrip ? | 17:53 |
tristan | I don't think so | 17:53 |
benschubert | yeah, not sure what's better than https://github.com/yaml/libyaml there? Do you know cphang ? | 17:54 |
tristan | looking at that page, looks like it does load/save, but not preservation of line/column/whitespace/comments etc | 17:54 |
tristan | ruamel.yaml even manages to lose some comments completely when you put them in the "wrong" places | 17:55 |
benschubert | yeah :/ and it being on sourceforge doesn't help for contributing | 17:56 |
tristan | is it on sourceforge ?! | 17:56 |
tristan | jaysus I didn't know the yaml format itself was even that old | 17:57 |
benschubert | it moved from bitbucket to sourceforge | 17:57 |
benschubert | because bitbucket dropped mercurial support | 17:57 |
* tristan wonders if they upgraded to CVS ! | 17:58 | |
tristan | Everything in CVS good ol' days are back ! | 17:58 |
benschubert | Nah, they just dropped to only support git | 17:59 |
benschubert | and this project prefered moving to sourceforge rather than move to git | 17:59 |
tristan | I meant ruamel.yaml moving to sourceforge so they could... use CVS but... yeah same difference I guess | 17:59 |
tristan | Well, it does remind me of the CVS days, when you really needed to have a "Configuration Manager" for every dev team | 18:00 |
cphang | benschubert tristan the main value of libcyaml is that you can define schemas to parse the yaml and map them into structs https://github.com/tlsa/libcyaml/blob/master/docs/guide.md to my knowledge libcyaml uses libyaml under the hood. | 18:00 |
tristan | Cause developers and VCSes are like oil and water | 18:00 |
tristan | yeah doesn't seem to fit the bill | 18:01 |
benschubert | cphang: we don't really have a schema, since each plugin has different stuff, so might not be the best fit, but thanks for showing :) | 18:08 |
benschubert | tristan: you might want to revert your last change: https://gitlab.com/snippets/1992018 | 18:08 |
benschubert | *https://gitlab.com/snippets/1992018#note_372759175 sorry | 18:08 |
tristan | Crap | 18:09 |
tristan | And the change is (A) Use a more generic array implementation (instead of PyObject ** and Py_ssize_t on ValueClass, we use ObjectArray with has a function call for appending values) | 18:15 |
tristan | And change a `list` for an `ObjectArray *`, oh also the new ObjectArray holds references to the objects | 18:15 |
tristan | but that should be an atomic operation | 18:16 |
tristan | surely assigning `(PyObject **)array[idx] = <PyObject>value && Py_INCREF(value)` must be cheaper than list.append(value) | 18:17 |
benschubert | What's the C code generated for both? | 18:17 |
tristan | not easy to say, because using a list calls into python | 18:18 |
tristan | and iterating it/popping it also | 18:18 |
tristan | while iterating over a pointer array should be lightning fast | 18:19 |
benschubert | right | 18:19 |
tristan | we don't technically need the ref/unref statements, I just thought they would be safer (in this case, we know these strings are already owned by other means) | 18:20 |
tristan | I'll try to cook up a more intensive local benchmark here tomorrow | 18:22 |
tristan | In any case, this particular run is telling | 18:22 |
tristan | It shows that the tiniest things are really having a huge impact on the result | 18:22 |
tristan | Well, huge, is a vast overstatement | 18:22 |
tristan | We are talking about .25 of a second for (project.conf variables) * 6000 | 18:23 |
tristan | still it's a visible difference | 18:23 |
benschubert | still around 5-ish% of the run yeah :) | 18:23 |
benschubert | And this is a "simple" 6k elements project | 18:24 |
benschubert | so for more realistic examples it will probably be worse :/ | 18:24 |
tristan | I'm not sure of that | 18:24 |
benschubert | yeah that's actually true, cas would be taking a bigger part for bigger projects | 18:25 |
tristan | benschubert, the more realistic and complex a project gets, the more composition will occur; but variable resolution will still happen once flatly for each element | 18:25 |
tristan | If you have a variety of element kinds and custom variable assignments in elements, this wont effect resolution (much) is my point | 18:25 |
tristan | you might end up having a wider value cache | 18:25 |
tristan | (cache of regex'ed "%{foo} and %{bar}" strings) | 18:26 |
tristan | those will indeed be more expensive | 18:26 |
benschubert | yeah, but the time will be mostly taken by casd looking if things are in cas right? | 18:27 |
benschubert | well, I'm sure of that actually ;) | 18:27 |
tristan | Sure but that's another aspect | 18:27 |
tristan | I don't think that technically a larger artifact should take a longer time to verify if it is cached, though | 18:28 |
tristan | every artifact should take the same time to check | 18:28 |
benschubert | Ok, should we remove a part of the "optimizations" that are not really some and just ensure the code is the easiest to maintain right now? And we can revisit later maybe? | 18:28 |
benschubert | tristan: would it? You'd need to check each blob | 18:28 |
tristan | benschubert, I am using the word "should" here :) | 18:29 |
tristan | I don't know what it is | 18:29 |
tristan | But definitely, it should | 18:29 |
benschubert | we'd need an indexed cas for that (and thus know what's stored), I don't think casd supports that yet? juergbi will know more about this subject than me though | 18:30 |
tristan | Cache expiry is certainly hard to get right but, we shouldnt have to verify the blobs related to a ref just to determine that the ref is valid when checking for a ref | 18:30 |
tristan | At worst, we'd find out when it's time to build something that we were unable to store all the elements required for the build in the local cache size | 18:30 |
tristan | and have a late stage failure | 18:31 |
juergbi | BuildStream can't handle that | 18:31 |
juergbi | so we have to check up front | 18:31 |
juergbi | we obviously don't want the whole build session to fail if a blob is missing | 18:31 |
tristan | We can bail out and say "sorry you wasted your time, next time get a larger disk" | 18:31 |
juergbi | we already do this if the cache is not large enough for the current session | 18:32 |
juergbi | however, at the beginning of the session we check existing artifacts whether they are completely available | 18:32 |
juergbi | separate scenario | 18:32 |
tristan | juergbi, what's the worst that can happen though ? You wouldn't need to rebuild everything in a subsequent session, you'd have to increase your `quota:` in your config, give it some more space, and rebuild | 18:32 |
tristan | and then you will only build the missing blob and continue from there | 18:32 |
juergbi | for that we need to know that the blob is missing | 18:33 |
benschubert | tristan: the problem is if half an artifact is missing | 18:33 |
tristan | juergbi, Right, I'm guessing that check is overkill right ? As soon as we remove blobs shouldn't we be removing refs ? | 18:33 |
juergbi | which means that we have to check each blob for each artifact | 18:33 |
juergbi | no, that's what we used to do | 18:33 |
juergbi | which resulted in expiry taking 24h | 18:33 |
* benschubert remembers this with dread | 18:34 | |
juergbi | so we switched to the blob-based expiry | 18:34 |
benschubert | it awas more 72 hours but hey :) | 18:34 |
cphang | this feels like a referential integrity problem.. | 18:34 |
juergbi | who's counting ;) | 18:34 |
tristan | yeah | 18:34 |
tristan | Could this not be solvable with like, sqlite ? | 18:34 |
juergbi | we could probably do something more fancy if we move to a proper database | 18:34 |
* tristan beat you to it | 18:34 | |
tristan | heh | 18:34 |
juergbi | hehe | 18:34 |
juergbi | creating an index in a database could certainly help | 18:35 |
juergbi | but it's also not trivial | 18:35 |
juergbi | and we also don't want to overcomplicate casd | 18:35 |
juergbi | (although a simple database would likely make sense even if we keep blob-based expiry) | 18:35 |
cphang | But you *only* need to guarantee that referential integrity over the duration of a BuildStream invocation. | 18:35 |
tristan | In that scenario, we could run a GC of refs at the beginning of a session and hopefully that GC is less costly when there are no refs to cleanup | 18:35 |
benschubert | cphang: you'd need to build it each start then, which takes some time | 18:35 |
benschubert | cphang: easily multiple minutes with large-ish projects | 18:36 |
juergbi | following all references can take a very long time | 18:36 |
juergbi | btw: the buildgrid cas server now has an indexed cas with expiry support | 18:36 |
* cphang feels like Bazel has faced similar problems in the past.. | 18:36 | |
tristan | benschubert, every time ? or only when blobs have expired ? | 18:36 |
benschubert | tristan: not sure I understand the question? | 18:37 |
juergbi | I think keeping blob-based expiry is the right approach, however, we can hopefully speed up the blob checks with a database index | 18:37 |
benschubert | that's what I would be thinking | 18:38 |
tristan | Knowing benschubert's definition of "large-ish" I think I would be happy to spend a couple of minutes at `bst build` launch time, every time the last build resulted in expiry of blobs | 18:38 |
cphang | juergbi buildbarn has an ADR basically on this issue: https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md, to satisfy the guarantee that CAS blobs cannot expire over the invocation of a Bazel build | 18:38 |
tristan | benschubert, I mean if nothing expired last run, it should return right away | 18:38 |
tristan | no refs to GC | 18:38 |
benschubert | tristan: a couple of minutes at each start is prohibitive, and things expire quickly, I'm regularly working with 2TB+ of cache :) | 18:38 |
juergbi | cphang: we have a protection of session blobs in casd as well. the issue we're talking about is integrity across sessions | 18:39 |
benschubert | even casd cleanup when going over takes north of 15 minutes | 18:39 |
benschubert | anyways, I'll drop off now :) have a good evening/morning/night/other ! | 18:39 |
juergbi | tristan: also, with support for partial CAS (e.g. intentionally only storing the directory objects without the files locally) reference-based expiry probably doesn't make sense | 18:40 |
juergbi | this can be very useful with remote execution where the client doesn't need the intermediate blobs | 18:40 |
persia | I think object lifecycle management is a known-hard problem. There's some reasonable examples in the HBase sources, but I've definitely encountered people who complained about near-quota behaviours causing dependency-production-and-purge cycles preventing query resolution. | 18:41 |
persia | Don't fear using an actual database for it: maybe sqllite, or maybe some custom columnar store. | 18:41 |
tristan | indexing the blobs for improved "check if it's present" can work too | 18:41 |
tristan | I guess the shape of data plays into this too | 18:41 |
juergbi | yes, likely | 18:42 |
tristan | When people have an artifact with many files, they are doing it wrong already | 18:42 |
juergbi | don't forget sources. they are also in CAS | 18:42 |
juergbi | although, they don't have to be checked that frequently, I suppose | 18:42 |
tristan | some few artifacts (compositions) might have many files | 18:42 |
benschubert | tristan: I fear we have a different baseline for what "many" files are :'D | 18:43 |
tristan | benschubert, I doubt it... I mean, having build elements with huge amounts of files in the artifacts, is akin to writing huge programs in one big file.c | 18:43 |
tristan | common sense dictates you want things to be modular | 18:44 |
persia | I can think of many reasons to create artifacts that are not compositions that are hundreds of files, and even more reasons to create compositions with thousands of files which are then themselves components in larger compositions. | 18:44 |
tristan | Data | 18:44 |
tristan | Not code, indeed | 18:44 |
persia | I hadn't conceived of data. My reasons are entirely related to code being compiled into things. | 18:45 |
tristan | But even then, if you have language packs for instance, you'd still want to break them into per language artifacts | 18:45 |
tristan | Well, then we might again have a different definition of "huge", I think that a webkit of firefox installation meets the criteria of "huge" when it comes to software builds | 18:46 |
juergbi | long build time and lots of files in the artifact are fairly independent | 18:47 |
persia | If you have a debian-based system, it might be worth querying which packages you have installed, and running `dpkg -L <pkgname> | wc -l` against each of them. | 18:47 |
persia | You may be surprised about how many end up with a very large number of files. | 18:47 |
juergbi | although e.g. libreoffice covers both. almost 33k files for a full un-split build here | 18:48 |
persia | Yep. Chromium is another fun one :) | 18:48 |
tristan | holy crap | 18:48 |
persia | Yep. | 18:48 |
juergbi | most are IDL and documentation, though | 18:48 |
tristan | See, but we are talking about monolithic cathedrals which I think everyone can agree needs to go through the xorg exercise | 18:48 |
tristan | which was once a similar breed, but sensibly became modular | 18:49 |
juergbi | maybe they should but we have to deal with real projects | 18:49 |
juergbi | also, spreading all these files out to multiple artifacts in a single pipeline doesn't reduce the total number of blob checks | 18:50 |
tristan | Yeah, I'm just saying that if the indexing is slow for the (potentially few) projects which are ridiculously monolithic, it's justified to say "You're doing it wrong" ;-) | 18:50 |
tristan | true, but it should give you better feedback | 18:50 |
juergbi | potentially, yes | 18:51 |
juergbi | btw: I also have vague plans of parallelizing those checks | 18:51 |
tristan | I've been saying this for years ! | 18:51 |
juergbi | casd supports real threads, not like Python | 18:51 |
juergbi | ;) | 18:51 |
tristan | nobody got around to it yet | 18:52 |
tristan | silly | 18:52 |
juergbi | so if we use async requests in Python, we can get this parallel without GIL blocking us | 18:52 |
tristan | I don't think GIL is a question here, we could have been doing this last year | 18:53 |
tristan | it's just not the right time to do it while benschubert is in the midst of changing the parallelization model | 18:53 |
juergbi | well, without casd it would be fairly difficult to really parallelize | 18:53 |
tristan | if we had CacheCheckJob in the first place, it wouldn't change much | 18:53 |
juergbi | as the Python threads probably wouldn't be worth it | 18:53 |
tristan | No no, we parallelize thought jobs in the scheduler | 18:54 |
tristan | right now that's done with multiprocessing | 18:54 |
tristan | if benschubert changes it to threads, then it's threads | 18:54 |
juergbi | would have to check whether the overhead is not too high for these checks | 18:54 |
tristan | That could have worked a long time ago | 18:54 |
juergbi | with the fork overhead, I'm not sure | 18:55 |
juergbi | fork in BuildStream is far slower than raw fork typically is in Linux | 18:55 |
tristan | Really ? | 18:55 |
juergbi | yes, it shows up very high on profiles | 18:55 |
tristan | What for ? | 18:55 |
juergbi | I'm not sure, I suspect atfork handlers | 18:55 |
juergbi | gRPC could be involved | 18:55 |
tristan | Python or smth I guess | 18:55 |
tristan | Even then, we could have easily batched them | 18:56 |
tristan | throw 100 elements at each CacheCheck job | 18:57 |
juergbi | yes | 18:57 |
juergbi | it's mainly a question how to refactor the code to allow this | 18:58 |
juergbi | as it's currently handled by each elemenet | 18:58 |
juergbi | anyway, not something for today | 18:58 |
tristan | The scheduler has element specific jobs, but since a long time ago that is a subset of it's jobs | 18:59 |
tristan | but yeah, better to not fiddle in there while benschubert is in it | 18:59 |
juergbi | there are no non-element jobs anymore, iirc, however, we could probably re-add that if needed | 19:01 |
tristan | The framework is quasi in place anyway | 19:03 |
tristan | what with the scheduler resources and such | 19:03 |
tristan | of course the logging has to be different for such batch jobs, but that seems not challenging and unworthy of concern :) | 19:04 |
benschubert | python being slow for fork: | 19:20 |
benschubert | -> close the event loop, all file descriptor + other stuff | 19:20 |
benschubert | -> the refcounting GC means copying all the memory pages on fork | 19:20 |
benschubert | Otherwise I'm not sure I understood everything juergbi: would you be doing the async requests in grpc in the scheduler or before? | 19:20 |
benschubert | If before: that would not affect me | 19:20 |
benschubert | if during: I think that will be great, I wanted to try the asyncio support of grpc now too | 19:20 |
juergbi | I was planning before | 19:21 |
juergbi | it's not something I'm working on right now, though | 19:21 |
juergbi | and probably also not anytime soon | 19:21 |
benschubert | yeah if it's before I would not be affected, so you can work on that if/when you want | 19:21 |
benschubert | juergbi: btw, how long to run the non integration tests on your system again? (and which parallelism?) | 19:21 |
juergbi | have to rerun to check | 19:22 |
benschubert | rouughly, don't need super accurate :) | 19:22 |
juergbi | with integration but without coverage it's 2min 15s with -n 12 | 19:23 |
juergbi | (Python 3.8) | 19:23 |
juergbi | with coverage it's a minute more or so (and problematic with 3.8) | 19:23 |
benschubert | ok! thanks, so threading is indeed slightly slower even on the test suite for now | 19:23 |
benschubert | I still get occasional lingering threads but I've got no idea why | 19:24 |
juergbi | do they cause any issues? | 19:24 |
benschubert | No, but it would be cleaner for testing no? | 19:25 |
juergbi | or is it just about making sure it's all working cleanly | 19:25 |
benschubert | Unless we don't care and I remove the check entirely | 19:25 |
juergbi | yes, definitely | 19:25 |
juergbi | without integration, without coverage: 1353 passed, 150 skipped in 52.12s | 19:25 |
benschubert | ok my tests take 2m12 with -n10 :/ | 19:25 |
benschubert | oh well | 19:25 |
juergbi | -nocover? | 19:26 |
benschubert | I'll try a bit further then cleanup my branch and make a WIP PR with thread checking disabled for tests | 19:26 |
benschubert | pytest -n 10 directly | 19:26 |
juergbi | ah ok | 19:26 |
juergbi | is tox behaving now? | 19:26 |
benschubert | nope | 19:26 |
juergbi | :-/ | 19:26 |
benschubert | still getting lingering threads with tox | 19:27 |
benschubert | It's 4 threads all the time | 19:27 |
juergbi | ah, the lingering thread issue is only with tox, not with pytest itself? | 19:27 |
benschubert | and they are not python threads | 19:27 |
benschubert | with pytest rarely (once every 2-3 test runs) | 19:27 |
juergbi | do you use a shared gRPC channel for casd? | 19:28 |
benschubert | with tox all the time, 90% of the tests fail | 19:28 |
benschubert | I think so, I haven't changed how that was handled | 19:28 |
juergbi | have you removed the explicit closing of gRPC before spawning new job? | 19:28 |
juergbi | that shouldn't be necessary anymore | 19:28 |
benschubert | https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#57eb6e43c19557b544ffb8be1ee3811ea76c8d95_558_552 yes AFAICT | 19:29 |
benschubert | maybe do you see something obvious on the diff? | 19:30 |
juergbi | yes, you've removed it: prepare_fork | 19:32 |
benschubert | interesting: tox -e venv /bin/bash then pytest fails | 19:38 |
benschubert | so, that is something related to my environment | 19:38 |
benschubert | that's fishy | 19:38 |
* tristan would hope that any parallelism is done with Scheduler invocations, for consistency, and to follow the same rules layed out in the Resources | 19:57 | |
tristan | Scheduler can be invoked multiple times with different queues | 19:58 |
juergbi | it only makes sense if any actual scheduling is required | 20:01 |
juergbi | I have to look into it again but it's possible batching is all that's needed, in which case it's simpler and more efficient without involving scheduler | 20:02 |
benschubert | juergbi: I have a working theory for my threading problem: If I don't start the cacheusagemonitor at all (now a thread in my implementation), then I don't have those lingering threads. So it seems that having two threads sharing the connection cause problems? Is there anything weird around GRPC I need to close/handle/lock? | 20:03 |
juergbi | I'm not familiar with Python gRPC's API semantics with regards to multiple (Python) threads | 20:04 |
juergbi | however, why would it be different for cacheusagemonitor and job threads? | 20:04 |
benschubert | I have no idea | 20:05 |
juergbi | unless job threads are simply less likely to conflict timing wise in the tests | 20:05 |
benschubert | removing the starting make all tests pass but one (the one around checking the cache usage) | 20:06 |
juergbi | is the cacheusagemonitor thread shut down / joined properly at the end of each 'cli' invocation? | 20:06 |
benschubert | Yep | 20:06 |
benschubert | Do you see something blatant here: https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#eee68cb7737cd5a69ac2b8ddcf0119307cf48977_734_739 ? | 20:07 |
juergbi | and done in such a way that it's not forcefully torn down in the middle of a grpc request? | 20:07 |
benschubert | wait, seems like some tests actually still fail | 20:07 |
benschubert | so it's not perfect | 20:07 |
benschubert | but better | 20:07 |
benschubert | so it seems like this is indeed the problem, grpc not shutting down nicely | 20:07 |
juergbi | self._connection.close() in CacheUsageMonitor seems odd to me | 20:08 |
juergbi | I don't think the CacheUsageMonitor owns the connection | 20:08 |
juergbi | but no idea whether that's related to the issue | 20:09 |
juergbi | how/where do you terminate this thread? | 20:09 |
benschubert | it actually has no effect, removing it doesn't fix the problem | 20:09 |
juergbi | and do you know how exactly this is done internally? trigger a Python exception in that thread? | 20:09 |
benschubert | https://gitlab.com/BuildStream/buildstream/-/compare/master...bschubert%2Fno-multiprocessing#eee68cb7737cd5a69ac2b8ddcf0119307cf48977_134_136 for shutdown | 20:09 |
juergbi | ah, right, that seems clean | 20:10 |
benschubert | (adds a 5s delay so will have to do it differently after) but yeah :/ | 20:10 |
benschubert | and I tried joining the thread before closing the connection right after | 20:10 |
benschubert | and this didn't change either | 20:11 |
juergbi | I would definitely do the join before | 20:11 |
juergbi | as it might still be using the connection | 20:11 |
juergbi | I'd suggest trying to find documentation on python gRPC thread safety | 20:12 |
juergbi | sharing a channel might not be supported at all or require certain precautions | 20:12 |
juergbi | also have to wrap my head around GIL in this context | 20:13 |
juergbi | with GIL presumably fewer things go wrong than without, but it's not magic thread safety | 20:13 |
benschubert | https://github.com/grpc/grpc/issues/9320 it seems correcy | 20:13 |
juergbi | I'm not used to this odd middle way | 20:13 |
juergbi | ok, so the gRPC side should be fine, hm | 20:14 |
benschubert | https://github.com/googleapis/google-cloud-python/issues/3272 | 20:16 |
benschubert | So the 4 is indeed the grpc threads | 20:16 |
juergbi | our _establish_connection is not really thread-safe | 20:17 |
juergbi | don't know whether this could cause this issue, though | 20:17 |
benschubert | ooh good point | 20:18 |
juergbi | but should be fixed in any case | 20:18 |
benschubert | well I can easily lock it and see what happens :) | 20:18 |
juergbi | yep | 20:18 |
benschubert | or force creation at the start anyways | 20:18 |
benschubert | juergbi: oh gosh, that was it | 20:22 |
juergbi | :) | 20:22 |
benschubert | so we'd actually create two | 20:22 |
benschubert | uh, more errors, apparently not the only part x') | 20:23 |
tristan | juergbi, I think that one would want to throttle the number of parallel checks; querying for cached state can mean 3 things; (A) Checking cas for cached artifact, (B) Checking cas for cached sources, (C) Checking local disk for sources | 20:30 |
tristan | Wouldn't it make sense to not launch all of that for hundreds of elements simultaneously ? if it would make sense: That is scheduling I think | 20:31 |
benschubert | tristan: we are meaning at the start, before showing all elemetns, getting their state as we need to prepare the scheduler. This makes sense to batch all of them in a single request to casd that should it as fast as possible :) | 20:34 |
juergbi | yes, it really depends on where/how we do the batching | 20:35 |
*** xjuan_ has joined #buildstream | 20:40 | |
tristan | I mean, the scheduler takes care of throttling parallel jobs so as not to overstress the hardware | 20:43 |
tristan | The user gets to say things like how many fetches and builds go on in parallel | 20:43 |
tristan | If we're going to launch jobs which need to be throttled and run in parallel, and we have a Scheduler with the Resources stuff to control such throttling, we should not re-invent that from scratch, but reuse the Scheduler object and Job harnesses to do so | 20:44 |
*** xjuan_ has quit IRC | 20:45 | |
tristan | even if it appears to be simpler to write something simpler that doesnt use the scheduler, there's no reason why using the scheduler cannot be as simple | 20:45 |
* tristan believes he had this conversation before, upon finding other threads/subprocesses/multiprocessing hacks in startup phase, over a year ago | 20:46 | |
* tristan would just like to use one tool for the job consistently, and not consider that the Scheduler is only for the main build process | 20:47 | |
juergbi | I would definitely not reinvent a scheduler in buildstream | 20:47 |
juergbi | however, e.g. casd (as a gRPC server) has a thread pool | 20:47 |
juergbi | and essentially letting that take care of scheduling for the startup check could also make sense | 20:48 |
tristan | Right, but launching 4 threads in a hard coded manner which do something in parallel at the beginning, would not look like inventing a scheduler, but would still be | 20:48 |
tristan | Also, I think that the jobs need to check local disk, not only do gRPC jobs | 20:48 |
juergbi | yes, I also wouldn't want to do that | 20:48 |
juergbi | it's casd that performs the actual disk checks (updating timestamps) | 20:49 |
tristan | each element needs to check local disk to get Source consistency (in the absence of source cache) | 20:49 |
tristan | Seems logical to couple all cache checks (including local disk) into one group | 20:49 |
juergbi | yes, so there is the non-CAS part first | 20:49 |
tristan | I think non-cas part is after | 20:50 |
tristan | We only need to check disk if source cache is empty ? or maybe we still need to know, not sure | 20:50 |
juergbi | first check is artifact, we only check anything source-related if we don't have an artifact | 20:50 |
juergbi | and the main part of the artifact check is CAS | 20:50 |
juergbi | if the artifact is missing or incomplete, we check the source cache, which is very similar to the artifact check | 20:51 |
tristan | Right, in builds we can probably even discard the checks for source consistency but, I have a feeling that's not a long term reliable rule | 20:51 |
tristan | Some scenarios might require knowing | 20:51 |
juergbi | source consistency, i.e., having ref doesn't involve the disk, though, right? | 20:52 |
juergbi | (except .bst and project.refs) | 20:52 |
benschubert | it can if you need to be staged before a ref | 20:52 |
tristan | consistency includes whether it's cached and whether the ref is known or not (tristate) | 20:52 |
benschubert | (local sources, workspaces) | 20:52 |
tristan | but consistency has been replaced under the hood | 20:52 |
tristan | I still think it's a decent thing to bring back if only for `bst source show` output | 20:53 |
juergbi | but you don't want to pay the price for all other operations | 20:53 |
tristan | (not implementation-wise, there are good reasons why this was broken down into 2 methods under the hood) | 20:53 |
juergbi | right, can be built on top | 20:53 |
juergbi | btw: are you still awake? | 20:54 |
juergbi | or again | 20:54 |
tristan | I also suspect we may hit a turning point where sources might be required by reverse deps | 20:54 |
* tristan is about to crash :) | 20:55 | |
tristan | can't sleep early this week | 20:55 |
tristan | cython has put me into a sleep at 7am pattern | 20:55 |
juergbi | evil cython | 20:55 |
*** dftxbs3e has quit IRC | 23:06 | |
*** dftxbs3e has joined #buildstream | 23:23 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!