IRC logs for #buildstream for Sunday, 2020-04-19

*** ceibal has joined #buildstream00:11
*** ceibal has quit IRC00:14
*** narispo has quit IRC00:42
*** narispo has joined #buildstream00:42
*** narispo has quit IRC01:14
*** narispo has joined #buildstream01:14
*** slaf has joined #buildstream02:02
*** narispo has quit IRC04:46
*** narispo has joined #buildstream04:46
*** narispo has quit IRC05:44
*** narispo has joined #buildstream05:44
*** slaf has quit IRC09:21
*** slaf has joined #buildstream09:27
*** mohan43u has joined #buildstream12:42
*** mohan43u has quit IRC13:44
*** mohan43u has joined #buildstream13:48
*** mohan43u has quit IRC13:55
jjardonbenbrown: juergbi Ive added another bastion for all the buildstream group so we not only rely on the one in buildstream/buildstream (and we can split all the request between the 2), but I think we are gitting a rate limit on the number of request to DO: I see a lot of "Rate limit detected (waiting at least 5s)" in the logs14:11
jjardonI think we are actually hitting a new burst rate limit DO introduced quite recently, and we can see it clearly now because we are using the docker-machine fork from gitlab; https://developers.digitalocean.com/documentation/v2/#rate-limit14:13
jjardon5,000 requests per hour or 250 requests per minute (5% of the hourly total) (the second part is new)14:14
jjardonhttps://developers.digitalocean.com/documentation/changelog/api-v2/add-burst-rate-limits/14:15
juergbijjardon: ah, this is relatively new on the DO side, that's why we didn't see this before14:17
jjardonyeah, I think that is what it happen14:17
juergbido you know roughly how many requests docker-machine issues per droplet/job?14:18
jjardoncombined to the fact that we now spin like 20 machines for each pipeline14:18
jjardonjuergbi: no idea; Ive been trying to search about that14:18
jjardonbut I think we reach the limit with 2 pipelines running already14:18
jjardon(40 jobs)14:18
juergbimaybe docker-machine could be optimized, no idea, though14:19
juergbijjardon: do the two bastions use different OAuth tokens to increase our total limit?14:20
jjardonjuergbi: yep14:20
jjardonIm about to setup a permanent runner in a very big machine, so at least there are more options available until we found a permanent solution14:21
jjardonthe fact that docker-machine is in maintenance mode is not good neither14:22
jjardonmaybe we should put the runners in a kubernetes cluster, which seems to be the modern way to do this14:22
cphangjjardon I'd maybe look at a container job service such as ECS14:30
cphangkubernetes will do the job, but more setup will be required14:31
jjardoncphang: all the configuration is almost automatic with gitlab14:31
jjardonproblem is that Im not sure you can make the cluster be elastic automatically14:32
cphangYeh, I'm not worried about the setup of the cluster itself, but 1) the autoscalling 2) the job dispatch 3) reporting back to gitlab might have a few steps to walk through14:33
cphangnothing insurmountable though14:33
jjardoncphang: but we still be using docker-machine, which is the thing we would like to avoid14:34
jjardonor is there other way?14:34
* cphang is reading https://docs.gitlab.com/runner/executors/kubernetes.html#workflow14:36
cphangit seems to create pods, which if they don't use the job abstraction makes me a little sad14:36
cphanghttps://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/14:36
cphangTo make it elastic, the dream would be to have a daemonset with a cluster autoscaler e.g. https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html14:37
cphangSo nodes get spun up as CPU load reaches a certain level14:37
cphangThat being said, I think that might be overengineering it for this use-case. I've had no problems with docker-machine on libreML, using GCP instead of DO14:39
cphangAnd I know AWS has been used with some success too.14:39
jjardonI think it would be possible, we simply need to install https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ and it whould work14:39
jjardoncphang: we want to not use docker-machine because is not being develop anymore, not because it doesnt work (we have been working for 2 years with a lot of loads without massive amount of problems; I think most of the problems currently are in DO side14:41
cphangSure, but I would do it by node, rather than pod, so you are actually requesting more resources to your cluster, rather than using the existing resources on a cluster with more pods14:41
cphangjjardon ack14:41
jjardoncphang: yep, that is exactly what above does; it creates more nodes if needed14:43
*** mohan43u has joined #buildstream14:44
jjardonI will create the big permanent runner so people can still work tomorrow, then maybe I will play with that idea14:45
*** mohan43u has quit IRC14:45
cphangjjardon to my knowledge https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ will create more pods for an existing deployment, replicaset or statefulset, but https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html will register more nodes to an existing cluster. I think for your workloads you want the latter14:45
cphangrather than the former.14:45
jjardoncphang: you are totally correct14:46
jjardonsorry I link the incorrect thing14:46
cphangnp :), just wanted to make sure my knowledge was correct as much as anything else :)14:46
cphangIt's worth digging into the details of https://docs.gitlab.com/runner/executors/kubernetes.html#connecting-to-the-kubernetes-api a lot more14:47
cphangthat will determine the cluster design imo14:47
jjardonyeah, cluster autoscaler is a standar thing btw, not eks only: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler14:47
cphangoh yeh, forgot about that.14:47
cphangcool14:47
jjardonso, I _think_ gitlab will require more resources, and if you have above configured everything will work automagically14:48
cphangjjardon it might be worth leveraging https://gitlab.com/celduin/infrastructure/celduin-infra/-/blob/master/terraform/lib/celduin.libsonnet14:49
cphangyou'll need to make some modifications to it, but the EKS portion should work out of the box14:50
cphangIf gitlab just wants you to setup a namespace + autoscaler and it'll handle the rest, that might be a quick win.14:50
juergbijjardon: could a simpler short term solution be to set MaxBuilds in the gitlab-runner config such that it will use a single droplet for multiple jobs before destroying it?14:56
juergbias the jobs themselves are running in docker, it should still be clean14:56
juergbiand maybe also set IdleCount to something non-zero (but we probably still want a low number)14:57
jjardonmmm, I though that option is only for jobs executed using the docker executor, not docker+machine; let me check14:58
juergbiit's a runner.machine config option, i.e., it is explicitly about machines/autoscaling14:58
jjardonoh, yeah you are rigth14:59
jjardonbut I think this will not solve the problem; machines are already there for 180 min after created14:59
jjardonbut if we create several pipelines (>1) on top of the ones already running, we will reach the rate limit as we need to create more machines15:01
juergbiso the default of MaxBuilds is already >1? somehow the documentation doesn't mention a default...15:01
jjardon(IdleTime = 1800 atm)15:01
juergbibut we have IdleCount 015:02
juergbiso we don't keep any around15:02
jjardonI need to refresh my memory15:02
juergbiif we start multiple pipelines at once, MaxBuilds might not be sufficient to avoid hitting the rate limit completely. but I'd hope it would still help a lot in average15:02
jjardonbut I think is for the minimum of mahcines that needs to be idle at any point15:03
jjardonlet me check15:03
jjardonyeah: "As for IdleCount, it should be set to a value that will generate a minimum amount of not used machines when the job queue is empty."15:04
jjardonfrom https://docs.gitlab.com/runner/configuration/autoscale.html#how-concurrent-limit-and-idlecount-generate-the-upper-limit-of-running-machines15:04
juergbiah ok, and MaxBuilds is 0 which apparently means, there is no limit15:05
juergbii.e., droplets should get destroyed only after the 1800s idle time15:05
jjardonyeah, that is the current configuration (or at least was the intention)15:06
jjardonjuergbi: ok, 2 permanent runner are setup for the entire buildstream group. Im not aware of prioritize runners (use permanent first, then fallback to elastic ones); should we pause the elastic ones until we found a solution for the rate problem?15:41
juergbijjardon: thanks. how many jobs can the permanent ones do in parallel?15:42
jjardon15 each15:43
jjardonso 30 max, we can try with more but with 25 I see problems of jobs failing because docker daemon is not available15:43
juergbiok. could it make sense to keep the autoscale with a low max machine limit such that the chance of hitting the rate limit is low?15:43
jjardonjuergbi: ok, I will do that; 10 machines each for now15:44
juergbiok, let's try this for now. if the autoscale ones still act up, let's pause them15:44
jjardonyeah, ok15:45
*** mohan43u has joined #buildstream15:47
jjardonjuergbi: benbrown as a reference, I've updated https://gitlab.com/BuildStream/infrastructure/infrastructure/-/wikis/home with current runners config16:24
*** suadrif has joined #buildstream16:28
*** suadrif has left #buildstream16:29
jjardonkubernetes seems to work! :) https://gitlab.com/BuildStream/buildstream/-/jobs/51761711117:00
jjardonI will pause it until more testing is done though17:00
juergbinice17:05
jjardonmmm, seems kubernetes runner doesnt handle more than one job17:18
* jjardon needs to investigate more17:18
jjardonah no; I have 4 pods but the problem is that is not autoscaling17:21

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!