*** ceibal has joined #buildstream | 00:11 | |
*** ceibal has quit IRC | 00:14 | |
*** narispo has quit IRC | 00:42 | |
*** narispo has joined #buildstream | 00:42 | |
*** narispo has quit IRC | 01:14 | |
*** narispo has joined #buildstream | 01:14 | |
*** slaf has joined #buildstream | 02:02 | |
*** narispo has quit IRC | 04:46 | |
*** narispo has joined #buildstream | 04:46 | |
*** narispo has quit IRC | 05:44 | |
*** narispo has joined #buildstream | 05:44 | |
*** slaf has quit IRC | 09:21 | |
*** slaf has joined #buildstream | 09:27 | |
*** mohan43u has joined #buildstream | 12:42 | |
*** mohan43u has quit IRC | 13:44 | |
*** mohan43u has joined #buildstream | 13:48 | |
*** mohan43u has quit IRC | 13:55 | |
jjardon | benbrown: juergbi Ive added another bastion for all the buildstream group so we not only rely on the one in buildstream/buildstream (and we can split all the request between the 2), but I think we are gitting a rate limit on the number of request to DO: I see a lot of "Rate limit detected (waiting at least 5s)" in the logs | 14:11 |
---|---|---|
jjardon | I think we are actually hitting a new burst rate limit DO introduced quite recently, and we can see it clearly now because we are using the docker-machine fork from gitlab; https://developers.digitalocean.com/documentation/v2/#rate-limit | 14:13 |
jjardon | 5,000 requests per hour or 250 requests per minute (5% of the hourly total) (the second part is new) | 14:14 |
jjardon | https://developers.digitalocean.com/documentation/changelog/api-v2/add-burst-rate-limits/ | 14:15 |
juergbi | jjardon: ah, this is relatively new on the DO side, that's why we didn't see this before | 14:17 |
jjardon | yeah, I think that is what it happen | 14:17 |
juergbi | do you know roughly how many requests docker-machine issues per droplet/job? | 14:18 |
jjardon | combined to the fact that we now spin like 20 machines for each pipeline | 14:18 |
jjardon | juergbi: no idea; Ive been trying to search about that | 14:18 |
jjardon | but I think we reach the limit with 2 pipelines running already | 14:18 |
jjardon | (40 jobs) | 14:18 |
juergbi | maybe docker-machine could be optimized, no idea, though | 14:19 |
juergbi | jjardon: do the two bastions use different OAuth tokens to increase our total limit? | 14:20 |
jjardon | juergbi: yep | 14:20 |
jjardon | Im about to setup a permanent runner in a very big machine, so at least there are more options available until we found a permanent solution | 14:21 |
jjardon | the fact that docker-machine is in maintenance mode is not good neither | 14:22 |
jjardon | maybe we should put the runners in a kubernetes cluster, which seems to be the modern way to do this | 14:22 |
cphang | jjardon I'd maybe look at a container job service such as ECS | 14:30 |
cphang | kubernetes will do the job, but more setup will be required | 14:31 |
jjardon | cphang: all the configuration is almost automatic with gitlab | 14:31 |
jjardon | problem is that Im not sure you can make the cluster be elastic automatically | 14:32 |
cphang | Yeh, I'm not worried about the setup of the cluster itself, but 1) the autoscalling 2) the job dispatch 3) reporting back to gitlab might have a few steps to walk through | 14:33 |
cphang | nothing insurmountable though | 14:33 |
jjardon | cphang: but we still be using docker-machine, which is the thing we would like to avoid | 14:34 |
jjardon | or is there other way? | 14:34 |
* cphang is reading https://docs.gitlab.com/runner/executors/kubernetes.html#workflow | 14:36 | |
cphang | it seems to create pods, which if they don't use the job abstraction makes me a little sad | 14:36 |
cphang | https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/ | 14:36 |
cphang | To make it elastic, the dream would be to have a daemonset with a cluster autoscaler e.g. https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html | 14:37 |
cphang | So nodes get spun up as CPU load reaches a certain level | 14:37 |
cphang | That being said, I think that might be overengineering it for this use-case. I've had no problems with docker-machine on libreML, using GCP instead of DO | 14:39 |
cphang | And I know AWS has been used with some success too. | 14:39 |
jjardon | I think it would be possible, we simply need to install https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ and it whould work | 14:39 |
jjardon | cphang: we want to not use docker-machine because is not being develop anymore, not because it doesnt work (we have been working for 2 years with a lot of loads without massive amount of problems; I think most of the problems currently are in DO side | 14:41 |
cphang | Sure, but I would do it by node, rather than pod, so you are actually requesting more resources to your cluster, rather than using the existing resources on a cluster with more pods | 14:41 |
cphang | jjardon ack | 14:41 |
jjardon | cphang: yep, that is exactly what above does; it creates more nodes if needed | 14:43 |
*** mohan43u has joined #buildstream | 14:44 | |
jjardon | I will create the big permanent runner so people can still work tomorrow, then maybe I will play with that idea | 14:45 |
*** mohan43u has quit IRC | 14:45 | |
cphang | jjardon to my knowledge https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ will create more pods for an existing deployment, replicaset or statefulset, but https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html will register more nodes to an existing cluster. I think for your workloads you want the latter | 14:45 |
cphang | rather than the former. | 14:45 |
jjardon | cphang: you are totally correct | 14:46 |
jjardon | sorry I link the incorrect thing | 14:46 |
cphang | np :), just wanted to make sure my knowledge was correct as much as anything else :) | 14:46 |
cphang | It's worth digging into the details of https://docs.gitlab.com/runner/executors/kubernetes.html#connecting-to-the-kubernetes-api a lot more | 14:47 |
cphang | that will determine the cluster design imo | 14:47 |
jjardon | yeah, cluster autoscaler is a standar thing btw, not eks only: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler | 14:47 |
cphang | oh yeh, forgot about that. | 14:47 |
cphang | cool | 14:47 |
jjardon | so, I _think_ gitlab will require more resources, and if you have above configured everything will work automagically | 14:48 |
cphang | jjardon it might be worth leveraging https://gitlab.com/celduin/infrastructure/celduin-infra/-/blob/master/terraform/lib/celduin.libsonnet | 14:49 |
cphang | you'll need to make some modifications to it, but the EKS portion should work out of the box | 14:50 |
cphang | If gitlab just wants you to setup a namespace + autoscaler and it'll handle the rest, that might be a quick win. | 14:50 |
juergbi | jjardon: could a simpler short term solution be to set MaxBuilds in the gitlab-runner config such that it will use a single droplet for multiple jobs before destroying it? | 14:56 |
juergbi | as the jobs themselves are running in docker, it should still be clean | 14:56 |
juergbi | and maybe also set IdleCount to something non-zero (but we probably still want a low number) | 14:57 |
jjardon | mmm, I though that option is only for jobs executed using the docker executor, not docker+machine; let me check | 14:58 |
juergbi | it's a runner.machine config option, i.e., it is explicitly about machines/autoscaling | 14:58 |
jjardon | oh, yeah you are rigth | 14:59 |
jjardon | but I think this will not solve the problem; machines are already there for 180 min after created | 14:59 |
jjardon | but if we create several pipelines (>1) on top of the ones already running, we will reach the rate limit as we need to create more machines | 15:01 |
juergbi | so the default of MaxBuilds is already >1? somehow the documentation doesn't mention a default... | 15:01 |
jjardon | (IdleTime = 1800 atm) | 15:01 |
juergbi | but we have IdleCount 0 | 15:02 |
juergbi | so we don't keep any around | 15:02 |
jjardon | I need to refresh my memory | 15:02 |
juergbi | if we start multiple pipelines at once, MaxBuilds might not be sufficient to avoid hitting the rate limit completely. but I'd hope it would still help a lot in average | 15:02 |
jjardon | but I think is for the minimum of mahcines that needs to be idle at any point | 15:03 |
jjardon | let me check | 15:03 |
jjardon | yeah: "As for IdleCount, it should be set to a value that will generate a minimum amount of not used machines when the job queue is empty." | 15:04 |
jjardon | from https://docs.gitlab.com/runner/configuration/autoscale.html#how-concurrent-limit-and-idlecount-generate-the-upper-limit-of-running-machines | 15:04 |
juergbi | ah ok, and MaxBuilds is 0 which apparently means, there is no limit | 15:05 |
juergbi | i.e., droplets should get destroyed only after the 1800s idle time | 15:05 |
jjardon | yeah, that is the current configuration (or at least was the intention) | 15:06 |
jjardon | juergbi: ok, 2 permanent runner are setup for the entire buildstream group. Im not aware of prioritize runners (use permanent first, then fallback to elastic ones); should we pause the elastic ones until we found a solution for the rate problem? | 15:41 |
juergbi | jjardon: thanks. how many jobs can the permanent ones do in parallel? | 15:42 |
jjardon | 15 each | 15:43 |
jjardon | so 30 max, we can try with more but with 25 I see problems of jobs failing because docker daemon is not available | 15:43 |
juergbi | ok. could it make sense to keep the autoscale with a low max machine limit such that the chance of hitting the rate limit is low? | 15:43 |
jjardon | juergbi: ok, I will do that; 10 machines each for now | 15:44 |
juergbi | ok, let's try this for now. if the autoscale ones still act up, let's pause them | 15:44 |
jjardon | yeah, ok | 15:45 |
*** mohan43u has joined #buildstream | 15:47 | |
jjardon | juergbi: benbrown as a reference, I've updated https://gitlab.com/BuildStream/infrastructure/infrastructure/-/wikis/home with current runners config | 16:24 |
*** suadrif has joined #buildstream | 16:28 | |
*** suadrif has left #buildstream | 16:29 | |
jjardon | kubernetes seems to work! :) https://gitlab.com/BuildStream/buildstream/-/jobs/517617111 | 17:00 |
jjardon | I will pause it until more testing is done though | 17:00 |
juergbi | nice | 17:05 |
jjardon | mmm, seems kubernetes runner doesnt handle more than one job | 17:18 |
* jjardon needs to investigate more | 17:18 | |
jjardon | ah no; I have 4 pods but the problem is that is not autoscaling | 17:21 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!