IRC logs for #buildstream for Sunday, 2020-04-19

*** ceibal has joined #buildstream		00:11
*** ceibal has quit IRC		00:14
*** narispo has quit IRC		00:42
*** narispo has joined #buildstream		00:42
*** narispo has quit IRC		01:14
*** narispo has joined #buildstream		01:14
*** slaf has joined #buildstream		02:02
*** narispo has quit IRC		04:46
*** narispo has joined #buildstream		04:46
*** narispo has quit IRC		05:44
*** narispo has joined #buildstream		05:44
*** slaf has quit IRC		09:21
*** slaf has joined #buildstream		09:27
*** mohan43u has joined #buildstream		12:42
*** mohan43u has quit IRC		13:44
*** mohan43u has joined #buildstream		13:48
*** mohan43u has quit IRC		13:55
jjardon	benbrown: juergbi Ive added another bastion for all the buildstream group so we not only rely on the one in buildstream/buildstream (and we can split all the request between the 2), but I think we are gitting a rate limit on the number of request to DO: I see a lot of "Rate limit detected (waiting at least 5s)" in the logs	14:11
jjardon	I think we are actually hitting a new burst rate limit DO introduced quite recently, and we can see it clearly now because we are using the docker-machine fork from gitlab; https://developers.digitalocean.com/documentation/v2/#rate-limit	14:13
jjardon	5,000 requests per hour or 250 requests per minute (5% of the hourly total) (the second part is new)	14:14
jjardon	https://developers.digitalocean.com/documentation/changelog/api-v2/add-burst-rate-limits/	14:15
juergbi	jjardon: ah, this is relatively new on the DO side, that's why we didn't see this before	14:17
jjardon	yeah, I think that is what it happen	14:17
juergbi	do you know roughly how many requests docker-machine issues per droplet/job?	14:18
jjardon	combined to the fact that we now spin like 20 machines for each pipeline	14:18
jjardon	juergbi: no idea; Ive been trying to search about that	14:18
jjardon	but I think we reach the limit with 2 pipelines running already	14:18
jjardon	(40 jobs)	14:18
juergbi	maybe docker-machine could be optimized, no idea, though	14:19
juergbi	jjardon: do the two bastions use different OAuth tokens to increase our total limit?	14:20
jjardon	juergbi: yep	14:20
jjardon	Im about to setup a permanent runner in a very big machine, so at least there are more options available until we found a permanent solution	14:21
jjardon	the fact that docker-machine is in maintenance mode is not good neither	14:22
jjardon	maybe we should put the runners in a kubernetes cluster, which seems to be the modern way to do this	14:22
cphang	jjardon I'd maybe look at a container job service such as ECS	14:30
cphang	kubernetes will do the job, but more setup will be required	14:31
jjardon	cphang: all the configuration is almost automatic with gitlab	14:31
jjardon	problem is that Im not sure you can make the cluster be elastic automatically	14:32
cphang	Yeh, I'm not worried about the setup of the cluster itself, but 1) the autoscalling 2) the job dispatch 3) reporting back to gitlab might have a few steps to walk through	14:33
cphang	nothing insurmountable though	14:33
jjardon	cphang: but we still be using docker-machine, which is the thing we would like to avoid	14:34
jjardon	or is there other way?	14:34
* cphang is reading https://docs.gitlab.com/runner/executors/kubernetes.html#workflow		14:36
cphang	it seems to create pods, which if they don't use the job abstraction makes me a little sad	14:36
cphang	https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/	14:36
cphang	To make it elastic, the dream would be to have a daemonset with a cluster autoscaler e.g. https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html	14:37
cphang	So nodes get spun up as CPU load reaches a certain level	14:37
cphang	That being said, I think that might be overengineering it for this use-case. I've had no problems with docker-machine on libreML, using GCP instead of DO	14:39
cphang	And I know AWS has been used with some success too.	14:39
jjardon	I think it would be possible, we simply need to install https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ and it whould work	14:39
jjardon	cphang: we want to not use docker-machine because is not being develop anymore, not because it doesnt work (we have been working for 2 years with a lot of loads without massive amount of problems; I think most of the problems currently are in DO side	14:41
cphang	Sure, but I would do it by node, rather than pod, so you are actually requesting more resources to your cluster, rather than using the existing resources on a cluster with more pods	14:41
cphang	jjardon ack	14:41
jjardon	cphang: yep, that is exactly what above does; it creates more nodes if needed	14:43
*** mohan43u has joined #buildstream		14:44
jjardon	I will create the big permanent runner so people can still work tomorrow, then maybe I will play with that idea	14:45
*** mohan43u has quit IRC		14:45
cphang	jjardon to my knowledge https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ will create more pods for an existing deployment, replicaset or statefulset, but https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html will register more nodes to an existing cluster. I think for your workloads you want the latter	14:45
cphang	rather than the former.	14:45
jjardon	cphang: you are totally correct	14:46
jjardon	sorry I link the incorrect thing	14:46
cphang	np :), just wanted to make sure my knowledge was correct as much as anything else :)	14:46
cphang	It's worth digging into the details of https://docs.gitlab.com/runner/executors/kubernetes.html#connecting-to-the-kubernetes-api a lot more	14:47
cphang	that will determine the cluster design imo	14:47
jjardon	yeah, cluster autoscaler is a standar thing btw, not eks only: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler	14:47
cphang	oh yeh, forgot about that.	14:47
cphang	cool	14:47
jjardon	so, I _think_ gitlab will require more resources, and if you have above configured everything will work automagically	14:48
cphang	jjardon it might be worth leveraging https://gitlab.com/celduin/infrastructure/celduin-infra/-/blob/master/terraform/lib/celduin.libsonnet	14:49
cphang	you'll need to make some modifications to it, but the EKS portion should work out of the box	14:50
cphang	If gitlab just wants you to setup a namespace + autoscaler and it'll handle the rest, that might be a quick win.	14:50
juergbi	jjardon: could a simpler short term solution be to set MaxBuilds in the gitlab-runner config such that it will use a single droplet for multiple jobs before destroying it?	14:56
juergbi	as the jobs themselves are running in docker, it should still be clean	14:56
juergbi	and maybe also set IdleCount to something non-zero (but we probably still want a low number)	14:57
jjardon	mmm, I though that option is only for jobs executed using the docker executor, not docker+machine; let me check	14:58
juergbi	it's a runner.machine config option, i.e., it is explicitly about machines/autoscaling	14:58
jjardon	oh, yeah you are rigth	14:59
jjardon	but I think this will not solve the problem; machines are already there for 180 min after created	14:59
jjardon	but if we create several pipelines (>1) on top of the ones already running, we will reach the rate limit as we need to create more machines	15:01
juergbi	so the default of MaxBuilds is already >1? somehow the documentation doesn't mention a default...	15:01
jjardon	(IdleTime = 1800 atm)	15:01
juergbi	but we have IdleCount 0	15:02
juergbi	so we don't keep any around	15:02
jjardon	I need to refresh my memory	15:02
juergbi	if we start multiple pipelines at once, MaxBuilds might not be sufficient to avoid hitting the rate limit completely. but I'd hope it would still help a lot in average	15:02
jjardon	but I think is for the minimum of mahcines that needs to be idle at any point	15:03
jjardon	let me check	15:03
jjardon	yeah: "As for IdleCount, it should be set to a value that will generate a minimum amount of not used machines when the job queue is empty."	15:04
jjardon	from https://docs.gitlab.com/runner/configuration/autoscale.html#how-concurrent-limit-and-idlecount-generate-the-upper-limit-of-running-machines	15:04
juergbi	ah ok, and MaxBuilds is 0 which apparently means, there is no limit	15:05
juergbi	i.e., droplets should get destroyed only after the 1800s idle time	15:05
jjardon	yeah, that is the current configuration (or at least was the intention)	15:06
jjardon	juergbi: ok, 2 permanent runner are setup for the entire buildstream group. Im not aware of prioritize runners (use permanent first, then fallback to elastic ones); should we pause the elastic ones until we found a solution for the rate problem?	15:41
juergbi	jjardon: thanks. how many jobs can the permanent ones do in parallel?	15:42
jjardon	15 each	15:43
jjardon	so 30 max, we can try with more but with 25 I see problems of jobs failing because docker daemon is not available	15:43
juergbi	ok. could it make sense to keep the autoscale with a low max machine limit such that the chance of hitting the rate limit is low?	15:43
jjardon	juergbi: ok, I will do that; 10 machines each for now	15:44
juergbi	ok, let's try this for now. if the autoscale ones still act up, let's pause them	15:44
jjardon	yeah, ok	15:45
*** mohan43u has joined #buildstream		15:47
jjardon	juergbi: benbrown as a reference, I've updated https://gitlab.com/BuildStream/infrastructure/infrastructure/-/wikis/home with current runners config	16:24
*** suadrif has joined #buildstream		16:28
*** suadrif has left #buildstream		16:29
jjardon	kubernetes seems to work! :) https://gitlab.com/BuildStream/buildstream/-/jobs/517617111	17:00
jjardon	I will pause it until more testing is done though	17:00
juergbi	nice	17:05
jjardon	mmm, seems kubernetes runner doesnt handle more than one job	17:18
* jjardon needs to investigate more		17:18
jjardon	ah no; I have 4 pods but the problem is that is not autoscaling	17:21

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!