Building the largest known Kubernetes cluster, with 130k nodes

(cloud.google.com)

23 points | by TangerineDream 2 days ago

8 comments

hazz99 2 hours ago
I’m sure this work is very impressive, but these QPS numbers don’t seem particularly high to me, at least compared to existing horizontally scalable service patterns. Why is it hard for the kube control plane to hit these numbers?
For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.
Not trying to be reductive - I’m sure there’s some complexity here I’m missing!
[-]
- phrotoma 43 minutes ago
  I am extremely Not A Database Person but I understand that the rationale for Kubernetes adopting etcd as its preferred data store was more about its distributed consistency features and less about query throughput. etcd is slower cause it's doing RAFT things and flushing stuff to disk.
  Projects like kine allow K8s users to swap sqlite or postgres in place of etcd which (I assume, please correct me otherwise) would deliver better throughput since those backends don't need to perform consenus operations.
  https://github.com/k3s-io/kine
  [-]
  - dijit 3 minutes ago
    You might not be a database person, but you’re spot on.
    A well managed HA postgresql (active/passive) is going to run circles around etcd for kube controlplane operations.
    The caveat here is increased risk of downtime, and a much higher management overhead, which is why its not the default.
blurrybird 1 hour ago
AWS and Anthropic did this back in July: https://aws.amazon.com/blogs/containers/amazon-eks-enables-u...
yanhangyhy 34 minutes ago
there is a doc about how to do with 1M nodes: https://bchess.github.io/k8s-1m/#_why
so i guess the title is not true?
xyse53 1 hour ago
They mention GCS fuse. We've had nothing but performance and stability problems with this.
We treat it as a best effort alternative when native GCS access isn't possible.
[-]
- dijit 0 minutes ago
  fuse based filesystems in general shouldn’t be treated as production ready in my experience.
  They’re wonderful for low volume, low performance and low reliability operations. (browsing, copying, integrating with legacy systems that do not permit native access), but beyond that they consume huge resources and do odd things when the backend is not in its most ideal state.
jakupovic 20 minutes ago
Doing this at anything > 1k nodes is a pain in the butt. We decided to run many <100 nodes clusters rather than a few big ones.
zoobab 1 hour ago
The new mainframe.
belter 54 minutes ago
130k nodes...cute...but can Google conquer the ultimate software engineering challenge they warn you about in CS school? A functional online signup flow?
[-]
- jasonvorhe 43 minutes ago
  For what? Access to the control plane API?
  [-]
  - belter 27 minutes ago
    In general... Try to sign up for their AI services...
rvz 2 hours ago
> While we don’t yet officially support 130K nodes, we're very encouraged by these findings. If your workloads require this level of scale, reach out to us to discuss your specific needs
Obviously this is a typical experiment at Google on running a K8s cluster at 130K nodes but if there is a company out their that "requires" this scale, I must question their architecture and their infrastructure costs.
But of course someone will always request that they somehow need this sort of scale to run their enterprise app. But once again, let's remind the pre-revenue startups talking about scale before they hit PMF:
Unless you are ready to donate tens of billions of dollars yearly, you do not need this.
You are not Google.
[-]
- mlnj 22 minutes ago
  >You are not Google.
  It's literally Google coming out with this capability and how is the criticism still "You are not Google"
  [-]
  - Rastonbury 1 minute ago
    The criticism is at pre-PMF startups who believe they need something similar