Karpenter Introduction

Karpenter is a node lifecycle manager provided by AWS that observes incoming pods and starts the appropriate nodes based on the situation. Node selection decisions are driven by policies and the specifications of incoming pods, including resource requests and scheduling constraints.

karpenter-new-generation-kubernetes-auto-scaling-t3

Its main functions include:

Starting nodes for unschedulable Pods
Replacing existing nodes to improve resource utilization
Terminating nodes if they time out or are no longer needed
Gracefully terminating nodes before preemption

Advantages#

Karpenter observes the aggregated resource requests of unscheduled Pods and makes decisions to start and terminate nodes to minimize scheduling latency and infrastructure costs.

Benefits

Faster: Karpenter communicates directly with the EC2 queue API on AWS, allowing it to select the correct instance type based on pod specifications for rapid autoscaling.
More cost-effective: Karpenter provides nodes based on workload requirements, packaging pods into the minimum number of appropriately sized nodes, saving costs.
More flexible: Using Karpenter allows for flexibility and diversity without the need to create dozens of nodes.
Avoids some inherent flaws of the native Cluster Autoscaler: for example, capacity limits when adding or removing capacity in the cluster, and the inability to use mixed-size clusters in Auto Scaling groups...
Karpenter automatically detects storage dependencies, such as if the original node pulls in zone A, and when the Pod is first started, it creates a PV (Persistent Volume), subsequent node startup decisions will automatically create in zone A (since EBS is single availability zone; manually creating across AZs may lead to access issues).

Usage#

Prerequisites#

Metrics Server needs to be deployed to collect metrics for pods and nodes for resource constraints.
Install the Karpenter controller, which is the core controller of Karpenter.
Install the Karpenter dashboard to view the nodes managed by Karpenter and performance metrics.

Getting Started | Karpenter

Karpenter provisioner template

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:
    name: default

  taints:
    - key: example.com/special-taint
      effect: NoSchedule

  startupTaints:
    - key: example.com/another-taint
      effect: NoSchedule

  labels:
    billing-team: my-team

  # Requirements that constrain the parameters of provisioned nodes.
  # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
  # Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8", "16", "32"]
    - key: "karpenter.k8s.aws/instance-hypervisor"
      operator: In
      values: ["nitro"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2a", "us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64", "amd64"]
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]

  kubeletConfiguration:
    clusterDNS: ["10.0.1.100"]
    containerRuntime: containerd
    systemReserved:
      cpu: 100m
      memory: 100Mi
      ephemeral-storage: 1Gi
    kubeReserved:
      cpu: 200m
      memory: 100Mi
      ephemeral-storage: 3Gi
    evictionHard:
      memory.available: 5%
      nodefs.available: 10%
      nodefs.inodesFree: 10%
    evictionSoft:
      memory.available: 500Mi
      nodefs.available: 15%
      nodefs.inodesFree: 15%
    evictionSoftGracePeriod:
      memory.available: 1m
      nodefs.available: 1m30s
      nodefs.inodesFree: 2m
    evictionMaxPodGracePeriod: 3m
    podsPerCore: 2
    maxPods: 20

  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi

  consolidation:
    enabled: true

  ttlSecondsUntilExpired: 2592000 # 30 Days = 60 * 60 * 24 * 30 Seconds;

  ttlSecondsAfterEmpty: 30
  weight: 10

You also need to configure the node template

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:
    name: default

  taints:
    - key: example.com/special-taint
      effect: NoSchedule

  startupTaints:
    - key: example.com/another-taint
      effect: NoSchedule

  labels:
    billing-team: my-team

  # Requirements that constrain the parameters of provisioned nodes.
  # These requirements are combined with pod.spec.affinity.nodeAffinity rules.
  # Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8", "16", "32"]
    - key: "karpenter.k8s.aws/instance-hypervisor"
      operator: In
      values: ["nitro"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2a", "us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64", "amd64"]
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]

  kubeletConfiguration:
    clusterDNS: ["10.0.1.100"]
    containerRuntime: containerd
    systemReserved:
      cpu: 100m
      memory: 100Mi
      ephemeral-storage: 1Gi
    kubeReserved:
      cpu: 200m
      memory: 100Mi
      ephemeral-storage: 3Gi
    evictionHard:
      memory.available: 5%
      nodefs.available: 10%
      nodefs.inodesFree: 10%
    evictionSoft:
      memory.available: 500Mi
      nodefs.available: 15%
      nodefs.inodesFree: 15%
    evictionSoftGracePeriod:
      memory.available: 1m
      nodefs.available: 1m30s
      nodefs.inodesFree: 2m
    evictionMaxPodGracePeriod: 3m
    podsPerCore: 2
    maxPods: 20

  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi

  consolidation:
    enabled: true

  ttlSecondsUntilExpired: 2592000 # 30 Days = 60 * 60 * 24 * 30 Seconds;

  ttlSecondsAfterEmpty: 30
  weight: 10

Principles#

Using Karpenter's layered constraint model, pod operation is subject to three layers of constraints:

Must run in the available zones of dependent applications or storage
Requires specific types of processors or hardware (must have corresponding nodes)
Wishes to use techniques such as topology spreading to ensure high availability

The first layer is limited by the types and zones of hardware from cloud service providers; the third layer is controlled by other technologies for pod scheduling; Karpenter schedules and makes decisions for specified nodes by setting Provisioners to meet the second layer of control.

The restrictions that can be achieved through Provisioners include:

Resource requests: Request a certain amount of memory or CPU.
Node selection: Choose to run on nodes with specific labels (nodeSelector).
Node affinity: Schedule pods to run on nodes with specific attributes (affinity).
Topology spreading: Use topology spreading to ensure application availability.
Pod affinity/anti-affinity: Direct pods towards or away from topology domains based on the scheduling of other pods.

Karpenter controls the scaling of nodes through several methods:

Deleting Provisioners: Deleting a Provisioner will gracefully take down all its nodes.
Emptiness: When there are no non-daemonset workloads on a node and ttlSecondsAfterEmpty is triggered, the node will be reclaimed after the time ends.
Expiration: When ttlSecondsUntilExpired is set, the node will automatically go offline when it reaches the expiration time.
Consolidation: Karpenter actively deletes or replaces nodes with cheaper nodes based on a proactive savings strategy.
- Specific strategies:
- If all pods on a node can run on the free capacity of other nodes in the cluster, the node can be deleted.
- If all pods on a node can run on a combination of the free capacity of other nodes in the cluster and a cheaper replacement node, it can be replaced.
Interruption: After starting interruption detection, Karpenter will determine if a node is about to experience an interruption event and proactively take the node offline.
Drift: Karpenter will take offline nodes that exhibit drift and will also actively detect and mark nodes where the AMI used on the instance does not match the AMI set in the AWSNodeTemplate.
Manual deletion: Actively deleting nodes through eksctl will also be gracefully handled by Karpenter.

Use Cases#

In AWS, it can replace the creation of fixed node groups, using taints and affinity/anti-affinity to specify more flexible strategies for node scheduling and constraints.