Troubleshooting

Datadog Agent Startup Failure With Hostname Detection

The DD agent startup failed on container as its not auto detecting container hostname. Agent works fine when its on a host , but when it runs on a container , we face this . Relook at the agent configuration to fix the issue Fix : The issue was due to Agent unable to connect kubelet API through https . By default TLS verification is enabled . Disabled the TLS verification by setting DD_KUBELET_TLS_VERIFY variable for all containers in Agent manifest — Helm : values.yaml Then redeploy the agent and issue will be fixed For more such troubleshooting — Follow us — cubensquare.com

Datadog Agent Startup Failure With Hostname Detection Read More »

Datadog Visualization Thing

For a specific cluster Datadog is reporting a change of Max, Min and Current Replicas about every 20 mins for the metric kubernetes_state.hpa.max_replicas . The actual number of pods/containers does not look to be changing neither the config files for the cluster. Rare cases – This could be a Datadog visualization thing. Using max by rather than sum by looks to have produced desired visualization . Just by getting into incognito mode | private browsing | clearing cache | after few mins | logout and logback in , issue should have been resolved

Datadog Visualization Thing Read More »

CPU & Memory Resource Limit For A POD

Is there a suggested limit for setting the CPU and Memory resource limit for a pod Obviously yes . The resource request should be near where the application run 100% of the time . The intention is you will need to operate near the request setting, tolerate spikes above that and stay out of ‘gray zone’ in between your request and limit. Example if it uses 500M memory, you could set the request to 750 including the buffer and limit could be 1G. Kubernetes will kill the pods that run above the limit ( OOM ) and may kill pods running in the gray area, depending on whether it needs the resources or not

CPU & Memory Resource Limit For A POD Read More »

Failed Create Pod SandBox

Troubleshooting : – Checked the dedicated node having issues with docker – When checked docker seems healthy enough to be passing checks but not healthy enough to run pods – This scenario happens typically when underlying infra issue like I/O being exhausted but the node looked fine – After few mins , the PODs in Error state came back normal and running – Two days past : issue pops up again . The new node hits at 100% CPU utilisation instantly – While digging further found out that Datadog agent has been installed/configured recently and no other changes – Considering the DD memory usage , the Node instance type was bumped up to c5.xlarge to c5.2xlarge – Even after change in node type change , the issue was intermittent . – Upon digging further , found out the actual root cause Actual root cause : Datadog integration with K8s – Earlier version of the library had issues with memory – the pod appear to be growing to 5gb and then failing at that point . Upon finding this , we got the full release version of the library and then things were stable with no memory issues. So the actual issue and fix was not about increasing the instance type with higher cpu/mem rather the unstable library

Failed Create Pod SandBox Read More »

Auto Scaling Of PODs To Respond To The Real Demand Of Service | TPS- Is A Real Challenge

Streaming movies, Web-series , sports has been the current market trend . Google ‘world cup streaming’ and you can find n number of streaming platforms . Considering the current market trend & customers expectation of flawless streaming, setting up the technology platform has become a challenge. We have multiple tools options to handle the workload but filtering those and implementing one of those is another big challenge at the same time a real fun filled journey which includes planning, design, implement & monitoring . For one of the streaming platform customer , the expectation set to technical team was to ensure that the platform needs to autoscale / have elasticity and ensure the dynamic work loads are being handled . Based on the teams which play , their statistics , the number of audiences / transaction per second ( tps ) / request per second ( rps ) are being decided . For instance if there is a match between India | Pakistan , the number of viewers will be way too high and to add a cherry top , if the match goes very interesting beyond winner prediction , we can see a huge increase in tps . So how do we handle such increase in workload . The Java applications are running in Kubernetes environment . 3 Master nodes and 100+ worker nodes . Based on the previous metrics , min/max pod are being decided . Team ensures that we have enough CPU , Memory on the nodes to handle the max pod . But the bigger question is , can containers scale at the pace of dynamic huge increase of workload / tps . In seconds , the incoming transactions will reach 5k — 10k — 15k . So , will the POD spin up real quick in seconds and handle such big loads ? Practically speaking the answer is NO . PODs take atleast 2 to 3 mins to spin and get into running status , then take the incoming traffic . To avoid this delay and to ensure smooth online streaming without interruption , we did prescaling of the k8s pod . Step 1 Took last 6 months metrics and anlysed the peak load , how much min / max POD has been set . Understand the CPU , Memory utilisation Step 2 Understand the approximate transaction per second / load from the product owner for the event Step 3 Request to perform a load testing with predicted TPS Step 4 Devops team to perform the prescaling with min , max POD , setup anti affinity parameters according to the requirement to meet high availability , check node resources quota. Reason for prescaling : K8S Autoscaling is a good option but not for the dynamic load which gets shooted up in few seconds Step 5 During the load test , monitor below metrics CPU Utilisation of POD , Containers, Node Memory utilisation of POD , Containers, Node Node resource utilisation metrics POD Scaling Node Scaling K8S Control plane — Ensure the control plane is able to handle the load of node autoscaling , saving details to ETCD, fetching up templates from ETCD to spin up PODs as per requirement Transaction per second Request per second Network traffic Disk I/O pressure Heap Memory Step 6 Based on the observation , decide on setting up the min / max POD , Node autoscaling readiness which includes changing the NODE instance type ( aws example : r5.x large to r5.2x large instance ) Step 7 Perform the prescaling before match starts and scale down post match This time , we couldn’t find a better option other than prescaling Kubernetes platform rather allow the default auto scaling to do its job. Prescaling worked perfect and we scale down post every match . Lets see how technology evolves and how we adapt to the right tools to perform autoscaling at this peak load increase . Stay Tuned : How does AWS , Kubernetes costing impact us during autoscaling / prescaling . Follow us for more details — cubensquare.com

Auto Scaling Of PODs To Respond To The Real Demand Of Service | TPS- Is A Real Challenge Read More »

Datadog Agent Startup Failure With Hostname Detection

The DD agent startup failed on container as its not auto detecting container hostname. Agent works fine when its on a host , but when it runs on a container , we face this . Relook at the agent configuration to fix the issue Fix : The issue was due to Agent unable to connect kubelet API through https . By default TLS verification is enabled . Disabled the TLS verification by setting DD_KUBELET_TLS_VERIFY variable for all containers in Agent manifest — Helm : values.yaml Then redeploy the agent and issue will be fixed For more such troubleshooting — Follow us — cubensquare.com

Datadog Agent Startup Failure With Hostname Detection Read More »

Docker Node Image Will Not Be Supported By GKE ( Google Kubernetes Engine ) Starting V1.24

Kubernetes nodes use the container runtime to launch, manage, and stop containers running in Pods. The containerd runtime is an industry-standard container runtime that’s supported by GKE. The containerd runtime provides the layering abstraction that allows for the implementation of a rich set of features like gVisor and Image streaming to extend GKE functionality. The containerd runtime is considered more resource efficient and secure than the Docker runtime.   GKE 1.24 and later only support node images that use the containerd runtime. To use these versions of GKE, you must migrate to a containerd node image For more such technical blogs follow us – cubensquare.com/blog

Docker Node Image Will Not Be Supported By GKE ( Google Kubernetes Engine ) Starting V1.24 Read More »

EKS IPV4 Exhaustion

Problem Statement Elastic Kubernetes Service (EKS) is predominantly used by many of the Organizations because it is an upstream and certified conformant version of Kubernetes with backported security fixes. It also provides a managed Kubernetes experience for performant, reliable and secure Kubernetes clusters. In a rapidly growing Business or Organization, where the Workloads deployed to EKS increases rapidly, Kubernetes Admin face a situation where the New Pods run out of IPs during its initialization as part of Scaling. Background: When we use a third-party Networking Plugin like Calico, Cilium, Flannel or etc., the IPs of the Node and the Pod initialized gets assigned from different IP CIDRs. Pod IP space (Network plugin CIDR) and Node IP space (from VPC subnet) are different, and Pods get an isolated IP addresses from other services. This case is bit different when we use EKS with AWS VPC CNI Networking Plugin. This is because the plugin assigns a private IPv4 or IPv6 address from your VPC to each pod and service. Your pods and services have the same IP address inside the pod as they do on the VPC network. This is intentional to ease the communication between Pod and other AWS services. Solution: 1. Enable ipv6 — Create EKS Cluster with ipv6 option enabled. 2. Add Secondary CIDR ranges to existing EKS cluster. We will discuss in detail about the second solution and how we can achieve it via Terraform. Steps in Detail: Create subnets with a new CIDR range aws ec2 describe-availability-zones — region us-east-1 — query ‘AvailabilityZones[*].ZoneName’ Considering our AWS region as us-west-2 1. list all the Availability Zones in your AWS Region, run the following command: aws ec2 describe-availability-zones — region us-west-2 — query ‘AvailabilityZones[*].ZoneName’ 2. Choose the Availability Zone where you want to add the subnets, and then assign those Availability Zones to variables. For example export AZ1=us-west-2a export AZ2=us-west-2b export AZ3=us-west-2c 3. To create new subnets under the VPC with the new CIDR range, run the following commands: SUBNETA=$(aws ec2 create-subnet — cidr-block 100.64.0.0/19 — vpc-id $VPC_ID — availability-zone $AZ1 | jq -r .Subnet.SubnetId) SUBNETB=$(aws ec2 create-subnet — cidr-block 100.64.32.0/19 — vpc-id $VPC_ID — availability-zone $AZ2 | jq -r .Subnet.SubnetId) SUBNETC=$(aws ec2 create-subnet — cidr-block 100.64.64.0/19 — vpc-id $VPC_ID — availability-zone $AZ3 | jq -r .Subnet.SubnetId) 4. (Optional) Add a name tag for your subnets by setting a key-value pair. For example: aws ec2 create-tags — resources $SUBNETA — tags Key=Name,Value=SubnetA aws ec2 create-tags — resources $SUBNETB — tags Key=Name,Value=SubnetB aws ec2 create-tags — resources $SUBNETC — tags Key=Name,Value=SubnetC 5. Associate your new subnet to a route table. List the entire route table under the VPC, run the following command: aws ec2 describe-route-tables — filters Name=vpc-id,Values=$VPC_ID |jq -r ‘.RouteTables[].RouteTableId’ export ROUTETABLE_ID=rtb-xxxxxxxxx 6. Associate the route table to all new subnets. For example: aws ec2 associate-route-table — route-table-id $ROUTETABLE_ID — subnet-id $SUBNETA aws ec2 associate-route-table — route-table-id $ROUTETABLE_ID — subnet-id $SUBNETB aws ec2 associate-route-table — route-table-id $ROUTETABLE_ID — subnet-id $SUBNETC Configure the CNI Plugin to use Newly created Secondary CIDR via Terraform var.eks_pod_subnet_ids — Subnet IDs created as part of previous step var.availability_zones — List of Availability Zones for which ENIConfig has to be created Summary: By this method, we can avoid a situation where we run out of ipv4 addresses in our Kubernetes environment. For more such technical blogs — cubensquare.com/blog

EKS IPV4 Exhaustion Read More »

Karpenter Containerd Runtime Mismatch With Datadog Docker Daemon

Issue: Datadog unable to post the Payload from the nodes provisioned by Karpenter. It prompts with the below error complaining about Multiple Mount points in your Kubernetes Pod. 2023–01–03 18:37:16 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:125 in LogMessage) | disk:e5dffb8bef24336f | (disk.py:135) | Unable to get disk metrics for /host/var/run/containerd/io.containerd.runtime.v2.task/k8s.io/84b24aadc886673856bde8c5 ceb172658ec8e4f6d2d30e13b4c7ed2528da00af/rootfs/host/proc/sys/fs/binfmt_misc: [Errno 40] Too many levels of symbolic links: ‘/host/var/run/containerd/io.containerd.runtime.v2.task/k8s.io/84b24aadc886673856bde8c 5ceb172658ec8e4f6d2d30e13b4c7ed2528da00af/rootfs/host/proc/sys/fs/binfmt_misc’. You can exclude this mountpoint in the settings if it is invalid. Debug Steps: Connect to Right Context path of your Kubernetes Cluster. Fetch the Datadog Pod name installed in your Datadog namespace. kubectl logs datadog-xxxx -n datadog -c agent Login to your pod with kubectl exec -i -t -n datadog datadog-xxxx — /bin/sh command Check the File System mounts This clearly shows that the Nodes provisioned using Karpenter uses containerd as Runtime environment while creating containers. kubectl get nodes -o wide — this command also shows the Container Runtime (dockerd / containerd) Solution: Check the Datadog Agent Runtime environment. In this case, Datadog agent uses dockerd and Karpenter uses Containerd Modify the Karpenter Provisioner configuration: kubectl edit provisioner default kubeletConfiguration: containerRuntime: dockerd Ref: https://karpenter.sh/v0.18.1/provisioner/ After making the changes, you need to do a rolling restart again to spawn up a new container with dockerd container runtime. kubectl rollout restart deploy <name>

Karpenter Containerd Runtime Mismatch With Datadog Docker Daemon Read More »

hand, business, technology-3044387.jpg

TCP Connection Intermittent Failures

Problem Statement: Some of the TCP connections from instances in a private subnet to a specific destination through a NAT gateway are successful, but some are failing or timing out.   Causes The cause of this problem might be one of the following: • The destination endpoint is responding with fragmented TCP packets. NAT gateways do not support IP fragmentation for TCP or ICMP. • The tcp_tw_recycle option is enabled on the remote server, which is known to cause issues when there are multiple connections from behind a NAT device.   What it is? The tcp_tw_recycle option is a Boolean setting that enables fast recycling of TIME_WAIT sockets. The default value is 0. When enabled, the kernel becomes more aggressive and makes assumptions about the timestamps used by remote hosts. It tracks the last timestamp used by each remote host and allows the reuse of a socket if the timestamp has increased.   Solution Verify whether the endpoint to which you’re trying to connect is responding with fragmented TCP packets by doing the following: 1. Use an instance in a public subnet with a public IP address to trigger a response large enough to cause fragmentation from the specific endpoint.   2. Use the tcpdump utility to verify that the endpoint is sending fragmented packets. Important You must use an instance in a public subnet to perform these checks. You cannot use the instance from which the original connection was failing, or an instance in a private subnet behind a NAT gateway or a NAT instance.   Diagnostic tools that send or receive large ICMP packets will report packet loss. For example, the command ping -s 10000 example.com does not work behind a NAT gateway.   3. If the endpoint is sending fragmented TCP packets, you can use a NAT instance instead of a NAT gateway.   If you have access to the remote server, you can verify whether the tcp_tw_recycle option is enabled by doing the following: 1. From the server, run the following command. cat /proc/sys/net/ipv4/tcp_tw_recycle If the output is 1, then the tcp_tw_recycle option is enabled. 2. If tcp_tw_recycle is enabled, we recommend disabling it. If you need to reuse connections, tcp_tw_reuse is a safer option. If you don’t have access to the remote server, you can test by temporarily disabling the tcp_timestamps option on an instance in the private subnet. Then connect to the remote server again. If the connection is successful, the cause of the previous failure is likely because tcp_tw_recycle is enabled on the remote server. If possible, contact the owner of the remote server to verify if this option is enabled and request for it to be disabled.

TCP Connection Intermittent Failures Read More »