How AWS CloudWatch Agent on Kubernetes Blew Our AWS Bill

July 12, 2024 · 4 min read

When running a microservice-based architecture, traffic flows from the front-end, passes through multiple microservices, and eventually receives the final response from the back-end. Kubernetes is a container orchestrating service that helps us run and manage these numerous microservices, including multiple copies of them if necessary.

During the lifecycle of a request, if it fails at a specific microservice while moving from one service to another, pinpointing the exact point of failure becomes challenging. Observability is a paradigm that allows us to understand the system end-to-end. It provides insights into the “what,” “where,” and “why” of any event within the system, and how it may impact application performance.

There are various monitoring tools available for microservice setups in Kubernetes, both open-source (such as Prometheus and Grafana) and enterprise solutions (such as App Dynamics, DataDog, and AWS CloudWatch). Each tool may serve a specific purpose.

Story Time — How we built our Kubernetes

In one of our projects, we decided to build a lower environment on an AWS Kubernetes cluster using Amazon Elastic Kubernetes Service (EKS) on Amazon Elastic Compute Cloud (EC2). We had around 80+ microservices running on EKS, which were built and deployed into the Kubernetes cluster using GitLab pipelines. During the initial development phase, we had poorly developed Docker images that consumed a significant amount of disk space and included unnecessary components. Additionally, we were not utilizing multi-stage builds, further increasing the image size. For monitoring purposes, we deployed the AWS CloudWatch agent, which utilizes Fluentd to aggregate logs from all the nodes and sends them to CloudWatch Logs.

Setting up Container Insights on Amazon EKS and Kubernetes

How to install and set up CloudWatch Container Insights on Amazon EKS or Kubernetes.

During a routine cost check, we made a startling discovery. The billing for AWS CloudWatch Logs (where the CloudWatch agent sends logs) in our setup was typically around 20–30 dollars per day, but it had spiked to 700–900 dollars per day. This had been going on for five days, resulting in a bill of 4500 dollars solely for the CloudWatch Logs and NAT gateway (used for sending logs to CloudWatch via public HTTPS). As an initial response, we stopped the CloudWatch agent daemon set and refreshed the entire EKS setup with new nodes.

What went wrong

As a temporary fix, we halted the CloudWatch agent running as a daemon set in our cluster to prevent further billing. Upon investigation, we discovered that a large number of pods were in an evicted state. The new pods attempting to start (as Kubernetes tries to match the desired state specified in manifests/Helm charts) were also going into the evicted state. This led to a high volume of logs, which were then sent to CloudWatch Logs via the CloudWatch agent. Since log billing is based on ingestion and storage, it significantly contributed to our AWS bill. This eviction was caused by a condition called “node disk pressure.” The node disk pressure occurred due to the following reasons:

The existing pod had generated a large number of logs, occupying significant disk space.
When a new version of the app was deployed in the cluster, the new container (approximately 3 GB in size) could not start due to insufficient available space.
After multiple attempts to start the pod, it went into an evicted state.
As the current pod was evicted, the deployment controller deployed another pod to match the desired state specified in the deployment.
These events generated more logs, further consuming available disk space.
This cycle continued for five days, exacerbating the situation.

How we resolved it

To address the problem, we implemented the following solutions:

Once we identified the issue, we refreshed Kubernetes by replacing the existing nodes with a new set. This action cleared up all the disk space on the nodes, and since all our logs are stored in CloudWatch Logs, we resolved the log-related concerns.
Additionally, we implemented multi-stage builds, which reduced the overall image size for deployment.
Lastly, we set up CloudWatch alarms to trigger when the disk usage percentage exceeds a certain threshold.

InfraSecOps : Enable Monitoring and automated continuous Compliance of Security Groups using Cloud-watch and Lambda

July 12, 2024 · 5 min read

As a Dev-ops engineer, we use different compute resources in our cloud, to make sure that different workloads are working efficiently. And in order to restrict the traffic accessing our compute resources ( EC2/ECS/EKS instance in case of AWS) , we create stateful firewalls ( like Security groups in AWS). And as a lead engineer, we often describe the best practices for configuring the Security groups.But when we have large organization working on cloud, monitoring and ensuring each team follows these best practices is quite a tedious task and often eats up lot of productive hours. And it is not as if we can ignore this, this causes security compliance issues.

For example, the Security group might be configured with following configuration by a new developer ( or some rogue engineer). If we observe the below , security group which is supposed to restrict the traffic to different AWS resources is configured to allow all kinds of traffic on all protocols from the entire internet. This beats the logic of configuring the securing the resource with security group and might as well remove it.

{  
    "version": "0",  
    "detail-type": "AWS API Call via CloudTrail",  
      "responseElements": {  
        "securityGroupRuleSet": {  
          "items": \[  
            {  
              "groupOwnerId": "XXXXXXXXXXXXX",  
              "groupId": "sg-0d5808ef8c4eh8bf5a",  
              "securityGroupRuleId": "sgr-035hm856ly1e097d5",  
              "isEgress": false,  
              "ipProtocol": "-1",  --> It allows traffic from all protocols  
              "fromPort": -1, --> to all the ports  
              "toPort": -1,  
              "cidrIpv4": "0.0.0.0/0" --> from entire internet, which is a bad practice.  
            }  
          \]  
        }  
      },  
    }  
  }

This kind of mistake can be done while building a Proof Of Concept or While testing a feature, which would cost us lot in terms of security. And Monitoring these kind of things by Cloud Engineers takes a toll and consumes a lot of time.What if we can automate this monitoring and create a self-healing mechanism, which would detect the deviations from best practices and remediate them?

The present solution that i have built in AWS, watches the each Security group ingress rule ( can be extended to even egress rules too) the ports that it is allowing, the protocol its using and the IP range that it communicating with. These security group rules are compared with the baseline rules that we define for our security compliance, and any deviations are automatically removed. These base-rules are configured in the python code( which can be modified to our liking, based on the requirement).

Components used to build this system

AWS Cloud trail
AWS event bridge rule
AWS lambda
AWS SNS
S3 Bucket
Whenever a new activity ( either creation/modification/deletion of rule) is performed in the security group, its event log not sent as event log to cloud watch ,but as api call to cloud trail. So to monitor these events, we need to first enable cloud trail. This cloud trail will monitor all the api cloud trails from EC2 source and save them in a log file in S3 bucket.
Once these api calls are being recorded, we need to filter only those which are related to the Security group api calls. This can be done by directly sending all the api call to another lambda or via AWS event bridge rule. The former solution using lambda is costly as each api call will invoke lambda, so we create a event bridge rule to only cater the api calls from ec2 instance.

3. These filtered API events are sent to the lambda, which will check for the port, protocol and traffic we have previously configured in the python code( In this example, i am checking for wildcard IP — which is entire internet, all the ports on ingress rule. You can also filter with with the protocol that you don't want to allow. Refer the code for details)

4. This python code will filter all the security groups and find the security group rules, which violate them and delete them.

Creating a rouge security group ruleThe lambda taking action and deleting the rouge rule

5. Once these are deleted, SNS is used to send email event details such as arn of security group rule, the role arn of the person creating this rule, the violations that the rule group has done in reference to baseline security compliance. This email altering can help us to understand the actors causing these deviations and give proper training on the security compliance. The details are also logged in the cloud-watch log groups created in the present architecture.

For entire python code along with terraform code, please refer the following Github repo. To replicate this system in your environment, change the base security rules that you want to monitor for in python and type terraform apply in the terminal. Sit back and have a cup of coffee, while the terraform builds this system in your AWS account.

Liked my content ? Feel free to reach out to my LinkedIn for interesting content and productive discussions.

Story Time — How we built our Kubernetes​

Setting up Container Insights on Amazon EKS and Kubernetes​

How to install and set up CloudWatch Container Insights on Amazon EKS or Kubernetes.​

What went wrong​

How we resolved it​