I discovered that airflow was consuming compute resources, I created a dedicated node group for data orchestration tasks with airflow and added tolerations to airflow deployment to tolerate the node.
Workarounds
1 2
kubectl get nodes kubectl taint nodes <NODE-NAME> reserved=dataTasks:NoSchedule
I deployed my airflow with a Helm package, a simple fix to the values.yaml fixed things
Diagnosing and Fixing Memory Leaks in Python applications can save your node, we used the following guide to improve our application issues
Scenario 3: Users dealing with insecure access (net::ERR_CERT_AUTHORITY_INVALID)) to our web applications because of unrenewed/unissued SSL/TLS certificates after their expiration.
apiVersion:networking.k8s.io/v1 kind:Ingress metadata: annotations: cert-manager.io/cluster-issuer:cert-manager-<output> name:web-app namespace:staging spec: rules: -host:oluchiorji.com http: paths: -pathType:Prefix path:/ backend: service: name:web-app port: number:80 tls:# < placing a host in the TLS config will determine what ends up in the cert's subjectAltNames -hosts: -oluchiorji.com secretName:webapp-cert
Scenario 4: Applications were always in a CrashBackLoopError state because of Error: secret <secret-name> not found
When we deploy application via helm , we notice the following :
Helm adds this prefix sh.helm.release.<version-number>-<app-name> when it creates the external secrets and deployments, not the secret-name we specified via our value files.
Workarounds
At first, we were using Kubernetes opaque secret API to offer quick fix but we encountered the following errors
The Secret object is convenient to use but does not support storing or retrieving secret data from external secret management systems such as AWS Secrets Manager or Parameter store.
Too many hardcoding and multiple yaml files on local system.
The goal of External Secrets Operator is to synchronize secrets from external APIs into Kubernetes. ESO is a collection of custom API resources - ExternalSecret, SecretStore and ClusterSecretStore that provide a user-friendly abstraction for the external API that stores and manages the lifecycle of the secrets for you.
Create the ClusterSecretStore is a global, cluster-wide SecretStore that can be referenced from all namespaces. We will create 2 gateways paramaterstore-cluster-secret and secretmanager-cluster-secret in order to access both AWS secret providers.
aws secretsmanager create-secret --name ecommerce/staging --description "My test secret created with the CLI." \ --secret-string "{\"POSTGRESQL_USER\":\"admin\",\"POSTGRESQL_PASS\":\"4g#4gGGDG9OghjuE\"}"
Use ExternalSecret to fetch the secrets. It has a reference to the ClusterSecretStore a global, cluster-wide SecretStore that can be referenced from all namespaces.
The issues with NodeNotReady could have been fixed effectively with The Kubernetes Cluster Autoscaler but it doesn’t automatically adjust the number of nodes in our cluster when pods fail or are rescheduled onto other nodes, we need to manually create a new node and remove a new node via the console like adjusting the desired size etc.
Some other issues we encountered are:
Using very small instances in node groups leads to node groups maxing out and resulting in unscheduled /evicted pods.
Using large instances in node groups leads to low resource utilization and increased cost.
We were able to fix the issue with Karpenter This is a detailed guide depending on what tool ( terraform, eksctl, kOps) you want to use to set up Karpenter.
Scenario 6: OpenSearch Shards Issues
1
"reason"=>"Validation Failed: 1: this action would add [10] total shards, but this cluster currently has [594]/[600] maximum shards open;" was the problem
Workarounds
I wrote a detailed guide on how to delete old indices from ES or OpenSearch using the Python Curator pip package, check it out here. This task can be wrapped into a Cron task.
Scenario 7: Struggling to fix Kubernetes over-provisioning
We wanted the ability to set appropriate resource requests for pods (applications)deployed in the cluster. The more precisely we set accurate resources to our pods, the more reliably your applications will run and the more space we’ll save in the cluster. We installed Kubecost and we had access to a view that look like this