Troubleshooting Terraform EKS module provider dependencies

Intro

Consider this article a memouir of my endeavours with provisioning an EKS cluster with Terraform instead of kubectl.

Most of the articles and tutorials I could find online focus on simple use-cases and everything looks bright. Also, I noticed that a specific Terraform module is widely used by the community for spinning up a new EKS cluster.

This article focuses on what can go wrong and takes an angle on how that can be solved.

Disclaimer: extensive practical understanding of EKS and Terraform is necessary to understand the technical content of this post.

The problem

Provisioning an EKS cluster involves a lot of resources on AWS. Amazon is taking away and abstracting most of the scary parts of Kubernetes. Like the master nodes. However, it does not mean that everything else is a smooth sail. Networking configuration, node groups and IAM roles are only some examples of required resources that one need to deal with. Terraform offers great flexibility in defining your infrastructure as code and by utilizing modules this can get even easier. This post refers to this EKS module.

I personally prefer eksctl but lately happened to use this module in a new cluster. After the usual plumbing and quirks of every module, I started bumping onto issues with Terraform having problems reaching the cluster for seemingly simple updates in the manifests.

In precise, this is the error I got in all my terraform plans from one moment and forward:

Error: Post "http://localhost/api/v1/namespaces/kube-system/configmaps": dial tcp 127.0.0.1:80: connect: connection refused 
on .terraform/modules/deployment.eks/aws_auth.tf line 65, in resource "kubernetes_config_map" "aws_auth":   

65: resource "kubernetes_config_map" "aws_auth" { 
  ...

This doesn’t seem something trivial to solve. Especially since I never defined 127.0.0.1:80 as an address for my cluster anywhere, therefore this looks to me as a fallback. Moreover, a handful of users were reporting the same issue or other similar ones. For example, some of them reported problems with deleting their cluster, but I did not encounter them myself thus not including them in my research at all.

Potential answers

To understand the problem a bit better, you need to have better context on the options offered. I will sum up everything you need to know here:

Access to an EKS cluster is managed by an aws-auth ConfigMap resource, which lists the roles and users that can access the cluster along with the relevant permissions.
The module opts to manage automatically the aws-auth ConfigMap for you but does not force you to.
The operator can map users and roles via Terraform and make them available inside this ConfigMap. That makes everything managed from within the module’s variables (mapUsers and mapRoles).

The solution contemplated a lot on the public forums that operators use is disabling the management of this resource by the module altogether.

module "cluster" {
  source  = "terraform-aws-modules/eks/aws"
...
  manage_aws_auth  = false
...
}

I do not recommend doing so if one does not entirely understand what this entails. Yet, this can be a first step towards a better solution.

Another angle

The EKS module is not a Hashicorp endorsed module, rather a community effort. Specifically it’s using a pattern that is discouraged by Terraform, since normally a provider config should

only reference values that are known before the configuration is applied

And this is the root of all evil.

As it turns out, Terraform simply does not support passing variables into a provider configuration which are not known at plan time. So, Hashicorp recommends separating out the AWS provider and the Kubernetes provider resources. That essentially means two apply commands when needed, affecting different resources each time.

A clean solution

The most reliable way to use the Kubernetes provider together with the AWS provider is by keeping two different states. By doing this, we can limit the scope of changes performed each time to either the EKS cluster or the Kubernetes resources. Thus, this will prevent dependency issues between the two providers, since Terraform’s provider configurations need to be known before a configuration can be applied.

The practical steps

There is no one-size fits all, but hopefully at least one person can benefit from reading through how I finally solved this problem.

Up to that point, a CI job was executing the terraform plan and terraform apply steps, which controlled all of our infrastructure.

Since we were not willing to use Terraform Workspaces or Terraform Cloud and our state backend - S3 - supports multiple states, we opted for splitting the states with the least amount of effort and impact on the codebase.

In our terraform infrastructure Github repo, everything was tranferred under a new folder in the root path, named aws.
Opted to take on the responsibility to maintain the aws-auth ConfigMap manually, aka set manage_aws_auth = false in the EKS module definition.
Created a new, eg. kubernetes folder in the root path of the repository, where the Terraform manifests for interacting with the Kubernetes resources would only be implemented. In our cluster’s case, that’d be a single ConfigMap.

Some example Terraform manifests that are quite close to the ones I actually ended up using:

variables.tf

variable "cluster_name" {
  type = string
}

variable "aws_region" {
  type = string
}

variable "aws_profile" {
  type = string
}

variable "master_users" {
  type = list(object({
    username = string
    arn      = string
  }))
  description = "List of user objects that will have master permissions on the cluster. Object consists of username and arn (strings)."
}

variable "nodes_role" {
  type        = string
  description = "The role that was created while provisioning the cluster and should be used by EC2 instances to interact with the cluster."
}

main.tf

locals {
  roles = {
    rolearn  = var.nodes_role
    username = "system:node:{{EC2PrivateDNSName}}"
    groups = [
      "system:bootstrappers",
      "system:nodes"
    ]
  }

  users = [
    for user_obj in var.master_users :
    {
      userarn  = user_obj.arn
      username = user_obj.username
      groups = [
        "system:masters"
      ]
    }
  ]
}

resource "kubernetes_config_map" "aws_auth" {
  metadata {
    name      = "aws-auth"
    namespace = "kube-system"
    labels = merge(
      {
        "app.kubernetes.io/managed-by" = "Terraform"
        "terraform.io/module"          = "koslib.com.kubernetes.aws-auth"
      }
    )
  }

  data = {
    mapRoles = yamlencode(local.roles)
    mapUsers = yamlencode(local.users)
  }

}

The two providers would still need to interact but that seems to be the cleanest solution at the moment:

providers.tf

data "aws_eks_cluster" "default" {
  name = var.cluster_name
}

data "aws_eks_cluster_auth" "default" {
  name = var.cluster_name
}

provider "aws" {
  region  = var.aws_region
  profile = var.aws_profile
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.default.token
}

Caveats

While this is the best solution I was able to come up so far, even though it entails more work (but it’s the CI doing it anyway), it is by no means perfect. The most important drawback is that the ConfigMap resource cannot be overwritten in Kubernetes, which means the aws-auth ConfigMap needs to be manually deleted with kubectl -n kube-system delete configmap aws-auth before applying any change through Terraform.

An idea for that would be to create a preparatory step to automatically destroy this resource in Terraform state (therefore also in Kubernetes) before applying the plan, but I was lazy enough to not go this far and accepted the level of my solution as-is.

Outro

This was a great reminder for me that some solutions that appear to be hacky are in fact acceptable solutions considering the level of available tooling. Who knows, this could all become obsolete in the next version of Terraform :shrug:

Finally, this may become a controversial post but I’d really like to know how other cloud operation fellows are solving this issue? Feel free to DM me or reply here with a comment.