FOCUS ON AWS TRANSIT GATEWAY - PART 3: AUTOMATION

This text was originally published in French in 2021 on the Revolve Blog by Jérémie RODON. Translation to English by Nicole Hage Chahine.

The solution described in this article is available as an open-source project! If you prefer a hands-on approach rather than just reading about it, you can find it on GitHub: Transit Gateway Automation.

In the previous article, I explained my method for systematically translating functional network requirements (which VPCs should communicate with which VPCs) into Transit Gateway configurations. While the method is effective, I cannot deny that it can be tedious to implement due to the large number of unnecessary routes it generates. In this article, we will explore how to add automation using CloudFormation (although it can also be done with Terraform, it’s much easier and less fun).

But before we dive in, what exactly needs to be automated? Of course, the creation of the Transit Gateway and its route tables should be automated, but that’s a one-time operation and doesn’t present any particular challenges. A CloudFormation template can handle it without any issues.

However, adding new VPCs over time to selected “bubbles” is a more complex and interesting problem. In the following sections, we will assume that the creation of the Transit Gateway and its route tables (the bubbles) has already been addressed, and we will focus solely on adding a new VPC to our network.

What does adding a new VPC entail?

Let’s assume we already have a Transit Gateway and want to automate the addition of a new VPC. Therefore, we need to:

Create a new VPC (as expected).
Create an attachment to the Transit Gateway.
Add the appropriate routes to the Transit Gateway in the VPC’s route tables.
Associate the new attachment with its bubble, meaning its Transit Gateway Route Table (TGWRT).
Create the necessary routes and blackholes in the other bubbles/TGWRTs (refer to Part 2 for a refresher on these concepts).

Since we want to use CloudFormation, this sequence of actions is not so straightforward. Creating a VPC is trivial, but starting from the second step, a problem arises: How do we know the ID of the Transit Gateway?

With Terraform, this problem can be easily addressed using a “data” block to search for the Transit Gateway based on tags, for example. However, CloudFormation does not have an equivalent built-in mechanism.

Another problem is that the first three steps will occur in one AWS account, the account of the application or project for which we are creating the new VPC. However, the last two steps will take place in the “Network” account that hosts the Transit Gateway. Usually, these are two separate accounts. Again, while Terraform allows for cross-account operations using different providers, CloudFormation cannot do that.

Finally, the actions in steps 4 and 5 are dependent on the “bubble” of the new VPC and are completely different from one bubble to another. We could use CloudFormation’s “Conditions” for this purpose, but it would make the template difficult to read and maintain, especially when dealing with a large number of bubbles.

Faced with these challenges, there are two possible approaches.

First, CloudFormation allows the definition of “Custom Resources,” which are Lambda functions where we can write whatever we want. With Custom Resources, we can perform ID lookups and cross-account operations as needed. However, relying heavily on Custom Resources while maintaining the robustness of CloudFormation can be challenging. It’s best to avoid using them unless absolutely necessary.

Second, we can observe that our VPC deployment can be achieved with just two consecutive CloudFormation templates:

One to perform the actions in the “application” account (create the VPC, etc.). One to perform the actions in the “network” account based on the chosen bubble. Therefore, we “only” need a system capable of orchestrating the deployment of two CloudFormation templates in different AWS accounts and selecting the second template based on a variable. If this system could also fetch the ID of a Transit Gateway, we would have achieved our goal.

Step Functions to the rescue!

At this stage, using Step Functions becomes increasingly tempting. AWS Step Functions is a serverless service that allows the orchestration of any workflow by representing it as a finite state machine, where each state generally corresponds to the invocation of a Lambda function. The service enables the execution of lambdas in sequence or in parallel, as well as making choices based on the result of a lambda.

To solve our problem, we can create a Step Function that:

Searches for the ID of the Transit Gateway and other required parameters for steps 1 to 3 of adding a new VPC.
Assumes a predefined role in the target AWS account and deploys the CloudFormation template for the new VPC with the established parameters from the previous step.
Waits for the deployment to complete.
Searches for the ID of the new attachment (an output of the VPC deployment) and other parameters required for steps 4 and 5.
Assumes a predefined role in the AWS “network” account, selects a CloudFormation template based on the desired bubble, and deploys it with the established parameters from the previous step.
Waits for the deployment to complete.

All steps can be achieved with lambdas except for the wait steps, for which Lambda is not ideal due to the following reasons:

It’s not clean to wait in a lambda, and it costs money for no reason.
A deployment might take longer than 15 minutes (although in reality it takes 2 minutes, let’s focus on the principles), exceeding the maximum execution time of a Lambda.

“Waiting” in a Step Function is actually quite easy using two special states provided by the service:

“Choice,” which allows us to choose the next step based on the value of a variable.
“Wait,” which is simply a configurable wait state, similar to a “sleep” in a program.

By using these two states together, we can easily set up a waiting logic.

For the deployment step in the “Network” account, we want to be able to choose the template based on the name of the “bubble”. One simple solution is to store the templates in an S3 bucket, ensuring that the key of each template can be derived from the name of the bubble. For example, if we have a variable “Bubble” = “shared-services”, we know that we should find “/templates/shared-services.yml” in a predefined S3 bucket. This approach is straightforward and allows for easy addition of new templates/bubbles.

Overall, the proposed solution looks like this:

Click here for the full-size diagram.

Note that the Step Function is located in an AWS account referred to as the “Manager Account.” Indeed, since we need to operate in multiple accounts for our deployment, it makes sense to separate our automation into a dedicated account. However, it could also be logical for the “Network” account containing the Transit Gateway to be the account hosting the Step Function. The provided reference implementation on GitHub allows you to choose the approach that suits you best.

One Bubble, One Template

It’s worth noting that we use a different template for each bubble. In such cases, we usually face a problem: how to manage the fact that templates might have different sets of parameters?

It’s an excellent question and a potentially complex problem, but luckily… the parameters are always the same! In fact, the structure of the templates is also always the same. Let’s see why.

To configure a new attachment according to its bubble, what do we need to do? If we refer to the method described in the previous article, placing a VPC in bubble “A” means:

Associating the VPC attachment with the corresponding Transit Gateway Route Table (TGWRT) for bubble “A.”
Creating a route to this attachment in all the bubbles/TGWRTs with which bubble “A” needs to communicate.
Creating a blackhole route for this attachment in all the bubbles/TGWRTs with which bubble “A” should not communicate.

Since for each bubble, bubble “A” can either communicate or not (there’s no in-between like “I can partially communicate but not entirely”), we realize that we will have to add either a route or a blackhole route in each TGWRT. Thus, the templates are always similar, with the same number of resources:

One resource for making the association.
One resource for each TGWRT to create either a route or a blackhole route.

In any case, we need to know the same information in the template:

The ID of the new attachment.
The CIDR Block of the new VPC.
The IDs of all the TGWRTs.

For the IDs of the TGWRTs, we can rely on CloudFormation exports. The remaining task is to obtain the ID of the attachment and the CIDR Block of the VPC, which are the only parameters expected by all the “bubble templates.”

Here’s an example of the “shared-services.yml” template (available on GitHub):

Note the use of “ImportValue” to find the IDs of the existing route tables.

Returning to a more native experience with Custom Resources

At this stage, we are discussing using a Step Function to deploy our new VPC, with the support of 2 CloudFormation templates. However, writing a JSON of parameters to trigger a Step Function is relatively far from the native CloudFormation experience.

And what happens when I want to delete the VPC? Or if I want to change its bubble?

The proposed solution suggests using other Step Functions to handle these scenarios, but it further moves us away from the native IaC experience.

Here, a “Custom Resource” can help. To simplify things a bit, a CloudFormation Custom Resource is essentially a lambda function that CloudFormation calls, saying “I need to [CREATE | UPDATE | DELETE]” and passing the resource parameters directly.

For example, the following Custom Resource calls a lambda whose ARN is obtained through a pre-existing CloudFormation export, which is the “ServiceToken” parameter. “ServiceToken” is the only required parameter, and all other parameters are considered “custom” and passed to the lambda:

The entire logic behind it is in the hands of the lambda function. This makes Custom Resources extremely powerful tools: in theory, there is absolutely nothing impossible to do with this mechanism. For example, with a (very) significant effort, one could envision having a set of Custom Resources that enable deployments on Azure using CloudFormation… but just because something is possible doesn’t necessarily makes it a good idea.

In the case at hand, a Custom Resource allows us to have a “dispatcher” that can call the appropriate Step Function with the right parameters based on the operation we want to perform:

Thanks to this, we return to a native CloudFormation experience to deploy our VPCs within our Transit Gateway bubbles: we simply need to deploy a CF template with a few parameters, as usual.

It sounds interesting, but it’s all very theoretical.

I understand that despite all my efforts, it is impossible to fully grasp such a complex solution with just words and four diagrams.

That’s why, if you want to delve deeper, I can only encourage you to visit the GitHub repository containing my reference implementation: a well-documented readme will guide you step-by-step through deploying the solution yourself, starting from scratch (although you will need at least an AWS account). This will allow you to better understand the solution and observe its benefits and limitations. The cost will be negligible in an enterprise sandbox environment, but for those with only a personal account, please note that testing the solution for a full day with 2 or 3 VPCs attached to a Transit Gateway should not cost you more than $5 for 8 hours in Ireland.