Cloudten techblog: Addressing HA Issues with NAT in AWS

 

So Far So Good …

So you’ve really put some thought into your secure and highly available cloud infrastructure running in AWS. You’ve followed best practice with your web tier; ELBs, auto-scaling with AMIs and even using configuration management tools to rebuild your robot army from scratch. Your database tier is a thing of beauty that automatically replicates over multiple availability zones and has point in time recovery to any five minute period within your retention window.

Everything is humming along perfectly … and then it’s not. All services that require outbound access from a public subnet just suddenly stop working. This can have some pretty dire consequences.

How are NAT Instances Used in AWS?

There are a number of ways that EC2 instances can gain outbound access:

  • Deploy the instance into the default subnet of your VPC. All instances will be automatically assigned a public IP address and will be able to route traffic through the internet gateway
  • Attach an Elastic IP ( EIP ) address to the instance and ensure that it is attached to a route table that references the internet gateway.
  • Use a NAT instance.

Note:  It’s not just outbound internet services that are governed by an internet gateway or NAT instance. It’s all services that run outside your VPC. Critical AWS components such as S3 storage, Cloudwatch monitoring and SES ( AWS’ email relaying service ) etc, all sit outside your VPC. In the event that you lose your gateway, you’re also cut off from these services.

If you’re using security best practice and segregating your network, AWS recommends using a NAT instance for all services that are required to initiate outbound connections. For example, a web application may need to connect to an external email gateway to send email or connect to a remote webservice for data ( eg. stock prices ).  The main reason for this is that you don’t want to be assigning public facing IP addresses to EC2 instances that don’t need them. The diagram below shows an active/active multi AZ web app deployment. The internet facing subnet consists of an ELB ( load balancer ) which takes all inbound web connections and forwards them to the web app servers which are located in a private subnet. In order for the app servers to communicate out they need to have their routing tables pointing to the NAT instance.

A NAT instance is basically an EC2 instance with source/destination traffic checking turned off and an Elastic Network Interface (ENI) attached with a public address to allow internet traffic. There are a number of ways to use NAT in a highly available configuration. This example article on the AWS site perhaps formed the basis of many people’s first experiences with HA NAT and is also on what a number of the approved AMIs available inside the AWS console are based on.

The Case of the Self Defeating NAT Instances

The diagram below illustrates the intended functionality of multi AZ NAT according to the article. Figure 1 on the left shows how traffic should flow under normal conditions. Each AZ has a dedicated NAT instance and the routing tables of the private subnets use the ENI of the local NAT instance for internet traffic ( i.e. 0.0.0.0/0 -> eni-id ). It’s important to remember that an ENI can be attached/detached and  re-attached to an EC2 instance and that multiple ENIs can be attached to a single EC2 instance.

The yellow lines between the two NAT instances represent a monitoring script that checks the health of the instance via ICMP (ping). Should one of the two NAT instances go down, then this monitor will pick up on this and perform the following:

  • Issue an EC2 stop command via the AWS CLI for the NAT instance in the other AZ
  • Replace the internet route of the web subnet in the other AZ with the ENI of the healthy NAT instance.
  • If the downed NAT instance starts back up again and is deemed healthy then the route will be replaced to restore HA.

Figure 2, shows how this should look. This does work. In MOST cases.

So What’s The Problem ?

If you happen to have deployed your NAT instances from the AWS community AMI instances ( we’ve tried up to ami-996402a3) then you may have come across an issue whereby, all of a sudden, both NAT instances seem to be cleanly brought down. It has nothing to do with load, there’s no errors in the system logs and you haven’t been hacked.

After months of stability using this solution, we noticed that the outages started occurring to a number of AWS customers in the AP Southeast ( Sydney ) region around November 2014. Speaking to some friends and colleagues in the AWS community from across the globe, it appears that this issue is not isolated to any specific region.

The issue relates to some flawed logic in the default settings of the nat_monitor.sh script which checks the health of the other NAT instance. Upon some further investigation, we noticed that there was intermittent  packet loss between the two availability zones that caused the ping tests between the servers to fail.


This “outage” is extremely short, but in the event that a monitor is running at the exact time it occurs then it result’s in a rather comical ( if the consequences weren’t so serious ) Mexican standoff. Each NAT instance thinks the other one is in trouble and they simultaneously initiate remote shutdown requests via an EC2 API call. Ironically, the communication channels used in this API call are different to the ICMP health checks so this step works perfectly.

Figure 3 shows the results of this comedy of errors.


What’s The Solution ?

Over time there have been quite a few new HA solutions suggested for NAT instances:

1 ) First of all, I don’t think there is any inherent flaw in the underlying design of the AWS provided example. For the most part it is quite elegant. It’s just an issue with the monitor configuration that causes this mishap to occur. The nat_monitor.shscript in question is perfectly tunable and you can definitely put in some additional steps to ensure that you never run into this issue.

We recommend one or more of the following:

  • Increase the number of pings ( Num_Pings ) used to check.
  • Increase timeout ( Ping_Timeout ) so that the check waits longer.
  • Implement a random sleep between say 30 and 120 seconds for the Wait_for_Instance_Stop so that you won’t have a case of both instances trying to shoot each other at the same time.
  • Instead of stopping the remote instance, attempt to reboot it first . A NAT instance is a fairly simple piece of infrastructure and will come up quite quickly. In the event that there is no fundamental issue with the surrounding environment then a restart in many cases would fix the issue. In order to perform a restart, you’ll need to add in the appropriate IAM permissions ( i.e. EC2:RebootInstances )

2) A slight variation on the same theme involves basically the same logic but rather than modifying the route table of the effected private subnets, detaching  the ENI of the unhealthy NAT instance and attaching it to the healthy one.

3) The use of “self-healing” groups is a natural progression to the topic as it incorporates one of the most useful AWS services to further improve high availability. A self-healing group is essentially an AWS auto-scaling group with the same number of minimum and maximum members. It basically ensures that an instance will be automatically replaced should it die. There are a number of possible uses for this.

  • Using one self healing group per AZ and having a single NAT instance in the group. However, this could possible results in a small outage in the event that the NAT dies and needs to be rebuilt.
  • Using multiple self healing groups ( one per AZ ) and essentially having an active/standby NAT instance in each group.
  • Using a single healing group that spans multiple AZs and has a single NAT in each AZ. This would also leverage some form of the ENI/route table manipulation to handle NAT instance failure

Joe from Uberboxen presents a possible self-healing solution in his blog post An Alternative Approach to “HA” NAT on AWS

4) Ben Whaley also presents another possible solution that utilises the 3rd party Packer utility with AWS CloudFormations in his blog post High Availability NAT Instances in in AWS VPCs.

As always, if you have any further questions about this post or high availability and AWS in general, please don’t hesitate to contact us at info@cloudten.com.au

 


Cloudten Industries © is an Australian cloud practice and a recognised consulting partner of AWS. We specialise in the design, delivery and support of cloud based solutions. Don’t hesitate to contact us if you have any queries about this post or any cloud related topic.