A week ago, we completed migrating our existing infrastructure from GCP to AWS. It was an important part of our ongoing effort to improve and extend our current product offering. In this post, we share our migration plan and the issues we encountered while executing it.
For the last two years ENPICOM has been running on GCP. However, we have been hearing more and more from our customers that for security and compliance reasons they would strongly prefer their SaaS providers to be hosted on AWS. Furthermore, a few managed services (like Athena) that AWS provides seemed a good match for our new data layer architecture, whereas Google had no comparable service available. To get a better idea of our options, one of our founders, dr. Nicola Bonzanni, reached out to a contact at AWS. From the get-go, we received an overwhelming amount of support from the AWS startup team, who were very interested in our platform and wanted to help us take it to the next level. The decision to move over to AWS did not require a lot of discussion at that point.
In case you are not familiar with ENPICOM yet, we develop software to manage and analyze immune repertoire sequencing data. Our platform runs mainly on Kubernetes (K8s), the container orchestrator. K8s has been at the center of many discussions regarding unnecessary complexity and overengineering. With this migration we have found that adopting K8s has been a net positive for us, leading to significant cost reductions. These cost savings derive directly from the fact that K8s is "universal"; meaning that a K8s cluster on GCP will work the same way as a K8s cluster on AWS (minus some specific differences, which we will point out later). Long story cut short: K8s allowed us to avoid vendor lock-in, cut costs, and to effortlessly move infrastructure between clouds.
Other than the K8s cluster we also had a managed PostgreSQL database (CloudSQL) to migrate, a (relatively) large block storage volume of about 2TB and some services such as our internal QA application and the downtime service.
If you are also thinking of migrating a K8s cluster between clouds, keep on reading - we are sure you will find something useful in this post.
From develop to production
At ENPICOM we work with three "environments", which will probably sound very familiar to you: develop, staging and production. Each environment contains the full infrastructure required to run our platform (a K8scluster, a database, a file storage solution) and is continually being deployed to by our CI/CD system. To find any issues with our migration runbook, or the AWS infrastructure setup, we decided to do the full migration process for each environment. There was also a week or two between each migration, so that any issues that would only appear after some time could also be caught before migrating production.
The migration plan
The migration plan was divided into three major parts: 1) Provisioning the infrastructure on AWS, 2) Moving over the data and 3) Executing the cut-over.
Everything started with setting up the required infrastructure on AWS, such as the K8s cluster and the database. Once that was done, we could start a planned downtime and move over the data from the old Google infrastructure. After all the data was present on AWS we updated the DNS (so the environment domain name pointed to the new AWS environment) and lifted the downtime.
As we could afford a significant amount of downtime while still meeting our SLA’s, we did not require a zero-downtime migration. We were therefore able to simplify the migration process by avoiding complex replications between environments (such as PostgreSQL logical replication), as well as keeping costs, engineering time and migration team size to a minimum.
1. Provisioning the infrastructure
As mentioned above, before moving over data we had to set up the required infrastructure on AWS. On GCP we did this using shell scripts that directly used the GCP CLI. This approach was optimal for us at the time; our infrastructure was mainly static and did not require a lot of changes, and every now and then we would need to spin up a new platform environment.
Now that we were moving to AWS we wanted to do it the "right" way. After looking around at the available options we settled on AWS CloudFormation templates. The templates saved us a lot of time doing manual resource provisioning. With just a few variable changes we could bring up a new environment within 30 minutes. The templates also worked great as an inventory of the infrastructure we had running, and which settings they were using. Previously we had some issues with environment settings getting out of sync and causing bugs. This was easily prevented by the CloudFormation templates.
There were some specific issues with CloudFormation that we encountered, however. Renaming an existing resource is pretty much impossible. The only way to give a resource a new name is to recreate it, which can be a big issue with databases and similar stateful infrastructure. Another issue that we often encountered is the initial template deployment failing mid-way through. The deployed resources will be rolled back, and the only way to "fix" the issue is to start the whole deployment again. It is not possible to just fix the bad CloudFormation template and continue with the resources that were already created during the previous failed deployment.
2. Moving over the data
Before getting started on anything else we reduced the Time-To-Live (TTL) of the A record pointing to our platform to about ten minutes. When everything is migrated from GCP to AWS this record will be modified to point to our Amazon static IP address. We reduced the TTL so that we could still quickly switch back to our old platform (running on GCP) in case something went wrong during the migration.
After this we enabled our downtime page. We had to make sure that there was no new data being generated by the platform anymore, or we would have issues while moving the old data to AWS. The downtime page was therefore activated to stop any potential incoming requests from customers. We warned our customers well in advance if the coming downtime, of course.
To move over our files from Google Filestore to AWS Elastic File System we used Rsync. On GCP we had an instance with access to the Filestore, and on AWS we had an instance ready with access to the EFS and with a large block volume mounted. These instances were some of the more pricy ones, to avoid processing power from turning into the bottleneck (instead of the network). On the AWS instance we ran this command to move over the data:
rsync -aHW user@<google_host>:/mnt/filestore/ /mnt/efs
- -a stands for --archive, and is short-hand for the -rlptgoD options. This option recurses into directories, copies symlinks, preserves permissions, preserves modification times and preserves groups and owners.
- -h (--hard-links) preserves hard links.
- -W (--whole-file) copies files whole, without the overhead of running the delta transfer algorithm. This speeds things up if the destination does not have any of the data to be transferred.
When all the data was moved over we used Rsync's checksum functionality to verify that no random bits were flipped during the transfer. It is the same as the previous command, except with some additional options.
rsync -niaHc user@<google_host>:/mnt/filestore/ /mnt/efs
- -n (--dry-run) does not perform any changes on the destination.
- -i (--itemize-changes) outputs a list of all updates.
- -c (--check-sum) skips files based on checksum, instead of modification time and filesize.
A few days before, after provisioning the AWS infrastructure, we had already ran Rsync to move over most of the (static) data. Because of this we were able to reduce the time required to migrate the filestore data to about 10% of what it would have been, thanks to the Rsync delta transfer algorithm.
At the same time we moved over the SQL data on the AWS instance using the standard tools provided by PostgreSQL: pg_dump and pg_restore. To speed things up we used the -j command line option, which uses multiple concurrent connections to the database for faster data export or import.
# Dump the data from the google filestore to the mounted volume
pg_dump -h <google_host> -j 9 -Fd -f /mnt/pgdump
# Restore the data into the AWS database
pg_restore -j 9 -h <aws_database_host>
3. Executing the cut-over
At this point we had provisioned the new infrastructure on AWS, moved over the data and started the IGX platform. The only thing left to do was to modify the A record in our DNS settings to point to our new AWS infrastructure. Because of the short TTL this was quickly done with, and we had successfully completed the migration to AWS.
During the migration we got a good look at the differences between the two major cloud providers. Both clouds have very similar products, which makes for a very easy comparison.
How Amazon Elastic Kubernetes Service differs from Google Kubernetes Engine
While K8s works the same on AWS as it does on GCP, there are still specific differences between the two products that were very noticeable. The situation is best described with the "batteries included" philosophy. Google Kubernetes Engine (GKE) has already handled everything that you reasonably might want to configure on a new K8s cluster, i.e., it comes with batteries included. These "batteries" include things like container logging, container metrics, load balancing, node pool autoscaling and network policies, for example. Mostly everything works without you having to configure or install anything in the cluster. And if you need to configure something, it is probably just one checkbox away.
AWS Elastic Kubernetes Service (EKS) is the complete opposite. Think of it like a mechanical tool that is sold without batteries, but which fits whatever battery you can find. With EKS you get a cluster "skeleton" that you will have to configure on your own. It allows for a wide range of possible feature configurations, instead of the single "default" configuration as is the case with GKE. However, as we are dealing with a managed solution here, one would expect a little bit more "managing" to be done by EKS. Most of the configuration work on our EKS clusters was very easily automated, and one might wonder why AWS has not grabbed the low-hanging fruit yet. On the other hand, K8s does originate directly from Google, so there might be more institutional K8s knowledge available there than at Amazon. Whichever is the case, we did find the K8s workflow on AWS a bit more painful than it should have been.
Below is a list of features that came out-of-the-box with GKE, but which we had to install and configure ourselves on EKS, to give an idea of what we are talking about:
- Logging: On GKE we got container logging out-of-the-box. On AWS we had to configure Fluent Bit to grab logs from the cluster nodes and send them to Cloud Watch.
- Metrics: To get metrics on container resource usage with EKS we had to install the AWS Cloud Watch agent on the cluster. GKE already took care of this.
- Network Policies: As mentioned before, with GKE this was simply clicking a checkbox in the GKE dashboard. For EKS we had to install a network policy engine, such as Calico.
- Load Balancing: K8s provides the Ingress resource as a way to provision load balancers without having to do cloud-specific API calls. Instead you install a load balancer controller for the cloud you are using. GKE clusters came with this controller already running, with EKS we had to install the controller ourselves.
- Cluster Autoscaling: Node pools (or groups) scaling automatically worked out of the box for GKE. For EKS we had to install an AWS Cluster Autoscaler daemonset, after which we had to manually set tags on each AWS autoscaling group to make sure that the nodes they created got the correct taints and labels.
On the bright side, AWS did provide pretty good documentation for each of these issues. Not having to figure these things out ourselves saved us a lot of time.
Besides the differences between AWS and GCP we also learned a few more generally applicable things. One such lesson was that (in our specific situation, at least) it is sometimes better to do things internally, rather than outsourcing it to an external consulting company. When the idea of migrating the infrastructure first surfaced we were thinking of outsourcing the whole infrastructure migration and setup.
Outsourcing might have helped us free up some time for other projects, but it didn't make much sense considering our tight deadline. Just the negotiation phase with the consulting company we reached out to took weeks, and at that point, we just decided to do it ourselves. In the end, we managed to migrate in less than a month with just two employees working full-time on the project. And as an added bonus, we now also have extensive in-house knowledge of how our infrastructure works.
Does any of the above sound like your cup of tea? Come join us! Take a look at enpicom.com/careers for our open positions. We are currently looking for a Front-end Engineer and a Bioinformatician to join our team.