AWS Data Migration and ETL — How AWS Glue stands out
For use cases involving AWS cloud, many a times, one will end up having multiple choices of AWS services, which can solve the problem in hand. There are many services where features overlap and the lines separating the functionalities get blurred. This is specially a case, where new efficient services have been built using the latest technologies and solutions, instead of revamping the old ones. One of the examples I am going to refer here is AWS Data Pipeline vs. AWS Glue
For a simple requirement of data migration in DynamoDB, I ended up spending a considerable amount of time exploring various options, running through few AWS services and proof of concepts before proposing the right design solution to the team. The use cases in this context are
- Copying DynamoDB data from one table to another
- Copying DynamoDB data from one table to another “with” ETL transformation
Following are few solutions which one would come across:
- Native DynamoDB backup and restore (without transformation)
- Using AWS Lambda Custom Implementation
- AWS Data Pipeline
- AWS Glue
** Other services like AWS DMS (Data Migration Service) are intended to be used to bring in external data from Mongo, Oracle etc. into AWS databases like RDS, DynamoDB, Aurora, Redshift and so forth. Therefore, it solves a different problem focussing on external data ingestion. My use case is related to internal data movement within AWS.
1. Native DynamoDB backup and restore (without transformation)
This is a native on-demand backup utility from DynamoDB which allows taking backup of an existing table and restoring it into a new table. It comes at no cost and does not consume provisioned write or read capacity. It is a good option for point in-time backups and archival use cases.
The limitations are that it is time consuming, one cannot transform attributes or data before restoration and one cannot restore in a user-defined pre-existing table. Unfortunately, any use case which requires you to have more control on data backup or requires data transformation does not get fulfilled with this utility
2. Using AWS Lambda Custom Implementation
This option requires you to write a AWS Lambda to do the magic of reading from source and writing to the target table. While this is easily achieved and it performs well with small volume of data, it cannot scale well and handle large volume of data with transformation.
3. AWS Data Pipeline
AWS Data Pipeline web service allows movement and transformation of data using EMR/EC2 clusters and uses big data capabilities like Hive, Pig etc to achieve the same. One has to configure the EMR cluster size to run the ETL job. Pre-existing templates exist which are composed of one or more activities. An activity is a collection of tasks to be performed. Typical examples of these tasks would be “Import from Dynamo and export to S3”, “Copy RDS table to Redshift”.
For the use case of dynamo to dynamo data copy, HiveCopyActivity was used. We have to provide the cluster size, source table, target table, data format and so forth in these templates. While the pipeline did the job and the data copy was successfully performed, following were some drawbacks observed in this exercise:
- High Latency: The time taken to start an EMR cluster, irrespective of the number of nodes, was high. It usually took more than six to nine minutes to spin up a cluster before the job execution can take place. This is quite a lot when you have chain of dependent pipelines required to complete the end to end flow.
- DynamoDB WCUs not fully utilised: Provisioned write capacity of target dynamo db table could not be utilised fully. The underlying cluster nodes were not fully utilised. This raised doubts on solution feasibility and efficiency when tens of millions of records are required to be copied.
Let us now refer to the second use case of transforming the attributes (columns) before the data is inserted in the target table. There is no out of box solution to achieve this. We have to first export the data from DynamoDB to S3 using EMRActivity in Data Pipeline and then have S3 trigger an event on a AWS Lambda which does the transformation. The last leg of Lambda processing is explained very well in this blog — https://aws.amazon.com/blogs/compute/creating-a-scalable-serverless-import-process-for-amazon-dynamodb/.
This solution demands quite a bit of custom coding with S3 data format handling, attempts on parallelization and requires monitoring of the dynamo table metrics during the process. I faced throttling issues during writes with provisioned throughput and that raised concerns on this approach, especially when data volume is large.
Let me now walk you through the experience with AWS Glue for the same use cases.
4. AWS Glue
AWS Glue is a managed ETL service which provides unparalleled capabilities when compared with AWS Data Pipeline. With no infrastructure to manage, it provides a higher level of abstraction. Basic ETL transformation can be easily achieved with minimal lines of code in python or scala, and very basic knowledge of Apache Spark.
While the above diagram of AWS Glue also depicts Data Catalog and Crawler components, they were not needed for my use case. Crawlers record metadata of the data sources in Data Catalog and are required for complex data lake needs. Keeping those pieces aside, all you see in the diagram is a “script” which you write. Rest of the infrastructure work is handled by AWS Glue.
I ended up writing a few lines of PySpark script which read from the source DynamoDB table and wrote to the target table. Glue computes the schema on the fly and does the task quite efficiently. For other interim backup needs and transformation where I needed the data to be in S3, Glue provides optimised export format options like Apache Parquet and ORC, apart from regular formats like JSON, CSV etc.
It is worth mentioning that with just additional few lines of code, I was able to transform few attributes (columns) and values (data) in Glue script, before writing the data into target DynamoDB table.
Recommendation
Although a bit more expensive than AWS data pipeline, AWS Glue not only does a great job for the said data migration, it is a swiss army knife for ETL and Analytics. You can exploit the full power of Apache Spark on top of serverless infrastructure at your disposal that glue offers. A recent addition of Streaming ETL capability, where AWS Glue can process streaming data from Amazon Kinesis and Apache Kafka, cannot be ignored. This provides truly serverless and near real-time analytics capability in the AWS ecosystem. Any use case where frequency of ETL job is not very high and controls on EMR clusters are not needed, AWS Glue is a good choice. The time saved on development, maintenance and monitoring of infrastructure compensates for the high cost of AWS Glue. It also allows you to have a ready to use playground and innovate at a rapid pace.