on
Continuous, Controlled Replication from Postgres to Redshift
Postgres or Redshift, what would you prefer for data warehouse grade querying?
A simple query to count the records in a table with over 2 billion records
@ Synup would take minutes in Postgres and the same would finish within a few seconds in Redshift.
This blog post talks about the motivation and challenges faced and how we overcame them while replicating data from Postgres to Redshift.
Why Replication?
In the micro-services architecture built @ Synup, there are several major components each of which has its own database hosted on AWS RDS Postgres instances.
Any kind of report generation or analysis of the product involved federating the databases and fetching information. This approach had several drawbacks, a few are listed below:
- Overhead on the production databases for performing huge amounts of reads
- Network related issues with federated databases and so on
As the volume of data increased, querying had become an even bigger challenge and the same would take minutes perhaps hours and sometimes wouldn’t produce any result, as a consequence of this we explored Redshift as an alternate option for its ability to allow heavy querying.
However the following question was “how do we replicate the data from databases in different RDS Postgres instances to a single Redshift instance on a regular basis?”
How did we approach the replication problem?
We explored several approaches to replicate the data on a regular basis from PostgreSQL to Redshift, some are listed below:
- Traditional trigger based approach with intermediate S3:
- This was a complete custom built solution, wherein triggers were setup after insert, update, delete on each of the tables in Postgres and data captured in staging tables and the same was used for replication onto Redshift.
- This added an overhead on the production databases.
- Using Debezium, Kafka connect to replicate data:
- This approach required a lot of operational work in terms of maintaining a kafka cluster, ensuring kafka connect was up and running and so on. wal2json plugin was used for setting up replication.
- Using AWS Glue to replicate data:
- AWS Glue was setup but needed a lot of effort again for each table to be maintained and involved significant effort for any customization on the tables.
- Using AWS Database Migration Service(DMS) to replicate data from Postgres -> Redshift:
- AWS DMS to replicate data directly from Postgres tables to Redshift tables worked best with minimal operational effort and comparatively reduced cost.
- However with AWS DMS several batches of records dropped when data injection rate was higher, this was noticed only when there was a direct replication from Postgres to Redshift
- One of the tables where this was noticed was having a data injection rate of over
1000 records/minute
- Hybrid solution with AWS DMS to replicate data from Postgres -> S3 and Custom solution to push further to Redshift:
- AWS DMS w.r.t replication from Postgres to S3 was noticed to not have any data loss unlike the direct database to database replication.
- Python Flask application was custom built to process data from S3 and push to Redshift
Initial idea was also to implement SCD Type-2 for handling history data in dimension tables in Redshift however as Synup’s business case did not have such a need, denormalization was placed on hold.
Data Pipes finalized
Below approaches were finalized to be implemented
- Using AWS DMS to replicate data from Postgres -> Redshift : For the tables with minimal DML
PostgreSQL supports logical decoding through which continuous replication of data is possible from Postgres to Redshift. AWS DMS implements logical decoding through test_decoding plugin hence there is only a negligible delay in data getting replicated onto Redshift.
Below operations are replicated onto Redshift with this approach continuously:- Insert, Delete, Update
- Create(Only in case the corresponding table name is included), Alter This approach was straightforward.
- Hybrid solution with AWS DMS to replicate data from Postgres -> S3 -> Redshift : For tables with heavy DML operations.
Hybrid Solution
Tables with heavy inserts, updates or deletes have an issue when setup on a data pipe where replication through AWS DMS happens directly between Postgres and Redshift, wherein several batches of records get dropped.
Our final implementation at a high level looks like this:
To overcome the issue of dropped records, the data pipe was redesigned and S3 was brought in, in the form of a data lake. With S3 as the target for DMS, there has never been a miss in the records being pushed from Postgres.
Once the data went into S3, the below options were evaluated for processing the data in S3:
- AWS Lambda Function with Pyspark code
- Flask application with custom code to process the data in S3, this was chosen with cost in consideration as the infra required to run the flask application was already available
Flask application exposes REST endpoints for individual tables, hence replication could be controlled at the granularity of a table.
Flask Application invocation is done on a periodic basis which is being controlled through Cron, however Airflow also can be used.
At Synup, we have our systems running in a Kubernetes cluster, hence it was a no-brainer to deploy the flask application in a Kubernetes cluster.
Flask application is structured as depicted below:
Complete Illustration with an example:
Let’s say there’s a table employee with three columns Employee ID, Employee Name, Employee Salary and there are 3 new employees(ID: 101,102,103) added and one employee’s(ID: 100) information updated in the Postgres DB to salary 1000, series of actions triggered are illustrated:
Prerequisites for setting up replication
- PostgreSQL DB must be v9.4.9 or above for logical decoding to work
- Replication slot must exist or IAM role should have access to both RDS and Redshift
- Redshift cluster must exist
- Replication Instance must be created on DMS
- Source and target endpoints must be created on DMS
- Tables created on Redshift
Data Mapping Challenges from Postgres to Redshift:
One of the first hurdles in this process is to create tables in Redshift as Redshift doesn’t support a number of data types supported by Postgres, hence there should be thorough check done while creating the tables in Redshift, following table provides the mapping of most of the data types between Postgres and Redshift.
Hint: Use AWS DMS jobs to create tables, however AWS DMS changes Boolean on Postgres to varchar in Redshift, such cases need to be individually verified and then tables recreated
Conclusion
Using this approach we transfer over 50GB of data
on a weekly basis
to Redshift at approximately 500$ per month.
We also have the luxury to increase the number of tables being replicated without significant addition in cost.
In conclusion, data replication using the traditional approaches have several limitations, but with features like Logical decoding streaming the data to data warehouses continuously without noticeable performance overhead on production databases can be achieved.
This post is an attempt to share the Hybrid approach developed at Synup for continuous or controlled replication of data from Postgres to Redshift.
If you do have challenges similar to this or want to discuss more on this, feel free to reach out to
References
- https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html
- https://engineeringblog.yelp.com/2016/10/redshift-connector.html
- https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
- https://blog.insightdatascience.com/from-postgresql-to-redshift-with-kafka-connect-111c44954a6a