A Data Migration Story, Part I: Architecture of a Databricks Migration

Photo by Jonathan Kemper on Unsplash

There has been an explosion of cutting edge data solutions in the 2010s, many of which target the growing problem of scalable storage with big data. Technologies like these are built on the shoulders of giants, immensely complex behind the scenes, but powerful and manageable to use once set up. The tricky part is that set up: how can we efficiently and reliably migrate data to these new powerful solutions?

The goal of this article is to outline a success story on one of those migration techniques, hopefully filling in the gaps where documentation is a bit too pedantic. This success story is admittedly very narrow in scope — each legacy source system (e.g. Oracle or SQLServer) comes with different rules and requirements, and this article only covers an Oracle migration. Disclaimer aside, Oracle databases are common and especially complex. Hopefully, this article helps guide other teams, to avoid the blunders we made and skip straight to the lessons learned.

With the goal of being both succinct and not missing any important details, I will break out my team’s success story into a few separate pieces. This is part I. Future parts will zoom in on the technical aspects and parameters of each tool in our new data pipeline.

How do you decide on your data migration solution?

My team had chosen Databricks long before I had any say in the matter, but given the choice I probably wouldn’t have changed anything. What is Databricks, and why did we choose it? Databricks is a unified analytics platform, comprised of two central components, compute and storage, packaged together with extra functionality on top.

The exceedingly helpful aspect of Databricks is that it fits a perfect niche between too-out-of-the-box and too-infrastructure-heavy. Sure, you could use Snowflake — but then you can’t code complex ETL Spark pipelines in the same platform as your storage. Or, you could set up your own Spark infrastructure, deal with a lot of moving parts to connect all the dots with scaling storage and compute, and probably end up paying more anyways. Databricks is the goldilocks zone.

Deciding on Databricks was really the easy part. The next step was planning how to get our data from a 10-terabyte Oracle 12.2 Database into our Databricks instance.

Here is another fork in the road, and unlike the decision for Databricks, this one was a bit more difficult. These are the questions we asked ourselves, in order to uncover the migration architecture that best fit our use case.

  1. Snapshot or Change Data Capture (CDC)? Are we migrating once and connecting current data collection systems to the new data system? Or is the plan to stream data, keeping your source database active as the permanent record keeper? Our plan required CDC, narrowing our options a bit.
  2. How much? How much data are we working with, and how many resources (engineers, capital, and time) do we have to work with? Our main constraint was our schedule, but a lack of engineers would have led us to GoldenGate and a lack of capital would have pushed us towards native JDBC.
  3. Where is it coming from? What sort of legacy data stores are we working with? More on this below.

Question three is really where the decision matrix grows complex: different legacy data stores truly need different solutions, and you absolutely need to consider the nitty gritty of the data store before finalizing your decision. For example:

Our source data is coming from an AWS Managed, Oracle Version 12.2, Read Replica, on a VPC. Each of these four components matters, because they create slight differences in architecture and set-up parameters. To elaborate, the documentation for Oracle specifies a radically different setup for DMS than, say, SQL Server, and if your database is a Read Replica then certain solutions are simply dead ends, with no good way to perform CDC. I wish I could deliver a full decision tree, but with at least a dozen potential data stores and many variables in each one, this sort of feat would be deserving of an entire textbook. For Oracle, at least, here are the specific questions to consider:

  1. Is the database AWS Managed or Self Served?
  2. What version of Oracle does the database use?
  3. Is the database a Read Replica, or do you have source access?
  4. Are you the DBA of your source database, and if not, can you request changes to the source database?
  5. Do you have any cloud-specific restrictions, like a VPC or multi-AZ migrations?

Our team quickly settled on the fact that AWS Database Migration Service (DMS) was probably our best option. We also considered a native JDBC connection and Oracle GoldenGate. DMS is the most documented solution and arguably the cheapest for streaming. JDBC is always a good backup because it’s free after compute costs, and Oracle GoldenGate would probably be fine, but expensive to license. For these reasons we chose to go with the following solution:

Outline of technical architecture for our Databricks Migration

Following from left to right, the data starts in a client-managed database on AWS RDS and uses Oracle Data Guard to transfer that to a read replica. These pieces are listed as “semi-out of our control”, because we do not own these systems. We can, however, request certain permissions, a necessary component for our migration, and one that took place at 3am on a weekday (Production databases, much to the displeasure of my sleep schedule, can only be modified during off hours). It should be noted that the ability to work with the team that manages your source data is almost always a requirement of a successful database migration, because streaming demands the enablement of certain logging permissions. If you can’t enable supplemental logging on your Oracle Source database, you most likely will not have success streaming data.

A DMS task then links the source endpoint read replica to our target endpoint S3 bucket. This S3 bucket is mounted to Databricks, and from there we can leverage Autoloader to stream the data from S3 into a Databricks Database.

The mounting is optional — it basically just means that your S3 bucket is a folder within the Databricks File Store (DBFS). The difference is having a path to your S3 bucket of dbfs://…/s3-bucket/ versus s3://…/s3-bucket/

At the end of this pipeline, our data is accessible in Databricks, streaming from the source database with a delay time of >1 minute without additional tuning. Considering the 2500 tables and multiple hoops our data needs to jump through to get from A to B, this is an incredible feat of modern technology. Furthermore, this is without any tuning — we are confident that properly partitioned tables and ideal DMS settings will bring us down even further.

This is the end of Part I. Part II will describe the exact processes and permissions needed for actually running this pipeline successfully. To reiterate some key pieces from Part I, a bit of “Things I wish I knew before starting my data migration”, here’s a recap:

  • Before deciding on your data migration solution, make sure you detail the specifications of your source database: Is it Oracle or SQLServer? Is it self-managed or hosted on RDS? Are you dealing with a read-replica, or the original data? These specifications are your boundaries for the “Art of the Possible”
  • After having the specifications of your source database, specify your solution requirements: do I need CDC? Is resiliency a high concern? do we have fiscal restraints? These specifications help outline where you should direct your focus.
  • After outlining your migration architecture, reexamine the restrictions of each component (Source endpoint, migration tool, target endpoint). Check and make sure you will not hit any dead ends, due to a source database that is incompatible with your proposed solution. I cannot stress this aspect enough.

I usually attempt development with a fast-and-loose mindset, preferring to deal with the problems when they get there rather than planning meticulously. Although many refer to these techniques with phrases such as “reckless development” or “careless coding”, I prefer the phrase “highly agile development”. With a database migration, however, a hybrid agile approach should absolutely be considered. This is because building a migration pipeline is similar to building the real sort of pipe: without a plan that explores every potential point of failure, every obstacle, your migration is likely to fail in the latest stages of development. When you realize a mountain prevents you from laying down the last pipes, the system might be too complex to fix. Then, it’s back to square one.

Chemical Engineer by education, Data Engineer by trade. Easily excited, mainly by topics from endurance sports, to pour-over coffee, to biotech trends

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store