Comparing Data Engineering Processes
ETL, ELT and more: What options are there and when do they work best?
In data engineering, a chief skill is building data pipelines. The pipeline processes take data from disparate sources and move it to the best location to enable data science and analysis teams to derive insights from that data. How is that done, though? In this post, I'm going to delve into a couple of approaches in more detail.
Extract Transform Load (ETL)
Extract Transform Load has traditionally been the method for data engineers to get data to where it needs to go. ETL has been around for decades and consists of three steps:
- Extract: This is the process of pulling the data from a source. This can range from another database, to JSON data from an API, to CSVs or other flat files. Usually, the extract step involves pulling data from more than a single source.
- Transform: This involves cleaning and preparing data for the target system. For example, one source may have US addresses with full state names, while another uses two-letter abbreviations. This data would need normalization to match the target as part of transformation. Transformation is also necessary if sensitive data is in the source (for example, Protected Health Information) and needs to be scrubbed before it goes to the target system. Transformation is the most complex piece of the puzzle, as there's a lot of analysis that needs to take place. It takes some critical thinking to determine how to clean the data, and is rarely a one-size-fits-all situation. Quality of the source data can also vary widely.
- Load: Once the data has been cleaned and transformed, it is loaded to the target system. Typically, this would be a data warehouse. Often, but not always, the transformed data is also first loaded to a staging table for a final review before it's officially loaded into the target.
ETL allows data from a variety of different sources to be unified in one place. Most often, the use case is to take data from various business systems and allow them to be parsed, analyzed, and acted upon. This final data can be used to build visualizations or machine learning processes, to name a couple of examples.
ETL is advantageous because it's predictable and repeatable. You can write code and schedule it to automatically run through the three steps, and you can easily audit the transformation process. Once it's done, the data scientists and analysts on the other end know they have complete data that's ready to be used for their purposes.
As a practical example of ETL, I've had to take data from an Oracle SQL database and unify it with data from a MS SQL Server database. The Oracle database was owned by another team, and the SQL Server database was owned by our team. I built multiple ETL pipelines that ran at various scheduled intervals, and each one would extract data from Oracle, combine with existing data from SQL Server, clean and transform into a final dataset, and then load to a table within SQL Server.
ETL doesn't necessarily have to be a sophisticated or complex process. Personally, I was doing what amounted to ETL in my job for awhile before I actually knew the term, by writing Python scripts to pull data from multiple DBs and land the data in a destination table. At the end of the day, you are moving data from points A, B, and C to Z, with one or more intermediate stops along the way.
To use an analogy, I liken ETL to having a long list of groceries and multiple stores to buy them from. The job of the data engineer would be to get the right groceries from the right locations (extract) in the proper quantities (transform), and bring them to the kitchen (load), so that the cook (data scientist or analyst) has everything they need in one place to make a meal.
ELT (Extract Load Transform)
Another data integration pattern is Extract Load Transform, where the final two steps in the process are reversed. While it's an approach that has really started to mature and gain broad adoption, it is much newer to the data engineering scene than ETL is. However, there are a lot of cases where it makes sense to tackle data pipelines in this way.
In an ELT approach, the extraction step again comes first. Since you have to have the data before you can put it anywhere else, this step really can't be moved. However, rather than first transform the data, an ELT process involves loading the data and transforming it later on.
ELT has gained popularity over the last decade thanks to the power of the cloud. Large data sets can be stored in powerful cloud-first data platforms such as Snowflake, BigQuery, or Redshift, and the transformation can happen later within the warehouse. When the data is loaded in an unstructured, pre-transformation status, this is known as a "data lake" rather than a data warehouse. The biggest advantages of ELT are the speed of data availability, and the flexibility of the process versus the tighter requirements of ETL.
In ELT, you don't have to wait for a process to run to transform and load the data. Instead, the data can be moved into the data lake as it becomes available. Transformation also happens only when needed, which can save you from transforming data that may never even be used for analysis. When there is a need for faster access to data, this can be highly beneficial, but it's important that the downstream teams are on the same page as far as understanding that the data is not in a fully-transformed state and may be less reliable.
Lack of reliability is the big trade-off with ELT. Having a tested, tried-and-true transformation step in the middle ensures you know what your data is going to look like when it gets to the warehouse. It's repeatable and testable. That is lost in the case of ELT.
As a very general rule, it's best to go the ELT route only if you have extremely large sets of structured and unstructured data that you need to move, and if the teams analyzing the data always need their hands on the latest data quickly. Otherwise, ETL tends to remain the best option. That being said, these are only patterns of data integration, and not necessary strict approaches. You don't need to pick one or the other for everything. They are different tools that serve different purposes.
Most of the time, the right approach to solving a data integration problem almost always involves a blend of ETL and ELT. This can even evolve to what you could call ETLT, where some transformation is done early on and more is done later. For example, within a largely ELT process, you might still need to scrub some sensitive data out between the extract and the load. Then, additional transformation can take place later on once the data is in the warehouse. As another example, perhaps as a team starts to use the data, they may find come common data points that are always needing to be transformed. It may then make sense to move that transformation step up in the process, prior to loading the data into the warehouse.
Deciding between ETL, ELT, or a hybrid ETLT approach requires asking the right questions upfront. Here is just a sampling of questions you will want to consider to help guide you to the right decision:
- How reliable does the data need to be? (higher reliability -> ETL)
- How fast do teams need access to the data? (faster access -> ELT)
- Are the downstream teams comfortable knowing the data they are working with may be in an "unfinished" state? (high comfort with unfinished data -> ELT)
- Is the data being moved as part of a nightly or weekly batch process, or is it a closer to real-time streaming process? (streaming -> ELT)
Once you pick the right mix of ETL and ELT for your use-case, it is a good idea to build out the pipeline and run it for a period of time as a test, to make sure it's working the way you expect and supporting the downstream teams' work in the way they expect. Having ample testing serves both to validate the data itself and the integration approach taken.
If downstream teams don't feel like the chosen approach is best supporting their efforts, it is best to modify the approach at this point in the process, rather than accumulating data and then trying to shift to a new strategy down the road. Refactoring any code can be challenging, but data-related code is particularly difficult to go back and modify because of the challenge of not just working with the code itself, but with the data it has created as well. If at all possible, avoid shortcuts and quick fixes, and focus on getting the right approach the first time, or the technical debt can be extremely costly later on.
Moderately spicy tech take of the day:— Alex Popoutsis (@apopoutsis) October 5, 2021
Database development is the area where technical debt is both the easiest to accumulate, and the hardest to pay off later.
Agree or disagree?
Other Integration Approaches
ETL and ELT tend to get thrown around the most these days, as far as data engineering practices are concerned, but these are far from the only patterns to know and be aware of. There's certainly more to cover than I can possibly get to in a single blog post, but I did want to highlight a few other terms to keep in mind.
- Data Replication: This is the copying of an entire table from one location to another. There isn't really an extraction step here since you're not running a query or hitting an API, you're just moving an entire table. This is useful, for example, if you have a production environment with a high load, and you want to copy data to a separate environment for data analysts to use without putting the live environment at risk. Otherwise, a badly-written query could run the risk of production downtime.
- Change Data Capture: This is the process of tracking data that has been changed, for example by monitoring a timestamp or using a table trigger, and acting based on those changes. This can allow near real-time awareness of changes to a source database, giving the ability to quickly get the changed data moved to a target system.
- Master Data Management (MDM): This is an approach to managing data from many source systems. In MDM, you have a set of master records with "golden values," and as data is loaded from various sources, it must be reconciled against the golden values to determine whether any records from a source should either modify an existing golden record or create a new one.
If you've made it this far, here are the key things to remember from this post:
- Extract Transform Load and Extract Load Transform are the two common patterns of data integration pipelines.
- You'll often need to combine the two approaches for a project, sometimes even combining them into Extract Transform Load Transform.
- It's important to ask good questions about your data and the desired end result to make sure you're choosing the right pattern or blending the two patterns appropriately.
- Beyond ETL and ELT, there are other data integration processes and patterns to keep in mind, such as data replication, change data capture, and master data management.
I hope you've found this post beneficial! I'm planning on sharing much more on data engineering in the weeks and months ahead, so keep up with me here and on my Twitter page for the latest. I'm always ready and willing to answer your data engineering questions as well, so feel free to post them in the comments below or tweet them to me!
Enjoyed this post? Consider buying me a coffee to fuel the creation of more content.