Learning about ETL - a founding engineer's personal account

in Data Engineering
 • Updated on 

There came a point in my journey working for hired.com where we began to struggle with our data architecture. In this article, I want to share some of the lessons learned in the hope that they are useful to you in your own engineering journey.

Hired started as a Ruby on Rails web application backed by a single Postgres database. All data (customer records, click logs etc… was stored in that database). The whole team was made up of product engineers — people who were adding features and who were intimately familiar with the application and the data that it generated. The team was also small so when the database needed to change or customer data needed to be updated everybody knew about it.

As time passed the business grew. The engineering team expanded and we added a real-replica read-only slave database to the architecture so that we could run analytics queries without impacting the customer experience… but this simple application structure worked well… for a while.

There was a distinct point in the company’s evolution where this simple setup started to break down and the lessons learned are worth sharing.

When things started to change

The simple data architecture we’d used really started to break down when we started adding teams.

The first team we added was the data analyst team. They were tasked with helping management make decisions by using data to answer business questions.

Their arrival represented a new dependency placed on our Postgres production databases. To do their job they needed a solid, unchanging database that they could build their work on top of and they didn’t have that.

Building on top of the production database meant —>

  • Data analysts’ dashboards would constantly break every time the engineering team needed to make a change to the db format.
  • Even if dashboards weren’t breaking they were often wrong because they were based on columns in tables whose meaning had changed as the product changed.
  • Queries were often SLOW. The production database schema was not optimized for analytics queries.

We soon added another data dependency — a data science team. Things got even worse.

Data scientists built on top of the production database so…

  • Their models constantly broke
  • Model quality was often not great because the team wasn’t clear about the quirks of how the data was being collected
  • Their transformation needs were greater than the analyst teams needs so long running queries were VERY painful

Clearly something had to be done so we started looking into building a more sophisticated data architecture.

We added a Data Warehouse

The first thing we tried to do to fix our data issues was add RedShift. Knowing nothing about data engineering we spent 15 minutes booting up a RedShift cluster and then assigned out of our web engineers to build a custom solution in Ruby that would execute on a cron and copy our production data tables into RedShift.

Note — this didn’t really SOLVE our problem… it just moved it. Production schema changes would now end up breaking the ETL system. We had better error handling capabilities and queries sometimes ran a bit faster since they were executing on a RedShift cluster but we hadn’t really de-coupled our web application database from our analytics data.

Looking back now I know that just copying tables over is not a very good solution for analysts and data scientists.

Analysts need to be able to run queries with a lot of joining and aggregating and these operations are fairly slow on a RedShift cluster if you have to join against several tables containing different pieces of data. It’s better to spend some time and design a database schema that makes business queries quick, simple and easy.

Data science teams usually need to do MUCH more complicated data manipulation. Often times they will store their datasets in files on services like S3 or google drive and process the datasets with tools like Spark only loading data into transactional databases or data warehouses when people want to interactively query some measure the data science team has come up with.

Ultimately the home-brewed ETL solution didn’t work because nobody maintained it. The analyst team was unhappy that they didn’t have access to the data they needed and the engineers were too busy building product to constantly tweak thee ETL code.

That’s what led us to purchasing an ETL solution.

We used managed ETL tools

We thought it would be a good idea to just buy an ETL solution rather than build our own. So we started doing research. We ultimately ended up purchasing a solution from a vendor and setting it up. It wasn’t a great experience.

Again, the only thing the vendor’s tool allowed was direct database to Redshift copy. This meant that we STILL had our analytics data coupled with our production database.

There were more issues as well —

The solution was managed by our vendor and the performance was flaky. Sometimes ETL jobs would fail or fail halfway through and we had no good way to troubleshoot issues.

We had to give the vendor access to our entire system and purchase a support contract.

Looking back I think we had picked a poor vendor, but also we had mis-diagnosed the problem and had picked a poor solution. Batch ETL processing is super tricky to get right and hard to troubleshoot. In normal engineering you write your program and immediately try it out. If something’s wrong you know almost instantly and you can go and fix it. With batch jobs each ‘run’ of your program can take hours to complete. Often you don’t realize that there’s a bug in your code until hours after you start your run. If you’re relying on a vendor and something about your data blows up your vendor’s ETL process you’re stuck sitting around waiting for their support team.

It’s hard to do quality engineering when cycle times are so slow. That’s why I think better solutions involve real-time event stream-based processing.

We started migrating to event based data pipelines

We did finally start solving that database coupling problem. The shift came as we began to understand that a lot of companies fill their data warehouse via event streams.

Event streams are another approach people take to data engineering. Instead of copying a database from one place to another, you introduce code into your application logic that emits events. These events are transmitted to a centralized place where they get processed. In some cases data in events gets written into a data warehouse like RedShift. In others it might end up in a file on S3 in preparation for a spark job.

There are some huge advantages of emitting events from application logic instead of copying data from the database.

  1. The event’s ONLY job is to get accurate data to the data warehouse. Contrast that with your application transactional database which exists to service and track customer transactional data and generally make the web application work. The database will be constantly changed by the engineering team every time they need to satisfy a new product requirement. The code that fires the event rarely (if ever) needs to be changed.

  2. Events don’t dictate eventual data structure. Your data engineers can design a schema suitable for your analysts and put the appropriate data into that scheme.

  3. Application code can be tested. You can write a test to make sure that your event always fires and contains the correct keys. If an engineer breaks your test they’ll know and they can fix it BEFORE rolling breaking changes out to production.

Events can be fired from anywhere… application code, client side mobile apps, anywhere. You can do much more complex data post-processing… load the event into Kafka and use python or Spark to parse complicated event payloads or run multi-step processes. It just works better. Your data gets processed in real-time. You can have results appear in the data warehouse without the delay involved in a lot of batch processing jobs. They’re easier to troubleshoot because you can collect failed events and examine (and then replay) them individually.

If you’re interested in learning more I recommend checking out Kafka.

Data is different and there’s a lot to learn. The point of this story is really that there’s a lot to learn about data engineering.

A lot of people who start trying to solve data problems come from the web world and try to apply their old skills to this new domain. Some skills and approaches are useful and some aren’t.

I hope my story is helpful. If you have questions or thoughts feel free to reach out to me on twitter — @wschlender

Get started. Try FlyData.
Quick setup. No credit card required. Unlimited sync during trial.
Fast and secure Enterprise-grade security and near real-time sync.
World-class support Proactive monitoring from technical experts, 24/7.
Straightforward pricing Pay for the rows you use, and nothing you don’t.