How we constructed our new fashionable ETL pipeline, half 1 — Inside GetYourGuide I VolHotels


Let’s first undergo what every part does individually, after which tie every little thing collectively on the finish.

JDBC Loader


  • Open supply distributed platform for change information seize (CDC) maintained by RedHat

  • It really works as a Kafka Join Connector

  • Streams database’s occasion log to Kafka, e.g. Binlog in case of MySQL.

Avro Converter

Schema Service

  • Scala library

  • Retains monitor of all schema modifications utilized to supply tables

  • Holds tables’ PKs, timestamp and partition columns

  • Prevents breaking modifications from being launched, e.g. column sort change


  • Spark utility

  • Reads in Parquet change logs generated by CDC pipeline, JDBC Loader or Avro Converter

  • Compacts change logs based mostly on desk’s PK

  • Creates Hive desk which comprises a precise duplicate of manufacturing database

Now that we’ve explored all of the parts individually, let’s see how they work collectively.

Every little thing begins with the CDC pipeline. At present now we have two methods of ingesting information from supply databases, Debezium or JDBC. Though we’re transferring away from JDBC in favor of Debezium, we’ll spotlight each situations on this article.

What’s vital to note, is on the CDC pipeline, we should always have a number of Parquet recordsdata which signify the change logs from supply databases. These recordsdata might comprise a number of information for a similar PK, and that’s why now we have upsert after that to compact these information.

CDC with JDBC Loader

JDBC Loader activity is triggered by way of Airflow for a particular desk, connects to the supply database, reads in information that has been up to date, or all information for small tables, then writes out Parquet recordsdata.

CDC with Debezium

As we talked about earlier than, Debezium is continually studying the databases’ occasion log, and publishing that to Kafka. On the different finish of those subjects, now we have one other Kafka Connector Sink which writes out Avro recordsdata to S3. Then Avro Converter picks up these recordsdata, and converts them to Parquet. 

Discover: If there’s an inconsistent sort change on any area, Avro Converter or JDBC Loader will talk with Schema Service, and drop that area particularly in order that we don’t find yourself with schema issues on the ultimate desk.

After CDC pipeline is accomplished, Upsert is the following part executed. The purpose of this step is to generate for every supply desk one other desk within the information lake underneath the db_mirror schema. So for supply desk reserving, you’ll have one other desk known as db_mirror.reserving.

To try this, the upserts part does the next:

  1. Reads within the Parquet change logs

  2. Compacts information based mostly on desk’s PKs

  3. Reads in present information from db_mirror desk

  4. Performs a FULL OUTER JOIN between compacted logs and present db_mirror information based mostly on the desk’s PKs

  5. Selects newest report based mostly on timestamp

  6. Deletes previous report that had been marked as deletions

  7. Generates a remaining Hive desk

For large tables, the upsert part additionally tries to optimize the FULL OUTER JOIN operation by choosing solely the partitions which were affected by the incoming change logs.

At this level now we have completed extracting information from supply databases through the use of both Debezium or JDBC Loader, and are actually able to dive into the transformations partially two developing. Keep tuned.

Are you interested by becoming a member of our engineering staff? Try our open positions.


Source link

You may also like...