3 Essential Questions to Address When Building an API-Involved Incremental Data Loading Script

TLDR

This article explains both the conceptual framework and practical code implementation for syncing data from API endpoints to your database using dlt (a Python library).At the end of this tutorial, you will understand (and know how to implement) the sync behavior in the following illustration where we extract data incrementally and write to the destination table with the merge (dedup) strategy.Image by Author

The context

You want to sync data from an application (e.g., ads performance, sales figures, etc.) to your database.Your application provides API endpoints to retrieve its data.The data needs to be synced daily to your database.You want to load only the ❗“new” data (or the changes)❗ into your database. You do not want to load the entire set of data all over again every time you sync.

How would you do it in Python?

I will walk you through the solution by addressing the following 3 questions.

💥 Question 1: What do I need from an API to sync data incrementally?

Before developing an incremental loading script, we need to understand the behavior of the API endpoints we are working with.

❗Not all APIs can facilitate incremental loading.

👉 Answer: Query params that support incremental loading

Let’s look at an example of an application (or “source” application) that tracks your sales performance. In this application, each record represents a product along with its sales volume. The fields created_at and updated_at indicate when the record was created and updated.

Changes in the sales data typically occur in two main ways:

New products are added to the list.Updates are made to the sales figures of existing records, which results in a new value for updated_at . This helps us to track the new changes; without it, we cannot know which records have been modified.

👁️👁️ Below is the example sales table in the source application’s database.

↪️ Yesterday’s data: 2 records

Image by Author

↪️ Today’s data: a new record added and a change made to an existing record

Image by Author

🟢 Takeaways: If the API endpoint allows queries based on theupdated_at parameter, you can implement incremental loading by making requests to retrieve only records that have an updated_at value later than the most recent updated_at value saved from the previous sync. In this context, updated_at is referred to as the incremental cursor, and its value, which persists through to the next sync, is known as the state.

The updated_at field is a common choice for an incremental cursor. Other query params, such as id or sales, cannot help us to request data incrementally as they cannot tell us which records have been added or updated since the last sync.

Which query param do you need to load data incrementally?

Image by AuthorSince we’re developing a data loading script that works with API, I’ll introduce two other important aspects of APIs for the code implementation: pagination and path parameters. They have nothing to do with incremental loading though.

🤷 Pagination mechanism

APIs often return results in small chunks to enhance performance. For instance, instead of returning 10,000 records at once, an API might limit the response to a maximum of 100 records per request, requiring you to iterate through subsequent batches.

To manage this, you typically (not always) need to use two query parameters: limit and skip (or offset).

Here’s a simple example to illustrate:

For the first request:

limit=100skip=0

For the second request, to skip the first 100 records we’ve already synced:

limit=100skip=100

This pattern continues, incrementing the skip value by the limit after each batch until all records are retrieved.

🟢 Takeaways: You need to understand how APIs return responses so that you won’t miss any records while extracting. There are many approaches an API can use to manage pagination, beyond the commonly used methods of skip and offset. But that’s a story for another day.

🤷 Path param

Path parameters are included directly in the URL of an API and are typically used to distinguish between different segments (partitions) of data. For example, they might specify different campaigns within your marketing account or different sub-accounts managed in the source application.

In the example below: the path params are applicationId and campaignId.

https://yourbaseurl.myapp/v1/applications/{applicationId}/campaigns/{campaignId}/sales

🟢 Takeaways: You need to decide whether you will sync data from the same API but with different path params to a single table or different tables (sales_campaign_1, sales_campaign_2, etc.).

💥 Question 2: How do I want to write the extracted records to the destination table?

Now let’s say you already extracted a bunch of records by making API requests with the above-mentioned params, it’s time for you to decide how you want to write them to the destination table.

👉 Answer: Merge/Dedup mode (recommended)

This question concerns the choice of Write disposition or Sync mode. The immediate answer is that, given you are looking to load your data incrementally, you will likely opt to write your extracted data in either append mode or merge mode (also known as deduplication mode).

However, let’s step back to examine our options more closely and determine which method is best suited for incremental loading.

Here are the popular write dispositions.

🟪 overwrite/replace: drop all existing records in the destination tables and then insert the extracted records.🟪 append: simply append extracted records to the destination tables.🟪 merge / dedup: insert new(*) records and update(**) existing records.

(*) How do we know which records are new?: Usually, we will use a primary key to determine that. If you use dlt, their merging strategy can be more sophisticated than that, including the distinction between merge_key and primary_key (one is used for merging and one is used for dedupication before merging) or dedup_sort (which records are to be deleted with the same key in the dedup process). I will leave that part for another tutorial.

(**) This is a simple explanation, if you want to find out more about how dlt handles this merging strategy, read more here.

👁️👁️ Here is an example to help us understand the results of different write dispositions.

↪️ On 2024.06.19: We make the first sync.

🅰️ Data in source application️️

Image by Author

🅱️ ️Data loaded to our destination database

No matter what sync strategy you choose, the table at the destination is literally a copy of the source table.

Image by Author

Saved state of updated_at= 2024–06–03, which is the latest updated_at mong the 2 records we synced.

↪️ On 2024.06.2: We make the second sync.

🅰️ ️️️️️️️Data in source application

Image by Author

✍️ Changes in the source table:

Record id=1 was updated (sales figure).Record id=2 was dropped.Record id=3 was inserted.

At this sync, we ONLY extract records with the updated_at> 2024–06–03 (state saved from last sync). Therefore, we will extracted only record id=1 and id=3. Since record id=2 was removed from the source data, there is no way for us to recognize this change.

With the second sync, you now will see the difference among the write strategies.

🅱️ Data loaded to our destination database

❗ Scenario 1: Overwrite

Image by Author

The destination table will be overwritten by the 2 records extracted this time.

❗ Scenario 2: Append

Image by Author

The 2 extracted records will be appended to the destination table, the existing records are not affected.

❗ Scenario 3: Merge or dedup

Image by Author

The 2 extracted records with id=1 and 3 will replace the existing records at destination. This processing is so called merging or deduplicating. Record id=2 in the destination table remains intact.

🟢 Takeaways: The merge (dedup) strategy can be effective in the incremental data loading pipeline, but if your table is very large, this dedup process might take a considerable amount of time.

💥 Question 3: How do I implement it in code?

👉 Answer: dlt — as it is lightweight, well-documented, and has an active community for support.

dlt is an excellent choice because it provides you with the right level of abstraction. In fact, you can choose how much abstraction you want. As you can see in my example code below, I have taken the liberty of writing my own request loop, but dlt offers helper functions that can do this for you with much fewer lines of code. This flexibility makes dlt stand out compared to other solutions.

You can refer to the diagram for a high-level view and then drill down to the code with detailed remarks below.

Quick note: dlt uses the terms source and resources in its structuring. A resource typically corresponds to an API endpoint and writes data to a table in the destination database. A source is a collection of resources.

In the illustration below, you can see the answers to the two questions we discussed:

Answer to Question 1: Make requests to an API endpoint using a date cursor to get data incrementally (and persist the cursor value, also known as state, for subsequent runs).Answer to Question 2: Write data to the destination table using the merge strategy.Now, you might wonder how to run this Python script? I suggest you to go to this repository and try out yourself. This repository also provides you with a mock API that you can deploy locally for testing purposes. Check out the README for the detailed execution guide.

💥 Here is a snippet of the dlt implementation with my remarks 💥

https://medium.com/media/4f3a9b007b6272f15218e9238e44d876/href

🟣️ Full code can be seen at the repository here 🟣

✅ This concludes the tutorial. I hope you have learned about the different components that make up an incremental loading script, as well as the code implementation.

If you’re interested in finding out more about how to build an incremental loading script with dlt, check out their documentation here.

About me

I am Daniel Le, based in Berlin. I currently work as a Data Engineer — with a great passion for Machine Learning.

I am based in Berlin, Germany and I am interested in new technologies and how they can be implemented to solve real-world problems.

Should you have any inquiries or wish to discuss these interests further, please do not hesitate to connect with me on LinkedIn.

Reference

https://dlthub.com/docs/general-usage/incremental-loading

3 Essential Questions to Address When Building an API-Involved Incremental Data Loading Script was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

TLDR

The context

💥 Question 1: What do I need from an API to sync data incrementally?

👉 Answer: Query params that support incremental loading

🤷 Pagination mechanism

🤷 Path param

💥 Question 2: How do I want to write the extracted records to the destination table?

👉 Answer: Merge/Dedup mode (recommended)

💥 Question 3: How do I implement it in code?

👉 Answer: dlt — as it is lightweight, well-documented, and has an active community for support.

About me

Reference

Related Posts