Building an Commerce ELT Pipeline from the Ground Up

Insight

If you’re looking to build a data and analytics solution for your brand, you have three options:

1. Create an ELT+BI stack by connecting a series of data tools:

  • An extractor tool (e.g., Fivetran, Stitch)
  • A data warehouse (e.g. Snowflake, BigQuery, Redshift)
  • A BI/Viz tool (e.g. Looker, Tableau, Power BI)
  • (Optional) A Customer Data Platform (e.g., Segment, Blueshift, Simon Data)
  • (Optional) A reverse ETL tool (Hightouch or Census) to push data and enrich marketing channels
  • (Optional) A data management tool (e.g., Panoply.io, Integrate.io)

2. Build the entire ELT pipeline from the ground up and connect with a data warehouse & BI solution 

3. Leverage an modular data platform that includes ELT+BI & Reverse ETL/CDP

We’ve covered the first build option in a previous article. But in this piece, we’re going to cover option 2: what you must understand and consider if you’re looking to build a ground-up ELT pipeline for a hyper-custom data solution. 

First: on APIs 

As APIs will be where your ELT work starts, that’s where we’ll start. 

API Architectures and Rate Limits

Although we won’t compare the technical differences among API architectures (that’s a separate article altogether), we’ll call attention to different types that you’ll encounter.

API architectures vary among platforms: you’ll encounter REST (i.e., RESTful) APIs, GraphQL APIs, and others that can’t be as neatly categorized, such as some of Klaviyo’s endpoints

For our API discussion going forward, we’ll introduce and stick to using Shopify’s APIs as an example.

Shopify offers REST and GraphQL Admin APIs for app and integration building.

  • Some companies offer more than one set of APIs, but they may have preferences about what APIs you use. In the case of extracting data from Shopify, the GraphQL option may be the way to go: Shopify actively promotes and “believes in the future of GraphQL.”

API Rate Limits

Shopify APIs use the classic “leaky bucket” algorithm to handle data requests, and they give specific figures for how often and how much you can hit the APIs.

  • It’s actually quite helpful that Shopify provides this information, as many apps either offer minimal or zero information about rate limits, other than offering guidance that amounts to “call less data to avoid tripping errors.”

As the metaphor goes, each tool/app has a “bucket” that can hold 40 API calls, and the bucket must not overflow. Every second, 2 API calls “leak” out of the bucket to avoid overflow, and each of your API calls goes toward filling the bucket. As long as your app doesn’t cause the bucket to overflow, it won’t trip a 429 error. 

The leaky bucket lets a platform or app make unlimited calls (as long as they’re in infrequent bursts) and each API has a different way of limiting calls.

For Shopify’s REST API, you can make 40 requests per app, per store, per minute. The API replenishes (i.e., “empties the bucket”) at a rate of 2 requests per second.

On the other hand, Shopify’s GraphQL API is structured based on query costs (i.e., what fields you are calling) rather than number of queries. You can “spend” 1000 cost points per app, per store, per minute. The API replenishes at a rate of 50 cost points per second. 

  • Most fields, like objects, cost 1 point. Mutations cost 10 points. 
  • These costs may vary, depending on how Shopify calculates your calls. 

Identifying Core Endpoints for Optimized Data Extraction

Given the variability of data available via API among platforms, your initial (and ongoing) challenge will be to identify the most efficient ways to call APIs to get the data you need.

Each platform has its own quirks, roadblocks, or challenges for a data extraction process, and you’ll need to thoroughly investigate documentation while always asking yourself, “Is there a better way to GET this data?”

What follows is a moderately deep dive on one of Shopify’s endpoints to illustrate the type of exploration that will be required for each extraction target.

Shopify currently has 18 endpoints, but from an analytics point of view, most of these will not be valuable to you. The most important will be Orders (by far the most important, to be covered here):

screenshot of REST Admin API with an arrow pointing to non-expanded Shopify order resource

You might ask: “Why wouldn’t I want to call the Analytics endpoint if I’m building a custom analytics app? Or, why wouldn’t I hit the Billing, Customers, or Discounts endpoint to get relevant first-party data?”

Answer: because they’re not relevant as they seem, and/or they’ll create significantly more work for you, which you certainly don’t want or need in a long-term project. 

For example:

  • The Analytics API only allows you to GET the same reports as found in your Shopify store UI, which is what you’re trying to get away from in the first place.
  • The data you would extract from the Billing, Customers, and Discounts APIs can be retrieved from the Orders API.

In short, you want to hit the fewest APIs that give you the most information. The Orders API will give you the most (the next most important ones are Inventory and Products, but we won’t cover those here).

Orders

The Orders endpoint provides a trove of data. With it, you will be able to build numerous custom reports as well as some analyses. 

screenshot showing the Order resource in Shopify.dev admin bracketed in red

Specifically, the Order resource within the Orders endpoint is going to be the most bang-for-your-API-buck. And more specifically, the single GET “Retrieve a list of orders” is going to be your best starting point: it will enable some custom reports.

For instance, although there is a Refund resource within the Orders endpoint (see above image), there is also a “refunds” object within “Retrieve a list of orders,” where you can GET refund data: 

refunds object from "retrieve a list of orders" GET

This “Retrieve a list of orders” GET is the most data-packed from Shopify’s APIs. Pictured below is an expanded view of the API call (sample data), which shows data ranging from particular order info, to discount information, to customer information, to currency and tax data—this is still only a portion of a full call. 

screenshot showing retrieve a list of orders GET from Shopify REST API

From this single GET, you can summarize the data and build a large number of reports on area such as:

  • Discounts
  • Products purchased together
  • Fulfillment statuses

The next steps: Choosing a data warehouse (easy), denesting JSON (hard), and error handling (hard)

Warehouse options

As Amazon Redshift does not easily support semi-structured data except via time-consuming manual work, we highly recommend selecting either Snowflake or BigQuery for your ELT pipeline—both natively support JSON.

In terms of the Snowflake vs. BigQuery debate, that’s an internal discussion to have and is more about preferences around querying and other tools you may use. E.g.:

  • If you’re comfortable with BigQuery’s SQL query syntax, and you use platforms in the GCP ecosystem, then BigQuery may be the way to go.
  • If you’re platform agnostic and you’d prefer to use standard SQL (which Snowflake supports), then Snowflake might be your best bet. 

Denesting and error handling

Although there will be substantial research required to understand the most efficient ways to GET data, this will be a relatively small portion of this project. 

Ultimately, the most time-consuming elements will be dealing with the following complicated projects:

  • Denesting JSON from API calls (there is a lot of nested JSON, especially with Shopify)
  • Building an appropriate error-handling program to highlight any daily workflow errors in the ELT. 

Denesting JSON

When it comes to denesting, you have two options (the first of which will not be feasible, so in reality, you have one option):

  • Build a denester that dynamically denests any JSON: This would be a program that ingests any JSON and turns it into appropriate parent and child tables based on the structure of the JSON. Although this is the clear choice for a custom ELT setup, it is effectively the basis for a company in and of itself. Therefore, it is not feasible as an internal data project at a brand. This was, for example, Stitch Data’s initial product, and it took years to build with a full engineering team.
  • Build individual denesters for each API: This will be a significantly faster time-to-value approach, although it will significantly increase resources for upkeep. Any time an API changes (which they do at least once per year), you will need to keep your extractor up-to-date. 

Error Handling in Your Workflow

On the error handling front, you will need to build a separate program that can diagnose issues with your daily ELT workflow. Getting even a crude build up and running is complicated. At absolute minimum, your error handling program will have to:

  • Indicate that you’ve made an invalid request
  • Indicate that you’ve requested too much data

The extra layer (and most important part) of the challenge here is having a program that not only indicates that something in your ELT malfunctioned, but what specifically malfunctioned, and where it malfunctioned. 

A crude error handling program may indicate a general error, but that won’t necessarily help you triage the data issue. You could spend days (or even weeks) figuring out what really went wrong. 

Option 3: Don’t Build This ELT Pipeline Yourself – Seriously 

We led off this piece by highlighting DIY-with-tools Pipeline (option 1) and the DIY everything (option 2), which is the subject of this piece. 

We’d like to offer a compelling third option: don’t build the ELT pipeline yourself. Avoid the multi-year build, and avoid the endless headaches of dealing with this process. 

Shopify is the basis for your business, but you’ll likely have 10+ integrations at some point, both inside and outside the Shopify ecosystem. Building, updating, and maintaining those ELT pipelines, is, to be technical, a nightmare. And if you throw in a custom Amazon ELT build, forget it. It’s not going to happen. 

On top of the challenges of building the APIs themselves, denesting the JSON, and building error handling systems, you’re going to have to successfully orchestrate the workflows (and that is a whole other article) based on daily data refreshes.

We know, because we have built all this. We = Daasity. (We’re Option #3.) 

Daasity is a complete data platform, purpose-built for omnichannel consumer brands selling via eCommerce, Amazon, retail, and/or wholesale: this includes the largest brands who are looking for the flexibility, customizability, and extensibility of a truly custom build. 

We support dozens of integrations, are quickly adding new ones, and we keep them updated as soon as APIs depreciate. 

Interested? We’d love to show you more about how we work. You can talk to a omnichannel data and technical expert here.

May interest you