In discussions and articles on building data stacks, ELT usually gets the spotlight. Which extractor will you use? What data warehouse will you choose? How are you planning to build your data models?
But to successfully implement and run a stack, to centralize fresh data every day from the four pillars of commerce (eCommerce, digital marketplaces, retail, and wholesale), the processes of your data tools must be granularly and precisely coordinated.
In short, your data orchestration must be in order.
What Is Data Orchestration?
Data orchestration is the programmatic logic that automates, coordinates, and manages data flows and processes across various systems and applications. A data orchestration tool or system will trigger all elements of a data pipeline, from start to finish, in order to bring data from source systems (e.g., Shopify Plus, Amazon Seller Central, or Klaviyo) to end destinations (e.g., a data warehouse). Furthermore, it will extend to processing within the data warehouse (i.e., data transformation).
Much like a symphony orchestra playing Mozart, if everything isn’t under control and perfectly timed, the whole performance will be a mess.
Orchestrating smaller amounts of data from fewer sources tends to be fairly manageable. However, managing millions or billions of rows of data from potentially dozens of sources becomes incredibly complicated, to the point that it may impede or halt the build of a successful data stack.
Why Does Data Orchestration Matter?
Data orchestration ensures the refreshment of data at chosen intervals (e.g., hourly, daily, weekly) and the completeness of the data by ensuring that it is transformed, cleaned, standardized, and integrated before teams leverage it for analysis or reporting.
If you’re trying to centralize order data, fulfillment data, marketing data (and so on), your Extractions, Loads, and Transformations must sequentially trigger and complete.
If the replications or queries throw errors, what was supposed to be an automated process turns into a manual daily job wherein data engineers must locate the errors, troubleshoot, and retrigger the problem areas.
Orchestration errors delay reporting, which frustrates data teams and non-data teams alike.
As businesses rely on data to make decisions, data orchestration needs to have an all-star performance every day—but given the complexity of modern data stacks, that’s not always the case.
How Does Data Orchestration Work?
Data orchestration is generally built via a dedicated tool that integrates with the rest of a data stack.
Digging into the DAG
A data orchestration tool will enable teams to define dependencies within the data stack (i.e., the sequence of triggers needed to complete a data refresh), via a direct acyclic graph (DAG).
A DAG (simple example below) is a visual representation of a mapped out workflow. Nodes in the DAG represent individual tasks, and arrows represent dependencies between them.
DAGs are often written in Python, or they can be mapped out via the UI of a workflow dependency builder (or both). The UI version has become more common among newer tools.
- Newer tools distance themselves from the term DAG, which may be seen as an older label for a visual dependency (more on this later). However, what is ultimately created in any platform will be a DAG.
A Step Deeper: Two Ways to Build Dependencies
There are two fundamental ways that data orchestration can be built at a technical level: time-driven orchestration and event-driven orchestration.
What Is Time-Driven Orchestration?
Time-driven orchestration schedules data to refresh at specific intervals, such as hourly, daily, weekly, or monthly. In practice, this means that when one element of the workflow is estimated to complete, the next one will start based on programmed logic.
This was the original way that data pipelines were orchestrated, and it remains in quite common use.
Time-Driven Orchestration Example
Consider a brand pulling data from Shopify Plus, Amazon Seller Central, and NetSuite. ELT from each of these data sources will likely complete within a particular time frame.
Building time-driven orchestration around these data sources means coding extraction, loading, and transformation to kick off at the latest likely time that each previous step will complete.
For instance (made up numbers to follow:)
- If Amazon Seller Central data usually takes 2 hours to be extracted, but its range is 1–3hrs…
- And if Shopify Plus data usually takes 1 hour to be extracted, but its likely range is 30min-2hrs…
- And if NetSuite data usually takes 3 hours to be extracted, but its likely range is 2–4hrs…
The engineer might code the Shopify Plus ELT to kick off first, then have the Amazon Seller Central extractor kick off after 2 hours, and then have the NetSuite extractor kick off after an additional 3 hours.
This way, based on the expected time windows, each will kick off after the previous one completes.
Time-Driven Orchestration Pros/Cons
- Automated pipeline management, ability to schedule around peak usage times.
- Inflexible and/or often manual process when there are only semi-predictable extractor windows. If part of the ELT runs late, it throws off the entire schedule and an engineer will be required to restart part/all of the process depending on the errors.
- Depending on the tool, errors may not be easy to localize and take time (sometimes, hours) to troubleshoot.
What is Event-Driven Orchestration?
In event-driven orchestration, workflows are initiated and updated in real time based on changes in data or the state of the systems, rather than running on a fixed schedule.
This allows for dynamic and accurate data processing, and it is the only way to achieve reliably automated data orchestration.
Event-Driven Orchestration Example
For the brand pulling data from Shopify Plus, Amazon Seller Central, and NetSuite, code will ensure that each set of extractions, loads, and transforms will complete in sequence.
Ideally, even when an element of a workflow runs long, logic is built to keep the workflow progressing, skipping over the delay(s). This ensures that, every day, as much data is refreshed as possible.
So, if Shopify Plus extraction completed as expected, and if Amazon took twice as long as expected, then the workflow might skip over Amazon to kick off the NetSuite extraction.
Event-Driven Orchestration Pros/Cons
- Enables fully automated data orchestration impossible via scheduling alone.
- Not offered by most data architecture tools/platforms
- Impractical to build in-house given complexity and extended development timeline
Bonus: Using Webhooks for Data Orchestration
A webhook continuously listens for the SaaS platform to send you data. So, in some ways, webhooks are the opposite of APIs. Instead of making an API call to a SaaS platform to retrieve data, the data is sent to you and you receive the data via a webhook.
In the context of data orchestration, webhooks can be a great way to execute simple workflows that occur frequently. For example: sending order data to your fulfillment warehouse when you get an order.
Similarly, you could use a webhook from an extractor to kick off a workflow once a certain extraction is completed. This is a reliable way to enable frequent recurring actions, but it is not meant for complex workflows, as waiting for multiple events to complete (as would happen in a transformation workflow, for instance) requires substantial design on the sending side or receiving side.
Data Orchestration Tools
There are numerous dedicated tools and data platforms with data orchestration functionality on the market today: notably, Airflow, Astronomer, Prefect, Dagster, Shipyard, and Daasity.
Individual Orchestration Tools
Originally built by AirBnB, Airflow remains a popular choice: in a poll of 400 data teams, it is used 41.6% of the time. Airflow has been around the longest and has a strong developer community. However, it only supports time-driven orchestration, which means that a more hands-on data process will be required.
Astronomer (powered by Airflow), Prefect, and Dagster are “second generation” tools, launched to try to address some challenges posed by Airflow.
Shipyard is the newest entrant to the space, founded in 2020. Shipyard features a more “modern” DAG/workflow builder via its GUI.
Data Platform with Orchestration
Daasity is the only ELT & Analytics platform purpose-built for consumer brands and has built event-driven orchestration as out-of-the-box functionality. Daasity enables data teams to manage ELT + orchestration from the same platform and build custom workflows based on their orchestration needs.
For a deeper comparison, check out our in-depth article for a deep dive into the different types of data orchestration tools.
Data Orchestration Challenges
Getting data orchestration to work smoothly is extremely challenging, even for sophisticated data teams. It requires precise coding so that your data gets to where it needs to go, on time, and in the right format, so that the entire org can be confident about using that data for strategic decisions.
Some of the main challenges include:
Integrating data from multiple sources can be difficult for consumer brands, as the data is invariably in many different formats, structures, and levels of quality. This can require significant effort to clean, standardize, and harmonize the data before it can be used.
Ensuring that data is accurate, complete, and compliant with regulatory requirements throughout the ELT process can be a challenge, especially when integrating data from multiple sources.
Orchestrating ELT pipelines can become extraordinarily complex and resource-intensive as the volume of data and data sources increase. This can make it difficult to scale the pipeline to meet growing demands and may cause latency issues.
Monitoring and Maintenance
Data orchestration requires ongoing monitoring, maintenance, and troubleshooting to ensure that it is running smoothly and that data is flowing correctly. Even with most tools, you will still likely need one engineer to stay on top of workflows and troubleshoot if necessary.
This is easier when your tool sends error notifications, so an engineer can leap on troubleshooting sooner rather than later.
Data pipelines often involve moving sensitive data between systems. Businesses need sufficient security protocols in place to minimize risks and ensure sensitive PI data is handled appropriately.
Your orchestration needs to be flexible enough to handle changes in data sources, data structures, and business requirements. This is especially true if you have different teams, like marketing teams, ops, finance, and product owners, with each team asking for different reports.
Data Orchestration at Daasity
Daasity offers event-driven data orchestration out of the box and enables teams to build custom workflows that execute the extraction, loading, transformation, and operationalization (reverse ETL) of their data.
For more information, check out our article on Daasity’s data orchestration to find out more about how Daasity can save you time and effort when it comes to data orchestration.
Ready to Take Control of Your Data Orchestration?
With the explosive growth of commerce, consumer brands are facing an ever-increasing amount of data that needs to be effectively managed and integrated. Data orchestration helps merchants handle the increased complexity and scale of their data environments, and it enables them to consistently leverage their data to drive growth.
Daasity was built for this.
Interested in learning more? Get in touch with a data specialist from Daasity to discuss your data orchestration needs.