Building a Data Stack for Consumer Brands

Insight

If you’re looking to build a data stack for your Consumer Brand (i.e., you sell via eCommerce, Amazon, retail, wholesale), you will need to connect and use the following types of tools:

  1. An extractor tool (e.g., Fivetran, Stitch)
  2. A transformation tool (e.g., dbt)
  3. A data warehouse (e.g., Snowflake, BigQuery, Redshift)
  4. A BI/Viz tool (e.g., Looker, Tableau)
  5. A data orchestration tool (e.g., Airflow)
  6. A reverse ETL/data operationalization tool (e.g., Hightouch)

When activated, the data stack—sometimes referred to as the Modern Data Stack (MDS)—will centralize data via ELT, visually present data for analysis, allow for queries and manipulations, and enable you to push data to your marketing channels/tools (e.g., Facebook, TikTok, ESPs).

Building, implementing, and maintaining your data stack, however, is a major and ongoing investment—especially if you’re part of a large brand—and the process is filled with challenges and considerations. 

Although there could be a full-length ebook dedicated to this topic, we want to cover some higher level information about what tools/platforms you can select from for the build as well as the challenges you’ll likely face along the way.

Extract and Load (EL)

There are several EL tools to choose from, but we’ll stick to covering Fivetran and Stitch, as they are the most well-known providers.

Extract and Load Functionality

Basic functions: Both tools copy raw data from data sources (e.g., CRMs, ERPs, ESPs, social media, ads platforms, CS tools, etc.) and load that data into a destination, such as a data warehouse.

Warehouse support: Both tools connect to the “major” data warehouses: Snowflake, Bigquery, Redshift. 

Amazon: Neither tool can connect to Amazon Seller Central.

eCommerce platforms: Both can connect to Shopify/Shopify Plus; Fivetran can connect to Salesforce Commerce Cloud (SFCC), but Stitch cannot (as of 2022); Stitch can connect to Magento, but Fivetran can only connect to Magento via Amazon RDS.

Operations: (There are many operations tools, but we’ll use Netsuite as an example:) Fivetran connects to NetSuite via the JDBC driver, and Stitch can connect via the JDBC driver and SuiteTalk.

Transformation: Fivetran has some in-platform pre-built transformation capacity for Sellers (via dbt core) and data models for some common tools used by brands (e.g., Klaviyo, TikTok Ads, Facebook Ads). However, this is only a starting point, and full use of dbt will be required. Stitch does not offer any pre-built transformation and will also require a transformation tool.

Orchestration: Stitch has built-in orchestration functionality (though its more advanced and hands-on functionality requires the Advanced or Enterprise pricing tiers), but Fivetran will require an orchestration tool (more on Fivetran orchestration in the data orchestration section later in this piece).

Extract and Load Pricing

Both Fivetran and Stitch offer pricing models based on MARs (monthly active rows). Stitch offers three pricing tiers, and Fivetran offers four pricing tiers. Actual cost estimates are highly variable and depend on your data use.

  • Stitch’s second and third pricing tiers require annual billing.

Further EL Analysis and Team Expectations/Requirements

The initial setup of the EL process is fairly straightforward, and a nontechnical person can link Fivetran or Stitch to connectors.

However, you will likely require extraction from unsupported data sources as part of your initial build, and you will certainly require unsupported data sources at some point, if you continue to use your data stack.

  • If you were using Fivetran, you could either use a Function or build the extractor entirely from scratch outside of Fivetran. 
  • If you were using Stitch, you could pay for their data engineers to build a custom extractor, use a third-party consulting firm that they recommend, or (as they recommend) use community/open-source information, or do a more fundamental DIY (extractor entirely from scratch outside of Stitch).

Data Warehouse

Just as we focused on Fivetran and Stitch as the “major” EL providers, we’ll focus on Snowflake (platform agnostic), BigQuery (Google Cloud Platform), and Redshift (Amazon Web Services) as the “major” (and more relevant) data warehouses. 

  • Please note our cautions around Redshift in the Redshift section below. 
  • Firebolt is a new entry (founded 2019) to the data warehouse space with some exciting functionality, but as of 2022, EL providers do not support it as a data destination. 

Data Warehouse Functionality

Snowflake, BigQuery, and Redshift are OLAP (online analytical processing) systems, which allow for queries of large data sets and are geared for BI and analytics. 

Snowflake

Snowflake is a cloud-agnostic data warehouse option, as it is available on AWS (Amazon Web Services) and GCP (Google Cloud Platform). 

Running a Snowflake data warehouse may improve time-to-value and lower overall team costs because Snowflake supports standard SQL, does not require its own software or hardware, and supports semi-structured data like JSON (which will be your preponderant data form, as you will be pulling a lot of data from APIs). 

Snowflake’s pricing

First, Snowflake pricing is based on Snowflake Credits, which are spent when you use any of the three services in their platform: virtual warehouse (i.e., computational resources), data storage, and cloud services (e.g., authentication, access control). 

Snowflake Credits pricing varies by region and the Snowflake pricing tier you’ve chosen (the higher the tier, the higher the cost per credit). The larger the Snowflake Virtual Warehouse, the more credits it will consume per hour (this ranges from 1 credit/hr for the smallest option to 128 credit/hr for the largest option) and the more credits it will require for cloud services.

Snowflake charges based on runtime, so knowing how much time your brand will need to run queries every day will be crucial to cost optimization. If you have a strong idea of your expected daily runtime, you may be able to opt for their prepaid pricing model, which will save you money every month. Additionally, it’s crucial to understand whether you will need to move data between regions because Snowflake may charge for transfers, depending on the location specifics.

BigQuery

BigQuery is a serverless, multi-cloud data warehouse that is part of GCP (Google Cloud Platform). 

Like Snowflake, BigQuery supports JSON, but unlike Snowflake, there are fairly substantial differences in the SQL syntax that is required to query data. So, an analyst on your team will either need to have prior BigQuery SQL experience or require training/self-education on how to best query data. 

BigQuery’s pricing

BigQuery offers one overall pricing model that charges for data storage and data analysis. BigQuery categorizes data analysis as querying and anything else that activates data, such as some data manipulation language (DML) and data definition language (DDL). 

The storage costs break down into two categories: active storage ($0.02/GB/mo) and long-term storage ($0.10/GB/mo). Data in BigQuery is charged as long-term when it has not been activated in 90 days. 

Redshift

Redshift is a “data warehouse service in the cloud,” and it is part of AWS. 

Redshift presents some significant challenges around storing and querying JSON: it did not offer JSON support until Q4 2020, and “recommend[s] using JSON sparingly.” In order to effectively use Redshift as a data warehouse, your JSON needs to be denested.

Redshift pricing

Redshift offers a different data storage approach than Snowflake and BigQuery, in that you’re actually renting nodes (physical servers) from Amazon when you purchase storage space. The nodes come in many sizes, which facilitates storage scaling and provides a predictable monthly storage cost. 

Storage costs vary by region, but in the United States, the price ranges from $0.024 to $0.027 per GB per month, and it offers considerable discounts for long-term pricing agreements. 

Team Requirements for the Data Warehouse

Managing a data warehouse, especially for larger brands, can require rather significant resources:

  • The warehouse must be set up properly (even in situations where numerous functions are automated, such as scaling, as is the case with Snowflake). 
  • Queries must be judicious so that costs are kept within a preferred range.
  • The analyst and/or data engineer must align warehouse functionality with changing business needs and API updates (which occur several times per year for many platforms).

Transformation

Transformation will be the longest and most complicated element when you’re configuring your data stack. Your data team will need to transform the data loaded into your warehouse and optimize it for your BI or Visualization tool, which will require building data models and mapping data. 

Transformation in the Consumer Brand context can be particularly tricky. The nuances of understanding how standard metrics like gross margin, CAC, and customer lifetime value are calculated can trip up even experienced analysts.

In addition to knowing how each metric needs to be calculated, the sheer number of tools that Consumer Brands use complicates the transformation process—particularly as you’re looking to understand cross-channel performance. For instance, even a “simple” first- or last-click attribution model or understanding product performance across distribution channels is quite intricate to accurately track and model.

dbt for Transformation

dbt is a popular cloud-based external transform tool (and it is becoming the de facto transformation solution as part of the Modern Data Stack). The tool facilitates elements of the transformation build by allowing significant work to be done in SQL and reduces the need for other languages.

dbt Pricing

If you only have one person using dbt, the tool is free. For smaller teams, pricing roughly starts at $50 per developer seat per month and scales from there. For additional features, via dbt’s Enterprise tier, pricing is custom.

Team Requirements for Transformation

With dbt, an analyst can build and test a data model largely using SQL SELECT statements. Additionally, they will be able to leverage CI/CD, version control, and some other best engineering practices. 

However, they will need to be SQL experts and highly proficient in data modeling to develop flexible data models that will be able to stand up to the data demands that scaling brands have. 

  • Depending on the amount of data you have, the number of tools that you’re extracting data from, the size of your data team (and the experience of your data team), you can expect a transformation build process that takes anywhere from 6 and 18 months, give or take a few months.

Data Orchestration

If you choose Fivetran as your EL provider, your data team will need a third-party workflow and data orchestration tool in order to schedule your data syncs, gain visibility into your daily ELT processes, schedule multiple syncs, and see where problems may arise. 

Without an orchestration solution, your team will not have visibility into the progress of data syncs, and when problems arise, they will not be able to effectively triage them (in fact, it may require days or weeks to do so).

Airflow is a free and open-source workflow management tool, and Fivetran does integrate with it, which means you will be able to see data sync progress via DAGs, which update to indicate whether workflows have been completed:

Team Requirements for Data Orchestration

In order to successfully use Airflow, someone on your data team must be proficient in Python, because Airflow is written in Python, and any workflow builds/changes must be updated in Python.

Visualization/BI Tools

For BI tools, we’ll talk a bit about Looker and Tableau, as they are a couple of the most popular visualization tools for Consumer Brands. 

Visualization/BI Functionality

Looker, Tableau, and other BI tools are designed to pull in a brand’s transform code into their platform. The quality of the code and logic determines the quality of reports that can be built in the platform. 

  • You want the newest and most accurate data to properly flow into the BI platform so the data tables, metrics, and visualizations that you create will be relevant, accurate, and filled with actionable insights.

Choosing a BI tool involves a number of variables. Individual analytic/aesthetic preference is a factor, as is budget (Tableau is generally much more affordable than Looker). There are also certain technical considerations to consider, including your eCommerce platform.

  • Looker is often considered to be more user-friendly than Tableau.
  • Looker is web-native.
  • Tableau is often considered easier to use, though Looker is considered to be more powerful from an analytics standpoint.
  • Tableau is owned by Salesforce, and it more easily integrates with Salesforce Commerce Cloud.

Visualization/BI Pricing

Pricing for BI tools ranges considerably, and it changes based on your needs. For some companies, a visualization solution can cost less than $1000, and for others, it may be over $100k. 

  • Looker requests that sales be contacted for any pricing, as it is custom for every company.
  • Tableau charges $70 per user per month (billed annually). 

Team Requirements for Visualization/BI

If your ELT process is well-built, then the team will spend more of their time in the BI tool of your choice creating reports and dashboards rather than troubleshooting data pipeline errors. 

An analyst will be responsible for creating reports and dashboards based on your data model, using SQL (or SQL+LookML in the case of Looker).

Reverse ETL

For Reverse ETL, we will stick to covering Hightouch as a tool option.

Reverse ETL Functionality

While it’s not always included in the Modern Data Stack, for Consumer Brands, it will be key to leverage a Reverse ETL tool to build lookalike audiences in your ads platforms, personalize messaging with hyper-targeted customer segments, and more. Hightouch will connect with your data warehouse and push chosen data to your marketing tools. 

Reverse ETL Pricing

Hightouch’s pricing is based on the number of destinations (i.e., data targets) that you’ll use. One destination is free, two destinations cost $350/month, four destinations cost $800/month, and more than four destinations require custom pricing.

  • Most Consumer Brands can expect to use at least four destinations. If you’re leveraging a reverse ETL solution, you want to be enriching multiple marketing channels. 

Team Requirements for Reverse ETL

In order to build segments and leverage customer data, an analyst will “shortlist tables" via SQL or pre-built dbt transforms (these will almost certainly require customization, and you will have your own transforms). Marketers then create segments or audiences on top of the data models, and they can sync that data to their tools. 

Replacing the (Modern) Data Stack for Consumer Brands with Daasity

This is the data stack that you’re looking to build:

The total time to implement this infrastructure will be (at minimum) one year and will cost (at minimum, excluding additional data team resources), $100k

However, the combination of tools, upkeep, customizations, and team salaries will be several (or many) times that. Some brands end up spending millions of dollars on their data stack every year, or they simply abandon their data project because the costs balloon out of budget.

We’d like to present you an alternative—Daasity. Daasity is a modular data platform, built for Consumer Brands, by analysts and data engineers from Consumer Brands. Now, you might be wondering what a modular data platform is, and that’s a reasonable question... Because we’re the first! 

A modular data platform will allow you to choose and use elements (or all) of our Extraction, Loading, Transformation, Visualization, Operationalization, and Orchestration functionality, in order to build your perfect custom data stack, centralize all your data, and run deep analytics... in days or weeks, not months or years.

We have extractors for dozens of integrations that will load data into your warehouse (or you can share our Snowflake instance), and we update them several times a year to stay ahead of API changes. This means your data team won’t spend time maintaining your pipeline, and they’ll have more time doing what they love: analyzing data and helping your brand grow.

Our data models/transformation code were purpose-built for Consumer Brands. They are truly future-proof and designed to accommodate the current and potential data of eCommerce platforms, Amazon, wholesale, and retail. 

  • To take the customization to another level, our transformation code can be fully adjusted to your needs.

You can leverage any visualization tool with our platform. We have a close relationship with Looker (and we will soon announce out-of-the-box Tableau support), which allows you to use our prebuilt sales, operations, marketing, LTV, site funnel (and other) dashboards as well as 120+ reports to share with your organization. All of our dashboards and reports can be customized, your analyst(s) can build new ones in your Looker, or our analysts can build custom dashboards and reports. 

  • In your visualization tool, you can combine all your data into one view.

We provide a transparent and customizable orchestration system, where your data team can schedule and adjust workflows as you see fit, use our out-of-the-box workflows based on your integrations, or work with our data team to find the best orchestration solution for you. 

  • You won’t need an orchestration or Python specialist to make a tool like Airflow work.

Lastly, you can operationalize your data (i.e., push your data via a reverse ETL solution) to key marketing channels (including Klaviyo, Google Ads, TikTok Ads, and Facebook) with Daasity. 

Another version of a data stack:

Your data stack could be a whole lot simpler. It could look like this:

We work with brands from under $1m to over $1b. Some use us for the entire ELTVO+Orchestration process, and other brands just use our extractors. The most important thing is that you’re happy with your data stack, and we’d love to help support you in any way we can.

We want to help put your data to work, reduce the headaches, and empower your data team (and brand) to uncover the insights that will take you to the next level.

That said, if you are interested in continuing the process of building your own data stack, we sincerely hope this guide has helped out; every Consumer Brand is different, and you may find true utility in building everything from the ground up. 

However, if you need to gain insights from your data in the next month or don’t want to build a data stack internally, give us a holler. You can talk to a data specialist here.

May interest you