Cleaning User-Entered Survey Data to Improve Marketing Attribution

Marcel Fafard
October 27, 2022

The Data Problem

Cleaning free-form text entry data poses significant challenges for omni-channel merchants, across a variety of data collection strategies. In the case of using HDYHAU (“How did you hear about us?”) surveys to improve attribution, merchants often face major headaches around user-entered data. 

Customers’ inputs can be highly variable, so brands must account for common misspellings, uncover related keywords, and manage page-long CASE statements. Yet despite best efforts, the end result tends to be the same: unusable or incomplete data as well as flawed attribution.

The Data Solution

We created a solution that cleans up user-entered survey data in the post-purchase survey platform Fairing. The resulting data can be used for marketing attribution, customer cohort analysis, and more.

We built a mapping document and transformation code to help you make sense of your customer responses in Fairing or from other platforms that involve free-form text entry data. You’ll be able to answer questions like: 

  • How can I better manage complex CASE statements?
  • How can I easily map customer typos to the correct value?
  • How are customers spelling Tim Ferriss’ name? …Can I spend less time cleaning customer responses and start enjoying my own “4-Hour Work Week”? :)

Expectations, Requirements, and Table Construction

Estimated build time: 5 to 20 Hours. The build time depends on the platform. If this is for Fairing, there will be a faster build, but if criteria changes need to be made, it will take longer. This estimate is based on having a flat datasource that contains one or two columns with any combination of free-form text, numbers, or strings.

Necessary familiarity: 

Table construction and notes:

  • The table drp_staging.mapping_raw_data should contain a dataset with a primary key as well as two criteria columns that need to be mapped.
  • A Google Sheet integration must be created that follows the mapping format, or is customized to a new use case. (see example below).
  • This gets transformed into a table that is joined on the criteria specified in the Google Sheet drop-down menu for criteria. 

Relevant Data and Considerations for This Analysis

For this transformation, you will need access to your Google Sheet, your source systems data tables (in this example we are using Fairing), and the drp_staging schema.

  • Google Sheet Integration to program the mapped values and mapping criteria
  • Source System data
  • DRP_STAGING schema to create intermediary tables during the transformation

Google Sheet Integration

Create a Google Sheet Integration to specify the mapping criteria for the integration. A template example is shown in the screenshot below.

  • In the example, the data entry on row 2 allows mapping of customer surveys which have a response exactly equal to newsletter, podcast, influencer, online article, or online ad AND any combination of a free-form customer entered data field with common misspellings of the author’s name “Tim Ferris.”

Source System Data (Fairing)

We then access data from the source system. We are able to map all misspellings and variants of Tim Ferriss’s name with this Google Sheet and the various combinations of the primary criteria to one Response Group and Sub Group.

Note: this solution prioritizes primary and secondary mapping, but this solution allows mapping of only a primary or only a secondary criteria. 

The Transformation Code 

Our templated solution allows quick implementation of the solutions described. Here is a quick overview of what the code does.

–Assign a unique identifier to each row in the Google Sheet:

WITH raw_sheet AS
   ROW_Number() OVER ( PARTITION BY response_1 ORDER BY
    response_1, operator_1, value_1, response_2, operator_2, value_2, response_group, online_ad, podcast, influencer, newsletter, other_store, online_article, outdoor_billboard
    ) AS row_pk
 * FROM gsheets.daasity_survey_mapping
 	 WHERE response_1 IS NOT NULL

–Split each row into multiple rows, allowing the multivalue inputs specified in the Google Sheet to be joined to the source system later in the transformation process. 

v1s AS (
 , TRIM(o) AS operator_1
 , TRIM(REPLACE(value,'"','')) AS row_criteria_1
FROM v1 d, lateral split_to_table(d.v, ',')

v2s AS (
 , TRIM(o) AS operator_2
 , TRIM(REPLACE(value,'"','')) AS row_criteria_2
FROM v1 d, lateral split_to_table(d.v, ',')

v1s_op1 AS (
    , CASE WHEN v1s.operator_1 IN ('=','in') THEN '=' ELSE v1s.operator_1 END AS operator 1
    , CASE WHEN v1s.operator_1 IN ('like','not like') THEN '%' ELSE '' END AS prefix_suffix
    , prefix_suffix || trim(row_criteria_1) || prefix_suffix AS value_1
  FROM v1s

v2s_op1 AS (
    , CASE WHEN v1s.operator_2 IN ('=','in') THEN '=' ELSE v2s.operator_2 END AS operator 2
    , CASE WHEN v2s.operator_2 IN ('like','not like') THEN '%' ELSE '' END AS prefix_suffix
    , prefix_suffix || trim(row_criteria_2) || prefix_suffix AS value_2
  FROM v2s

–Combine subgroups into one column:

response_groups AS (
    , response_group
    , COALESCE(online_ad,COALESCE(podcast,COALESCE(influencer,COALESCE (newsletter,COALESCE( other_store,COALESCE(online_article,outdoor_billboard)))))) AS sub_group
    , ROW_NUMBER() OVER (PARTITION BY value_1, value_2, sub_group ORDER BY value_1) rn
  FROM raw_sheet

all_mappings AS (
   , rg.response_group
   , rg.sub_group
   , v1s_op1_operator_1
   , v1s_op1.value_1
   , v1s_op2_operator_2
   , v2s_op2.value_2
FROM response_groups rg
LEFT JOIN v1s_op1 ON rg.row_pk = v1s_op1.row_pk
LEFT JOIN v2s_op2 ON rg.row_pk = v2s_op2.row_pk
WHERE rg.rn = 1

–Map responses to the source system based on the criteria, while prioritizing mappings with multiple criteria: 

prefin AS (
 , COALESCE(am1.response_group, am2.response_group, am3.response_group, am4.response_group, am5.response_group, am6.response_group,'Missing From BSD') AS primary_group
 , COALESCE(am1.sub_group, am2.sub_group, am3.sub_group, am4.sub_group, am5.sub_group, am6.sub_group,'Missing from BSD') AS sub_group
FROM drp_staging.enquire_mapping_raw_data map
	-- Primary and Secondary Criteria are = to
LEFT JOIN all_mappings am1 ON map.primary_criteria = am1.value_1
                           AND map.secondary_criteria = am1.value_2
                           AND am1.operator_1 = '='
                           AND am2.operator_2 = '='
	-- Primaary criteria is = to, secondary criteria contains
LEFT JOIN all_mappings am2 ON map.primary_criteria = am2.value_1
                           AND map.secondary_criteria ILIKE am2.value_2
                           AND am2.operator_1 = '='
                           AND am2.operator_2 = 'like'
  -- Primary criteria is = to, secondary criteria is blank
LEFT JOIN all_mappings am3 ON map.primary_criteria = am3.value_1
                           AND am3.operator_1 = '='
                           AND am3.operator_2 IS NULL
  -- Primary criteria contains, secondary criteria is blank
LEFT JOIN all_mappings am4 ON map.primary_criteria ILIKE am4.value_1
                           AND am4.operator_1 = 'like'
                           AND am4.operator_2 IS NULL
  -- Primary operator is null, secondary criteria = to
LEFT JOIN all_mappings am5 ON am5.operator_1 IS NULL
                           AND am5.operator_2 = '='
                           AND map.secondary_criteria = am5.value_2
	-- Primary operator is null, secondary criteria contains
LEFT JOIN all_mappings am6 ON am6.operator_1 IS NULL
                           AND am6.operator_2 = 'like'
                           AND map.secondary_criteria ILIKE am6.value_2

Final Result

The ‘Missing from BSD’ record needs to be added to the Google Sheet to update the mapping. Thankfully, this is as simple as adding this value to the list that already exists:

< src="" defer>