Data Wrangling
Data wrangling is the work of cleaning, restructuring, and transforming raw data into a format suitable for analysis. It covers type conversions, null handling, deduplication, date reformatting, string normalization, joining disparate sources, and resolving schema inconsistencies. The term describes the gap between data as it arrives and data as it needs to be.
Industry surveys consistently estimate that analysts spend 60–80% of their time on wrangling rather than analysis. The number has held steady for years despite advances in tooling. The reason is structural: data arrives messy because the systems that generate it were built for operations rather than analytics. A CRM stores dates in one format. A billing system uses another. A third-party API returns nested JSON with inconsistent field names. None of these systems were designed to feed a dashboard.
What wrangling involves
The specific tasks fall into predictable categories:
Type conversions. Dates stored as strings. Revenue stored as text with currency symbols. Boolean values represented as 0/1, "yes"/"no", or "true"/"false" depending on the source system. Every downstream tool expects consistent types.
Null and missing value handling. Deciding whether a NULL means "unknown," "inapplicable," or "the source system failed to populate." The choice affects aggregation logic – does a NULL discount mean no discount (zero) or missing data (exclude from calculation)?
Deduplication. Multiple records for the same entity, caused by system integrations that create duplicates, user error, or merge failures. Identifying which records represent the same customer or transaction requires matching logic that accounts for spelling variations, partial data, and conflicting values.
Reformatting and normalization. Standardizing country names (US vs. United States vs. USA), cleaning phone number formats, splitting or combining address fields, converting time zones.
Joining disparate sources. Connecting data across systems that don't share a common key. Matching Salesforce accounts to Stripe customers to product usage data often requires fuzzy matching, lookup tables, or manual mapping.
The spectrum of wrangling tools
Wrangling happens at every point in the data stack, with varying levels of automation and governance.
Manual – spreadsheets. An analyst downloads a CSV, cleans it in Excel, and uploads the result. Fast for one-off tasks. Impossible to reproduce, audit, or scale.
Code-based – SQL and Python. Analysts write transformation scripts. Reproducible and version-controllable, but requires programming skills. Most data teams live here.
Transformation frameworks – dbt, Dataform. SQL-based tools that codify transformations into tested, versioned, scheduled pipelines. Wrangling logic becomes infrastructure rather than one-off scripts. This is where mature organizations push their recurring cleaning work.
Low-code wrangling tools. Visual interfaces for building transformation logic – drag-and-drop joins, point-and-click type conversions, GUI-based deduplication rules. Tools like Trifacta (now part of Alteryx), Datameer, or built-in prep features in BI platforms. These lower the barrier for business analysts who can't write SQL but need cleaner data than what the warehouse provides.
Pushing wrangling into the right layer
The core problem with ad-hoc wrangling is repetition. When each analyst cleans the same dataset independently, the work multiplies. Ten analysts cleaning the same customer table means ten slightly different cleaning approaches, ten sets of assumptions about nulls, ten undocumented transformations.
The fix is to push wrangling logic into a shared layer – either the transformation layer (dbt models that clean data once for everyone) or the semantic layer (definitions that handle type casting, null logic, and formatting as part of the metric definition). When wrangling runs once in a governed pipeline rather than ten times in ten notebooks, the output is consistent and the total effort drops.
A well-maintained data catalog accelerates this by showing analysts which datasets have already been cleaned and which transformations have been applied. Without that visibility, analysts often re-wrangle data that someone else has already prepared – or worse, use an uncleaned source when a clean version exists.
The wrangling burden is also one of the clearest signals in a time-to-insight scorecard. If the Prep stage scores low – meaning data consistently requires significant manual work before it's usable – that indicates a gap in the transformation or semantic layer worth investing in.
The Holistics Perspective
Holistics minimizes wrangling at the BI layer by pushing transformation logic into the semantic layer. Rather than cleaning data in spreadsheets or notebooks before analysis, teams define transformation rules in AML models that apply consistently across every report and dashboard.
See how Holistics approaches this →