Some Data Cleaning Steps They Don’t Warn You About

Some Data Cleaning Steps They Don’t Warn You About

We’ve all seen the “10 quick tips for cleaning your data” posts. Most of them leave out the tricky parts — the places where you can quietly ruin your dataset without even noticing.

A few that deserve more attention:

1. Format conversion is dangerous.

Getting data into the format you need (CSV, Excel, SQL dump…) — and out again — isn’t just file saving. It's a conversion – and every conversion step can introduce corruption: dropped characters, wrong date parsing, silent type changes.

2. Encodings aren’t optional.

Most guides skip character encodings entirely. Don’t. Mixed encodings — where only part of a text is wrong — are especially nasty. You need to know how to detect and fix them before you trust anything else. I've written about it in #006 First Encounters with Broken Text.

3. Find-and-replace can be a wrecking ball.

A careless replace can wipe out the wrong substring, change the wrong column, or mangle IDs that just happened to contain your target string. Always check scope and matches before hitting “replace all.”

4. Understand what your cleaning functions really do.

Excel’s CLEAN() sounds nice, but it happily leaves ASCII 127 untouched while deleting tabs — which might be the only thing separating fields in a cell. If your data uses tabs as ad-hoc delimiters, you don’t want to lose them.

5. “Remove duplicates” ≠ “problem solved.”

Removing duplicate entries is easy, but only if you're really sure what a duplicate is in your data set. See TMH issue #002 A Tale of Two Rows — sameness is often an illusion for some advice how to handle this tricky step.

6. Splitting columns isn’t trivial.

Names aren’t always “First Last.” Dates aren’t always YYYY-MM-DD. And product IDs aren't always given as "ABC-123456". Before splitting on a space, hyphen, or slash, make sure your data is actually consistent. Also read the TMH issue #009 for all the problems regarding names.

7. Document every cleaning step.

Data cleaning is highly individual to the dataset. Without a record of what you did and why, you can’t debug when something goes wrong later. And if you reuse a cleaning workflow blindly, expect surprises.

8. Know your data’s origin.

How was it collected? Which systems touched it? What quirks does it carry from those systems? These details often explain “mystery” issues faster than any amount of trial and error.

Good cleaning isn’t just about making the mess go away — it’s about knowing exactly what mess you had, and what you did to it.

Thanks for reading,
Stefan


🧮 The Missing Number

$15 — The going rate on Fiverr for cleaning up to 700 rows in Excel.

A reminder that “data cleaning” means very different things to different people — and that the real cost often shows up after the job is done.


PS: This is issue #014 of The Missing Header. You may be receiving this newsletter because you subscribed to the Tablecruncher Newsletter some time ago. I’ve rebranded it to avoid confusion with the software project.