A Tale of Two Rows (and the Illusion of Sameness)
Welcome to issue #002 of The Missing Header — formerly known as the Tablecruncher Newsletter*
A user emailed me, rather frustrated.
They had a CSV full of customer data — and a clear goal: get rid of duplicates. They opened it in Excel, clicked Remove Duplicates, and… nothing happened.
Why? Because every “duplicate” row had one tiny difference: a last_updated
timestamp. The data was compiled from multiple database backups, and — not that surprising — the timestamps differed. Same database. Same customers. Different timestamp.
The result? Not a single row matched exactly — and the cleanup failed.
What counts as a duplicate?
This wasn’t a software bug. It was a definition problem.
If you want to find duplicates in a CSV, the first and most important step is deciding what makes two rows the same.
- Do you want to match the entire row? While this may seem like the right approach, it can fail completely — as that user story showed.
- Or just a specific column like email, user ID, or product number? If so, you need to be sure that this column truly defines uniqueness.
- Or maybe a composite key like
first name + last name + date of birth
? This might work with sparse data, but can lead to false positives. If your file contains millions of users, there may well be multiple John Smiths born on the same day.
So before anything else, you need clarity on the definition of a duplicate that fits your data.
Without it, any tool will struggle.
Before de-duplicating, inspect
Don’t underestimate how much you can discover just by looking at your data the right way. Often, a quick scan reveals more than any automated tool. Try this:
- Sort by the column you think might define duplicates
- Look for troublemakers: trailing spaces, inconsistent casing, dashes in product numbers, etc.
- Use Find to search for common formatting issues (
-
, spaces, non-breaking spaces)
Got no single key? Try a composite. Sorting by multiple columns often reveals hidden structure.
The Cleaner
Once you think you’ve found your “key,” don’t trust it blindly.
- Strip whitespace
- Normalize casing
- Decide how much variation you tolerate (e.g. is “Montréal” the same as “Montreal”?)
Only when your data is clean and your definition is sound, should you trust any automatic deduplication.
Cleaning messy data is a universe of its own, though. I’ll come back to it in future issues — this one’s just the surface.
One more thing…
If you're wondering: yes, Tablecruncher is now open source.
Same fast CSV editor. Now with a GitHub repo: https://github.com/Tablecruncher/tablecruncher
🧮 The Missing Number
80 — number of different countries a license of Tablecruncher has been bought
Have a favorite data fail? Or a nightmare deduplication story?
Just hit reply — I’d love to hear it.
Thanks for reading,
Stefan
* I’ve rebranded the newsletter to avoid confusion with my CSV app. It’s now The Missing Header. Same author, same focus — all about the weird and wonderful world of tabular data.