Seven Bad Data Habits (And Why I Still Do Them)
We all know the best practices. Write clean code. Use libraries. Document everything. But here’s the thing: in the real world, things are messy — and I’m no exception.
So today, let’s talk about my data engineering sins. Not as a confession, but as a mirror. Because chances are, you've done a few of these too.
1. I parse HTML with regex
Yes, I know. You’re not supposed to do that. But sometimes the structure is simple, the data is buried deep in the tags, and I just need to grab a few fields. A five-line regex script feels faster than pulling in a parser.
Of course, it’s brittle. And wrong. And still… I keep doing it.
2. Regex creep
Speaking of regex: it’s an amazing tool — until it quietly takes over your entire workflow. It starts with one quick pattern, then another. Before you notice, every data transformation, every filter, every rename job involves a regex.
Regex is addictive. And the worst part? You don’t even notice how tangled things have become.
3. Manual scripts — in Excel
It starts with one small change: a filter here, a column removed there. Easy to do, easy to remember. The next week, there’s another step — maybe a new formula, or a sort, or a paste from another sheet.
Soon you're writing down the steps in a note somewhere so you can repeat them next time. Congratulations: you’ve written a script.
And you’re the interpreter.
4. My text editor is a data tool
I always have TextMate (a great text editor for macOS) open. I use it for writing, scratch notes, quick CSV cleanup, and, of course, heavy Regex usage. At any given moment, there are 20+ document windows open — some temporary, some important, all mixed together. I don’t even bother saving those documents — macOS just reopens them after every reboot.
It’s fast. It’s comfortable. It’s chaos.
5. One-time scripts… that never die
This is a classic. You write a quick script — just for today — and move on. Except you don’t. You run it again next week. Then someone else asks for the same output. Soon it’s part of your routine.
But it’s still undocumented. Fragile. Ugly. A ghost in the machine.
6. Reused scripts with no memory
Worse than one-time scripts: “kind-of reusable” ones. They’re more complex, designed with reuse in mind — but I never bothered to explain how they work. So after a few weeks, even I don’t remember what they do. And good luck handing them to someone else.
7. CSVs by hand
I’ll admit it: I sometimes write CSV output without a library. I know what I’m doing, right? Just join some strings, throw in some commas, done.
Except the moment someone adds a value with a comma in it — or worse, a line break — everything breaks. And if this is inside one of those undocumented one-off scripts? Disaster.
These habits aren’t just technical problems. They’re human shortcuts. They sneak in when you’re in a hurry, when you’re tired, when you think “I’ll clean this up later.”
But here’s the thing: sometimes these shortcuts do save time. Sometimes they’re the only way forward. Just don’t pretend they’re best practices. Call them what they are: bad habits that work — until they don’t.
Have you committed a few of these sins yourself?
Send me an email — but don’t worry, you don’t have to confess.
Thanks for reading,
Stefan
🧮 The Missing Number
524 — Number of functions in Excel for Microsoft 365, assuming my TextMate Regex didn’t miss anything. (No guarantees.)
Here's the source: https://support.microsoft.com/en-us/office/excel-functions-alphabetical-b3944572-255d-4efb-bb96-c6d90033e188
PS: This is issue #012 of The Missing Header. You may be receiving this newsletter because you subscribed to the Tablecruncher Newsletter some time ago. I’ve rebranded it to avoid confusion with the software project. Same author, same scope — still all about solving messy data problems.