When Your CSV Starts Breaking the Rules
Most CSV headaches start the moment your data stops being truly flat.
CSV is the most common data format in the world. It’s simple, human-readable, and supported by almost every tool.
And yet, it’s one of the most heavily misused formats out there.
A CSV is built for one thing: storing flat, tabular data. Think of it as a single table in a relational database — rows for records, columns for fields, and every row having the same set of columns. That’s its home turf.
The problem?
Much of today’s data isn’t flat. And instead of changing the format, people try to cram complex structures into CSV.
Here are a few “creative” workarounds I’ve seen:
1. Embedding another format (JSON in a cell)
ID,First Name,Surname,Contact Data
123,John,Miller,"{telephone:["01234-56789","09876-654321"], email:"miller@example.com"}",New York
You now need a CSV parser and a JSON parser — and a prayer that all quoting rules were followed.
Many relational databases (MySQL, SQLite, PostgreSQL) now have a JSON column type. This means JSON-in-a-cell will appear more often — you should learn how to handle it. But if you’re creating the CSV yourself, don’t half-commit: store the entire dataset in JSON if possible.
2. Overloading a cell (pseudo-CSV inside CSV)
ID,First Name,Surname,Contact Data,City
123,John,Miller,"01234-56789,09876-654321,miller@example.com",New York
If it’s correctly quoted, parsers won’t get confused — but human readability is gone. And these constructs tend to take on a life of their own: sooner or later, someone will “just add one more value” manually, skip the quoting, and break every parser in sight.
Better: Again, using JSON would be a better fit. Or, at least, use JSON-in-a-cell — it’s more widely supported (especially in databases) than an ad-hoc “mini-CSV” inside a cell.
3. Expanding the table sideways (repeated fields)
ID,First Name,Surname,Telephone 1,Telephone 2,Telephone 3,City
123,John,Miller,01234-56789,09876-654321,,New York
Chances are, you’ve made the schema bigger than the data. ;-) And when someone adds a fifth number? New column, new headache.
Better: Use JSON (either as the full format or in a single column) to store variable-length lists. If you have large datasets with multiple related entities, consider using a relational database instead.
4. Breaking the table apart (linked CSVs)
Some systems — like Google Search Console exports — store related data in separate CSV files and expect you to join them later. It’s a valid technique, but now you’re manually rebuilding a database without the benefits of a database.
Better: If you control the export, choose a relational database format (SQLite, Parquet) or bundle the CSVs with clear join keys and a README.
Mental model
A CSV is a single flat table.
The moment your data needs multiple tables, nested structures, or varying fields per row — you’ve left CSV’s comfort zone. You can force it to fit, but every shortcut adds work later.
Practical takeaway
- If you’re consuming messy CSVs, watch for these patterns — they signal extra parsing work ahead.
- If you’re producing data, resist the urge to be “clever” inside a CSV. Either keep it truly flat or choose a format built for complexity.
In short: CSV is a great servant, but a bad contortionist.
Thanks for reading,
Stefan
🧮 The Missing Number
8259 — The number of the RFC that defines JSON. Successfully.
PS: This is issue #015 of The Missing Header. You may be receiving this newsletter because you subscribed to the Tablecruncher Newsletter some time ago. I’ve rebranded it to avoid confusion with the software project. Same author, same scope.