The best file format for long-term storage

Even the Library of Congress agrees: boring formats win in the long run.

In our daily data work, we often treat the data we get as throwaway material. We read it, interpret it, combine it with other data, and finally reach our desired result. Then we toss the intermediate data — and often even the source data — aside.

“Throwing away” doesn’t necessarily mean deleting files. Leaving them in some project folder we won’t remember next month? That’s the same thing.

For many use cases, that’s fine. But sometimes, we should take a step back and ask: Should I archive this data on purpose?

A few years ago, I had to store Google Search Console data for a client’s seven websites. There wasn’t a real project around it. No warehouse, no data pipeline. Just a vague idea that the data might be useful someday.

The IT department was supposed to build a “real data warehouse” — once they had time. In the meantime?
“Just store the files on your machine,” an executive told me.

The job grew fast: ten files per day, per site. That’s over 25,000 files per year. More than 5 GB of data. No structure. No budget. Just one person trying to make sure the data wouldn’t vanish.

I tried several database approaches — but they introduced more problems than they solved. One version used SQLite. Then one of my scripts failed to lock the database file properly and corrupted it. I lost almost a year of data. Luckily, Google let me re-download it.

After that, I switched to plain old CSV files.

Yes, they’re boring. But that’s precisely the point. They don’t break easily. You can read them in five years — or hand them off to someone else. They don’t need a server, a runtime, or documentation. You just write and forget.

Even the Library of Congress puts CSV at the top of their “Recommended Formats Statement” [1] for datasets:

Widely used as an exchange format for tabular data. Although very limited in functionality, there are many data exchange or data preservation contexts for which it is adequate (…) A simple text-based format that is very transparent, being both human-readable and easily machine-processable.
loc.gov

What's good enough for the Library of Congress can’t be too bad for us, right?

(And yes, that "real data warehouse" still hasn’t been built. Five years later, my scattered CSV files are still the real data warehouse.)

Thanks for reading,
Stefan

P.S. The Library of Congress also lists SQLite as a recommended format for datasets — and for any well-defined archive, that's a solid choice. But for ad-hoc collections, built on the fly with scripts and deadlines, I’d still go with simple text files every time.

[1] https://www.loc.gov/preservation/resources/rfs/data.html


🧮 The Missing Number

840,000 — Number of CSV files in the Library of Congress collections