First Encounters with Broken Text
This is issue #006 of The Missing Header
Did you ever get an email with a sentence like:
We meet at the café"
You probably guessed the sender meant a coffeehouse. But where's this letter garbage coming from?
This is a classic symptom of broken character encoding. Somewhere, someone opened or saved text using the wrong encoding — and now the byte soup is showing.
But isn't everything UTF-8 these days?
If only. While UTF-8 has become the standard for the web, data doesn’t always come from well-behaved sources. Legacy systems, careless exports, or tools with bad defaults can still mess things up. For example, Excel can export UTF-8 encoded CSVs — but only if you explicitly choose the right format when saving.
And even correctly encoded UTF-8 might not behave the way you expect. Curious? Google Unicode normalization — just don’t blame me if you fall down that rabbit hole.
So where does the mess come from?
Computers store everything — even text — as a sequence of bytes. The simplest way to do this is to map one byte to one character. Since Western alphabets only need about 128–256 symbols, this works pretty well… until it doesn’t.
Early encodings like Latin-1 (ISO 8859-1) did exactly this. But they quickly ran into trouble with accented characters, symbols from other languages, or non-Latin scripts.
The solution? Create more encodings!
The ISO-8859 family, for example, now has 15 different encodings, each incompatible with the others. And when you pick the wrong one? You get café instead of café.
Practical tips for decoding the mess
Here’s how to approach weird text in a file:
- Start with UTF-8. It’s the most common encoding today.
- Look for “Ô artifacts. If you see things like ä, ö, or é, it’s likely the file is actually UTF-8, but was opened as Latin-1, Latin-9, or Windows-1252.
- Try different encodings. A good text editor (like Sublime Text or VS Code) lets you re-open files with various encodings.
- When in doubt, open the hex editor. This shows you the actual bytes, so you can figure out what the file really contains.
Sometimes it’s detective work. You guess. You test. You squint at bytes. Eventually, you win.
We’ll return to this topic
This is just a quick look at one of the most misunderstood parts of working with data. Unicode was invented to fix this mess — and in many ways, it did. But it also made things more complicated.
I’ll return to this topic in future issues. If you want to explore on your own, here are some common myths about Unicode:
- UTF-8 and Unicode are the same thing.
- If I use UTF-8, I’ll never have encoding issues.
- UTF-16 always uses two bytes per character.
- A Unicode code point equals a character.
- A “character” is a well-defined thing.
- You know what a character is.
🧮 The Missing Number
98.3% — The percentage of websites that use UTF-8 as their encoding
Thanks for reading,
Stefan
PS: You may be receiving this newsletter because you subscribed to the Tablecruncher Newsletter some time ago. I’ve rebranded it to avoid confusion with the software project. Same author, same scope — still all about solving messy data problems.