What Makes a Name a Match—And Why It Matters For Your Data Handling

How do you know if two names refer to the same person?

That’s the question behind a new EU regulation that will soon require European banks to warn customers during money transfers if the payee’s name doesn’t match the account holder. It’s meant to prevent fraud — but it also shines a light on a classic data wrangling headache: what makes a name a match?

The idea behind the regulation: if you try to send money to Thomas Müller and the bank account says Scammy Company Corp., you should get a warning.

And honestly, that sounds like a good idea.

But how do you define a match?

Is “Thomas Mueller” the same as “Thomas Müller”?
Is “JJ Beringer” the same as “James J. Beringer”?
What about “Robert Smith III” vs. “Robert Smith the Third”?

These few examples are enough to show: there’s no perfect answer. And that’s why the EU regulation doesn’t enforce name matches — it introduces a traffic light–inspired warning system instead.

That’s not just a banking problem. Names are everywhere: in SaaS user databases, customer records, CRMs, support tickets, comment sections, or shipping labels.

And yet: handling names is surprisingly tricky.

Searching for a user? Merging duplicate records? Cleaning up a messy export? The name field often seems like low-hanging fruit — until it breaks your logic.

If you’ve ever tried to deduplicate thousands of rows based on “similar” names, you’ve probably learned the hard way: even something as basic as “Who is who?” can quickly become a data nightmare.

Why name comparison is so hard

Let’s break down the most common troublemakers:

Accents and Umlauts

  • Müller vs. Mueller vs. Muller
  • Beyoncé vs. Beyonce
    Systems often strip accents or map them inconsistently.

Middle Names & Initials

  • Thomas A. Edison vs. Thomas Edison
  • JJ Beringer vs. James John Beringer
    Middle names may appear, disappear, or get abbreviated depending on the context.

Suffixes and Numerals

  • Robert Smith III vs. Robert Smith
    Suffixes like Jr., Sr., II, III are tricky. Are they part of the name? Depends on the system.

Cultural Name Order

In China, Korea, Hungary, and parts of Eastern Europe, surnames typically come first. So in names like Zhang Wei or Kim Jong-un, Zhang and Kim, respectively, are the surnames.

Hyphenation and Spacing

  • Lee-Ann vs. Lee Ann
  • Knowles-Carter vs. Knowles Carter
    Simple formatting differences can break comparisons.

Toponymic Names and Nobility

  • Leonardo da Vinci
  • Charles de Gaulle
  • “Prinz von Preußen” as a legal surname in Germany
    Some names include particles or noble titles that defy rigid first–last name models.

Titles Treated as Names

In Germany, a Ph.D. may legally appear in a passport:
Dr. med. Anna Schmidt
For many, “Dr.” isn’t just a title — it’s part of their name. An that’s how they enter it.

We can't solve the mess — but we can reduce it

Trying to solve all edge cases is a never-ending rabbit hole. But here’s what we can do:

▶️ Don’t assume names are unique identifiers

Never use names alone for de-duplication. Always combine them with more stable fields like birth date, address, email, or customer number.
Even in small datasets, name collisions happen more often than you'd think. If you want to know how easy that is, look up the “birthday paradox”.
A survey conducted in Texas found that in Harris County, there are 2,488 patients named Maria Garcia — and 231 of them share the same birth date.

▶️ Store a normalized form

Convert names to lowercase, strip accents, and optionally map characters (like ü → ue). Store both the original and the normalized version.
Use the normalized one for comparisons, but always display what the user typed.

▶️ Provide fuzzy matching

Depending on your use case, consider fuzzy search using techniques like Levenshtein distance, n-grams, or Soundex.
Just make sure the results are clearly marked as approximate.

Names are one of the most human forms of data — and deeply rooted in culture. In our globalized systems, those origins are often hidden until they suddenly matter. That’s why regulations like EU 2024/886 are a good moment to revisit how we handle names in our systems.

Let’s make our data flows a bit more resilient — and a lot more respectful of human naming diversity.

🧮 The Missing Number

69,807–the number of pairs of patients in Harris County, Texas who share both name and birth date.

Thanks for reading,
Stefan


PS: This is issue #009 of The Missing Header. You may be receiving this newsletter because you subscribed to the Tablecruncher Newsletter some time ago. I’ve rebranded it to avoid confusion with the software project. Same author, same scope — still all about solving messy data problems.

Read more