Fuzzy Matching & Auto-Correction in Pandas Using RapidFuzz

Last modified: August 03, 2025

Use case:

Fix messy strings like "Jon Smith" vs "John Smiht" vs "J. Smith"
Deduplicate entries with typos or different formats
Automatically map messy entries to a clean master list

Example Using `RapidFuzz`

import pandas as pd
from rapidfuzz import process, fuzz

# Messy data
df = pd.DataFrame({
    'name': ['Jon Smith', 'John Smith', 'J. Smith', 'John Smiht', 'Smith, John', 'Jane Doe']
})

# Master list of known correct names
known_names = ['John Smith', 'Jane Doe']

# Function to find best fuzzy match
def fuzzy_match(name, choices, score_cutoff=80):
    match = process.extractOne(name, choices, scorer=fuzz.token_sort_ratio, score_cutoff=score_cutoff)
    return match[0] if match else name  # Return matched name or original

# Apply the fuzzy matching
df['corrected_name'] = df['name'].apply(lambda x: fuzzy_match(x, known_names))

print(df)

📊 Output:

name	corrected_name
Jon Smith	John Smith
John Smith	John Smith
J. Smith	John Smith
John Smiht	John Smith
Smith, John	John Smith
Jane Doe	Jane Doe

Why This Hack Is Powerful:

Handles typos, abbreviations, and formatting issues
Uses RapidFuzz, which is 10–100x faster than fuzzywuzzy
Easy to integrate into cleaning pipelines before ML or database sync

Cache Matches to Speed Up Repeated Processing

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_match(name):
    return fuzzy_match(name, tuple(known_names))  # tuple for hashability

▉ Tags: #python, #snippet

▉ Categories: programming

Fuzzy Matching & Auto-Correction in Pandas Using RapidFuzz

Use case:

Example Using RapidFuzz

📊 Output:

Why This Hack Is Powerful:

Cache Matches to Speed Up Repeated Processing

Example Using `RapidFuzz`