The Pythonic Way: semantic keyword clustering in python (a practical beginner guide)
I’ve been there. I’m staring at a 5,000-row CSV of keywords exported from Ahrefs or Semrush, and half of them seem to mean the exact same thing. “Best running shoes,” “top rated sneakers for running,” and “running shoes reviews” are technically different strings, but they share the same user intent.
Sorting this manually in a spreadsheet is a recipe for burnout. Using simple text matching (like grouping everything with the word “shoes”) misses the nuance of language. You need a way to group these keywords based on meaning, not just spelling.
In this guide, I’ll walk you through how I use semantic keyword clustering in python to turn messy lists into actionable content plans. We will cover the different approaches (SERP vs. embeddings), build a working pipeline using safe defaults, and frankly discuss the trade-offs I’ve learned the hard way. This isn’t theoretical; it’s the exact workflow I use to build scalable site architectures.
Quick answer: What is semantic keyword clustering in Python?
Semantic keyword clustering is the process of using Natural Language Processing (NLP) to group keywords based on their underlying meaning and search intent rather than just shared words. By converting keywords into embeddings (numerical vectors that represent meaning) or analyzing shared URLs in Google search results, Python scripts can automatically cluster thousands of keywords into tight topical groups. The output is a clean list of “parent” topics and supporting keywords that you can turn directly into content briefs.
What I’ll help you build (inputs → clusters → SEO actions)
If we follow this process, here is what we are building toward:
- Input: A raw CSV of keywords (exported from your favorite SEO tool).
- Cleaning: A preprocessing step to normalize data without stripping away intent.
- Embedding: Converting text into vectors using a model like
all-MiniLM-L6-v2. - Clustering: Applying algorithms (KMeans or HDBSCAN) to group similar vectors.
- Labeling: Automatically naming the clusters so humans can understand them.
- Quality Check: Validating the output with metrics and common sense.
- Export: A final CSV mapping every keyword to a
cluster_idandsuggested_page_title.
Why semantic keyword clustering matters for SEO (especially for business sites)
When you are managing a massive site—whether it’s a SaaS blog or an ecommerce category structure—manual grouping is impossible. But beyond saving time, clustering solves specific architectural problems.
Signals clustering can improve:
- Topical Authority: You cover a topic completely by hitting all the nuances in a cluster, rather than scattering them across five weak pages.
- Keyword Cannibalization: By mapping all related terms to a single URL, you stop your own pages from fighting for rankings.
- Site Architecture: It reveals the natural hierarchy of your content (Hubs vs. Spokes).
- Content Planning Speed: You can brief 50 pages in an afternoon instead of a week.
I once worked on a project for a local service directory. We had 20,000 keywords. Before clustering, the team was writing separate articles for “emergency plumber” and “24 hour plumbing service.” After running a semantic clustering pipeline, we realized these were the same intent. We consolidated them into robust location pages, and the crawl efficiency improved almost immediately. It wasn’t magic; it was just cleaning up the mess.
Semantic vs lexical grouping: why TF‑IDF often falls short
Old-school methods relied on lexical similarity. If two phrases shared the word “best,” they might get grouped. Tools using TF-IDF (Term Frequency-Inverse Document Frequency) look at word overlaps.
The problem? “Apple bank” and “Apple fruit” share a word but have zero semantic relationship. Conversely, “lawyer” and “attorney” share no words but mean the same thing. Embedding-based clustering catches these relationships because it understands that “jogging sneakers” and “running shoes” exist in the same vector space, even if the letters differ.
Pick your method: SERP-based vs embedding-based vs hybrid keyword clustering in Python
Before you write a line of code, you have a choice to make. There are generally three ways to do this in Python. I’ve used all three, and the right choice depends on your budget and dataset size.
| Method | Best For | Data Required | Pros | Cons |
|---|---|---|---|---|
| SERP-Based | High accuracy, high-stakes pages | Live Google Search Results | Extremely accurate reflection of Google’s current view. | Slow; expensive (API costs); hard to scale past 5k keywords. |
| Embedding-Based | Large scale (10k+ keywords), zero cost | Just the keywords | Fast; runs locally; free; great for topical discovery. | Can miss subtle SERP intent shifts (e.g., informational vs transactional). |
| Hybrid (BERTopic) | Interpretability & Research | Keywords + Preprocessing | Transparent; excellent labeling; “SDEC” style accuracy. | More complex code; harder to tune for beginners. |
SERP-based clustering (shared URLs → graph → communities)
This approach mimics Google. You fetch the top 10 results for every keyword. If Keyword A and Keyword B share 4 of the same URLs in the top 10, they belong together.
- Fetch SERPs for all keywords (using an API like SerpApi or DataForSEO).
- Build a network graph where keywords are nodes and shared URLs are edges.
- Run a community detection algorithm (like Louvain) to find the clusters.
The friction: API quotas. Pulling 10,000 SERPs costs money and takes time. I usually reserve this for validating my most important “money” keywords.
Embedding-based clustering (SentenceTransformers → vectors → clusters)
This is the method I recommend for most intermediate users. You use a pre-trained model (like SentenceTransformers) to turn text into numbers, then use math to group the numbers. It’s fast, free, and runs on your laptop.
Rule of thumb: Use KMeans if you know exactly how many pages you want to create (e.g., “I need 50 blog posts”). Use HDBSCAN or community detection if you want the data to tell you how many clusters exist naturally (handling outliers better).
Hybrid pipelines for interpretability (e.g., BERTopic-style workflows)
If you need to explain to a CMO why “software” and “platform” are in the same cluster, hybrid models help. They combine embeddings with c-TF-IDF (class-based TF-IDF) to generate very clear, specific labels for each cluster. This is what I reach for when stakeholders demand transparency on the “why” behind the grouping.
Step-by-step: semantic keyword clustering in python using embeddings (beginner-friendly)
Let’s build a pipeline that actually runs. I’ll walk you through the logic using the SentenceTransformers library and standard clustering algorithms. Assume you have a CSV file named keywords.csv with a column header keyword.
Step 0: Define the SEO goal (what the clusters are for)
Before coding, pause. The algorithm doesn’t know your business strategy.
- Are these for landing pages? (You want tight, transactional clusters).
- Are these for blog posts? (You can afford broader, informational clusters).
- Is the intent mixed?
When I skip this step, I often end up with clusters that look mathematically perfect but are useless for the content team. Define your “granularity” goal first.
Step 1: Prepare and clean the keyword list
Garbage in, garbage out. But be careful not to over-clean.
- Deduping: Remove exact duplicates.
- Normalization: Lowercase everything.
- ASCII conversion: Remove weird characters.
One major gotcha: Don’t aggressively remove stop words or location modifiers blindly. In local SEO, “plumber near me” and “plumber” have different SERP features. I usually keep a raw_keyword column and a clean_keyword column for the model to read.
Step 2: Create embeddings (my safe default model and why)
We need to turn words into vectors. I recommend starting with the all-MiniLM-L6-v2 model. It offers the best balance of speed and performance for English SEO tasks.
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings (batching helps memory)
embeddings = model.encode(keyword_list, batch_size=64, show_progress_bar=True)
| Model | Pros | Cons | Use When |
|---|---|---|---|
| all-MiniLM-L6-v2 | Fast, lightweight | Smaller context window | Standard SEO tasks |
| all-mpnet-base-v2 | Higher accuracy | Slower, more RAM | Nuanced B2B topics |
Step 3: Cluster the vectors (KMeans vs DBSCAN/HDBSCAN in plain English)
Now we group the vectors.
KMeans forces every keyword into a cluster. You have to tell it “Make 50 clusters.” This is fine if you are mapping to a known set of page types.
Fast Clustering (Community Detection) is often better for SEO because it uses thresholds. You tell the script: “If two keywords are 75% similar (cosine similarity >= 0.75), group them.”
from sentence_transformers.util import community_detection
# Threshold is the lever you pull to change granularity
# 0.75 is a good starting point. 0.90 is very tight. 0.60 is very broad.
clusters = community_detection(embeddings, min_community_size=3, threshold=0.75)
If you expect a lot of “long-tail oddballs” that don’t fit anywhere, density-based clustering (HDBSCAN) is superior because it creates a “noise” bucket for irrelevant terms, keeping your main clusters clean.
Step 4: Label clusters so humans can use them
A cluster ID like Cluster 42 is useless to a writer. You need a label.
- Centroid Method: Find the keyword that is mathematically closest to the center of the cluster.
- Frequency Method: Find the most common bigram (two-word phrase) in the cluster.
I usually verify labels manually. If the label is “best,” the clustering was too loose. If the label is “best running shoes for flat feet 2024,” it’s nice and specific.
Step 5: Export outputs (CSV schema I actually use)
Future-you will thank you when you rerun this next quarter if you keep a clean schema. Here is the CSV structure I recommend:
| Column | Meaning | Example Value |
|---|---|---|
| cluster_id | Unique ID for the group | 104 |
| cluster_label | The representative name | email marketing software |
| keyword | The specific search term | best email tools for small business |
| intent | (Optional) User intent | Commercial |
| search_volume | From your SEO tool | 1,200 |
How I validate cluster quality (before I trust it for SEO decisions)
Running the script is easy. Trusting the output is hard. How do you know if the clusters are “good”? I don’t just trust the math; I use a mix of metrics and eye tests.
I approve a cluster set when:
- Silhouette Score is decent: This measures how similar an object is to its own cluster compared to other clusters. A score above 0.5 is generally solid for SEO data.
- Intents are consistent: I shouldn’t see “buy laptop” (Transactional) and “history of computers” (Informational) in the same group.
- Outliers are manageable: If 50% of my keywords are in the “Unclustered” bucket, my threshold (0.75) was likely too high.
- Cluster sizes look natural: I prefer a distribution where a few “head” terms have large clusters, and long-tail terms have smaller ones. If every cluster has exactly 5 keywords, something is artificial.
If I’m honest, I usually spot-check the top 10 clusters by volume manually. If those look sane, the rest of the dataset is usually safe enough to work with.
A practical review loop: metrics → manual spot checks → parameter tweaks
- Run with default threshold (0.75). Check the number of clusters generated.
- Spot check 5 random clusters. Are the keywords actually related?
- Adjust. If clusters are too mixed, raise threshold to 0.80. If clusters are too fragmented (splitting hairs), lower to 0.70. Rinse and repeat.
Scaling semantic clustering to 10,000+ keywords (without melting your laptop)
When you move from a sample dataset to a full site audit (10k, 50k, or 100k keywords), simple scripts will crash your RAM.
Performance Checklist:
- Batch Embeddings: Never encode the whole list at once. Process in batches of 64 or 128 rows.
- Use Faiss: For massive datasets, standard cosine similarity calculations are too slow. Facebook AI Similarity Search (Faiss) libraries allow for approximate nearest neighbor search, which is exponentially faster.
- GPU Acceleration: If you have access to a CUDA-enabled GPU (like on Google Colab), use it. It can speed up embedding generation by 10x-20x.
- Cache your embeddings: Save the embeddings to a
.npyor.picklefile. You don’t want to pay (in time) to re-encode the same text just because you changed a clustering parameter.
When SERP-based clustering becomes expensive (and what to do instead)
If you have 10,000 keywords, fetching live SERPs for all of them might cost you $20-$100 depending on your API provider, and it will take hours. For small teams, this adds up. I recommend a hybrid workflow: Use embeddings (free) to do the heavy lifting and group 90% of the keywords. Then, only use SERP data to validate the top 20 most critical clusters where accuracy is non-negotiable.
Using clusters for real SEO work: content briefs, internal linking, and automated publishing
Congratulations, you have a CSV with clean clusters. Now what? This is where data science meets content strategy. I use these clusters to drive three specific outcomes.
Once I have clusters, AI SEO tool workflows help me turn them into repeatable briefs and publish with editorial control. For example, if I generate 30 clusters for a new service line, I don’t want to manually write 30 document specs. I want to automate the handoff.
| Cluster Output | Page Type | Brief Elements | KPI |
|---|---|---|---|
| Informational (How-to) | Blog Post | FAQs, H2 structure, Semantic entities | Organic Traffic |
| Transactional (Best X) | Commercial Page | Comparison table, Features, Price | Conversion Rate |
Tools like AI article generator capabilities can ingest these structured clusters to produce first drafts, while an Automated blog generator can help schedule them out. But the strategy starts with the cluster.
Template: the 1-page content brief per cluster
For every cluster, I generate a brief that includes:
- Primary Keyword: The highest volume term in the cluster.
- Page Goal: Defined by the dominant intent of the group.
- Secondary Keywords: The other members of the cluster (sprinkle these naturally).
- Suggested H2s: Questions found in the cluster (e.g., “keywords that start with ‘how'”).
- Internal Link Targets: Which other clusters is this one mathematically close to?
Internal linking map: hub-and-spoke from cluster relationships
Since we have vectors, we can calculate which clusters are “neighbors.” If Cluster A (Email Marketing) is close to Cluster B (Newsletter Tips), they should link to each other. I usually export a “Links Map” spreadsheet for my writers: “If you are writing page A, you must link to Page B.” It removes the guesswork.
Common mistakes, FAQs, and next steps
Even with the best scripts, things go wrong. Here is what typically trips me up.
Mistakes & fixes (the ones I see most often)
- Over-cleaning data:
Fix: Stop removing words like “best” or “2024” during preprocessing. They often define the intent.
- Choosing the wrong threshold:
Fix: Don’t set it and forget it. If your clusters are huge and vague, increase your similarity threshold (e.g., from 0.70 to 0.85).
- Ignoring Brand terms:
Fix: Separate brand keywords from non-brand keywords before clustering, or they will clutter your topical maps.
- Trusting the machine blindly:
Fix: Always have a human review the “Cluster Label.” If the label makes no sense, the cluster is likely garbage.
FAQs (tools, scale, and practical usage)
What Python libraries do I need?
At a minimum: pandas for data handling, sentence_transformers for embeddings, and scikit-learn (or hdbscan) for clustering. semantic-clustify is a great wrapper tool if you prefer a CLI approach.
Can I do this with 100,000 keywords?
Yes, but you need to use batching and likely FAISS for indexing. Don’t try to run a standard similarity matrix on 100k rows on a standard laptop; you will run out of memory.
Is Embedding clustering better than SERP clustering?
It is faster and cheaper, making it better for scale. SERP clustering is more accurate for distinguishing subtle intent differences but doesn’t scale well due to costs.
Recap + what I’d do next (a simple action plan)
Semantic keyword clustering in Python isn’t just a fancy trick; it’s a survival skill for modern SEO. It allows you to move from guessing to engineering your site architecture.
Your next moves:
- Start small: Export top 500 keywords for your site today.
- Run the pipeline: Use the
all-MiniLM-L6-v2model and a standard threshold (0.75). - Validate: Check the clusters. Do they make sense?
- Operationalize: Turn the top 5 clusters into content briefs and get them into production.
Once you trust the process, you can scale it to the rest of your domain. Good luck, and happy clustering.



