Cosine similarity in search engine optimization

Last updated: Apr 01, 2026

Before diving in, you don’t have to be a math nerd to understand this topic. As long as you are interested in getting your page ranked among the top results and appearing in AI overviews, all you need to know is the difference between numbers from -1 to 1.

A value of -1 is not good to use, while values between 0.7 and 1 are good to consider after checking your cosine similarity.

What is cosine similarity?

This is the cosine measurement of an angle between non-zero vectors that determines how the vectors are closely similar.

Vector 𝐀 and vector 𝐁 having an angle 𝚹 between them, the cosine similarity will be the Cos(𝚹).

The mathematical formula

Cos(𝚹) = 𝐀⋅𝐁 βˆ₯𝐀 βˆ₯ ⁒ βˆ₯𝐁βˆ₯

Where 𝐀⋅𝐁 is the dot product of the two vectors, which just tells you how the two vectors project onto one another.

βˆ₯𝐀βˆ₯ ⁒ βˆ₯𝐁βˆ₯ is the product of the length or magnitude of vector 𝐀 and vector 𝐁.

Example: We find the cosine similarity of two-dimensional vectors 𝐀 and 𝐁 in the diagram below.

graph for cosine similarity calculation

Cos(𝚹) = 𝐀⋅𝐁 βˆ₯𝐀 βˆ₯ ⁒ βˆ₯𝐁βˆ₯

𝐀⋅𝐁=3Γ—4 + 4Γ—2=20

βˆ₯𝐀 βˆ₯ ⁒ βˆ₯𝐁βˆ₯ = 25 Γ— 20 = 105

Cos(𝚹) =0.8944

The measurement always arranges from -1 to 1, where 1 shows that the vectors are similar while the -1 shows that the vectors are opposite of each other or totally not similar.

So when it comes to words, a cosine similarity of 0.5 indicates a moderate level of similarity between words, meaning they share some common features or terms but are not identical.

Cosine similarity significance in SEO

Implementing this simple algebra during your keyword research analysis can help by giving you closely related words, which you can use to get your page ranked for the same search intent instead of choosing from a broad list of keywords.

link between cosine similarity and search results

Practical use of cosine similarity

Apart from keyword research analysis, cosine similarity is useful in the following SEO tasks:

  1. Comparing your page with competitors’ pages – For this, consider both the term frequency-inverse document frequency (TF-IDF) and cosine similarity to see how important terms are on their pages and how similar they are to your page.
  2. SERPs analysis - Using queries from Google search console(GSC) that gives you impressions, you can compare how search engines consider those queries similarity to your page and to pages ranking higher than you. The results will give you a hint to improve your content quality or work on authority.
  3. Avoiding keyword cannibalization – Identify which pages within your website compete for the same keyword and might confuse search engines.
  4. Interlinking decisions – Help with getting words on one page that you can place a link on that points to another page on the site relating to the word.

Search engines use of numerical vectors

Search engines are machines that work primarily with numbers to understand the semantic meaning of text. By finding words or phrases with the highest cosine similarity, you can optimize your page to rank for all queries related to those words.

Converting text to vectors

Phrases can be converted to vectors by counting word occurrences on it.

For example, if we have β€œSEO services” and β€œSEO audit” as keywords in our corpus, we can create a vocabulary {SEO, services, audit} from them after tokenization.

To get vectors of these two keywords, we will count how many times a word in our vocabulary appears in a keyword:

SEO services β†’ [1, 1, 0]

SEO audit β†’ [1, 0, 1]

From the conversions above we now have numerical vector of the words and can calculate their cosine similarity to find how similar they are as our keywords

cosine similarity of two phrases: 'SEO services' and 'seo audit'

Let

SEO services β†’ 𝐀

SEO audit β†’ 𝐁

Cos(𝚹) = 𝐀⋅𝐁 βˆ₯𝐀 βˆ₯ ⁒ βˆ₯𝐁βˆ₯

𝐀⋅𝐁=1Γ—1 + 1Γ—0+ 0Γ—1 =1

βˆ₯𝐀 βˆ₯ ⁒ βˆ₯𝐁βˆ₯ = 2 Γ— 2 =2

Cos(𝚹) = 12 =0.5

Word embedding

Doing the math above manually would be hectic or nearly impossible in real-world scenarios with high-dimensional vector spaces.

The manual math is also weak for short keywords because it ignores important factors such as weighting of words and subword relationship. This is where word embedding comes in.

Word embedding is the computational implementation of the distributional hypothesis, where computers use natural language processing (NLP) to convert words in a corpus into numerical vectors to understand their semantic meaning and relationships.

There are many tools and libraries for word embedding, such as Word2Vec, developed by Google researchers. Word2Vec uses a continuous bag-of-words (CBOW) and skip-gram approach to learn static word representations based on context.

Implementing this technique is practical because Google has been using it for ranking pages on their search engine.

Implementing word embedding

I have tried using both Word2Vec and Transformer-based embedding models, but I settled on the Transformer-based approach because I had access to a pretrained transformer model all-MiniLM-L6-v2 that encodes phrases efficiently using sentence-transformers.

So, how do you get the generated word embeddings?

For example, suppose I want to extract keywords from competitors in Kenya who offer SEO services, determine the similarities between these keywords, and use them to optimize my page.

To save time and effort, I have already written a Python script using a Transformer-based embedding model.

Script to generate keyword embeddings

The script extracts keywords from competitor websites, generates embeddings using a Transformer-based model, and calculates cosine similarity between the keywords.


import requests
import re
import nltk
import pandas as pd
from bs4 import BeautifulSoup
from nltk.util import ngrams
from collections import Counter
from sentence_transformers import SentenceTransformer, util
nltk.download("punkt")
urls = {
    "competitor_1": "https://kenseo.co.ke/",
    "competitor_2": "https://artlydigitalmarketing.co.ke/top-seo-agency-in-kenya/",
    "competitor_3": "https://www.seosmart.co.ke/",
}
MIN_PHRASE_FREQ = 2
MIN_PHRASE_LEN = 12
NGRAM_RANGE = (2, 4)
def fetch_clean_text(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    html = requests.get(url, headers=headers, timeout=15).text
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "nav", "footer", "header", "noscript"]):
        tag.decompose()
    text = soup.get_text(separator=" ")
    text = re.sub(r"\s+", " ", text)
    return text.lower()
def extract_phrases(text, min_n=2, max_n=4):
    tokens = re.findall(r"[a-zA-Z]+", text)
    phrases = []
    for n in range(min_n, max_n + 1):
        phrases.extend([" ".join(g) for g in ngrams(tokens, n)])
    return phrases
all_phrases = []
for name, url in urls.items():
    text = fetch_clean_text(url)
    phrases = extract_phrases(text, *NGRAM_RANGE)
    all_phrases.extend(phrases)
    print(f"Processed: {name}")
phrase_counts = Counter(all_phrases)
keywords = [
    phrase for phrase, count in phrase_counts.items()
    if count >= MIN_PHRASE_FREQ and len(phrase) >= MIN_PHRASE_LEN
]
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(keywords, convert_to_tensor=True)
cosine_matrix = util.cos_sim(embeddings, embeddings)
similarity_df = pd.DataFrame(
    cosine_matrix.cpu().numpy(),
    index=keywords,
    columns=keywords
)
similarity_df.to_excel("competitor_keyword_similarity2.xlsx")

  

Keyword analysis using cosine similarity

So what I basically always do is search for something in my niche, such as β€œSEO services in Kenya”, on Google and get the top-ranking websites for this phrase. I then collect the exact URLs that were ranked, maybe the top 5 or 10, because that is where you want to be.

You can also use other tools such as Semrush to get these URLs, but having your primary data is always better.

With the URLs available, just import them into the script and generate a full list of keywords with the cosine similarity calculated between each other.

Your job now is only to sort the ones that have high cosine similarity and use them to optimize your page.

For example, I got 774 keywords from three competitors, and when I chose the one I thought was relevant to my page, i.e., β€œaffordable SEO,” and filtered with the criteria of related keywords having a cosine similarity greater than or equal to 0.5, I ended up with just 191 words that I can use on the same page.

Excel sheet showing 191 relevant keywords filtered by cosine similarity from 774 competitor keywords for 'affordable SEO'

I can reset my filter criteria and get only those that are much closely related e.g greater or equal to 0.7 cosine similarity and have 16 keywords that will have similar search intent ranking.

Excel

Hopefully I am not the one finding this interesting as a math nerd, I hope you find it helpful and time saving as well cheer!