Moving Beyond Keywords with Vector Embeddings

Published on Thu Jul 10 2025

In my previous post, we built a simple search engine from scratch. We dove into the three pillars: the crawler, the indexer, and the ranker. While it was a great way to understand the fundamentals, our search engine had a critical limitation: it couldn't grasp meaning. It could find pages with the exact word "dog" but it wouldn't understand a query about "puppy" or "bark".

One way to patch this is by using a synonym list. We could manually tell our engine that "puppy" is related to "dog," or "laptop" is related to "computer". But this approach is fragile and doesn't scale. What about the subtle differences between "house" and "home"? Or the conceptual link between "king" and "queen"? Manually mapping every relationship is an impossible task. We need a system that can learn these relationships automatically.

This post will answer that question. We'll explore the technology that powers semantic search: vector embeddings. This is where we move beyond simple keyword matching to understanding the intent and context behind a query. We're going to explore how a search engine can learn that "workforce restructuring" and "employee layoff" are talking about the same thing.

I’ve spent the last four years working in the search industry, with a big chunk of that time focused specifically on vector search. In the world of vector search, we're so often focused on performance, scalability, and the latest algorithms. We rarely step back to ask the most fundamental question: How does a list of numbers actually capture meaning in the first place?

So, What is a Vector?

Let's start with a simple analogy. Imagine you're describing a shoe. You might rate it on a few different scales:

How comfortable is it (from 1 to 10)?
How stylish is it (from 1 to 10)?
How durable is it (from 1 to 10)?
How expensive is it (from 1 to 10)?

Now you can represent any shoe as a vector, like {Comfort: 8, Style: 7, Durability: 6, Price: 5}. This is a four-dimensional vector where each dimension represents a different aspect of the shoe. So vectors are just lists of numbers that represent different features of an object.

Loading animation...

If we plotted these on a graph, shoes with similar numbers, and therefore similar vectors, would be grouped together.

Now, let's apply this same idea to words. Instead of four dimensions like "Comfort" and "Style," we can use hundreds of dimensions that represent different aspects of a word's meaning. While we can't picture 300 dimensions in our heads, a computer has no problem working with the math.

A vector embedding is simply a word represented by a list of numbers (a vector) in this multi-dimensional space.

How Can Numbers Represent Meaning?

The core idea behind creating these vectors is based on a simple observation from linguist John Rupert Firth:

You shall know a word by the company it keeps.

You can understand someone by looking at the friends they hang out with. Similarly, you can understand a word by looking at the words that often appear around it. This is the foundation of vector embeddings.

Models like Word2Vec and GloVe are trained on massive amounts of text, like all of Wikipedia. They look at which words tend to appear in similar contexts. Words that show up in the same kinds of sentences will be given similar vectors.

The word "cat" often appears near "meow," "purr," "kitten," and "feline."
The word "dog" often appears near "bark," "puppy," and "canine."

Because of this, the vectors for "cat" and "kitten" will be very close to each other in this meaning space. The vector for "cat" will be further from "dog," and much further from "car."

Loading animation...

This creates a space where the distance and direction between vectors represent relationships between words. This allows for the famous example of vector math:

vector('King') - vector('Man') + vector('Woman') ≈ vector('Queen')

By taking the vector for 'King', subtracting the concept of 'Man', and adding the concept of 'Woman', we end up with a vector that is very close to the one for 'Queen'. This shows that these vectors capture the relationships between words, not just the words themselves. It is just mindblowing that we can just do arithmetic calculations on words just like we do with numbers.

But What Do These 300 Dimensions Mean?

For our shoe vector, we defined dimensions ourselves: "Comfort", "Style" etc. But for word vectors, the dimensions are not so cleanly labeled. The model learns then automatically.

You can think of these as "latent" or hidden features of meaning. One dimension might, through training, end up representing the concept of "royalty". So king, queen and prince would all have high values on this dimension. Another dimension might represent "animal-ness", and another might represent "softness".

A word like "kitten" would have high values on the "animal-ness" and "softness" dimensions, while a word "throne" would have high values on the "royalty" dimension but low on "softness". This is how the vectors capture subtle meanings and relationships between words.

From Words to Documents

We can create vectors for single words, but what about the millions of pages from our crawler example? A common starting point is to create a single vector for a document by averaging the vectors of all the words it contains.

Now, each page we crawled in the last post - like "https://jasir.dev/cat_story" - can be represented by one vector in this shared meaning space.

Search with Vectors

This leads to a completely different way of searching. Here’s how it works:

Indexing: First, we turn every crawled document into a vector and store them.
Querying: When a user enters a query like "what do felines eat?", we don't look for keywords. Instead, we turn the query itself into a vector using the same model.
Searching: The search engine's job is to find the document vectors that are closest to the query vector. "Closeness" is often measured with a Similarity Function (e.g., cosine similarity, which looks at the angle between two vectors. A smaller angle means a more similar meaning.)

Loading animation...

This solves our original problem. The vector for "what do felines eat?" will be very close to the vector for our "cat eats fish" document, because the model learned that "feline" is close to "cat" and "eat" is close to "eats." We get a relevant result without an exact keyword match.

What's Next?

We now have a way to search based on meaning. But this introduces a new technical challenge. How do you efficiently compare a query vector to billions of document vectors to find the closest ones instantly? Checking them one by one (brute force) is far too slow.

That is the problem we'll explore in the next post. We'll look at Approximate Nearest Neighbor (ANN) search and the algorithms that make searching through billions of vectors practical.

Stay tuned!