Coding a Search Engine from Scratch

Published on Fri Jun 20 2025

Tools abstract complexity, but fundamentals unlock mastery.

You have used Google search a lot, but have you ever thought about how it works? Of course, we're not building the next Google (the title was a bit of a clickbait!), but by understanding the core components, you'll gain a deeper appreciation for what is going on underneath the neat search bar. In this post, I'll tear apart the search engine architecture and rebuild it with pure python. I'll cover the basics of web crawling, indexing and searching.

The 3 Pillars of a Search Engine

  1. The Crawler (Collects raw data)
  2. The Indexer (Organizes the data)
  3. The Ranker (Finds relevant results)

The Crawler (Collecting Raw Data)

The crawler navigates the web, fetches pages and extract links which it will follow next. This is the first step in building a search engine. The crawler is responsible for discovering and downloading web pages, which will later be indexed and searched. In the real world, crawlers are complex systems that handle billions of pages, but we can build a simple one in Python to understand the basics.

import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse def crawl(start_page, crawled_pages, visited_urls=set()): """ A simple web crawler that fetches pages and follows links. Args: start_page (str): The URL to start crawling from. crawled_pages (dict): A dictionary to store crawled pages with URLs as keys and their content as values. visited_urls (set): A set to keep track of visited URLs to avoid revisiting them. """ # Avoid revisiting if start_page in visited_urls: return visited_urls.add(start_page) print(f"Crawling: {start_page}") try: response = requests.get(start_page, timeout=10) # Skip if the request was not successful if response.status_code != 200: return # Use BeautifulSoup to parse the HTML content soup = BeautifulSoup(response.text, 'html.parser') # Store the crawled page content crawled_pages[start_page] = soup.get_text() # Find all links in the page for link in soup.find_all('a', href=True): raw_link = link['href'] # Make links absolute absolute_link = urljoin(start_page, raw_link) # Only crawl links on the same domain if urlparse(absolute_link).netloc != urlparse(start_page).netloc: continue # skip specific extensions exclude_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.css', '.js', '.svg', '.ico', '.woff', '.woff2') if any(absolute_link.endswith(ext) for ext in exclude_extensions): continue # recursively crawl the link crawl(absolute_link, crawled_pages, visited_urls) except requests.RequestException as e: print(f"Failed to crawl {start_page}: {e}")

This crawler is a simplified version of what search engines do. It doesn't handle several things. Here are a few more things to consider for starters:

  1. Handling javaScript-heavy sites: Most modern websites rely on JavaScript to render content, which this simple crawler won't handle.
  2. Respecting robots.txt: Search engines check this file to see which pages they are allowed to crawl.
  3. Rate limiting: To avoid overwhelming a server, crawlers should implement delays between requests.
  4. Error Handling and Retries: Real crawlers implement robust error handling and retry mechanisms for failed requests.

The Indexer (Organizing the Data)

Modern search engines uses several indexing techniques, but the most fundamental one is an inverted index. This allows for fast lookups of documents containing specific words.

Now that our crawler can populate crawled_pages dictionary with URLs and their text content, we can build our inverted index from its output.

def index(crawled_pages): inverted_index = {} for url, text in crawled_pages.items(): # Split the text into words for word in text.split(): # Normalize the word (e.g., lowercase) word = word.lower() if word not in inverted_index: inverted_index[word] = set() inverted_index[word].add(url) return inverted_index

Let's test this function to see how an inverted index looks like.

# Example of crawled pages from the crawler we wrote above # Note: This is a simplified, hypothetical example to demonstrate how the indexer works. crawled_pages = { "https://jasir.dev/cat_story": "cat eats fish", "https://jasir.dev/fish_story": "fish swims in water", } inverted_index = index(crawled_pages) print(inverted_index) # Output: { 'cat': {'https://jasir.dev/cat_story'}, 'eats': {'https://jasir.dev/cat_story'}, 'fish': {'https://jasir.dev/cat_story', 'https://jasir.dev/fish_story'}, 'swims': {'https://jasir.dev/fish_story'}, 'in': {'https://jasir.dev/fish_story'}, 'water': {'https://jasir.dev/fish_story'} }

Now we can instantly look up which pages contain a specific word. For example, if you search for "fish", you can quickly find both cat_story and fish_story pages.

print(inverted_index.get("fish", set())) # Output: {"https://jasir.dev/cat_story", "https://jasir.dev/fish_story"}

This is a very basic indexer. Real-world search engines use more complex structures to handle synonyms, stemming, and other linguistic features. Here are some enhancements you can consider:

  • Better Tokenization: Instead of splitting by spaces, use a library like nltk or spaCy (or Lucene if you want to go full Java) to handle complex tokenization.
  • Normalization: Convert words to lowercase, remove punctuation, and handle stemming or lemmatization.
  • Handling Stop Words: Common words like "the", "is", "and" can be ignored to reduce index size and improve search relevance, although this is debatable in modern search engines as they may have relevance in some contexts.
  • Phrase Indexing: Store phrases or n-grams to support phrase searches.
  • Vectorization: Use embeddings to represent documents and queries in a high-dimensional space, allowing for semantic search capabilities. This is where the search industry is heading, moving beyond simple keyword matching to understanding the meaning behind queries.

The Ranker (Determining Relevance)

The final step is to search the index. A naive search algorithm can just find all pages containing the query term. For this, a simple lookup in our inverted_index is all we need. Ideally, we would split the query into words and find pages that contain all of them, but for simplicity, let's assume the query is a single word.

def naive_search(query, inverted_index): # Normalize the query query = query.lower() return inverted_index.get(query, set())

However, this returns results in no particular order. To improve this, let's replace our naive search with one that ranks results based on term frequency. Term Frequency (TF) measures how often a term appears in a page. The more times a term appears, the more relevant the page is likely to be for that term.

def search(query, crawled_pages, inverted_index): # Normalize the query query = query.lower() if query not in inverted_index: return [] # Get list of all pages that contain the query term matching_pages = inverted_index[query] scores = {} for url in matching_pages: # Count term frequency in the document page_text = crawled_pages[url] term_frequency = page_text.lower().split().count(query) scores[url] = term_frequency # Sort pages by term frequency in descending order return sorted(scores.items(), key=lambda item: item[1], reverse=True)

In the example above, we used term frequency to score results. However this doesn't account for how common or rare a term is across the entire collection. That's where TF-IDF (Term Frequency-Inverse Document Frequency) comes in. This is a classic algorithm used in search engines. TF-IDF increases the weight of rare but important terms, while downplaying overly common ones.

Modern search engines use a combination of techniques to rank documents, including machine learning models that take into account user behavior and other factors. As a next step of this over-simplified example, you can try digging into these advanced ranking algorithms:

  • TF-IDF (Term Frequency-Inverse Document Frequency): As discussed above, this algorithm balances term frequency with document rarity.
  • BM25: A probabilistic model that improves upon TF-IDF by considering document length and term saturation.
  • PageRank: Originally used by Google, this algorithm ranks pages based on the number and quality of links pointing to them.
  • Semantic Search: Leveraging embeddings and vector search to find documents based on meaning rather than exact matches.

Putting It All Together

Now that we have our crawler, indexer, and ranker, we can put everything together in a simple search engine:

def mini_search_engine(start_page, query): print(f"Starting crawl from: {start_page}") # Dictionary to hold crawled pages crawled_pages = {} # Step 1: Crawl the web crawl(start_page, crawled_pages) print(f"Crawled {len(crawled_pages)} pages.") # Step 2: Index the crawled pages print("Indexing crawled pages...") inverted_index = index(crawled_pages) print(f"Indexed {len(inverted_index)} unique words.") # Step 3: Search the index for the query print(f"Searching for '{query}'...") results = search(query, crawled_pages, inverted_index) return results

Now we can run our mini search engine:

if __name__ == "__main__": # Example usage start_page = "https://jasir.dev" query = "fish" results = mini_search_engine(start_page, query) print("Search Results:") for url, score in results: print(f"{url} (Score: {score})")

Conclusion

In this post, we've covered the basic components of a search engine: crawling, indexing, and ranking. While we've simplified many aspects, the core principles remain the same in real-world systems. You can extend this example by:

  • Implementing a more robust crawler that respects robots.txt and handles JavaScript.
  • Building a more sophisticated indexer that supports better tokenization, phrase queries, synonyms, and stemming etc
  • Enhancing the ranker with advanced algorithms like TF-IDF or BM25.

In an upcoming post, we'll go beyond keyword search and explore how modern search engines perform semantic search using vector embeddings, ANN search and neural rankers. Stay tuned!