mini-Search Implementation

Build a simplified search engine that can index documents and perform full-text search operations. This project combines data structures, algorithms, and system programming to create an efficient search solution.

Project Structure

The mini-search is organized as follows:

phase4/mini-search/
├── CMakeLists.txt
├── include/
│   └── search_engine.h  # Search engine interface and classes
└── src/
    └── search_engine.cpp  # Implementation of search functionality

Learning Objectives

During this project, you will:

Implement inverted index data structures for efficient text search
Design tokenization and text processing algorithms
Build query parsing and execution systems
Apply performance optimization for search operations
Create document storage and retrieval mechanisms
Implement ranking algorithms for search results
Design memory-efficient storage for large document collections

Core Concepts

Inverted Index

The core data structure of search engines is an inverted index:

Forward Index: Document → Terms (words in each document)
Inverted Index: Term → Documents (documents containing each term)
Document Store: Storage of actual document content with metadata

Example structure:

Term "hello": [doc1, doc5, doc10, ...]
Term "world": [doc2, doc5, doc7, ...]

Text Processing

Implement proper text processing pipeline:

Tokenization: Split text into individual terms/words
Normalization: Convert to lowercase, remove punctuation
Stemming (optional): Reduce words to root forms
Stop Word Removal (optional): Remove common words like "the", "and"

Implementation Architecture

SearchEngine Interface

class SearchEngine {
public:
    explicit SearchEngine(const std::string& data_dir);
    ~SearchEngine();
    
    // Indexing operations
    void add_document(int doc_id, const std::string& content, 
                     const std::string& metadata = "");
    void remove_document(int doc_id);
    void commit();  // Persist changes to disk
    
    // Search operations
    std::vector<SearchResult> search(const std::string& query, 
                                   size_t max_results = 10);
    
    // Utility operations
    size_t document_count() const;
    std::string get_document(int doc_id) const;
    
private:
    std::string data_dir_;
    
    // Core data structures
    std::unordered_map<std::string, std::vector<DocTermInfo>> inverted_index_;
    std::unordered_map<int, Document> documents_;
    std::unordered_map<std::string, int> term_frequencies_;
    
    // Synchronization for thread safety
    mutable std::shared_mutex index_mutex_;
    
    // Text processing utilities
    std::vector<std::string> tokenize(const std::string& text) const;
    std::string normalize(const std::string& token) const;
};

Supporting Structures

struct DocTermInfo {
    int doc_id;
    int term_frequency;  // How many times the term appears in document
    std::vector<int> positions;  // Positions where term appears
};

struct SearchResult {
    int doc_id;
    double score;  // Relevance score
    std::string snippet;  // Excerpt containing search terms
};

struct Document {
    std::string content;
    std::string metadata;
    std::chrono::system_clock::time_point timestamp;
};

Implementation Requirements

Core Features

Document Indexing: Add, update, and remove documents from the index
Full-Text Search: Find documents containing specified terms
Phrase Search: Support for multi-word searches
Ranking: Rank results by relevance (TF-IDF or similar)
Persistence: Save index to disk and load on startup
Memory Efficiency: Handle large document collections

Text Processing Pipeline

Implement a robust text processing system:

Tokenization: Split text using whitespace and punctuation as delimiters
Normalization: Convert to lowercase, remove common punctuation
Filtering: Optionally remove stop words (very common words)
Caching: Cache processed terms to avoid re-processing

Sample Implementation Approach

class SearchEngine {
private:
    // Inverted index: term -> list of documents containing term
    std::unordered_map<std::string, std::vector<DocumentFrequency>> inverted_index_;
    
    // Forward index: document ID -> content and metadata
    std::unordered_map<int, Document> document_store_;
    
    // Statistics for ranking algorithms
    std::unordered_map<std::string, int> doc_frequency_;  // How many docs contain each term
    
    // Text processing utilities
    std::vector<std::string> tokenize_query(const std::string& query) const;
    double calculate_tf_idf_score(int doc_id, const std::vector<std::string>& terms) const;
    
public:
    void add_document(int doc_id, const std::string& content);
    std::vector<SearchResult> search(const std::string& query);
};

Search Algorithms

Basic Search Implementation

Start with a simple boolean search:

Parse the query into terms
Find all documents containing each term
Return documents that match all terms (AND operation)

Advanced Ranking (Optional)

Implement TF-IDF (Term Frequency-Inverse Document Frequency) for better relevance:

TF: How frequently a term appears in a document
IDF: How rare the term is across all documents
Score: TF × IDF for each term in document

Performance Considerations

Index Size: Keep index in memory for fast access
Compression: Compress the index to reduce memory usage
Caching: Cache frequent searches and results
Query Optimization: Optimize query processing for common patterns
Memory Management: Use appropriate data structures to minimize memory overhead

Persistence Strategies

Consider implementing persistent storage options:

Text-based format: Simple JSON or custom text format (easy to debug)
Binary format: More efficient storage but harder to debug
Database backend: Use SQLite or similar for persistence (advanced)
Memory mapping: Map index files directly to memory for fast access

Testing Strategy

Test your search engine with:

Individual document indexing operations
Various query types (single terms, phrase queries)
Ranking accuracy of results
Performance with large document collections
Memory usage efficiency
Persistence and recovery from restart

Usage Example

#include "search_engine.h"
#include <iostream>

int main() {
    try {
        SearchEngine engine("./search_data");
        
        // Add documents to index
        engine.add_document(1, "The quick brown fox jumps over the lazy dog");
        engine.add_document(2, "A journey of a thousand miles begins with a single step");
        engine.add_document(3, "To be or not to be, that is the question");
        
        // Commit changes to disk
        engine.commit();
        
        // Perform search
        auto results = engine.search("fox dog");
        std::cout << "Found " << results.size() << " results:" << std::endl;
        for (const auto& result : results) {
            std::cout << "Doc " << result.doc_id 
                      << " (score: " << result.score << ")" << std::endl;
        }
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    
    return 0;
}

Advanced Features (Optional)

Consider implementing additional capabilities:

Phrase Search: Support for "exact phrase" queries
Boolean Queries: AND, OR, NOT operations between terms
Fuzzy Search: Find similar terms using edit distance
Faceted Search: Filter results by document categories or attributes
Real-time Indexing: Update index without stopping the service
Distributed Search: Split index across multiple nodes

Building and Testing

# Build the project
cd build
cmake ..
make -j4

# Create data directory
mkdir -p search_data

# The search engine can be used as a library in other applications
# or implemented with a simple command-line interface for testing

Next Steps

After completing the search engine, move on to the High-concurrency Crawler project to build a web crawler application.

mini-Search Implementation

Project Structure​

Learning Objectives​

Core Concepts​

Inverted Index​

Text Processing​

Implementation Architecture​

SearchEngine Interface​

Supporting Structures​

Implementation Requirements​

Core Features​

Text Processing Pipeline​

Sample Implementation Approach​

Search Algorithms​

Basic Search Implementation​

Advanced Ranking (Optional)​

Performance Considerations​

Persistence Strategies​

Testing Strategy​

Usage Example​

Advanced Features (Optional)​

Building and Testing​

Next Steps​