Big Tech Coach

What is a Search Engine Database?

titleImagePath
components/Search-Engine-Article.png
date
May 23, 2024
slug
what-is-a-search-engine-database
status
Published
tags
summary
Dive into the capabilities of search engine databases in noSQL systems, exploring how they address complex search functionalities and handle large-scale data efficiently.
type
SystemComponent
systemType
probability
In this article, we dive into the world of a unique noSQL database variant - the search engine database. Tailored to drive an array of sophisticated search functionalities, this component boasts scalability and can swiftly fetch pertinent results, even amidst vast datasets.
For example, in the context of movie streaming systems, search is a central feature. While housing the movie metadata alongside the movie in object stores streamlined our process and maintained a simplified architecture, a pressing question arises: How do we efficiently present movies to users based on their search queries?
Object stores, much like key-value stores, lack support for query languages. Consequently, our current setup poses significant challenges for implementing a seamless text search.

Problem at Hand

Before delving into solutions, it's pivotal to ensure that we're aligned on the inherent challenges of constructing a text search feature using just the Object Store. Let's also explore how much traction we'd gain if we adopted a relational database.

Object Storage

notion image
Picture this: Our movie streaming system has been crafted based on our prior discussions, predominantly relying on an object store for metadata management.
A user is in the mood to dive into the "Rings of Power" series produced by Amazon.
Upon inputting "Rings of Power" into the search bar, the system commences a linear scan, navigating through the entire object storage, inspecting each metadata entry for the exact phrase "Rings of Power".
notion image
This approach is riddled with pitfalls:
  1. Efficiency Concerns: The algorithm's linear nature implies scouring the entirety of the database, ensuring no pertinent entries are overlooked. Consequently, the last entry could well be the one the user seeks. Apart from the glaring design inefficiency, this leads to heightened latency, compromising user experience. As our video library burgeons, this latency problem will only exacerbate.
  1. Relevance Dilemma: Using this methodology, our system would invariably return both "The Fellowship of the Ring" and the desired "Rings of Power" series, since both metadata encompass the search term. Our system grapples with discerning which result holds more relevance for the user.
 
Indeed, while die-hard "Lord of the Rings" enthusiasts might delight in stumbling upon "The Fellowship of the Ring" while hunting for "Rings of Power," this serendipity wouldn't hold for other search terms. Imagine what would happen if someone searches for the series “Friends”.
 
The limitation to only match exact strings also leads to the fact that a search request with any typos won't return any results. In combination with very long running queries, this is a pure user experience nightmare. Now let's see if an SQL database would allow us to implement a solution with a better user experience.

Relational Databases

Relational databases are adept at storing and manipulating structured data, especially when it's organized in a tabular format of rows and columns. These databases facilitate flexible searches across multiple record types, pinpointing values in specific fields.
 
Certain fields within a database's records might encompass free-form text, such as a movie description. A majority of relational databases extend support for keyword searches within these unstructured fields.
This functionality is readily accessible via SQL. Thus, a relational database addresses the issues we identified with object storage: not only would data retrieval be more expedient, but we'd also benefit from rudimentary text-search capabilities. Although search queries containing typographical errors might still yield no results, our position is significantly improved.
However, we must not overlook the well-documented performance challenges associated with JOIN operations.
notion image
For clarity, a join operation is necessitated when data from two or more tables must be merged to satisfy a query. This becomes essential when a user seeks data dispersed across multiple tables that maintain one-to-many or many-to-many relationships.
Considering our movie database as an example, a query for "Ian McKellen" would mandate a join between the actor and movie tables. Regrettably, for expansive datasets, these join operations can be notoriously sluggish.
notion image
Were our queries to be of a fixed nature, database schema denormalization could be a viable strategy to diminish the need for join operations, thereby optimizing performance. However, for a vast dataset accessed by a multitude of users, each armed with unique search terms, this isn't a practical choice.
While a relational database offers a superior alternative, it's far from perfect. Let's shift our attention to search engine databases to discern their distinctions and the ensuing benefits.

Search Engine Database

Search engines fall under the umbrella of NoSQL databases. While basic pattern matching has its merits, search engine databases truly shine when delivering pertinent results even when users introduce typos or don’t find an exact match for their queries. Additionally, full-text search can be harnessed to produce autocomplete suggestions as users engage with the search bar. Due to efficient indexing, these tasks can be accomplished much more swiftly than rudimentary pattern matching across vast data sets.
Let's delve deeper into full-text search and its functionalities:
  • Fuzzy Search
  • Auto Suggest
These features immensely enhance user experience, and achieving them with a relational database would be exceedingly challenging if not unfeasible. To better appreciate the prowess of search engines, let's uncover the mechanics behind their remarkable search capabilities.

Architecture

To truly grasp the nuances of search engine databases, we need to acquaint ourselves with two pivotal elements of their design - Indexing and Querying.
During indexing, the database engine formulates a search-optimized data structure. When querying, user inputs are processed, and potential matches are subsequently fetched, arranged in order of relevance.
To lay a strong foundation, let's begin with a core concept - the document.

The Document

Envision a document as a row in a relational database, symbolizing a particular entity—the very thing you're scouring for. In our scenario, each movie and actor would be represented by distinct documents.
A document would encompass details such as the title, description, release year, and the cast for movies. Actor-centric documents would delineate their names and all movies they've featured in. Each document possesses a unique ID and a designated data type, which specifies the kind of entity the document symbolizes.
Once the search request is executed, these documents relay the information to the user. Let’s delve into the mechanics of how this materializes.

Inverted Index

Much like Key-value stores and Caches, search databases lean heavily on dictionaries or hashmaps for their foundational structure. It’s evident that when rapid and efficient data retrieval is the endgame, dictionaries play an integral role. However, the application here diverges slightly.
Whereas key-value stores index values using a specific key, search engine databases employ an inverted index. This maps content to the corresponding documents that house it. In essence, an inverted index dissects each document into standalone search terms, subsequently mapping each term to its respective documents.
Returning to our earlier illustration will shed light on the anatomy of an inverted index.
notion image
All metadata pertaining to the fellowship of the ring is encapsulated within a document, paralleling the treatment of the Rings of power. Concurrently, every actor starring in either receives individual documents. Subsequently, the inverted index is constructed from all pertinent terms derived from these documents. Each term is endowed with a unique ID and an array that denotes the documents it surfaces in.
Shifting focus from indexing, let's unravel the intricacies of data retrieval.
 
Navigating the vast features of a search engine database presents significant hurdles, with the integration of an effective search algorithm taking precedence. Given the granular data access provided by the inverted index, it's paramount that returned documents align with user expectations. Ensuring user-relevance is, therefore, vital.
If a user, while recalling The Fellowship of the Ring, can only remember "Elijah Wood", the search engine must discern and avoid presenting unrelated results like those linked to "Woody Allen", solely because both names share the substring "wood".
Addressing this relevance conundrum, search engine databases deploy a ranking algorithm. This stratifies potential relevant documents, omitting those below a defined threshold from user view. Though this may sound straightforward, when scaling to extensive systems accommodating myriad users and vast databases, multiple variables come into play, such as search history, which might profoundly influence an individual user's search outcomes.
That provides a glimpse into the inner workings of search engine databases. While there's a trove of fascinating details to explore, such depth is beyond the confines of this course.
For those keen on diving deeper, I'd recommend "Search Engines - Information Retrieval in Practice". It’s available in the resources tab.

Popular Implementations

When it comes to search engine databases, there are numerous options to consider, encompassing both open-source projects and commercial solutions.
Undoubtedly, Apache's Lucene stands as a monumental implementation in this domain. At its core, Lucene is a Java-based library. While it doesn't furnish all the functionalities essential for applications autonomously, it offers intricate low-level features, with full-text indexing and searching being paramount among them.
Despite its initial release in 1999, Lucene's relevance has not waned; it continues to see active development. Its most profound contribution has been catalyzing the evolution of renowned open-source search engine databases such as Elasticsearch and Solr. Moreover, MongoDB harnesses Lucene's prowess for its Atlas Search feature. A noteworthy alternative is RedisSearch, which leverages Redis as its foundational data store.
On the commercial side of the spectrum, cloud giants aren't left behind. Amazon presents its managed rendition of Elasticsearch, and Microsoft introduces Azure Search, serving as their indigenous cloud search engine solution.
Having delved into the popular search engine database implementations, let's now shift our focus to the advantages and potential constraints inherent to these systems.

Benefits & Limitations

Benefits

  1. High Scalability: At the outset, Search Engine Databases leverage the advantages inherent to all NoSQL databases. Their innate design, devoid of transactions and rigid consistency assurances, facilitates effortless deployment across clusters, thus enabling horizontal scalability.
  1. Schemaless: Deploying relational databases often necessitates a painstaking, time-consuming process of data normalization to fit a tabular format. Conversely, schema-less databases eradicate this step, vastly simplifying the preliminary setup. The primary decision revolves around determining which entity attributes should be made searchable.
  1. Advanced Search Features Built-In: Crucially, these databases offer sophisticated search functionalities like full-text search, suggestions, and autocomplete natively, without any additional configurations.

Limitations

  1. Lack of ACID Guarantees: The intrinsic absence of ACID guarantees renders it suboptimal for scenarios demanding stringent consistency assurances.
  1. Not Optimal for General Data Operations: These databases are primarily tailored for search functionalities and are not the most efficient when it comes to standard write, read, or update operations. Consequently, there's often a need for a complementary database to handle the broader system state, while the search engine focuses on making select data subsets searchable.
  1. Management Challenges: Operating a Search Engine Database in tandem with your principal database introduces new intricacies. Echoing the challenges with caches, ensuring synchronization between the search engine and the main database is pivotal to prevent presenting outdated data to users. The level of complexity is influenced by the frequency of state changes in the primary database.
  1. Converging Functionalities with SQL Databases: In the spirit of clarity, even many conventional relational databases now extend support for features like full-text search. Thus, the boundaries distinguishing these database types have started to blur. Deciding on deploying and upkeeping a dedicated search engine database is contingent upon your specific use case and performance criteria.
For a system geared towards streaming, the aforementioned limitations aren't overly consequential. The need for strong consistency is negligible since newly introduced media files don't necessitate immediate universal accessibility. Given that the metadata is stored alongside media files in object storage, there's no compulsion to integrate another database, mitigating potential complexities. Furthermore, the infrequent alterations to metadata ensure that database synchronization remains a minimal concern.
 
/