Creating a scalable search architecture is a popular and important task for many systems. There are different solutions for this task. Choosing the right one depends on the requirements of your project.
Sometimes, as a project grows and its requirements change, you may run into new problems that you cannot solve with the search architecture you are using. For example, when increasing the amount of data, including synonyms in the search, adding multilingual search, etc. In this case, you need to think about creating a new, more efficient, scalable search architecture.
The search architecture must support the rapid read and write scaling required for most use cases.
In this article, we will talk about the main challenge that an effective search architecture must solve. We will also learn the main ways to implement such an architecture and what tools can be used to do this. In the end, we will tell you how to speed up the search engine.
Building a scalable search architecture is a challenge for most programs of varying sizes and complexity. To solve it, there are several different search architectures.
However, due to the rapid development of information technology, the creation of new fast-growing marketplaces, and SaaS applications, existing solutions can no longer meet all requirements. Therefore, there is a need to build a new efficient scalable architecture that can solve modern problems.
Let’s take a look at the challenges developers of scalable search architectures are currently facing.
While search engines and relational databases have a lot in common, there are also key differences between them.
Relational databases store structured data in the form of interrelated tables. They allow you to process much more information than search engines. However, the advantage of search engines is that they can parse unstructured data. They store flat objects instead of interconnected tables. Search engines allow you to improve the performance of data reading and writing operations by parallelizing them.
In relational databases, information is well organized and more reliable. In contrast, in search engines, information is not systematized and is not stable, since its location and content can constantly change.
Search engines are easy and fast to implement. Unlike databases, search engines need to be updated frequently. Their main goal is to provide a relevant set of high-quality search results that respond quickly to customer needs.
A primary/replica architecture is used to support a large number of reads. It involves replicating the data of the main server to several replicas. This architecture allows you to process several times more requests than if you use only one copy of the data on one server. For example, if you use three copies of data on three servers, then you can process three times more requests than with one copy on one server.
To support more write operations, you need to split the data into several smaller parts and add more CPU to create these parts. Scaling reads and writes requires adding shards and having multiple copies on multiple machines.
To implement a primary/replica architecture that replicates the master server’s data across multiple replicas, each shard needs to have one version that accepts writes. Other replicas use the primary as a source of truth. A log file that is stored on the primary shard is often used to synchronize master data and replicas. This log file contains all the writes received by the primary shard in sequential order. Each replica shard reads writes from this log and applies them locally.
The main disadvantage of this approach is the co-location of indexing and search on the same machine. Since we need to create multiple copies of the data, the CPU and memory indexing needs to be duplicated. This greatly increases costs. If you need to scale indexing and search at the same time, then the replication factor increases and applies to more data, requiring a huge amount of additional resources.
Increases in CPU and memory usage for indexing can also negatively affect the end-user experience if the same computer processes both indexing and search queries. If the search traffic generated by end users increases dramatically, then the resources used for indexing may limit the ability to account for this traffic surge.
In addition, this approach limits the auto-scaling capabilities, since adding a new replica requires pulling data from existing machines, often takes several hours, and adds additional load on the machines. This results in you having to greatly increase the size of your architecture and expect a significant increase in data or queries.
Another way to create a scalable search architecture is to replicate a binary data structure committed to disk after an indexing task has been completed.
This approach avoids duplicating the CPU and memory used for indexing. However, while overwriting all data structures, the binaries can be large, resulting in some delay.
Most often, search architectures process a large amount of indexing and searching operations at a rate of less than a minute. Therefore, in most cases, this approach is not used.
In addition, search engines rely on generational data structures. This means that instead of a single binary file, there is a set of files. When a segment receives new indexing operations, it is stored in a smaller data structure on disk. New indexing operations are performed in generation zero until parts of the files reach a certain size and must be merged with generation 1. This is necessary to remove duplicates and optimize search efficiency. The disadvantage of this approach is that all files on the disk are modified and each replica needs to get a new version containing all of the data shards.
The process of merging data and its transfer for replication will be influenced by such factors:
To create a search architecture you should use three services: crawlers, webpage processors, and indexing.
Crawlers are bots that are used to visit a web page, get all the links that are on it, and follow those links. This allows search engines to constantly find new content.
Web page processors read the page content and metadata. Then you need to break down the content of the web page into simpler forms that can be grouped according to different criteria. For example, by topics, keywords, etc. Metadata contains useful information such as keywords, descriptions, and more.
Indexing is used to organize found information so that it can be read quickly and easily. You can use keywords and page rank for this purpose. However, more efficient indexing requires some research and development.
Kubernetes is a scalable Docker container platform that can be used for all the services needed to create a search architecture. It allows you to configure services in such a way that they work regardless of what hardware is used for this. In addition, you can scale each service separately according to your needs.
Kubernetes allows you to create services and assign them unique IP addresses. This allows services to communicate with each other without creating special connections. In addition, it ensures the security of your services.
The main benefits of using Kubernetes are:
Memphis is a real-time data processing platform. It allows you to work with streaming data and supports asynchrony. It is well suited for engineers who work with a large amount of data and require writing a large amount of complex code.
The Memphis platform is high-performance and fault tolerant. It has a built-in monitoring capability which is very useful and handy for troubleshooting.
Using Memphis solves most of the data flow control problems you face when building a search architecture. Among them are:
Memphis uses a producer-consumer pattern and will control the entire data flow for the services you use in your search architecture.
Let’s list the main Memphis benefits.
Elasticsearch is an application that allows you to create a search index. It is often used for log analysis, full-text search, intelligent security systems, business intelligence, monitoring of ongoing processes, etc.
You can send data to Elasticsearch as JSON documents using APIs or using other tools like Logstash. Then Elasticsearch automatically saves the document and adds a link to it in the clustered index, including searchability. You can also find and retrieve a document using the Elasticsearch API.
Elasticsearch provides horizontally scalable search and supports multithreading. You can use several different search fields and prioritize which values are the best indicator that a record matches the search.
The development of cloud infrastructure and increased throughput allows for greater isolation between indexing and search without negatively impacting indexing latency. This made possible more efficient and dynamic scaling.
Let’s take a look at how scalable architecture works.
This search architecture has a list of advantages.
We figured out how modern scalable search architectures work, and what are their advantages and disadvantages. We also told what main services the search architecture consists of, and what they are used for. In addition, the article contains a description of useful tools that you can use when developing your scalable search architecture.
Use the knowledge gained and design an efficient, scalable search architecture for your project.