Anatomy of a Search Solution

Krishna Nareddy
Windows NT Query Team
Microsoft Corporation

July 29, 1997

Introduction

The explosion of the Internet and intranets has led to an enormous growth in the amount of textual information available to the masses. Any World Wide Web site with textual content needs a search solution to help its visitors find what they want with the least amount of effort. A wide array of search solutions, ranging from the home-grown pattern matchers to specialized high-volume concept-based search engines, have invaded the Web to offer a wide range of solutions.

Empowering your site with a search solution is as simple as spending a few hours to download and set up your free copy of Microsoft® Index Server. Before you realize it, you are the proud owner of a Web page brandishing your logo and a dialog box waiting to serve your visitors. But wait a minute! Are you making the most of it? Is that configured to be the right solution for your needs? How secure is your corpus? What if you need to scale up? What if you need to customize your solution to meet the needs of your diverse users? Heavy-duty search solutions such as the Index Server and Microsoft Site Server Search (scheduled for release towards the end of 1997) are an intricate interplay of several components engineered to seamlessly work for you. Understanding as much as you can helps you mold the solutions to address your needs.

This is the first of a series of articles aimed at helping you understand and effectively deploy Microsoft’s search solutions on your Web sites and intranets. This article is designed to help you identify the various components and to enumerate potential features. That will help you convert your search need into a series of feature requests. The next article in this series will describe Microsoft Index Server 2.0 (scheduled for release in October 1997) and Microsoft Site Server Search against the framework provided in this article.

Pieces of the Puzzle

Let’s start with what you have—a bunch of documents in several file formats, structured in several ways, and possibly written in several languages. This is your data. To distinguish you from those who did not read this article, call it a corpus. Your corpus is spread out on your server(s), possibly across your intranet. You need an agent to feed individual documents to the assembly line. For lack of a better term, let’s call it a (document) gatherer. Each document has its own format, so you need a (document) filter to get rid of the unnecessary bits and extract the real content. The result is a stream of characters that needs to be run through a word breaker to extract the sequence of words to index.

What should the search engine do with these words? Whatever it takes to resolve your queries as quickly as possible. The words and related information, such as the relative position of words and their frequency of occurrence, is compiled into one or more persistent structures collectively called the index. The module that handles this task is the indexer.

How can your users express their information need and retrieve what they want? You need a query language and a user interface to help them compose queries. Queries are transmitted to the server, where a query processor uses the index to resolve queries and arrives at a set of matches. The list of matching documents, the hit list, is presented to the user, who uses a (document) browser to peruse any of those documents, preferably in their original form.

Any system you deploy should be easy to administer and configure. Hardware and network changes, disk corruption and capacity issues, reliable operation, round-the-clock availability, and a variety of ad hoc administration issues crop up on a regular basis. The solution you deploy should be designed to address these issues with as little human intervention as possible. Different pieces have different issues. We will address them as appropriate when we detail the pieces.

Description of the Pieces

The pieces are in place. Let’s look at each one of them in detail. When you are done digesting this section, you should be able to examine your search needs and identify how to define each piece of your solution.

Where Does the Web Fit In?

We are discussing search solutions in the context of the Web, which is the big blob gluing all the pieces of the puzzle. The Web provides a standard client/server infrastructure and this article assumes that you have some familiarity with that infrastructure. We will not discuss any part of this infrastructure, except when it is directly related to the implementation of a search solution.

The Document Corpus

What type of documents do you need to be able to search? The search solution you deploy should be able to handle your data. The following is a list of features to help you examine various aspects of your corpus.

Security: Select groups of individuals have access to select sets of documents. If a document cannot be accessed by a user using conventional means (through the file system, from a server, and so on), it should not be accessible through your search engine.
File Formats: Different applications generate documents in different file formats. What different file formats does your organization have? You should be able to index all of them. Most commercial-quality search solutions can handle common formats such as Microsoft Office documents and HTML.
Languages: Does your corpus have documents written in multiple languages? Does a single document have content written in multiple languages? Are most documents written primarily in one language?
Corpus Size: How large is your corpus? How large will it be in the future? Your search solution should be able to handle more than what you anticipate for the future. You should be able to handle your future data within acceptable performance parameters (disk space, index time, query response time, and so on).
Corpus Flux: How often does your corpus change? How often are documents deleted, modified, and added? You should not have to shut down your search service to handle changes to your corpus.
Location: Where are your documents located? On a file system? On your Web site? Served by a different server? In a proprietary storage? On fast read-write disks? On a CD-ROM?

Gathering Documents

Depending on the location of your data, you need to employ different techniques to channel your data to the indexer. Most Web-based solutions can pull all the documents served by a single, designated Web server. Others provide a Web crawler to gather documents from the Web. Some can work with your file system. Others provide specialized solutions to pull documents from proprietary stores such as the Microsoft Exchange Server.

Filtering Documents

You have various file formats and you want your search solution to extract the content that matters to you. Most documents have more than content; they also have properties such as author and creation date. You want to extract all those properties that matter to you and index them. Document filters understand specific file formats and channel your content and properties to the indexer. Preferably, your solution should be able to handle a variety of common file formats right “out of the box.” You should be able to acquire or develop filters for the rest of the file formats and be able to communicate with the indexer through a standard interface.

Recognizing Features of Your Content

Your documents are more than just a stream of text. They contain syntactic units such as words and sentences. Beyond that, they have some "meaning." The task of a search engine is to "understand" user's queries and find documents that best match those queries. The first step in that direction is an attempt to understand the content.

At the very least, individual words should be accurately recognized by the word breaker. The next level is to be able to extract conceptual features such as noun phrases (for example, "United Nations") and stems (for example, "swim" is the stem of "swimming," "swam," "swimmer," and so on). Beyond that, natural language processing has not matured enough to be of much use in a general-purpose search solution. Feature recognition is language-specific; so make sure the languages you care about are supported.

Documents generally contain many very commonly used words that are of little use in distinguishing one document from another. General examples are words such as “a,” “an,” and “the.” Certain words in a domain are used too frequently to be of much use in distinguishing documents from one another—for example, the word “From” in a corpus of mail messages. Eliminating such noise words serves the dual purpose of improving computational efficiency and improving the quality of documents returned in response to a query. You should be the judge of what words are noise words in your corpus.

Indexing Documents

An index is an efficient organization of the extracted features. Its purpose is to provide an efficient lookup at query time. You expect your indexing service to be reliable, to operate round-the-clock, to be amenable to easy configuration and administration, and to meet your performance requirements. These are essential features of any service you install on your servers. Therefore, this section will dwell only on features unique to the domain of search solutions.

Incremental Indexing: This is the ability to compile newly changed documents into an existing index. Without this, you may be forced to throw out your old index and completely re-index your corpus. You don’t have to do this. Any commercial-grade search solution worth its salt will provide this feature.
Index Availability: You should always be able to resolve queries against the latest index. Some solutions may provide incremental indexing, but they may not be able to make the index available for searching while it is being updated. So they duplicate the index, incrementally index the copy, and swap the copies. This is resource-intensive and denies immediate availability of newer documents.
Robustness: Production servers often operate in a less than perfect environment. Networks go offline, disks run out of available space, hardware or power failures might occur, or the system may need to be shut down at short notice. Indexing services utilize a wide array of resources to process large amounts of data. They have several intermediate in-memory and persistent data structures that eventually get compiled into the index. If they cannot handle interruptions gracefully, they will end up recreating parts or the whole of the index. This is a drain on resources. Moreover, it delays document availability. A well-designed search solution will gracefully handle failures, recover from them, and need little or no human intervention.
Selective Indexing: As we learned earlier, an index contains whatever it takes to resolve queries on demand. Since indexing is a resource-intensive activity, the indexer should do only what is necessary to provide the desired search features.

Your users may not need all the features provided by the search solution. For example, they may never need to view or query the document modification time. Or they may never need to sort hit lists on a given property. If you need to provide only a select subset of search features, wouldn’t it be efficient to have the indexer work to provide only those features?

The Query Language

A query language is the most direct link between the end user and the search solution. It should be sufficiently expressive to allow users to convey a wide variety of information needs. It should be intuitive enough to allow your least experienced user to frame the right queries without feeling overwhelmed by the syntax and semantics of the language. And it should allow your users to frame precise queries.

Query languages are generally closer to computer languages than they are to human languages. So you may often find yourself having to build layers between the query processor’s native language and your users. A well-designed query language should facilitate easy translation between your user’s needs and its syntactic idiosyncrasies.

Understanding different types of information needs and how they translate to queries will help you map your user needs to the query language features provided by the search solution. Because of the range of possible information needs, the following list is not exhaustive.

Document recall: You know that a document exists in your corpus and you want to retrieve it. You may know some of its properties, such as filename, author name, or title. Or you may remember nothing more than a few key words and phrases of its content. The more you know, the quicker you can retrieve it. You should be able to frame as precise a query as possible, specifying all the properties and content fragments you know exist in the document(s).
Document discovery: This represents a wide range of information-seeking activities. You need all the documents you can find to help you accomplish a task such as buying a house or a car. Or you may want to know more about the Second World War. The possibilities are endless. You know a few key words, phrases, and concepts. You should be able to mix words, phrases, and concepts in a single query and retrieve what is out there. Discovery is generally an iterative process with successive iterations of the query process retrieving more of what you want (recall) and just what you want (precision). Earlier iterations are likely to retrieve plenty of documents. For the user to make progress, queries in successive iterations should be more precise. A search solution should aid the user in this iterative process.
Find similar documents: You have an example of a document or a paragraph or a sentence fragment that is just right for you. Wouldn’t it be convenient and efficient to retrieve documents containing “similar” content? That could save a lot of time you would have otherwise spent framing queries and wading through irrelevant documents.
Relevance feedback: You browse through a list of matches and judge what is relevant and what is not. Imagine being able to submit your original query plus a list of your relevance judgments to have the search engine return with a better set of matches. That could drastically cut down your search effort.
Precision and recall: Sometimes you need to know as much as you can about a narrow area of interest. For example, lawyers researching a case would like to know about all prior cases and relevant laws pertinent to that case. Sometimes you don’t care about all the documents that are out there. You only want to retrieve a few representative documents with the least amount of effort. For example, if you are a new homebuyer, it is necessary to read only a few select articles to know most of what you need, although hundreds and thousands of articles are available on the subject.

The query language should be flexible enough to express these needs. It is easy to achieve high recall. You just have to throw a lot of words and phrases into your query. It doesn’t help if all your relevant documents are buried in a sea of irrelevant ones. Therefore, you should strive to attain a high level of precision. You need to know what you want and the query language should allow you to express what you know. For example, if you know that your target documents should contain the phrases “mortgage broker” and “real estate,” but that the first phrase is more important than the second one, you should be able to express that in your query language to improve your precision.

The Query Processor

Like the indexer, the query processor should be reliable, operate round-the-clock, be amenable to easy configuration and administration, and be able to meet your performance requirements. This section will dwell only on features unique to the domain of query processors.

Security: If users cannot access a document using other means, they should not be able to do so through your search solution. The query processor should not even acknowledge the existence of documents that are out-of-bounds to the user issuing the query.
Search scope: If users are interested only in searching a specific portion of the available corpus, they should be able to specify the scope along with the query. That improves the precision of the search and cuts down on the resources used to resolve the query. Likewise, if users want to target multiple indexes and collate the hit lists into one list, they should be able to specify all the indexes along with the query. This will be much more efficient than having users collate the hit lists from different indexes.
Optimized for type of access: When you retrieve hits, will you need them all at once? Can you let the search engine return hits as it finds them? Do you need to have the server cache the results so you can scroll back and forth within that set? If you can identify your mode of access and let the engine know your requirements, it should be able to utilize that information to optimize execution of the query. A query processor that will let you do this helps you maintain a well-tuned server.
Honor resource constraints: You should be able to impose constraints on resources utilized to resolve the query. This helps the administrator ensure that the server is configured to enforce resource-related policies. Examples of such resources are CPU time, total turn-around time, memory consumption, and priority of the search process (relative to the other processes executing on the server).

The Hit List

The hit list contains the documents matching the query. It should contain all the information needed to help the user determine whether a document holds any promise of being relevant. Once that decision is made, the hit list usually serves as a launch pad for the document browser. Note that a hit list is not necessarily a list in the strict sense of the word, although that is the most commonly used format. It is any user interface (UI) that directly or indirectly aids the user to walk through the available list of matches. The following is a list of features you might expect from such a UI.

Configurable set of properties: Users should be able to specify what properties they would like to have displayed in the hit list. A quick glance at those properties will help them identify obviously relevant and irrelevant documents.
Sort on demand: Hit lists are usually sorted on some criteria chosen by the user. Being able to change the sort order (without re-executing the query) helps the user identify promising candidates more efficiently.
Property-based grouping: When a query results in many hits it could take several iterations to find a set of relevant documents. If the hits are presented as a few groups of documents (clusters) related through common properties, it would be relatively easy for the user to narrow in on a promising group of documents and focus the next iteration within that group.
Content-based grouping: This is similar to property-based grouping, but the basis of the grouping is commonality of content. This is not a precise grouping, but it might be good enough to help the user identify promising groups of documents.
Aiding the information quest: Users usually go through several iterations to find what they are looking for. A hit list is a great place to help users refine their queries. Imagine being able to pick a relevant document and submitting that as a query or being able to provide relevance feedback to generate a better set of matches.

Browsing the Documents

Whenever possible, documents should be perused using viewers that can render a document with full fidelity. The nontextual context and the layout could provide important clues that help the reader quickly judge its relevance or make the best use of the information presented in the document.

Systems that can explain their actions inspire confidence in their users. In the context of a search solution, a simple way of accomplishing that is by highlighting words and concepts of the query that caused the document to be chosen as a match. Hit highlighting can also help the user judge the relevance of the document or find the information they are looking for by navigating between highlighted features.

Microsoft’s Web-based Search Solutions

Now that you have an understanding of the anatomy of a search solution, you can analyze your search needs and determine how well a search solution can stack up against your needs. Your understanding will also help you make the most of the solutions you deploy. Future articles in this series will provide detailed overviews of Microsoft’s search solutions. The overview articles will be followed by in-depth technical articles to help administrators, solution providers, and developers to make the most of the features provided by these products.

Microsoft provides a single-server Web solution, Microsoft Index Server version 2.0, which is scheduled for release in October 1997. For intranets that need a larger-scale solution, Site Server Search is scheduled for release towards the end of 1997. Subsequent articles in this series will provide a detailed overview of these products.