Under the Covers: How Search Engines Work

The amount of computerized data continues to grow exponentially. There are public and private databases on CD-ROMs and the Internet. As more and more data on more and more topics is accumulated in databases, how to retrieve all the information on a topic readily, efficiently, and economically is a hot topic. Adding intelligence that will assist in the search will be of enormous use. In this article I outline the structure of Natural Language Processing (NLP) and Information Retrieval (IR) technology that will eventually help Web users.

Looking for Information

Individual organizations used to construct private databases of accounting or scientific data that could only be interrogated by that organization's computer experts. Such mining required professional programming expertise and technical knowledge of the structure of individual databases. As database technology improved, stylized query languages emerged that still required some software expertise, knowledge of individual database schemas, and some professional and logical expertise.

In those days you needed to know how to build sophisticated queries and how the search engine worked. You needed to master its internals and you needed to know how to use things like operators and modifiers precisely-just like a logician. Further, the response or the results of the search were often limited because non-logical connections between items were usually excluded.

You no longer need to be a computer expert to find the information you are looking for. This has been made possible by integrating Natural Language Processing capability into search engine pre- and post-processors. Two of the main elements of NLP are morphology, the study of word forms and their relationships to parts of speech, and syntax, the study of the ways in which words are combined together to develop phrasal structures.

With a natural language ranking method, all significant terms in a search query are used, and a search based on this approach can produce some results even if the terms in the query are misspelled. This would not be possible in a less-forgiving Boolean search. Novice users are familiar with the terms they want to find, but are not comfortable with Boolean searches. When a search engine has been augmented with NLP capability, there is no need for users to master a very complex query syntax because they can query data sources using everyday language.

An NLP system can add relationship information to the words in a database of indexes. NLP systems make indexes of words more sophisticated because Information Retrieval systems can differentiate between search terms on the basis of morphosyntactic relationships. "Morphosyntactic" means nothing more than taking grammatical rules of a language's structure into consideration. For instance, "the White House" or "the Fourth Amendment" will be recognized as phrases with special meaning; therefore, the IR system will differentiate between documents containing "The White House" and documents containing phrases such as "my neighbor's white house."

NLP systems are capable of detecting noun phrases as syntactic components of a language. For instance, I can analyze the syntax (how words are arranged into a structure) of the sentence "Internet companies have a high market value." I will identify "Internet companies" as a noun phrase where "Internet" is the adjective and "companies" is the noun.

Why is noun phrase detection so fundamentally important? Because noun phrases capture the meaning of the information you want to find. In a content-rich and unstructured environment like the Web, this capability can save the user time and effort. The ability to distinguish noun phrases allows increased word discrimination that can greatly improve the relevance of returned documents. For instance, searching for "nuclear," "power," and "plant" separately instead of "nuclear power plant" would result in many irrelevant matches. Thus, IR systems that use noun phrases can enhance IR precision.

Languages

Major search databases now exist in languages other than English, and NLP will work in them as well. Some vendors offer NLP capabilities bundled with their search and retrieval engines that are capable of handling languages such as French, German, Spanish, Italian, Dutch, Danish, Portuguese, and other European languages. These include Xerox/XSoft and InSo Corp. (previously InfoSoft, a spin-off of Houghton Mifflin, the company that provides thesauruses, spell checkers, and grammar checkers to several companies including Microsoft). Both Xerox and Inso are also developing support for double-byte character sets for languages such as Japanese, Chinese, and Korean.

The combination of language support, NLP, and IR systems has given users much more flexibility. Multilingual document collections can be queried through a multilanguage query syntax capability. Users can set up a collection of email, tech notes, or other documentation in French, German, and Italian-even the same document can contain different languages-and perform successful queries. You can mix terms in different languages when you query a data set containing multilingual documents. For example, you can use this query

"find all the documents regarding computer, computadora, logiciel, 
elaboratore dati."

to retrieve any English document containing the word "computer," in addition to Spanish, French, and Italian documents containing the corresponding (if not translated) term.

Clearly, multinationals and international organizations can benefit from this multilingual approach. As a rule, such companies have settled on a few popular languages for searches. However, Internet users will soon expect to use their own languages when searching for information.

Indexing and Ranking

Indexing is the first task an IR system has to accomplish; you can only search and retrieve a document if it has been indexed into a database. Each word (index) is represented by a set of coordinates that describes where a word is located-in what document, sentence, paragraph, heading, and so on. The resolution of the index controls whether retrieved pointers point to whole documents, subsections within documents, or exact word locations.

Ultimately, IR systems analyze queries to see if the words in the queries are in the indexes. If they are, the system displays the matching documents and ranks or scores them according to various scoring methods/algorithms. The documents that receive the highest score represent the best match, and the system presents them at the top of the list of retrieved documents.

Tool Characteristics

NLP tools need to explicitly address the following issues for indexing and querying:

Tokenization-parsing text and segmenting it into recognizable substrings such as words, numbers, dates and times.
Noun phrase detection-noun phrases can consist of a series of adjectives (nice, functional, recent) followed by a series of nouns (database, structure, component).
Part of speech tagging-identifying nouns, adjectives, prepositions ( "for," "to," "through"), verbs (to code, to test), and so on and label them as nouns, adjectives, and verbs with a complete tag set (N, ADJ, V).
Stemming and uninflection-combining related words by reducing the words to a common root form. For instance, fusing the words "flew," "flies," "flown" to their common root form, "fly" (also known as baseform). Stemming and uninflection can be used in IR systems to create smaller indexes. Inflection expands words into the inflected forms of their baseforms (expanding "fly" to its inflected forms "flew," "flies," "flown"). This can be used in IR applications to broaden search queries. A stem of a word is the part of a word to which inflectional endings, such as suffixes, are added to alter the word's meaning. More about stemming later. The existence of genders and cases complicates stemming, uninflection, and inflection in other languages.

Computer-specific issues include platforms, nongrammatical input, internationalization (support for non-English character sets), code footprint, and lexical coverage (number of words recognized). The ability to customize lexicons (dictionaries, special vocabularies) and noun phrase grammars is useful too.

Natural Language Analysis

As programs and procedures for adding NLP capability to IR systems are developed, they are tested on a body of text. Sometimes a collection of news reports from Agence France (the French press) or Der Spiegel (the German press) will serve as an informal test bed. In a formal setting, a recognized corpus of documents is used. A corpus is a standard set of texts used for study and comment by linguistic scholars. Examples include the British National Corpus (http://info.ox.ac.uk/bnc/) and the Brown University Corpus of American English.

One way to add NLP capability to a search and retrieval engine is to apply three analyses to the searcher's natural language query expression: morphosyntactic analysis, term expansion, and query construction. Each is described below. The combination of these analyses will generate a final form of the search query that, depending on the text to be retrieved, may be very complex, mirroring the input a computer analyst had to generate in the old days. This final form of the search query is then input to the search engine.

The first function, morphosyntactic analysis, tokenizes a query, recognizes parts of speech including noun phrases in a free-form text, and tags them. Put another way, this function parses the user's expression, breaks it into recognizable words, then analyzes it. The IR system must provide part of speech information to construct noun phrases.

A word's part of speech may be ambiguous when taken out of context. For instance, the word "code" can be a noun or a verb, depending on the context. Thus, the IR system must provide a basic disambiguation mechanism that increases the accuracy of noun phrase construction by refining part of speech tags.

Term expansion expands the number of key terms by adding related words. Some words may be added to the text in the query expression. These words may be related by stemming. Remember, the stem is the part of the word to which inflectional endings are added. This means that if you identify "think" as a key word, then "thinks" is a word related by stemming. By applying the same concept, you identify "fly" or "flight" as words related by stemming to the key word "flew." "Better" is related to "good" and "best," and "bought" to "buy." Stemming and uninflection must be sophisticated, particularly in the cases of such irregular verbs and adjectives. Take the word "thought"-in addition to being a noun, it's two forms of the verb "to think" ("I thought," the simple past verb, and "I have thought," the past participle form).

A stemming function can reduce all tokens in the index to a baseform and similarly reduce all tokens in queries to their baseform. This can be complex; in Italian, for instance, "sono," "siamo," and "furono" are all inflected forms of the verb "essere" (to be). In Spanish, there are over sixty inflected forms of the Spanish verb "minar." Full morphological analyses for Romance-family languages typically contain much more information than English; it is claimed that in some languages like Finnish, a verb can appear in over ten thousand forms!

A different implementation would relate words by thesaurus (a database of words related by sense) or context, so that all the words related to a particular subject defined in a thesaurus can be included. A key word such as "information retrieval" can be expanded to words such as "retrieval system" and "information database" when a thesaurus is used.

A third term expansion method is truncation, in which words are fused by using wildcard characters so that a prefix will match several words. For instance, "histo*" will find "history," "historian," "historians," and "historical." Conversely, a stoplist is used to eliminate a set of words that have no indexing value, such as articles like "the" and "a," and prepositions like "for" and "to."

Back to natural language analysis methods. Query construction is the creation of a query language expression, which used to be built manually by a computer expert. The use of an NLP operator such as <LANGUAGE>, for instance, would allow users to find all the corresponding terms for "press" and "politicians" in other languages. So, the query

<LANGUAGE>press <AND> <LANGUAGE>politicians

where <LANGUAGE> is an NLP operator and <AND> is a Boolean, would find the American English terms "press" and "politicians" when the following query is used:

<American>press <and> <American>politicians.

The query

<French>presse <et> <French>politicantes

would retrieve the French terms "presse" and "politicantes." The query

<Italian>stampa <e> <Italian>politici

would retrieve the Italian terms "stampa" and "politici."

Information Retrieval systems augmented by NLP capability can enhance two essential IR requirements: recall
and precision. By enhancing information retrieval recall, the user can retrieve a larger amount of documents on the searched topic. Improving IR precision translates into better relevance of the retrieved information, which means users will most likely not retrieve documents that have no pertinence with the subject researched. Better precision alone could put an end to a very common complaint by Web users, who find search engines currently available on the Web unsuitable to retrieve precise and accurate information on a consistent basis.

Markov Chains

Two mathematical procedures are used in the development of NLP-based information retrieval systems: Markov Chains and scoring.

A Markov Model is a finite-state device (also called a word chain device) that, when faced with a choice between two or more lists, chooses among them according to specified probabilities. It is capable of creating an unlimited number of distinct combinations from a finite set of elements. The components of a word chain device are a number of lists of words or phrases and a set of directions for going from one list to another. A processor selects one word from the first list, another word from the second list, and so on to form a complete sentence.

A word chain is the simplest way to combine words together. An implementation of a word chain device is the recorded voice that gives you a phone number when you dial directory assistance. In linguistics, there have been studies that model the English language as a very large word chain. Some psychologists have suggested that human language is based on a large word chain stored in the brain. They have theorized that a stimulus causes a human to respond with a word, which in turn constitutes the stimulus for another word out of many possibilities, and so on.

The linguist Noam Chomsky has criticized the idea that a language can be reduced to a word chain, because a sentence in English is completely different than a string of words chained together, according to the language's transition probabilities. Chomsky has shown that an improbable sequence of words can be grammatical and that nonsense can be grammatical. But the fundamental difference between a language and a word chain is that a word chain can only remember which word list it has just chosen from. It cannot remember things such as an "if" said at the beginning of a sentence, while a human being can, thus enabling a person to follow the "if" with the appropriate word categories.

A Markov Chain is also defined as a special type of dynamic statistical function. It is represented by a matrix whose elements are conditional probabilities. This is practical if the number of statistical states is small and manageable. One very simple use of Markov Chains, in principle, is to record the probabilities that one word or word type follows another in a language. For instance, the word "thought" may be tagged ambiguously as both noun and verb; examination of the context would allow the system to assign a most likely part of speech to the token. Estimates of the conditional probabilities can be refined in successive testing stages. This process is often the one involved in training networks.

Scoring

A scoring function assigns weights to components of a query based on such variables as length, position within its paragraph, position of its paragraph within the document, presence in headings, and presence or absence of fixed phrases, theme words, and uppercase words. The resulting score, used for ranking, is the sum of these weights. The values of the weights differ among document types and languages. When weighting, words that are indexed or used in a query are assigned a numerical value usually based on the frequency of the words in stored documents.

Ranking models can be probabilistic. By using information about word frequency, IR systems can assign a probability of relevance to all the documents that can be retrieved. These documents are then ranked according to this probability. Ranking is very useful because document databases tend to be very large, and large return sets can be daunting if not sorted.

XLT

For this article, I examined XLT from Xerox/XSoft and Intelliscope from Inso. At the time of my analysis, XLT was available in English, French, German, Spanish, Italian, and Portuguese. Intelliscope was available for English, French, German, and Spanish. Both products will segment or tokenize, stem words, tag parts of speech, and detect noun phrases.

XLT linguistic database products perform sophisticated natural language analysis and have an underlying architecture that can provide support for all European languages with a single API.

What distinguishes XLT is that it is composed of algorithms rather than lists of words and relationships between the words. The data structure underlying this technology is called a finite-state transducer and the code that runs these transducers-achieving morphological analysis, generation, tagging, tokenization, OCR guidance, spellchecking and so on-is very small and language-independent, saving on storage. New language models can be plugged into the existing runtime code. Finite-state transducers can be combined into one transducer.

Xerox uses two simple principles in creating lexical transducers: all inflected forms of the same word are mapped to the same canonical dictionary in citation form. Morphological parts of speech (categories) and features are included in the lexical representation.

Oracle and America OnLine have signed leasing agreements for Xerox's XLT technology.

Intelliscope

Intelliscope, a new natural language tool, is based on a simpler concept and design than Xerox's XLT, but is just as powerful in noun phrase detection, part of speech tagging, stemming, and uninflection. Its flexibility allows users to build their own personal dictionaries and noise-word stop lists. This capability would require Xerox to build a new transducer, since Xerox doesn't provide a way for users to build their own. Intelliscope also offers custom lexicons and custom noun phrase grammars; Intelliscope users can build their own dictionaries and change the way in which noun phrases are detected. This is an important feature for customers like pharmaceutical or chemical companies that have already developed their own terminology through years of operation in their specific industry sector.

Intelliscope functions for index preprocessing and search preprocessing include language-sensitive tokenization. Intelliscope defines tokens as whitespace-delimited strings with leading and trailing punctuation removed. Intelliscope follows predetermined punctuation rules to resolve cases where attached punctuation is ambiguous. It also returns the beginning and ending points of each sentence in the input buffer and the number of tokens in each sentence.

Token normalization allows users to configure Intelliscope to perform customized token conditioning according to code and data size, speed, and accuracy requirements.

Uninflection and inflection identifies the baseforms of inflected words. In ambiguous cases, Intelliscope returns a list of baseforms.

Compound processing recognizes compound forms and their constituents. A compound is a word form made up of other constituent words. For instance, the German word "Eisenbahngesellschaft" could return "Eisenbahn" and "Gesellschaft." This process is available for languages such as Danish, Dutch, German, Norwegian, and Swedish.

Clitic processing identifies clitics for their removal to find root words. Clitics are articles, pronouns, and prepositions attached directly or with punctuation to other words. For instance, Intelliscope can recognize the French word "l'enfant" as a clitic form and will strip l' to return the word root "enfant." This process is available for romance languages such as French, Italian, Portuguese, and Spanish.

Intelliscope takes unverified tokens and returns spelling correction alternatives, generated from its database, as suggestions for possible replacement. This will allow users to avoid running IR queries with misspelled words.

Intelliscope databases are rich in linguistic information. Tokens are verified to ensure their existence in the language. Capitalization or parts of speech are examples of returned information about words from the database. Unrecog-
nized words are flagged as unverified. To maximize language coverage, Intelliscope uses comprehensive wordlists in
many languages.

The Intelliscope noise-word filter removes noise words, words that are not helpful in discriminating relevant documents from non-relevant documents. Microsoft Index Server, part of the Microsoft Commercial Internet System, uses Inso software in its implementation.

In the Field

Verity, a well-known company in the Internet industry, has deployed Intelliscope in its sophisticated retrieval technology. The integration of Intelliscope will allow users to query data sources using their everyday language.

Verity Natural Language Tool is capable of utilizing part of speech information and detecting noun phrases that capture the essence of a free-text query. The company plans to ship the free-text query parser as part of the standard product. The free-text query parser will transform a natural language query in any of the languages Verity supports (English, French, German, Spanish, Italian, and Dutch) into a text reference query expression (TRL). The tool can also perform spell checking in all of the languages Verity supports.

The basic morphosyntactic analysis of the input that the tool performs could be used with or without the help of Verity Topic thesaurus or other knowledge bases (information structures) to automatically construct a query in the TRL. In other words, the tool could auto-match any query input to a predefined Topic set.

From preliminary testing, Verity has found a consistent and significant improvement in the precision of the retrieved results of free-text searching, without negatively affecting performance when compared with queries run without the use of a natural language-based tool.

Verity is currently pursuing proper name detection.
Since proper names add very specific knowledge, they are very important to the quality of the free-text search. Verity is also planning to integrate a statistical algorithm to automatically create a corpus-derived thesaurus. This functionality will further allow a free-text query to expand to a more specific search, therefore improving the quality of precision and recall.