IFILTER_INIT

The IFILTER_INIT enumeration vallues control text canonicalization, attribute output, embedding scope, and IFilter access patterns. These flags are used by IFilter::Init.

typedef enum tagIFILTER_INIT
{
    IFILTER_INIT_CANON_PARAGRAPHS          = 1,
    IFILTER_INIT_HARD_LINE_BREAKS          = 2,
    IFILTER_INIT_CANON_HYPHENS                 = 4,
    IFILTER_INIT_CANON_SPACES              = 8,
 
    IFILTER_INIT_APPLY_INDEX_ATTRIBUTES    = 16,
    IFILTER_INIT_APPLY_OTHER_ATTRIBUTES    = 32,
    IFILTER_INIT_INDEXING_ONLY                 = 64,
 
    IFILTER_INIT_SEARCH_LINKS              = 128
} IFILTER_INIT;
 

Elements

IFILTER_INIT_CANON_PARAGRAPHS
Paragraph breaks should be marked with the Unicode PARAGRAPH SEPARATOR (0x2029).
IFILTER_INIT_HARD_LINE_BREAKS
Soft line breaks, such as end-of-line in Microsoft® Word, should be replaced by hard line breaks, LINE SEPARATOR (0x2028). Existing hard line breaks may be doubled. Any of carriage return (0x000D), line feed (0x000A), or the carriage return and line feed combination should be considered a hard line break. The intent is to enable pattern-expression matches that match against observed line breaks.
IFILTER_INIT_CANON_HYPHENS
Various word processors have forms of hyphens that are not represented in the host character set, such as optional hyphens (appearing only at end of line) and non-breaking hyphens. This flag indicates that optional hyphens are to be nulled out, and non-breaking hyphens are to be converted to normal hyphens (0x2010), or HYPHEN-MINUSES (0x002D).
IFILTER_INIT_CANON_SPACES
Just as the previous flag canonicalizes hyphens, this one canonicalizes spaces. All special space characters, such as non-breaking spaces, etc., are to be converted to the standard SPACE character (0x0020).
IFILTER_INIT_APPLY_INDEX_ATTRIBUTES
Indicates that the client wants text split into chunks representing pseudo-properties.
IFILTER_INIT_APPLY_OTHER_ATTRIBUTES
Any attributes not covered by the preceding flag should be emitted.
IFILTER_INIT_INDEXING_ONLY
Optimizes IFilter for indexing because the client will be calling Init only once and will not be calling BindRegion. This eliminates the possibility of accessing a chunk both before and after accessing another chunk.
IFILTER_INIT_SEARCH_LINKS
The text extraction process must recursively search all linked objects within the document. If a link is unavailable, the GetChunk call that would have obtained the first chunk of the link,should return FILTER_E_LINK_UNAVAILABLE.

Remarks

Generally, text output by GetText should exactly match the actual text of the document, but in order to achieve maximum interoperability some canonicalization of common features is desirable. These features include paragraph breaks, line breaks, hyphens and spaces. IFilter servers can also embed null characters in text, which will be nearly ignored by clients. That is, Unicode character 0x0000 will be completely ignored and 0x0001 will be treated as a word break.

Four flags control canonicalization. They are
IFILTER_INIT_CANON_PARAGRAPHS,
IFILTER_INIT_HARD_LINE_BREAKS,
IFILTER_INIT_CANON_HYPHENS, and
IFILTER_INIT_CANON_SPACES.

Different clients of IFilter will want different views of an object. Two flags, IFILTER_INIT_APPLY_INDEX_ATTRIBUTES and IFILTER_INIT_APPLY_OTHER_ATTRIBUTES, control the set of attributes that should be applied to chunks. In addition, specific attributes may be requested in IFilter::Init calls as an array of size cAttributes, stored in aAttributes.

IFilter implementations will need to store some chunk information when operations other than content indexing occur. IFILTER_INIT_INDEXING_ONLY will optimize the filter for indexing.

For viewing purposes, it may be desirable to search across links as well as in the document and any objects it embeds. IFILTER_INIT_SEARCH_LINKS specifies recursively searching all links.

See Also

IFilter::BindRegion, IFilter::GetChunk, IFilter::Init, IFilter::GetText, IFILTER_INIT