File ParsersFile Parsers*
*



Contents  *



Index  *Topic Contents
*Previous Topic: File Manager Extensions
*Next Topic: File Parser Constants

File Parsers


A file parser is a dynamic-link library (DLL) that provides the low-level parsing needed to generate a quick view for a file of a given type.

arrowy.gifAbout File Parsers

arrowy.gifFile Parser Constants

About File Parsers

File parsers work in conjunction with the file viewing components of the Microsoft® Windows® operating system. These components are the shell, the Quick View program (Quikview.exe), and display engines. When the user wants to generate a quick view for a file, the shell responds by calling the Quick View program. The program manages the process, directing one of the display engines to draw the Quick View window and fill it with a view of the file. The display engine uses a file parser to determine the contents of the file and to draw those contents correctly.

You can extend the file viewing capabilities of Windows by supplying additional file parsers. Each file parser is responsible for a specific type or class of file and is associated with one of the display engines. For example, you can allow a quick view to be generated for a .doc file by creating a file parser to support that file type and associating the file parser with the word processor display engine.

This overview describes the file parser interface and explains how to write file parsers for word processing documents, spreadsheets, databases, bitmapped graphics, and vector graphics. For information about extending the file viewing capabilities in other ways, see File Viewers.

The file viewing technology used in the Quick View feature system has been jointly developed by Microsoft Corporation and Systems Compatibility Corporation.

Adding or Removing File Parsers

For performance reasons, the file viewer builds a cache of the file parsers in the system the first time the Quick View feature is used. This cache is stored in the registry. If a file parser is added or removed, this cache must be rebuilt. To make the system rebuild the cache, set the "verify data" value under the following key to something other than zero:

\\HKEY_LOCAL_MACHINE\SOFTWARE\SCC\Viewer Technology\MS1 
 

Implementing a File Parser

Every file parser must implement the following functions from the Microsoft Platform SDK:
VwStreamCloseFunc VwStreamSectionFunc
VwStreamOpenFunc VwStreamSeekFunc
VwStreamReadFunc VwStreamTellFunc
VwStreamReadRecordFunc

The display engine calls these functions to display a file of the type supported by the file parser.

The display engine starts the file viewing process by calling VwStreamOpenFunc, sending the name of a file to the file parser. The first responsibility of any parser is to verify that the given file has the proper format and can be processed. If the file is viewable, the file parser returns a value to the display engine acknowledging the request.

Once the parser completes verification, the display engine calls VwStreamSectionFunc, directing the file parser to identify the type and name of the first section of the file to be processed. A section is a portion of the file in which all the data is of one type; it forms a logical breaking point for the processing of the file. The standard section types are word processing, spreadsheet, database, bitmapped graphics, and vector graphics. A file can consist of a single section, multiple sections of the same type, or a combination of sections of different types. The actions that the display engine takes to display the file depend on the type of section currently being processed. The file parser must call the SOPutSectionType and SOPutSectionName functions to output the section type and to set the section name.

Before the file parser returns from VwStreamSectionFunc, it may need to provide the display engine with additional information. If the portion to be processed is a word processing section, the file parser must set entries for the font table by using the SOPutFontTableEntry function. If it is a spreadsheet section, the file parser must set the column width by calling the SOPutColumnInfo function. If it is a database section, the file parser must set the field format by calling the SOPutFieldInfo function. The file parser can also set the date base used by spreadsheets and databases to calculate dates by using the SOSetDateBase function. In addition, the file parser can set header entries by calling the SOPutHdrEntry function.

After the section type and general information are set, the display engine requests data for the section by calling VwStreamReadFunc. The file parser fulfills this request by calling the stream output functions. These functions pass the data to the display engine in a form that is easiest for the engine to display, copy to the clipboard, or write to disk.

The stream output functions used by the file parser depend on the section type. For word processing sections, the file parser uses the SOPutParaSpacing, SOPutCharAttr, and SOPutChar functions to set the spacing for paragraphs, set the style attributes for characters, and output characters, respectively. For spreadsheet sections, the parser uses the SOPutDataCell and SOPutTextCell functions to output the content (data or text) of cells. For database sections, it uses the SOPutField and SOPutVarField functions to output the data of fields. The parser uses the SOPutBitmapHeader and SOPutScanLineData functions for bitmapped graphics sections and the SOVectorAttr and SOVectorObject functions for vector graphics sections.

To set a break for a paragraph, cell, or field, the file parser calls the SOPutBreak function with an appropriate value, either SO_PARABREAK, SO_CELLBREAK, or SO_RECORDBREAK. The return value from SOPutBreak tells the file parser how to proceed. If it is the SO_STOP value, the file parser stops all processing and returns from VwStreamReadFunc.

The file parser continues to output data until it reaches the end of the section. The parser must end a section by calling SOPutBreak with the SO_SECTIONBREAK value. If this is the last section in the file, the file parser indicates that the end of the file has been reached by using the SO_EOFBREAK value instead.

If there are subsequent sections in the file, the display engine calls VwStreamSectionFunc again to request the type and name of the next section, and processing continues just as it did for the first section.

After the last section, the display engine calls VwStreamCloseFunc to indicate that processing is complete and that no further requests for data will be made. The file parser must close the file and any related files it has opened and clean up resources, such as freeing memory.

If an error occurs while a file is parsed, the file parser should call the SOBailOut function to notify the display engine of the error condition. The parser must immediately return from VwStreamReadFunc after calling the SOBailOut function.

Restartable Parsing

You must design the file parser so that parsing can be efficiently restarted at discrete locations within a file. The goal is to give the display engine the best performance without it having to store a completely converted copy of a file.

To facilitate restartable parsing, the display engine incorporates a module, which is called the chunker, that essentially caches data from the parser. The chunker does not cache all the data—only the data that the display engine has most recently requested. However, it does cache state data for restartable locations in the file. This means that as long as the parser maintains its own internal data in a way that can be efficiently restarted, the display engine and the parser can work cooperatively to locate and restart processing at the cached locations.

The file parser is responsible for determining the best locations for restarting parsing. It does this by calling the SOPutBreak function. The chunker assumes that each break is a restartable location in the file. Before calling SOPutBreak, however, the file parser must save up-to-date data about the location so that it can quickly retrieve and begin processing the data at the location if requested to do so.

The display engine uses the VwStreamSeekFunc and VwStreamTellFunc functions to direct the file parser to a restartable location.

Word Processing Sections

Word processing sections contain text organized as paragraphs, tables, and subdocuments. Of these, paragraphs and tables can have attributes, such as indentation, tab stops, and spacing. The text in word processing sections consists of characters having attributes, such as font, height, and weight. Word processing sections can also include embedded objects, allowing bitmapped art and other graphics to be included with the text.

A file parser processes the text associated with a word processing section when the display engine calls the VwStreamReadFunc function. The file parser must set all attributes before calling the SOPutChar function or other text output functions. The file parser must never automatically set an attribute as a default. If the state of a current attribute is not known, the file parser must not set it.

Paragraph attributes

The file parser sets the attributes of a paragraph before outputting characters for the paragraph. The attributes are the alignment, indentation, spacing, tab stops, and margins.

The file parser sets the alignment to be left, right, centered, or justified by using the SOPutParaAlign function and sets the left, right, and first line indents by using the SOPutParaIndents function. The file parser sets the spacing before and after the paragraph and between lines of the paragraph by using the SOPutParaSpacing function. The file parser sets tab stops by using the SOPutTabStop function, calling the function once for each tab stop. To mark the start and end of a tab stop definition, the file parser calls the SOStartTabStops and SOEndTabStops functions. The file parser sets page margins for the paragraph by using the SOPutParaMargins function.

Tables

The file parser can add tables to a word processing section's text by using the SOBeginTable and SOEndTable functions to mark the start and end of the table definition. It can format the rows and cells in tables by using the SOPutTableRowFormat and SOPutTableCellInfo functions. The file parser uses the character and paragraph functions to output the text for each cell and set the attributes.

The file parser marks the end of each cell and each row by using the SOPutBreak function with the SO_TABLECELLBREAK and SO_TABLEROWBREAK values. A file parser must insert a cell break after each cell and a row break at the end of each row. If a file parser inserts a row break before inserting as many cells as were defined for the row, the remaining cells are assumed to be empty. Empty cells may be inserted in the middle of a row by inserting consecutive cell breaks.

Row and cell formats must be defined before the last cell of a row. After defining the row properties by using the SOPutTableRowFormat function, the parser must call the SOPutTableCellInfo function for each cell in the row. After a row is defined, the row properties are assumed to apply to subsequent rows until new row properties are specified. Thus, a filter can define an entire table by specifying the row and cell properties once and then using the appropriate row and cell breaks.

You can add borders to cells by setting the pLeftBorder, pRightBorder, pTopBorder, and pBottomBorder members of the SOTABLECELLINFO structure to appropriate values when setting the cell format.

You can add tabs to cells by using the special character, the SO_CHCELLTAB value. This character is defined for cells that are merged with their neighbors and acts as a tab that moves the current text position to the location of the next boundary that would have existed if the cells had not been merged.

Subdocuments

The file parser can add subdocuments to the document. Subdocuments consist of headers, footers, footnotes, and comments. Subdocuments are added to the document using the SOPutBreak function. To start a subdocument, the file parser calls SOPutBreak with the SO_SUBDOCBEGINBREAK value. To end the subdocument, the file parser calls SOPutBreak with the SO_SUBDOCENDBREAK value.

After ending a subdocument, the file parser must restore character and paragraph attributes to what they were before the subdocument was started. The file parser can use the SUUserPushData and SUUserPopData functions to save and restore nested subdocument information. A parser can nest subdocuments without limit. The following example shows when to save and restore this information.

This is a <Bold On> test

    // At this point, the filter should save its internal 
    // information to reflect the fact that bold is on.
    SOPutBreak(SO_SUBDOCBEGINBREAK);
    SOPutSubdocInfo(...);
<Subdoc Begin> This is a <Bold Off>subdocument<Subdoc End>

    // At this point, the filter should restore its internal 
    // information to reflect the fact that bold is on.
    SOPutBreak(SO_SUBDOCENDBREAK);

document <Bold Off>of mine. 
 

File parsers are not expected to correctly exit a subdocument when run from a regular paragraph break (with the SO_PARABREAK value) inside the subdocument. The display engine lets the file parser run to the subdocument's end break (that is, the SO_SUBDOCENDBREAK value) and returns the SO_STOP value.

Characters and character attributes

The file parser outputs characters by using the SOPutChar function. It can specify extra properties for a character, such as grouped or hidden, when outputting by using the SOPutCharX function. The file parser outputs special characters, such as tabs, hard line breaks, hard page breaks, and hyphens by using the SOPutSpecialCharX function.

Before outputting characters, the file parser sets character attributes by using the SOPutCharAttr, SOPutCharFontById, SOPutCharFontByName, and SOPutCharHeight functions. These functions set the style, font, height, and width of the character. The SOPutCharAttr function lets the file parser set style attributes, such as italic, underline, and strikeout. The SOPutCharFontById and SOPutCharFontByName functions can specify any font that the parser added to the font table during processing of the VwStreamSectionFunc function. The SOPutCharHeight function sets the character height, in half points.

Embedded graphics objects

The file parser can embed graphics objects in the text of a paragraph section by using the SOPutEmbeddedObject function. The function inserts the embedded graphics object at the current location in the document.

Spreadsheet Sections

The file parser outputs content (data or text) for cells in a spreadsheet by using the SOPutDataCell and SOPutTextCell functions. Before outputting cell data, the file parser must get the range of columns to be output by using the SOGetInfo function with the SOINFO_COLUMNRANGE value. When SOGetInfo returns, the low-order word of its pInfo parameter identifies the first column of data to generate output for, and the high-order word identifies the last column. The file parser should only call SOPutDataCell or SOPutTextCell for cells within the range indicated by a call to SOGetInfo. When there is no more data within a range of columns, the file parser must call the SOPutBreak function with either the SO_EOFBREAK or SO_SECTIONBREAK value, whichever applies. This must be done for each range of columns in the document.

For example, if the first column is 10 and the last column is 19, the filter reads the file from its current position, but it only calls SOPutDataCell or SOPutTextCell for cells that belong in columns 10 through 19, inclusively. (Column numbers are zero-based.) The parser skips over cells that belong in columns outside of this range. The filter must produce cells for all columns in the range, filling in with empty cells if necessary. As before, the filter continues until SOPutBreak returns the SO_STOP value.

In general, the file parser should carry out the following steps:

  1. Determine the desired range of columns.
  2. Determine the next cell available from the input file.
  3. Determine if the cell is in the given range of columns. If not, repeat step 2.
  4. If the cell is not empty, call SOPutDataCell or SOPutTextCell with the current data. Otherwise, call SOPutDataCell for a cell of the SO_CELLEMPTY type.
  5. Update local variables, such as row and column numbers.
  6. Call SOPutBreak with the SO_CELLBREAK value.
  7. If SOPutBreak returns the SO_STOP value, return from the VwStreamReadFunc function.
  8. If at the beginning of the next section, call SOPutBreak with the SO_SECTIONBREAK value and return.
  9. If at the end of the file, call SOPutBreak with the SO_EOFBREAK value and return.
  10. Repeat steps 2 through 10.

When the chunker saves local data for various seek positions in a document, it does so within SOPutBreak when the break is of the SO_CELLBREAK type. So when a file parser has its local data restored for a random seek position, the data will reflect the state of the file parser during its call to SOPutBreak for the last cell of the previous chunk in the current range of cells. Any tracking done by the parser, such as the current row number, should be updated before SOPutBreak is called for each cell.

Every horizontal range of columns, specified by the dwExtraData parameter in each call to your VwStreamReadFunc function, must eventually be terminated by a call to SOPutBreak with the SO_EOFBREAK or SO_SECTIONBREAK value, whichever is applicable. The type of break depends on the input file. A file parser must not put a section break at the end of the file, and an end-of-file (EOF) break, of course, cannot occur anywhere but at the actual end of the file.

For example, if the input document contains a single spreadsheet that is 30 columns wide, the display engine can call the parser with three different ranges of columns: 0 to 11, 12 to 23, and 24 to 29. The file parser calls SOPutBreak with an EOF break three times, once for each time it reaches the end of the file while processing a given range.

When calling SOPutBreak with a section break, the file parser must be sure that the seek position is at the beginning of the next section. This ensures that the file position is where the file parser needs to be when VwStreamSectionFunc is next called. Any one of the calls to SOPutBreak for a section break may be the one that sets the seek position for the top of the next section.

Database Sections

The file parser outputs data and text for a database by using the SOPutField, SOPutMoreVarField, and SOPutVarField functions. The parser uses the SOPutField function for fields of a fixed size. The other functions are used for variable length fields. The parser sets field information by using the SOPutFieldInfo function while processing the VwStreamSectionFunc function.

Bitmapped Sections

The file parser starts a bitmapped section by calling the SOPutSectionType function with the SO_BITMAP value while processing the VwStreamSectionFunc function. The file parser must also set the bitmap header information for the section by using the SOPutBitmapHeader function before returning from VwStreamSectionFunc. The information in the bitmap header allows the chunker to allocate storage for other bitmap information, such as the palette. This means that the file parser must call SOPutBitmapHeader before any other bitmapped section functions.

Section palettes

The file parser must generate a palette for those sections that have the SO_COLORPALETTE value set in the wImageFlags member of the SOBITMAPHEADER structure. The parser uses the SOStartPalette, SOPutPaletteEntry, and SOEndPalette functions to define the color palette for a bitmapped section. Only one palette may be defined for a bitmapped section.

All members set during the stream read can use RGB values, palette index values, or palette-relative RGB values. These values must be set through the SOPALETTEINDEX, SORGB, or SOPALETTERGB macro. For more information about these types of color values, see the description of the COLORREF value.

Tiles and scan lines

A bitmap image in a bitmapped section consists of tiles and scan lines. A tile is a rectangular portion of an image, containing at least one scan line. An image is one or more tiles wide and one or more tiles long. A tile column is the horizontal positioning of a tile; the tiles that have their x-coordinate equal to zero belong to tile column zero, with tile column numbers incrementing in the direction of the increasing x-coordinates.

The file parser specifies its tile length in terms of scan lines. Once the length is specified, the display engine always requests bitmap data as whole tiles; that is, it tells the parser to stop only on integer multiples of the tile length. For formats that contain multiple tiles, file parsers should set the tile length to the minimum number of scan lines required for a single tile. Formats that are not stored in tiles should have the tile width set equal to the image width and the tile length set to one scan line.

The following values are expected to be valid when tiles are created.

TILESACROSS = (ImageWidth+TileWidth-1)/TileWidth
TILESDOWN = (ImageLength+TileLength-1)/TileLength
TILESPERIMAGE=TILESACROSS*TILESDOWN
 

To output bitmap data, the file parser outputs a scan line at a time, in sequential order, by using the SOPutScanLineData function. All of the scan lines must belong to the same tile column. After each scan line, the file parser calls the SOPutBreak function with the SO_SCANLINEBREAK value. As is normally the case, the return value from SOPutBreak indicates whether the file parser should return from the VwStreamReadFunc function.

Building scan lines

The file parser builds the scan line data as a continuous stream of bits that define each pixel. Each pixel is packed into an array of bytes in such a way that if the data were written out in hexadecimal or binary numbers, the pixels could be read in order from left to right. For example, in a 4-bit-per-pixel format, the first pixel is stored in the high-order bits of the first byte (bit 7, bit 6, bit 5, and bit 4), and the second pixel is stored in low-order bits of that byte (bit 3, bit 2, bit 1, and bit 0). So if the first eight pixels of a 4-bit-per-pixel scan line have the hexadecimal values of 0, 2, C, 9, A, 4, 3, and F, the first four bytes of scan line data would be 02, C9, A4, and 3F.

If the parser provides a palette for the image, the data for each pixel is interpreted as an index into the palette. If no palette exists for the image, the bits for each pixel specify either a true color (24-bit only) or a gray scale value. For 24-bit color, each 3 bytes of a scan line represent the intensities of red, green, and blue of a single pixel.

When the scan line has been completely specified, the parser must call SOPutBreak with the SO_SCANLINEBREAK value, except for the last line of the bitmap. The last line of the bitmap must end with a break of the SO_SECTIONBREAK or SO_EOFBREAK type, whichever applies.

The following example illustrates the use of the bitmapped functions in the simplest possible case. In this case, the parser has scan line data stored one tile wide. The scan line data is already in the format that the parser is required to provide it in, so the data requires no additional processing after being read in. This example also does not check for EOF or read errors.

WORD  wBytesRead;
WORD  wBufSize = Proc.ScanLineBufSize;

do
{
...

xread( hFile, Proc.ScanLineBuf, wBufSize, &wBytesRead );

SOPutScanLineData( Proc.ScanLineBuf, hProc );

...

} while( SOPutBreak( SO_SCANLINEBREAK, 0, hProc ) == SO_CONTINUE );
 

Vector Graphics Sections

The file parser starts a vector graphics section by calling the SOPutSectionType function with the SO_VECTOR value while processing the VwStreamSectionFunc function. The file parser must also set the vector header by using the SOPutVectorHeader function before returning from VwStreamSectionFunc. The information in the SOVECTORHEADER structure defines the size and attributes of the rectangle in which vector graphics are drawn.

The vector graphics functions are similar to the primitive graphic device interface (GDI) functions, but they include extensions based on the file formats being supported. All vector graphics objects are described in two-dimensional space on a logical coordinate system. The direction and resolution of the x- and y-axes are defined in SOVECTORHEADER.

The file parser uses two functions to transfer data. The SOVectorAttr function sets attributes related to drawing vector graphics objects, and the SOVectorObject function defines a vector graphics object to be drawn. The parser specifies an identifier, a data size, and the address of data when it calls a function. The identifier specifies the action to take and the size and data-defined details of the action. Each action has a corresponding structure in which the data must be given. For example, to define a logical font, the parser must set the members of the SOLOGFONT structure and pass the structure to SOVectorAttr.

Although vector graphics functions are similar to the GDI functions, they are not exactly the same. For example, the members of the SOLOGFONT and LOGFONT structures are not necessarily the same.

The file parser should call the SOPutBreak function with the SO_VECTOROBJECTBREAK value after drawing every object.

Writing a File Parser

File parsers should be contained in a set of source and include files as follows, where XXX represents a mnemonic for the data format. For specific examples, see the sample ASCII filter files identified in the following table:
Generic file nameContentsSample ASCII filter file
VS_XXX.CCodeVS_ASC.C
VSD_XXX.CDataVSD_ASC.C
VS_XXX.HType definitionsVS_ASC.H
VSP_XXX.HPortability informationVSP_ASC.H

The portability information file makes porting of filters platforms easier. A set of include files is provided that will allow conditional compilations to yield executable DLLs for all of these needs from the same set of source files.

Your VSP_XXX.H file should look something like the following. (For further information, see the corresponding ASCII filter file.)

The parser must not change the contents of the structure because it is shared among all instances of the parser.

VwStreamDynamicName is for consistency and has no real use because all dynamic data is accessed through the function pseudonym. Each instance of the parser has a separate copy of dynamic data.

VwStreamSaveName should reference an element in the VwStreamDynamicType structure. The data in this structure is saved after every call to VwStreamSectionFunc and VwStreamReadFunc and restored before every call to VwStreamReadFunc.

If neither of these is defined, the file parser is assumed to be a single section only. VwStreamSectionName should reference an element in the VwStreamDynamicType structure. The data in this structure is saved after each call to VwStreamSectionFunc and is guaranteed to contain the current section's data on entry to VwStreamReadFunc.

This example shows the relationship of the various save areas to the dynamic data structure.

typedef struct {
 ...
} VwStreamSaveType;
typedef struct {
 ...
} VwStreamSectionType;
typedef struct {
 ...
    VwStreamSectionType VwStreamSectionName; // multisection only 
    VwStreamSaveType VwStreamSaveName;
} VwStreamDynamicType;
 

VwStreamIdName is the name of the FILTER_DESC array in VSD_XXX.C, and VwStreamIdCount is the number of elements in this array. Like the static data, this data should never be changed by a parser.


Up Top of Page
© 1997 Microsoft Corporation. All rights reserved. Terms of Use.