Spreadsheet Data Description Language and Tabulating Tool

Many persons work to put data into the text and tables of reports. Your job is to get that data out. You want to describe and aggregate the data so that you can query and analyze it flexibly, quickly, and efficiently. The Spreadsheet Data Description Language (SDDL) and the SDDL Tabulating Tool (STT) can help you. SDDL / STT helps to transform various human-generated data forms into a uniform, unified data list.

This document describes SDDL / STT in detail. Available elsewhere is a gentle introduction to SDDL / STT. To get running, get the source code for STT and instructions on how to run STT. The following description assumes SDDL tokens and parameters are set to the defaults shown in the distribution STT source code. These defaults can be changed to customize and localize SDDL.

An SDDL block is a rectangular block of spreadsheet cells that can be at any position in a spreadsheet. STT tabulates an SDDL block to another sheet in the same spreadsheet document. By default, the target tabulation sheet is named "tabulated", which will be created if it doesn't already exist. The SDDL directive "%tabto", which can occur only on the first row of an SDDL block, explicitly directs the tabulation to the sheet named in the cell to the right of that directive.

By default, STT tabulations are cumulative. The items in the SDDL block will be added to the target tabulation, with additional categories added if specified categories do not already exist. Clearing the tabulation sheet cells using spreadsheet clear functionality allows starting a new tabulation.

Tabulations include meta-data not explicitly specified in SDDL. The meta-data are placed in the leftmost columns of the tabulation. The default metadata shows the tabulation sheet row number for each row of the tabulation. Setting the OPTIONAL_META flag within the STT source code will add left columns with the spreadsheet ID, the tabulation sheet name, the SDDL source sheet name, and the SDDL source sheet row number that triggered the tabulation of that tabulation row. Operations of tabulating a record and clearing a record, including the explicit SDDL "clear" directive, have no effect on items in metadata categories.

STT processes SDDL in row order, cumulating data item into user-defined data categories. The basic SDDL row is a category followed by an item. This category-item pair tells STT to place in that category the following item. When an item is added to a category that already contains an item from a previous SDDL row, STT tabulates a record and clears all non-pinned categories.

Prefixes prepended to categories control the pinning and un-pinning of items in categories. If "*" is prefixed to a category in a category-item pair, then the corresponding item will remain in that category across subsequent record tabulations until a different item is placed in that category or the record is explicitly cleared. The category prefix "^" inserts in that category the corresponding item in that category-item pair, but does not trigger a record tabulation even if an item was already in that category. The prefix "&" behaves the same, but pins the inserted item. Using a category in a SDDL stack or twoway (see below) automatically unpins the category. Pinning/insertion prefixes in an implicit stack (on its own or in a twoway) are ignored. Pinning/insertion prefixes cannot be combined with the "1" and "2" prefixes.

An SDDL stack is one or more rows of categories typically followed by more than one item. If there are no missing values, then the count of items in each stack row is equal. One record is tabulated for each column of items, with the column items filling the corresponding categories. Each record includes all items placed in categories in the current record before the stack began. The rightmost item in the stack defines the column dimension of the stack. In tabulating the stack, empty cells are filled with the missing-value indicator. The stack ends when an category-item row is encountered, or a directive row, or a stack row that contains a category already included in the current stack. In the last case, the current stack is tabulated and a new stack begins. Categories prefixed with a 1 explicitly declare a stack. Prefixing a "1" to a stack category allows defining a single-column stack and avoids terminating a wider stack where some stack row has a missing second item.

An SDDL twoway form supports tabulating a grid of items. In the simplest case, a twoway form generates one tabulated record with each item in the grid. Each tabulated record includes all items placed in categories in the current record before the twoway began. The twoway form starts with the directive "%twoway" followed by precat categories. A stack, with items offset to the right by the number of precats, follows the twoway directive. The come "2cat" rows, which are rows that begin with category prefixed with a "2". The 2cat row consists first of precat items, then grid items. The precats define the categories of items included in 2cat rows before grid items. The record for grid item at position (r,c) is created from items in the "precat" section of row r, the items in the stack above the grid in column c, and the grid item. The grid item goes into the "2cat" category that begins the grid row. Different 2cat categories are permitted within a single twoway form. If precats items are duplicated across some rows with different 2cats, then the record set will be compressed appropriately. If all precat items are identical, then the number of records tabulated equals the highest-count 2cat times the number of columns in the grid/stack. As with the stack, the twoway handles empty cells within the form as missing values.

STT differentiates missing values in stacks and twoways, empty categories when a record is tabulated, and non-existent categories when a record is tabulated. The value tabulated for the first two cases are explicitly defined (separately) in the STT script. Because STT processes SDDL serially, a non-existent category at the time of a record tabulation necessarily appears as a blank cell when the category is subsequently added to the tabulation. The user can also define a missing value that the user explicitly places like any other item.

The STT directive "%gettab" adds a tabulation (data list with fields/categories in the first row) to the current tabulation. Parameters in cells to the right of this directive specify the location of the tabulation to be added. In the STT Google Apps Script implementation, the two cells to the right contain the Google Spreadsheet Doc ID and the sheet name within that spreadsheet.

Other SDDL directives:

Within the spreadsheet SDDL block selected for STT processing, blank rows are ignored up to the MAX_BLANK_ROWS parameter set in the script. If that number of consecutive blank rows are found, STT ends processing the block. Similarly, in processing a stack or twoway, STT will look across a maximum of MAX_BLANK_CELLS to find a non-blank item following (implicitly) missing items in that row. These constraints allow you to select a whole spreadsheet (clicking on the box at the corner of the row and column labels) and have STT process the sheet relatively efficiently.

STT processes a block according to the principle "do the best you can with what you're given." Processing choices made in response to irregular SDDL are typically logged to a log file. Check the log file after running STT to be informed of any recognized SDDL issues. Adding the optional meta-data will show you the row number in the SDDL source that triggered the given record line in the tabulation. That source row number can help you to understand how a record got into the tabulation.

SDDL / STT examples:

Return to Galbi Think! main page