http://www.spinellis.gr/pubs/trade/1998-Login-TextTools/html/paper.html
This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:

Diomidis Spinellis. Small tools for automatic text generation. ;login:, 23(4):44–47, August 1998.

The document's metadata is available in BibTeX format.

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Diomidis Spinellis Publications

Small Tools for
Automatic Text Generation

Diomidis Spinellis
SENA S.A.
Byzantiou 2
GR-142 34 Nea Ionia
Greece
e-mail: dds@senanet.com

March, 1998

1 The Problem

The textual description and analysis of large data sets can be a repetitive and error-prone task. Examples of data sets which may need to be textually described include population statistics, market research results, scientific and engineering data such as chemical compounds or earthquake structural damages, and literature surveys. In all such cases the data needs to be organised and presented in a meaningful way. In some cases this process is periodically repeated (e.g. market research results); in other cases the base data set is frequently updated (e.g. a literature survey). Automating the data presentation process can reduce the work required to generate the reports, eliminate a source of errors, enhance the outcome's consistency, and speed-up the generation process.

In the following paragraphs I describe a small set of special-purpose tools developed for the automatic production of a multiparadigm language literature survey. A set of 104 languages combining different programming paradigms needed to be presented in a meaningful way, tabulated, indexed, and cross-referenced. I decided to divide them according to the paradigms they supported and present them by listing certain characteristics of each language. During the course of my investigation the set of languages surveyed grew by 100% and was frequently revised; this made the hand editing of the text error-prone and unproductive.

Although the tools developed are special purpose, the methods used for the construction of the tools and the generation of the output can be applied in a number of different situations.

2 Functional Description

The results of the survey were entered into a simple database structured as a text file. The file is structured similar to a refer database: records are separated by empty lines, and record fields are identified by a letter following the percent character at the beginning of a line. A sample record from the database can be seen in figure 1. The record's fields are the name of the language N, significant references R, its characteristics C, its usual implementation I, the paradigms it supports, P and a short descriptive text D.

**Figure 1:** Sample record from the language database.
%N Modula-Prolog %R Mul86 %C UN RL RT nDT BR... ...a-2. The library includes term handling procedures.

The text generator scans the database and divides the languages into categories according to the paradigms supported by each language. Each such category (e.g. languages that support the logic and object-oriented paradigms) is formatted as a separate text section. The generator formats the title as in the following example:

2.2.5 Combinations of Imperative and Logic Paradigms

It then inserts some manually prepared descriptive text and appends text that tells the reader the number of languages in that category, provides pointers to the summarising tables (examples are tables 1 and 2), and introduces the paragraphs to follow. The following is an example of the automatically generated text:

We found ten languages that combine the imperative and logic programming paradigms. Their implementations are summarised in table 1 and their characteristics in table 2. In the following paragraphs we list the most important features of each language.

A description of the languages is then produced based on the R, D, and N record fields:

2.PAK
[Mel75] Block structured language offering user-defined pattern matching and backtracking.
C with Rule Extensions
[MS90] Based on the C programming language with an extended syntax, a richer set of data types, a flexible input/output system and a forward chaining [Ric83, p. 56] execution strategy.
Leda
[Bud91] Language with syntax similar to that of Pascal, with an additional code abstraction facility, the relation. The data-space for all entities contains the undefined value. Relations are coded as Prolog rules, and allow backtracking.
...

**Table 1:** Implementations combining the imperative and logic paradigms
Name	References	Implementation
2.PAK	[Mel75]	Language
C with Rule Extensions	[MS90]	Extension of C, preprocessor
Leda	[Bud91]	Language
Logicon	[LC86]	Prolog interpreter in Icon
...

**Table 2:** Characteristics of imperative and logic paradigm combinations
Name	Characteristics							Control
	BR	$\|\|$	R	RT	$U$	DT	RT
2.PAK	$v$				$v$	$v$	$v$	*
C with Rule Extensions	$v$		$v$	$v$	$v$	$v$	$v$	*, X
Leda	$v$		$v$	$v$	$v$	$v$	$v$	SLD, X
Logicon	$v$		$v$	$v$	$v$	$v$	$v$	X, SLD
...

The generator inserts at the end of each section another tailor-made paragraph containing concluding remarks on that section. A special section contains all languages that could not be fitted into one of the preceding sections together with a special table listing the paradigms each language supports (table 3). After all sections have been processed the generator produces a summary of the number of languages supporting each paradigm combination (table 4).

**Table 3:** Languages and paradigms they support
Name	Paradigms
DSM	Imperative, Object-oriented and Relational
Echidna	Constraint, Logic and Object-oriented
Educe	Database and Logic
Enhanced C	Imperative and Set
Fooplog	Functional, Logic and Object-oriented
Icon	Generators and Imperative
KE88	Functional, Logic and Object-oriented
Kaleidoscope	Constraint, Imperative and Object-oriented
Lex	Imperative and Regular-Expression
ML-Lex	Functional and Regular-Expression
...

**Table 4:** Summary example table
Functional	$o$		$o$	$o$			$o$		$o$
Imperative		$o$	$o$			$o$	$o$
Object-Oriented				$o$	$o$	$o$	$o$	$o$
Logic	$o$	$o$			$o$		$o$	$o$	$o$
Distributed								$o$
Constraint									$o$
Number of languages	24	10	9	8	11	7	5	4	5

3 Implementation

I implemented the generator as a set of ten small tools. A separate program was used to administer the database in cases where a text editor was not adequate. Each of the ten tools performs a small specialised task. The tools are implemented in the Perl and Bourne shell (sh) languages making use of additional Unix tools such as grep and sed.

The system's driver is the maketext program which given a list of interesting paradigm combinations generates the section title and the opening paragraph. It then calls the external programs desclist to create the description list and chartable/imptable to create the characteristics and implementation tables. After the whole database has been processed the same program generates the summary table. One other program, partable, generates the language/paradigm table for the last catch-all section. All programs take as argument the paradigm combination described in the section processed, or for the last section the paradigm combinations already processed. Maketext is implemented in Perl, making extensive use of regular expressions to parse the database.

The second level programs operate on the output of the pars program which given a paradigm combination generates a sorted list of all records exactly matching that paradigm combination. A similar program, chars, generates only the characteristics of the paradigm combination.

Pars depends on its operation on the dbgrep program which scans the database for records matching the search criterion specified as an expression possibly containing string regular expressions. Dbgrep also takes a flag to display all records not matching the criterion. This feature is used in the generation of the last section.

In order to match the record output order of these programs and create meaningful tables the records and fields are sorted in various phases of the text generation, either using the sort statement available in Perl, or two additional programs linesort and llinesort which sort the words on every input line. Llinesort performs this operation only on lines matching a specific pattern.

This division among the many specialised programs resulted in a modest implementation effort -- 697 lines of code divided as indicated in table 5. The largest program, maketext, is only 117 lines long including the text that is copied verbatim to the output file.

**Table 5:** Implementation effort and details.
Program	Implementation	Lines
chars	Sh	13
chartabl	Perl	59
dbgrep	Perl	27
desclist	Perl	17
imptable	Perl	38
linesort	Perl	19
llinesor	Perl	26
maketext	Perl	117
pars	Sh	13
partable	Perl	27
Total	Perl	697

The output of the text generator is in the LaTeX text markup language. The 1134 line language database is converted into 1464 lines of LaTex which also include commands to include another 371 lines of human-generated text.

4 Conclusions

The database described in the previous sections and the associated tools were used for a period of almost three years. During that time the database was frequently edited and revised. In some cases the output format was also modified. Both types of changes would have been very difficult if the data was statically embedded in a document. Using approach I described, changes were made in the structured database and a single command reflected them in the camera-ready output. The ease of adding new records and modifying existing ones encouraged me to keep the database current.

I have thus found that special purpose throw-away tools can be effectively used to automatically generate text out of structured data collections. For those who might be interested to use this approach in other domains the following suggestions may be of help:

The use of a markup language such as HTML, troff or TeX as tool output can be employed to focus on the problem to be solved and avoid issues of output formatting and device support. Troff preprocessors such as eqn and tbl have successfully employed this method.
Intermixing human with generator-produced text can be used to create more lively output and avoid the dry feel of generator output. Small tricks such as correctly spelling out numerals, and adding a final ``and'' to a generated list also help.
The leverage provided by the string handling, regular expressions, associative arrays and meta-expression evaluation capabilities of the Perl programming language was invaluable. The idea of the automatic generation of text may not have been practical without Perl as an implementation vehicle.
Although Perl has a number of useful capabilities the use of pipelines to divide the work among small units which process their input and produce some output in conjunction with the use of standard Unix tools was also critical to the implementation of the system.

Small Tools for Automatic Text Generation

1 The Problem

2 Functional Description

3 Implementation

4 Conclusions

Small Tools for
Automatic Text Generation