298x Filetype PDF File size 0.84 MB Source: www.cs.upc.edu
literate
Moderated by
Christopher J. Van Wyk programming
n May and June 1986, Programming worth an experiment to see whether interest
Pearls took up literate programming, an among readers, authors, and critics would be
I
approach to programming espoused by sustained. With his help, we commissioned
Donald Knuth. Knuth’s premise is that the Christopher Van Wyk of AT&T Bell Laborato-
best programs are meant for people as well as ries to be the moderator of a new column on
machines; they are meant to be read as well literate programming, inaugurated with
as run. He built a system, WEB, that scans this issue. Van Wyk received his Ph.D. from
specially structured technical reports to Stanford University in 1980 and has spent
extract Pascal statements interwoven the intervening years at Bell Labs
with the text, forming from them working on programming lan-
an executable program. Pascal is guages for graphics and algo-
not essential; WEB can work as rithms for computational
well with any other programming
language, such as Fortran. As moderator, Van Wyk
A literate program contains will pose the problems to be
not only the needed state- considered, solicit and
ments in a programming receive literate pro-
language, but also a precise grams, obtain cri-
problem statement, a summary tiques, and publish
of the background needed to the results within
understand the solution, an
evaluation of alternatives, constraints of a
assessments of trade-offs magazine-all
between the runnin tasks that will use
time and space,
or between run- 1
ning time and
programming
time, and sugges-
tions on how to modify the
program. Program code segments
are inserted in the text at points logical to
the intellectual development of the algorithm. renewal. This experiment can succeed only
A literate program pays careful attention to with your full support. If you are interested
lucidity of presentation and presents all argu- in being the author of a literate program, or
ments needed to understand why the pro- acting as the literary critic, contact Van Wyk
gram will actually work as intended. To at the address listed in the masthead:
emphasize the literary side, the second Pearl Christopher J. Van Wyk
included a critique of Knuth’s program that AT&T Bell Laboratories
evaluated it on these points. Room 2C-457
The response to both Pearls was strong 600 Mountain Avenue
and positive. Jon Bentley proposed that Murray Hill, NJ 07974
Communications renularlv devote snace to
interesting, literate programs andtheir well- Peter J. Denning
written critiques. I agreed that it would be Editor-in-Chief
July 1987 Volume 30 Number 7 Communications of the ACM 593
literate
Moderated by
Christopher J. Van Wyk programming
PRINTING COMMON WORDS
Moderator’s Introduction to Column 1 literate programming that involves much less machin-
Two Programming Pearls columns in 1986 presented ery than WEB. Finally, note that Hanson solved a
literate programs by Donald Knuth. The May 1986 col- slightly different problem than Knuth; although that
umn contained a WEB program to generate a sorted list makes little difference to our discussion of literate pro-
of M distinct numbers randomly chosen from 1 through grams, it highlights the importance of careful problem
N. Programming Pearls of April 1987 contained readers’ specification in the design of large systems.
comments on Knuth’s literate program, as well as an
unimplemented solution to the problem by David Gries.
The June 1986 Programming Pearls presented Knuth’s
WEB program to print the k most common words in a
file, in decreasing order by frequency. That column
also elicited comments from many readers. Printing Common Words
Several readers wondered how well a small example
can serve to illustrate the features of a system for liter- 1. Introduction. In describing Don Knuth’s WEB sys-
ate programming. The issue arose in two very different tem in one of his “Programming Pearls” [Communi-
ways: One reader who thought WEB worked acceptably cations of the ACM 29, 5 (May 1986), 364-3691, Jon
in the small doubted that it would serve a large project Bentley “aSsigned” the following programming problem:
well; another suggested that the WEB solution’s appar- “Given a text file and an integer k, you are to print the k
ent clumsiness in the small could be attributed to its most common words in the file (and the number of their
being designed for larger, more complex programs. occurrences) in decreasing frequency.”
A couple of readers very much appreciated Doug It is unclear from this problem statement what to do
McIlroy’s addition of a figure to the documentation. with ‘%ies,” that is, does k refer to words or word fre-
They also thought that trie-sorting in Sections 37-39 quencies? For example, in the problem statement, “the”
needed more comments than “The restructuring opera- occurs three times “k ” “in ” “and ” and “file” each oc-
tions are slightly subtle here.” curs twice, and td rek of t6e wor& each occurs once. If
One sharp-eyed reader noticed that, although the program is invoked with the statement as input and
Section 38 uses “the fact that count [ 01 = 0,” k = 2, which word should be output as the second most
count [ 01 is never initialized. Although some Pascal common word? A rephrasing of the problem removes the
compilers detect references to uninitialized variables, ambiguity: “Given a text file and an integer k, you are
apparently the compiler that Knuth used at SAIL does to print the words (and their frequencies of occurrence)
not. whose frequencies of occurrence are among the k largest
In this inaugural column, we present another solu- in order of decreasing frequency.”
tion to the problem of printing the k most common Using this problem statement, the output of ,the pro-
words in a file. The author of this solution is David gram with the original problem statement as input and
Hanson, a professor of computer science at Princeton with k = 2 is
University and an editor of the journal Software-Prac-
tice 6 Experience. Hanson’s program illustrates nicely 3 the
the use of abstract data types on the way to the solution 2 file
of a problem. It discusses the role of profiles of execu- 2and
tion in the design of good programs, a topic that is also 2 in
discussed in this month’s Programming Pearls. Han- 2k
son’s solution shows how one can design a system for
Tkii work WBB supported in part by the National Science Foundatian under
0 1987 ACM OOOl-0782/137/0700-0594 $1.50 Grant MCM2398.
594 Communications of the ACM luly 1987 Volume 30 Number 7
Literate Programming
Bentley posed this problem to present a “real” exam- /* initialize k */
ple of WEB usage. For more information about WEB, see /* initialize word table */
Knuth’s “Literate Programming,” The Computer Journal while (getwordcbuf , MAXWORD) != EOF)
67, 2 (May 1984), 97-111. Knuth’s solution appears in addword (buf > ;
Communications of the ACM 29, 6 (June 1986), 471-483, printwords (k) ;
along with a review by Doug McIlroy. where buf is a character array of MAXWORB characters,
The solution given here is writ,ten in the C program- and getword places the next word in the input in buf
ming language and presented using the loom system to and returns its length, or EOF at the end’of file. MAXWORD
generate the printed program and its explanation. loom is defined to be 101 to allow room for a terminating null
is a preprocessor whose input is a text file with embed- character. addword adds the word in buf to the table of
ded references to fragments of the program. loom re- words, and printwords prints the words with the k
trieves these fragments, optionally pushes them through largest frequencies.
arbitrary filters, and integrates the result into the out- Getting program arguments and environment vari-
put. ables, such as k and PAGESIZE, is a common feature of
loom’s output is usually input to a document format- many UNIX0 programs and the code is idiomatic. Exam-
ter, such as troff or ‘I&X. loom was originally written by ples can be found in The UNIX Programming Environ-
Janet Incerpi and Robert Sedgewick and used in prepa- ment by B. W. Kernighsn and R. Pike (Prentice-Hall,
ration of Sedgewick’s book Algorithms (Addison-Wesley, Englewood Cliffs, N.J., 1984).
Reading, Mass., 1983). Starting from their program, I
rewrote loom for use in writing a book and several pa- 4. Reading words. getword reads the next word from
pers. the input. This is accomplished by discarding characters
loom is not as ambitious or as comprehensive as WEB. up to the next occurrence of a letter, then gathering up
It does, however, have the virtue of independence from the letters into the argument buffer:
both formatting and programming languages. It does
not, for example, provide the comprehensive indexing, int getwordcbuf, size)
cross-referencing, or pretty printing facilities of WEB. With char *buf;
help from its associated filters, loom does provide index- int size;
ing of the identifiers used in the program fragments, al- c
though the index is omitted here for brevity. And since char *p;
it is not necessary to present the whole program, irrel- int c;
evant details can be omitted permitting the documen- p = buf;
tation to concentrate on the important aspects of the while ((c = getchar != EOF)
programs. I have formatted this program description in if (isletter( <
a style similar to WEB for comparison purposes, but the do C
formatting of loom’s output is not constrained to any one if (size > 1) C
style. Using loom also has an effect similar to WEB: De- *p++ = c;
veloping and writing about programs concurrently affect size--;
both activities dramatically. 1
c = getcharo;
2. Definitions. The problem statement does not give ) while (isletter(c.1) ;
a precise definition of a “Word” or of the details of pro- *p = ‘\O’;
gram invocation. Words are given by the set {w / w = return p - buf;
oa* and [WI 5 lOO} where a E {a...z,A...Z}; that return 1 EOF;
is, a word is a sequence of one or more upper- or lower- >
csse letters, up to a maximum of 100 letters. Only the
first 100 characters are considered for words longer than size is compared with 1 to ensure that there is room for
100 characters. the terminating null character. isletter is a macro that
The program, called common, is invoked with a single tests for upper- or lowercase letters:
optional argument that gives the value of k and reads #define isletter (c >= ‘a’ && c <= ‘z’ II \
its input from the standard input file. If the argument is c >= ‘A’ && c <= ‘Z’)
omitted, the value of the environment variable PAGESIZE
is used; the default is 22. 5. Storing the words. The words must be stored in
3. The main program. As suggested in Software a table along with the number of times they occur in the
Tools by Kernighan and Plauger (Addison-Wesley, Read- input. This table must handle two kinds of access: While
ing, Mass., 1976), the structure of the program can often the input is being read, the table is “indexed” with a
be derived from the structure of the input data. The in- word in order to increment its frequency count. After the
put to common is a sequence of zero or more words, which input has been read, the entries with the k largest fre-
suggests the following structure for the main program: UNIX is a registered trademark of AT&T Bell Laboratories.
]uly 1987 Volume 30 Number 7 Communications of the ACM 595
Literate Programming
quency counts must be located and printed in decreasing addword also increments a global integer, total., which
order of those counts. counts the number of distinct words in the table. This
These two kinds of access are disjoint; that is, initially, number is required in the second phase of the program.
all accesses to the table are of the first kind, followed by strcmp is a C library function that returns 0 if its two
only accesses of the second kind. Consequently, the table arguments point to identical strings, and strcpy is a C
representation can be designed to facilitate the first kind library function that copies the characters in its second
of access, and then changed to facilitate the second. argument into its first.
A hash table is appropriate for indexing the table with allot (n, size) allocates space for n contiguous ob-
words. Since the size of the input is unknown, a hash ta- jects of size bytes each by calling calloc, a C library
ble in which collisions are resolved by chaining is used. function that does the actual allocation and clears the
Space for both the word and the table entry can be allo- allocated space. allot’s primary purpose is to catch allo-
cated dynamically. The hash table itself, hashtable, is cation failures. Many C programmers erroneously assume
an array of pointers to word structures: that calloc cannot fail. On machines like the VAX, al-
#define HASHSIZE 07777 /* hash table size */ location rarely fails, but on smaller machines, failure is
struct word ( common.
char *word; /* the word */ 7. Printing the words. As suggested in the outline
int count; /* frequency count */ for main, given above, printwords prints th.e desired
struct w,ord *next; /* link to next entry */ output. To print the k most common words as specified,
) *haehtable [HASHSIZE+l] ; printwords must sort the contents of the table in de-
The bounds of hashtable are 0 to 2” - 1, where n is 12 creasing order of the count values, and print the first k
here. Using a power of two facilitates rapid computation entries. Since the frequencies range between 1 and N,
of the index into hashtable given a hash number: If h is where N is the number of words, sorting them can be
a hash number, the index is h&HASHSIZE. hashtable is accomplished in time proportional to N (assuming every-
initialized in main to NULL pointers. thing fits into memory) by allocating an array of pointers
to words that is indexed by the frequency of occurrence.
6. addword (buf ) adds the null-terminated string in buf Each element in the array points to the list of words with
to hashtable, if necessary, and increments its count the same count values; that is, list [il points to the list
field. To compute the index into hashtable, the contents of words with count fields equal to i.
of buf must be “hashed” to yield a hash number h, from printwords (k)
which the index is computed as described above. A sim- int k;
ple yet effective hash function is to sum the codes of the {
characters in buf. This function also yields the length of int i, mw:
the word, which is needed to add new words to the table. struct word *wp, **list, *q;
Putting this all together produces addword: list = (struct word **> alloc(tota1, sizeof wp)
addword(buf) max = 0;
char *buf; for (i = 0; i <= HASHSIZE; i++)
c for (wp = hashtable[il ; wp; wp = q> {
unsigned int h; q = wp->next;
int len; wp->next = list [wp-Xountl ;
char *s. *alloc() ; list [wp->count] = wp;
struct word *wp; if (wp->count > max)
max = wp->count ;
h = 0; /* compute hash number of buf [I. .I */ 3 = max; i >= 0 && k > 0; i-->
s = buf; for (i
for (len = 0; *a; len++) if ((wp = list[i]) && k-- > 0)
h += *s++; for ( ; wp; wp = wp->next)
for (wp = hashtable[h&HASHSIZE] ; wp; printf (“%d %s\n** , wp->count , wp->word) ;
wP = wp->next) 3
if (strcmp(wp->word, buf > == 0)
break ; max keeps track of the largest frequency count, which is
if (wp == NULL) { /* a new word */ usually much less than N, and provides a starting point
wp = (struct word *) alloc(1, sizeof *wp) ; for the reverse scan of list.
wp->word = alloc(len + 1, sizeof (char)) ;
strcpy(wp->word, buf); 8. Performance. Bentley did not give specific perfor-
wp-Xount = 0; mance criteria for common, but he did say that “a user
wp->next = hashtable [h&J+ASHSIZE] ; should be able to find the QIKYI most frequent words in a
hashtable [h&HASHSIZE] = wp; twenty-page technical papei without undue emotional
total++ ;
3 trauma.” To test common I concatenated seven of the
wp->count++; documents from volume 2 of the UNIX Programmer’s
3 Manual from the Berkeley 4.2 UNIX system to form a
596 Communications of the ACM July 1987 Volume 30 Number 7
^; ------ “,I, - I
no reviews yet
Please Login to review.