156x Filetype PDF File size 0.84 MB Source: www.cs.upc.edu
literate Moderated by Christopher J. Van Wyk programming n May and June 1986, Programming worth an experiment to see whether interest Pearls took up literate programming, an among readers, authors, and critics would be I approach to programming espoused by sustained. With his help, we commissioned Donald Knuth. Knuth’s premise is that the Christopher Van Wyk of AT&T Bell Laborato- best programs are meant for people as well as ries to be the moderator of a new column on machines; they are meant to be read as well literate programming, inaugurated with as run. He built a system, WEB, that scans this issue. Van Wyk received his Ph.D. from specially structured technical reports to Stanford University in 1980 and has spent extract Pascal statements interwoven the intervening years at Bell Labs with the text, forming from them working on programming lan- an executable program. Pascal is guages for graphics and algo- not essential; WEB can work as rithms for computational well with any other programming language, such as Fortran. As moderator, Van Wyk A literate program contains will pose the problems to be not only the needed state- considered, solicit and ments in a programming receive literate pro- language, but also a precise grams, obtain cri- problem statement, a summary tiques, and publish of the background needed to the results within understand the solution, an evaluation of alternatives, constraints of a assessments of trade-offs magazine-all between the runnin tasks that will use time and space, or between run- 1 ning time and programming time, and sugges- tions on how to modify the program. Program code segments are inserted in the text at points logical to the intellectual development of the algorithm. renewal. This experiment can succeed only A literate program pays careful attention to with your full support. If you are interested lucidity of presentation and presents all argu- in being the author of a literate program, or ments needed to understand why the pro- acting as the literary critic, contact Van Wyk gram will actually work as intended. To at the address listed in the masthead: emphasize the literary side, the second Pearl Christopher J. Van Wyk included a critique of Knuth’s program that AT&T Bell Laboratories evaluated it on these points. Room 2C-457 The response to both Pearls was strong 600 Mountain Avenue and positive. Jon Bentley proposed that Murray Hill, NJ 07974 Communications renularlv devote snace to interesting, literate programs andtheir well- Peter J. Denning written critiques. I agreed that it would be Editor-in-Chief July 1987 Volume 30 Number 7 Communications of the ACM 593 literate Moderated by Christopher J. Van Wyk programming PRINTING COMMON WORDS Moderator’s Introduction to Column 1 literate programming that involves much less machin- Two Programming Pearls columns in 1986 presented ery than WEB. Finally, note that Hanson solved a literate programs by Donald Knuth. The May 1986 col- slightly different problem than Knuth; although that umn contained a WEB program to generate a sorted list makes little difference to our discussion of literate pro- of M distinct numbers randomly chosen from 1 through grams, it highlights the importance of careful problem N. Programming Pearls of April 1987 contained readers’ specification in the design of large systems. comments on Knuth’s literate program, as well as an unimplemented solution to the problem by David Gries. The June 1986 Programming Pearls presented Knuth’s WEB program to print the k most common words in a file, in decreasing order by frequency. That column also elicited comments from many readers. Printing Common Words Several readers wondered how well a small example can serve to illustrate the features of a system for liter- 1. Introduction. In describing Don Knuth’s WEB sys- ate programming. The issue arose in two very different tem in one of his “Programming Pearls” [Communi- ways: One reader who thought WEB worked acceptably cations of the ACM 29, 5 (May 1986), 364-3691, Jon in the small doubted that it would serve a large project Bentley “aSsigned” the following programming problem: well; another suggested that the WEB solution’s appar- “Given a text file and an integer k, you are to print the k ent clumsiness in the small could be attributed to its most common words in the file (and the number of their being designed for larger, more complex programs. occurrences) in decreasing frequency.” A couple of readers very much appreciated Doug It is unclear from this problem statement what to do McIlroy’s addition of a figure to the documentation. with ‘%ies,” that is, does k refer to words or word fre- They also thought that trie-sorting in Sections 37-39 quencies? For example, in the problem statement, “the” needed more comments than “The restructuring opera- occurs three times “k ” “in ” “and ” and “file” each oc- tions are slightly subtle here.” curs twice, and td rek of t6e wor& each occurs once. If One sharp-eyed reader noticed that, although the program is invoked with the statement as input and Section 38 uses “the fact that count [ 01 = 0,” k = 2, which word should be output as the second most count [ 01 is never initialized. Although some Pascal common word? A rephrasing of the problem removes the compilers detect references to uninitialized variables, ambiguity: “Given a text file and an integer k, you are apparently the compiler that Knuth used at SAIL does to print the words (and their frequencies of occurrence) not. whose frequencies of occurrence are among the k largest In this inaugural column, we present another solu- in order of decreasing frequency.” tion to the problem of printing the k most common Using this problem statement, the output of ,the pro- words in a file. The author of this solution is David gram with the original problem statement as input and Hanson, a professor of computer science at Princeton with k = 2 is University and an editor of the journal Software-Prac- tice 6 Experience. Hanson’s program illustrates nicely 3 the the use of abstract data types on the way to the solution 2 file of a problem. It discusses the role of profiles of execu- 2and tion in the design of good programs, a topic that is also 2 in discussed in this month’s Programming Pearls. Han- 2k son’s solution shows how one can design a system for Tkii work WBB supported in part by the National Science Foundatian under 0 1987 ACM OOOl-0782/137/0700-0594 $1.50 Grant MCM2398. 594 Communications of the ACM luly 1987 Volume 30 Number 7 Literate Programming Bentley posed this problem to present a “real” exam- /* initialize k */ ple of WEB usage. For more information about WEB, see /* initialize word table */ Knuth’s “Literate Programming,” The Computer Journal while (getwordcbuf , MAXWORD) != EOF) 67, 2 (May 1984), 97-111. Knuth’s solution appears in addword (buf > ; Communications of the ACM 29, 6 (June 1986), 471-483, printwords (k) ; along with a review by Doug McIlroy. where buf is a character array of MAXWORB characters, The solution given here is writ,ten in the C program- and getword places the next word in the input in buf ming language and presented using the loom system to and returns its length, or EOF at the end’of file. MAXWORD generate the printed program and its explanation. loom is defined to be 101 to allow room for a terminating null is a preprocessor whose input is a text file with embed- character. addword adds the word in buf to the table of ded references to fragments of the program. loom re- words, and printwords prints the words with the k trieves these fragments, optionally pushes them through largest frequencies. arbitrary filters, and integrates the result into the out- Getting program arguments and environment vari- put. ables, such as k and PAGESIZE, is a common feature of loom’s output is usually input to a document format- many UNIX0 programs and the code is idiomatic. Exam- ter, such as troff or ‘I&X. loom was originally written by ples can be found in The UNIX Programming Environ- Janet Incerpi and Robert Sedgewick and used in prepa- ment by B. W. Kernighsn and R. Pike (Prentice-Hall, ration of Sedgewick’s book Algorithms (Addison-Wesley, Englewood Cliffs, N.J., 1984). Reading, Mass., 1983). Starting from their program, I rewrote loom for use in writing a book and several pa- 4. Reading words. getword reads the next word from pers. the input. This is accomplished by discarding characters loom is not as ambitious or as comprehensive as WEB. up to the next occurrence of a letter, then gathering up It does, however, have the virtue of independence from the letters into the argument buffer: both formatting and programming languages. It does not, for example, provide the comprehensive indexing, int getwordcbuf, size) cross-referencing, or pretty printing facilities of WEB. With char *buf; help from its associated filters, loom does provide index- int size; ing of the identifiers used in the program fragments, al- c though the index is omitted here for brevity. And since char *p; it is not necessary to present the whole program, irrel- int c; evant details can be omitted permitting the documen- p = buf; tation to concentrate on the important aspects of the while ((c = getchar != EOF) programs. I have formatted this program description in if (isletter( < a style similar to WEB for comparison purposes, but the do C formatting of loom’s output is not constrained to any one if (size > 1) C style. Using loom also has an effect similar to WEB: De- *p++ = c; veloping and writing about programs concurrently affect size--; both activities dramatically. 1 c = getcharo; 2. Definitions. The problem statement does not give ) while (isletter(c.1) ; a precise definition of a “Word” or of the details of pro- *p = ‘\O’; gram invocation. Words are given by the set {w / w = return p - buf; oa* and [WI 5 lOO} where a E {a...z,A...Z}; that return 1 EOF; is, a word is a sequence of one or more upper- or lower- > csse letters, up to a maximum of 100 letters. Only the first 100 characters are considered for words longer than size is compared with 1 to ensure that there is room for 100 characters. the terminating null character. isletter is a macro that The program, called common, is invoked with a single tests for upper- or lowercase letters: optional argument that gives the value of k and reads #define isletter (c >= ‘a’ && c <= ‘z’ II \ its input from the standard input file. If the argument is c >= ‘A’ && c <= ‘Z’) omitted, the value of the environment variable PAGESIZE is used; the default is 22. 5. Storing the words. The words must be stored in 3. The main program. As suggested in Software a table along with the number of times they occur in the Tools by Kernighan and Plauger (Addison-Wesley, Read- input. This table must handle two kinds of access: While ing, Mass., 1976), the structure of the program can often the input is being read, the table is “indexed” with a be derived from the structure of the input data. The in- word in order to increment its frequency count. After the put to common is a sequence of zero or more words, which input has been read, the entries with the k largest fre- suggests the following structure for the main program: UNIX is a registered trademark of AT&T Bell Laboratories. ]uly 1987 Volume 30 Number 7 Communications of the ACM 595 Literate Programming quency counts must be located and printed in decreasing addword also increments a global integer, total., which order of those counts. counts the number of distinct words in the table. This These two kinds of access are disjoint; that is, initially, number is required in the second phase of the program. all accesses to the table are of the first kind, followed by strcmp is a C library function that returns 0 if its two only accesses of the second kind. Consequently, the table arguments point to identical strings, and strcpy is a C representation can be designed to facilitate the first kind library function that copies the characters in its second of access, and then changed to facilitate the second. argument into its first. A hash table is appropriate for indexing the table with allot (n, size) allocates space for n contiguous ob- words. Since the size of the input is unknown, a hash ta- jects of size bytes each by calling calloc, a C library ble in which collisions are resolved by chaining is used. function that does the actual allocation and clears the Space for both the word and the table entry can be allo- allocated space. allot’s primary purpose is to catch allo- cated dynamically. The hash table itself, hashtable, is cation failures. Many C programmers erroneously assume an array of pointers to word structures: that calloc cannot fail. On machines like the VAX, al- #define HASHSIZE 07777 /* hash table size */ location rarely fails, but on smaller machines, failure is struct word ( common. char *word; /* the word */ 7. Printing the words. As suggested in the outline int count; /* frequency count */ for main, given above, printwords prints th.e desired struct w,ord *next; /* link to next entry */ output. To print the k most common words as specified, ) *haehtable [HASHSIZE+l] ; printwords must sort the contents of the table in de- The bounds of hashtable are 0 to 2” - 1, where n is 12 creasing order of the count values, and print the first k here. Using a power of two facilitates rapid computation entries. Since the frequencies range between 1 and N, of the index into hashtable given a hash number: If h is where N is the number of words, sorting them can be a hash number, the index is h&HASHSIZE. hashtable is accomplished in time proportional to N (assuming every- initialized in main to NULL pointers. thing fits into memory) by allocating an array of pointers to words that is indexed by the frequency of occurrence. 6. addword (buf ) adds the null-terminated string in buf Each element in the array points to the list of words with to hashtable, if necessary, and increments its count the same count values; that is, list [il points to the list field. To compute the index into hashtable, the contents of words with count fields equal to i. of buf must be “hashed” to yield a hash number h, from printwords (k) which the index is computed as described above. A sim- int k; ple yet effective hash function is to sum the codes of the { characters in buf. This function also yields the length of int i, mw: the word, which is needed to add new words to the table. struct word *wp, **list, *q; Putting this all together produces addword: list = (struct word **> alloc(tota1, sizeof wp) addword(buf) max = 0; char *buf; for (i = 0; i <= HASHSIZE; i++) c for (wp = hashtable[il ; wp; wp = q> { unsigned int h; q = wp->next; int len; wp->next = list [wp-Xountl ; char *s. *alloc() ; list [wp->count] = wp; struct word *wp; if (wp->count > max) max = wp->count ; h = 0; /* compute hash number of buf [I. .I */ 3 = max; i >= 0 && k > 0; i--> s = buf; for (i for (len = 0; *a; len++) if ((wp = list[i]) && k-- > 0) h += *s++; for ( ; wp; wp = wp->next) for (wp = hashtable[h&HASHSIZE] ; wp; printf (“%d %s\n** , wp->count , wp->word) ; wP = wp->next) 3 if (strcmp(wp->word, buf > == 0) break ; max keeps track of the largest frequency count, which is if (wp == NULL) { /* a new word */ usually much less than N, and provides a starting point wp = (struct word *) alloc(1, sizeof *wp) ; for the reverse scan of list. wp->word = alloc(len + 1, sizeof (char)) ; strcpy(wp->word, buf); 8. Performance. Bentley did not give specific perfor- wp-Xount = 0; mance criteria for common, but he did say that “a user wp->next = hashtable [h&J+ASHSIZE] ; should be able to find the QIKYI most frequent words in a hashtable [h&HASHSIZE] = wp; twenty-page technical papei without undue emotional total++ ; 3 trauma.” To test common I concatenated seven of the wp->count++; documents from volume 2 of the UNIX Programmer’s 3 Manual from the Berkeley 4.2 UNIX system to form a 596 Communications of the ACM July 1987 Volume 30 Number 7 ^; ------ “,I, - I
no reviews yet
Please Login to review.