[CHWP Titles]

TACT Design[1]

John Bradley

University of Toronto Computing Services

john.bradley@utoronto.ca

CHWP B.1, publ. May 1996. © Editors of CHWP 1996. [First published in CCH Working Papers, vol. 1 (1991).]


[Abstract / Résumé]

KEYWORDS
TACT, text retrieval, European languages, markup, display of results, reuse of results


I am the principal designer of TACT, and, with Lidio Presutti, I have been working on its design over several years. Of course, the design process cannot proceed without recognizing many realities: it is a tradeoff between what I would like to do and what we have had time to do. For this reason, TACT development is still an ongoing process.

TACT is a text retrieval program.[2] Like others of its kind, it focuses on the vocabulary of a work. Almost any question you can ask TACT begins: "Show me the words in my text where ...". Although people think they know what text retrieval software is, I often find in describing TACT that it is actually confused with two similar types of software. One related but different genre is represented by programs like PC-Browse, Gofer or GREP. These programs are useful when you have a large collection of relatively small documents -- say, for example, a group of letters that you have typed in using WordPerfect. If you ask these programs to find all places where the word love is found, they will simply scurry around, opening each document in turn, and searching the file for the characters in the word. Usually the procedure used to look for the characters has been optimized for speed, and is therefore relatively quick. However, if the document collection is large, the process will, of necessity, take some time. Furthermore, by nature of the procedure used, these programs do not know what words occur in the documents until they go out and look. They therefore cannot show you a vocabulary list, or assist you in other, similar ways.

TACT is also sometimes wrongly equated with concordance programs such as the Oxford Concordance Program (OCP). It is true that OCP is, in some ways, a closer fit than PC-Browse. Both OCP and TACT are designed to work with a single large work, or a large unchanging collection. OCP however is an example of a batch processor. Normally you ask it to process your text once, and it produces a large printed concordance (or its electronic equivalent) that you can consult exactly like a conventional printed concordance. If the text is large you can easily have waited hours for OCP to collect all the words in the text, sort them in alphabetical order, and print out the concordance.

TACT is not a concordance program, nor a file searcher (such as PC-Browse). Instead, it is closer in spirit to WordCruncher. TACT is interactive. It specializes in quickly answering questions related to a work's vocabulary. TACT achieves this relatively quick response time by working with a textual database, which contains not only the text, but a complete index of all the word forms in the text, with pointers to their position in the text. Since TACT needs this index, we have provided a program called MakBas [now called MakeBase] which takes your text and creates this index of all the word forms it finds (plus other types of information). TACT can then use this index to show you quickly all occurrences of any word form you select. The resulting textual database, which MakBas produces and which TACT uses, is significantly larger than your original text, usually three times as large in fact. In exchange, however, TACT can offer you relatively quick responses to your queries.

What types of texts can TACT work with? TACT works correctly with the Roman alphabet and with various diacritics, supporting a sufficient number of diacritics to handle most European languages that use the Roman alphabet. In addition, support is provided for Classical Greek. TACT does not limit the diacritics to certain diacritic/letter combinations. Thus, a database can be created containing words with, say, an acute a letter combination, even though the standard IBM PC cannot display this combination without special handling. Since TACT 'knows' that it is displaying an acute a it is able both to display it if the computer is capable of it, and to handle the problem differently if it is not. Full support is available for such nonstandard accent/letter combinations for EGA and VGA screens.

Currently TACT does not work with Cyrillic or with Hebrew, although Cyrillic support could be added if we had a little detailed help from a Cyrillic language specialist. Hebrew would require more work due to the problems of displaying right-to-left. However, it is still relatively easy to add a diacritic or two to the Roman and Greek alphabets to meet the needs of more languages.

TACT was designed to support texts with a rich structural markup. Within TACT you can code such things as page numbers, speakers in a play, or other types of structural divisions. Indeed, it is not uncommon to have 20 different structural entities tagged and available for use within TACT. Furthermore, the different tags do not need to fit into a single hierarchical structure. Indeed, multiple hierarchical structures can be represented in parallel.[3] The structural information can be used in TACT in four ways: (a) as part of a citation: "this occurrence of the word earth is found on page 15", (b) to control the selection range: "find all places where earth and stars occur in the same sentence", (c) as a basis for distribution: "show me how the occurrences of 'earth' words are distributed among the chapters", and, finally, (d) as a basis for selection: "show me only the uses of 'earth' words as spoken by Juliet".

Clearly, if structure is to be used by TACT it must be first present, coded appropriately within the text in ways the program can identify. We attempted to support the major markup schemes -- the so-called BYU and COCOA markup schemes -- that were current when TACT was being developed. Thus many TACT users can take advantage of the markup in texts from text archives with little or no modification required to the markup. Of course, although TACT is designed to do useful things with structural markup, you can also use it with a text containing no additional markup at all.

Most people begin using TACT by examining the vocabulary of their text. An example of how TACT shows this is given in figure 1. This is the beginning of the vocabulary list for the Second Quarto edition of Romeo and Juliet (old spellings).[4] The number beside each word is the number of times it occurred; thus, the word able occurs twice in the text.

Central to the TACT program is the Selected List. The user, by selecting words from the vocabulary (or in other ways we shall touch on shortly) can put words or word positions into this list. The list can be changed at any time during a TACT session. Although the list often contains word forms, there is implicitly associated with the word forms a collection of word positions, the positions in the text where those words occur. Thus, associated with any selected list is a list of word positions in the text. In most of the examples below we have selected the words beginning with lou- as our list.

TACT can show you information about the word positions in the selected list in five displays. The Text display (fig. 2) shows the words in context; indeed, when you are using the text display it is as if you were viewing the text with a word processor, except that the word positions in the selected list are highlighted. Other displays are the KWIC [now called Variable Context], one-line Index [now called KWIC], and Distribution displays (fig. 3, fig. 4 and fig. 5). Notice that the distribution display also can show the distribution across some type of structural markup; in figure 6 we show the distribution by speaking character. The final display -- the Collocate display (fig. 7) -- shows all words that occurred near to the selected positions. These are sorted by Z-score, a statistical measure of the likelihood that their occurrence near the selected position is of significance.[5]

The displays are all scrollable: you can move through them in either direction. Furthermore, they are somewhat adjustable: you can change the amount of context in the KWIC display, for example. They can be printed out or stored in a DOS text file for later use.

As you can see, the five displays show information about the selected positions in five very different ways. It could be easy to become disoriented when trying to compare one display to another. For this reason, TACT provides you with the ability to choose one particular selected position, and to centre all the displays around it. In this way, for example, you could choose the last occurrence in the 'Juliet' line of the Distribution display. When you switch to the Text display, you will see this same selected position centred in the display. Similarly, if you are looking at the Collocate display, and you want to see the position in the text where the 'love' word is close to marke, you can move the selection pointer to point to marke, and when you switch to the KWIC display, it will be pointing at the KWIC entry for that particular item. This type of coordination helps prevent the TACT user from becoming lost, the focus remaining on the words of the text itself, since the context can easily be seen any time it is wished.

As well as allowing you to choose words from the vocabulary as a way of creating your selected list, TACT lets you specify rules for the selection of words or textual positions. There are several different types of selections TACT can do for you:

  1. you can give it a pattern to use to select word forms;
  2. you can let it find words that are similar to a given word;
  3. you can ask it to select positions in the text based on co-occurrence of two events: the last example (fig. 8) shows a fragment of the result TACT produced when asked to find all places in the text where a 'love' word was close to a 'death' word;
  4. you can refine your word selection by structural information: "find all 'love' words spoken by Juliet";
  5. and by frequency: "find all words that occur more than 100 times".

Useful rules can be collected together, saved in files called Rule [now Query] libraries, and used again with another text, or with the same text at a later time. Rule libraries can either be typed in in advance (with, say, a word processor), or can be created by the exporting of useful rules from TACT itself. An example of one use of a library is to build a thesaurus. You could use a thesaurus rule library to look up a word and use the rule associated with that word to find synonyms in your text.

Similarly, selected lists (or the results of rule searches) that are of likely future interest can be saved in entities called Categories [now Groups], and reused later.

I have already mentioned that TACT development is an ongoing process. We have many plans for new features and capabilities. From the beginning I have tried to accommodate the needs of the TACT user community, and, indeed, most of the features we are now working on arose out of requests from them. The next version of TACT will introduce a few changes in direction, including, for the first time, a collection of small independent programs to perform specific functions, such as producing a vocabulary list sorted by frequency of occurrence, or by reverse alphabetical order.

I am always interested in hearing suggestions from TACT users. Please tell me what you think, and how you think the program could be improved. We shall continue to rely on your comments.

[CHWP Titles / Titres]


Notes

[1] Editorial note. This article is as applicable to the functionality of the current version of TACT as it was when written in 1991. The names of a few of its elements have changed; these changes are noted at the relevant places in the text and figures. Cf. notice on availability of TACT.

[2] Editorial note. The term TACT is now used for the overall system of text retrieval programs, the two main components of which are MakeBase, which converts a text file into a textual database, and UseBase (formerly called TACT), which allows one to search the database.

[3] Cf. K.B. Steele, "'The Whole Wealth of thy Wit in an Instant': TACT and the Explicit Structures of Shakespeare's Plays", CCH Working Papers, vol. 1 (1991) and W. McCarty, "Finding Implicit Patterns in Ovid's Metamorphoses with TACT", ibid. [Editorial note. See the CHWP versions of Steele and McCarty.]

[4] All the examples use a text edition prepared by Ken Steele, a graduate student in the Department of English, University of Toronto.

[5] This unusual and useful display was originally the idea of Professor Ian Lancashire of the Department of English, University of Toronto.