CHWP A.3 | Winder, "Reading the text's mind" |
Those working in quantitative linguistics and lexicology associate lemmatisation with the methodology of frequency lists and concordances; the term "lemmatisation" is otherwise not widely known, and might be considered a technical term reserved to these fields.
But whether working in quantitative linguistics or not, all literate speakers have in fact a very practical experience with lemmatisation through the use of dictionaries. When we as readers turn to the dictionary for help with a word encountered while reading, we perform a lemmatisation. We associate a textual occurrence of a word with an idealised form of the word found in the dictionary. The dictionary headword is "idealised" because it stands for a range of possible occurrences. Thus, if we encounter the word "go" 10 times in a passage of Shakespeare, we will associate each occurrence with the same dictionary headword. On another level the headword is idealised because it represents obvious variants of a word: to know more about the words "went" and "gone", for example, we will look in the dictionary under "go". In English, we generally look for conjugated verb forms under infinitives, plural nouns under their singular form, abbreviations under their full form, lexicalised compounds under key words, etc.
Dictionary headwords belong to a different world than the words we find in other texts. That fundamental difference is labelled differently in different philosophical traditions. For structuralists, dictionary headwords belong to "langue", the systematic side of language; text occurrences belong to "parole", the unsystematic side. For generative linguists, headwords are units of linguistic competence; occurrences, units of performance. For logicians, headwords represent logical classes, to which text occurrences belong as members; headwords are mentioned, occurrences used. For lexicologists, headwords are types and text occurrences are tokens.
The type/token distinction was originally given systematic study by Peirce. However, his distinction was between types, tokens, and tones.[3]. The tone category of signs did not have the same success as its siblings, since it is a more delicate matter to distinguish tones, and they are not subject to the same quantitative treatment as tokens.
The three appear as manifestations of three modes of reality: existential reality (token), the reality of law (type), and the reality of qualities (tone). We will consider each in turn.[4]
A token belongs to the existential world. It is a sign that represents by way of its particular place in time and space. By definition a token is unique and different from anything else. Thus, if I point to the token "swounds" in my copy of Shakespeare's works and say that it is misspelled, I am pointedly not saying that "swounds" is by nature a misspelled word (though that may coincidentally be in some sense true), nor even that in all Shakespeare's plays it is misspelled (which may be coincidentally true also). By calling it a token, I only wish to say that this particular example in my book is misspelled, without any further, perhaps unwarranted, generalisation. To say that a sign is a token is simply to point to what is absolutely unique in the occurrence, i.e. its position in time and space.
Peirce's definition of a token is narrower than the modern definition. Peirce's tokens expressly do not belong to a given type. They are simply text [5] positions, before any lemmatisation has occurred. Peirce reserves the term "replica" for lemmatised tokens, and I will follow his terminology here.
A type belongs to the world of law. It is a sign that represents the law-like generality of a class. Again when I point to "swounds" in my copy of Shakespeare and say that "I know this word", though I may never have read that particular text in my life (and therefore have never had any existential contact with this particular instance), what I mean is that the occurrence can be assumed under a general model. I recognise it as an expression of a law or convention, in the same way that I recognise the falling of a rock as an expression of gravity. Unlike tokens, types cannot be pointed to any more than can the law of gravity; types are real but do not belong to the existential world where pointing is possible. If a rock fell, it would be incongruous to say "There is gravity." But in the case of a textual replica, I can translate the pointing to another world, and indicate the headword of a dictionary. To say "I know this word" is tantamount to saying that if a replica were found as a headword in a dictionary, I could paraphrase the dictionary text that follows it. The shift in text worlds is an essential property of the type-replica relationship, as is the predictability of the defining text. But fundamentally, a type is by nature a semiotic item that is beyond considerations of time and space.
Tones belong to the world of qualities. Qualities are the fundamental perceptual units that cannot or will not be analysed in a given investigation. For example, when reading the barely legible handwriting of one of Shakespeare's manuscripts we might point to a word and say that it looks like "swounds", or perhaps "swound" or even "smounds". We may not recognise individual letters, but still have the impression that we recognise the general form of the scrawl. Whether we recognise the form or not, we do indeed "cognise" some quality, i.e. something that has a value as a first impression, whatever interpretation we may finally bring to it; something that is at the same time both this and that, something distinct from time and space, and distinct from a law. That "something" is a tone, precisely because it does not require that we know what it is, only that we be aware of it as something that is not constrained by time and space. In other words, the same tone is free to appear simultaneously in many places.
This gift of ubiquity that a tone shares with a type often leads us to confuse them. Even a knowledgeable reader of Peirce such as Savan can have misgivings:
The difficulty ... is that there are no criteria of identity for qualities. ... So a quality is identical with or similar to those qualities of which it is judged to be a sign. If Locke's blind man judges the blare of a trumpet to be like red, so be it. The sound is a qualisign of the colour. Such arguments led Peirce to adopt the hypothesis of synaesthesia, that all the sensory modalities form one continuum of qualities. It ought to have led him to ask whether the notion of a qualisign [tone] was in any significant way different from that of a legisign [type]. (Savan 1988: 24, emphasis added)
A tone concerns what is possible, independent of the laws that
relate the instances of those possibles; a type concerns necessary
relations between possibles. In some contexts, the two are interdefinable,
and therefore may appear to be equivalent, as Savan suggests.
Thus, in modal logic a necessary proposition is defined as one
whose meaning is not possibly not true (i.e. not possibly false);
in formula:
However, that interdefinability depends on negation (a distinctive aspect of the token dimension) and the subtle principle of hierarchy of operators (a type distinction). Possibility is indeed involved in necessity, but nonetheless distinct from it: in Peircean terms, necessity (a third) is what determines individuals (seconds) to possess certain possibilities (firsts).
In practical terms, we recognize this distinction when we recognize the difference between a wildcard pattern and the class of target strings it delimits. The pattern has its own distinctive internal properties; the class, on the other hand, is an interpretation of the pattern in a given context. It depends on the expression of a rule. Thus, depending on the search context, two different patterns may delimit the same class of target strings. Programmers distinguish between these two dimensions when they recognize the difference between metacharacters (types) and escaped characters (tone).
What are the tones of a text? When we read, we may certainly be aware of much more than we actually use in our reading. A letter may be smeared, the spacing of the text may vary, a higher proportion of e's may be found in one passage, a sequence of words may have a particular rhythm, etc. All these qualities we may perceive, but may choose to ignore, or not, when we read. Carmina figurata --visual patterns made with text-- are good examples of how unique qualities can be found in any passage. Even in the more abstract medium of the electronic text, where the basic tone units are the set of ASCII characters, the possible number of tones is infinite, since even the shortest passage can be taken as the source of a combinatorial explosion which places it at the centre of an infinitely expanding cloud of associated patterns. However, the tones of a text are defined as those qualities that we do indeed wish to consider as fundamental and unanalysable in a given analysis. For instance, in most texts we will consider alphanumeric characters as unanalysable; we will not analyse them into the bars and curves, or distinguish them according to their pitch in proportional spacing.
Thus, following this triadic division, dictionary headwords are types; source text occurrences are tokens; and character combinations are tones.
These three sign classes are defined in relation to one another, so that, like dimensions of fractal geometry, their proportion is maintained on any structural level --macro, meso or micro--. At the same time, the type/token/tone scale is hierarchical: a headword may subsume a set of occurrences; an occurrence may subsume a set of characters; and characters are unanalysed members of a tonal alphabet. This "sliding hierarchy" is represented in Peirce's phenomenology by the numbers 3, 2 and 1: a type is a third; a token is a second; a tone is a first.
In spite of this absolute hierarchical ordering of signs, a particular textual sign may belong to all three worlds at the same time. As the examples I have given show, an occurrence of "swounds" can be called a type, a token, or a tone, depending on what we wish to say about it. It is only in our discourse that we distinguish between them and highlight one aspect. In other words, the justification of that distinction cannot be found so much out there, in the text, as in our own discourse and in the critical practices we wish to describe.
In fact, we typically have difficulty maintaining a clear distinction between these three and we often use the term "word" to ambiguously refer to all three. There are at least two very good reasons for the confusion.
First, in a very real sense, type, token, and tone dimensions are indissociable; they are like the legs on which the word stands. When we consult a dictionary, we use the tonal qualities of the text to associate a textual token with a dictionary type. If ever there is a misprision at any of these levels, the lemmatisation fails as does the consultation. To return to the example of a misspelled "swounds": though we presented it as a case of tokenness, it is clear that all three worlds are involved. Misspelling concerns types as well as tones, since we must first recognise the occurrence as a word, and indeed see some resemblance between the misspelling and the correct spelling, and finally associate it with an occurrence of the type "swounds". So it is for all the examples we gave: there is no pure case where only one world appears alone; our critical discourse is needed to select a particular aspect for study.
Secondly, the highlighting we choose may be intentionally complex. We may wish to study signs that are intrinsically hybrid, i.e. that are more or less a tone, more or less a token, more or less a type. For instance, when we speak of the spelling of "swounds" in the first edition of Shakespeare's plays, we restrict the range of the type to a particular set of texts. In other words, the "swounds" type may be drawn from the special, restricted lexicon of that edition, and not all Shakespeare, nor all texts. At the same time, as laws, types can be more or less general, the more specific being instances of other, more general types; the occurrence "swounds" is a replica of the type "SWOUNDS", which in turn is a replica of the type "INTERJECTION", which is a replica of the type "WORD", etc.
Nor do tonal qualities have distinct borders: just as the colour red blends continuously into scarlet and purple, so do character combinations, which tend to resemble more or less other character combinations; wildcard, anagram, and fuzzy match functions are designed precisely to deal with this variability.
Finally, tokens, though particular, are not defined according to size: a word may be a token, but a fixed expression may be one as well, and so too a paragraph, or even a text.
While it is generally not a simple matter to preserve these distinctions, it is worth noting that lexicographic practice does distinguish very carefully between the three. I have drawn on that tradition by saying that the dictionary represents a separate world, the world of types. It is only a metaphor because, after all, a dictionary is a text. Yet, at the same time the tonal space of the dictionary is unique among texts. The lexicographer uses typography, format, and alphabetical order to set headwords apart, to detextualise them. Alphabetical order and headword format are lexicographic conventions that are used respectively to represent the tone and type spaces of texts.
[Return to table of contents] [Continue]
[3] Peirce's terminology varied. He often used the designations legisign/sinsign/qualisign instead of type/token/tone. We will use the type/token/tone designations here, since they are better known and lexically less baroque. Peirce used them as late as his last letters to Lady Welby in 1908.
[4] Much of what follows is inspired by David Savan's interpretation of Peirce's writings (Savan 1988).
[5] The type/token/tone distinction is not limited to language and textual signs; it applies to any semiotic system. However, throughout this article we are describing this distinction only in the restricted context of textual signs, and particularly with respect to electronic texts.