[CHWP Titles]

Numbering the streaks of the tulip?

Reflections on a Challenge to the Use of Statistical Methods in Computational Stylistics[1]

J. F. Burrows

Centre for Literary and Linguistic Computing,
University of Newcastle, Australia

lcjfb@cc.newcastle.edu.au |||| http://www.newcastle.edu.au/department/lc/burrows.html

CHWP A.2, publ. February 1996. © Editors of CHWP 1996.


[Abstract / Résumé]

KEYWORDS
computational stylistics, statistics, statistical analysis, randomness, probability, authorship, language, word-frequency


The last twenty years have brought remarkable advances in the many forms of research that lie within the province of computing in the humanities. Computational stylistics is among the areas where those advances have been most marked: the authorship of doubtful texts, formerly the staple of such work, is now only one of many worthwhile subjects of inquiry. The more reason, then, to pause sometimes in order to ask whether we know quite what we are doing and where we are going at such a giddy pace. The more reason, also, to face up to the objections raised by sceptical colleagues.

I am not referring to expert assessments on points of technique. It is both necessary and desirable that competent statisticians like M. W. A. Smith and David I. Holmes should examine the ways in which quantitative methods are being used.[2] I am referring to more general objections than theirs, ranging from minor reservations to wholesale onslaughts. Although my assessment of our achievements and prospects is more positive than those of Thomas N. Corns and Mark Olsen, chiefly because they want so much so soon, I believe that Corns serves us well by emphasising the importance of reaching beyond a narrow empiricism and that Olsen does so by drawing attention to valuable new lines of inquiry.[3] Discomfiting as it is to take one's turn as exemplar, we are all the better for being forced to answer strenuous challenges like those of Stanley Fish and Willie van Peer.[4] Although both objections prove, on examination, to be ill-founded, they undoubtedly repay the thought they require.

In the present paper, I have been asked to address myself to another of these seemingly fundamental challenges, one which I have not seen put forward in any published form but which I frequently encounter in discussions with colleagues in different parts of the world. Their basic position seems to be that it is inappropriate to use statistical methods, which take the potential randomness of data as their starting-point, in analysing something so highly systematic as the English language. I am inclined to believe that the arguments they put express a broad sense of unease rather than an exactly framed objection and so I shall try to look at the whole matter from several different angles.

One part of their position is indisputable. As the size of our textual archives and the subtlety of our computer programs increase, it is becoming ever more clear that (both in English and in other languages) patterns of occurrence and word-frequency patterns are far more systematic than it has hitherto been possible to demonstrate. That is not to say that the grammarians of former times would necessarily be surprised by our findings but only that we can now show plainly what they could, at most, have glimpsed.

For the purposes of this discussion, a systematic effect is present whenever the occurrence of any one word-type in a given text creates a better or worse than random likelihood that some other word-type will also be used there. Such an effect is also present whenever there are concomitant variations, across a range of texts, in the frequencies of two or more word-types.

These co-occurrences and these interrelationships of frequency fall into three broad classes. The first has to do with contextual or semantic probabilities. The occurrence of the word "field" creates a better than random likelihood that "magnetic", "green", or "academic" will also occur. If "green" also occurs, the odds in favour of "tree" increase while the odds in favour of "magnetic" (but not necessarily of "academic") diminish. The co-occurrence of ten or a dozen words will so sharply constrain the identity of their probable fellows that shrewd guesses about the subject, provenance, and other features of the text will often begin to be possible. (To speak, in this way, of contextual probabilities is not to deny that a conscious attempt to beat the odds might yield astonishing results.) This line of reasoning extends well beyond the range of what we ordinarily regard as lexical words. The occurrence of "hath" creates a better than random likelihood for "hast" and "doth". So, again, does "she" for "her" and "I" for "my" and "me". The study of such contextual probabilities, though not usually of colourless specimens like those last cited, lies within the province of discourse analysis.

The second class of co-occurrences has to do with transitional probabilities. These have mostly been studied by computational linguisticians but (frequently through the use of TACT[5]) they are yielding literary-historical and literary-critical findings like those presented in Ian Lancashire's paper on phrasal repetends.[6] They are also proposed by B. J. R. Bailey as a means of establishing authorship (Bailey 1990). The best-known example of a transitional probability in English is among the strongest: the likelihood that "of" will be followed by "the" and that "of the", in turn, will be followed by a noun. Other such links include the vestiges of the English system of inflection in forms like "I am". Others again represent collocations like "to and fro" or "of course" and stock phrases like "I do not know" and "I cannot tell". The probability that, in these and innumerable other cases, x will be followed by y varies greatly in degree: but a systematic effect is present whenever that probability reflects better than random odds. So, too, with worse than random odds: that "the" will be followed by "of"; that a conjunction will end a sentence; and that "I" will be followed by "I", "I", and "I".

The third class has to do with recursive probabilities. This is the main province of computational stylistics, the emerging field of study where my own work lies. In a text where "I" occurs even once, there is a better than random likelihood that it will occur again; and, up to a working-limit of about one word-token in ten or twelve, every occurrence increases the likelihood that it will occur yet again. In a text where "I" has not occurred within the first few hundred or few thousand words, there is a diminishing likelihood that it will occur at all. In a text where "like" is used as a conjunction ("like I said") or as a barbarous parenthesis ("I said, like, man, you know what I mean"), there is a strong likelihood that it will be used again and again and again. These simple forms of recursive probability are enriched by concomitant effects, cases where statistically significant correlation coefficients identify word-types whose frequencies rise and fall in unison across a range of texts. Where "I" abounds, "me" is seldom far away. Wherever "like, man" gets in, "you know" and "laid back" are likely to follow. In some texts (especially those written by educated Englishmen of the eighteenth century) there are subtle but powerful concomitances of frequency between such disparate forms as "which", "of", "by", "upon", "this", "than", and "very". Most of these occur significantly less often in the writings of their less educated contemporaries and also in the writings of their successors. For those other writers, however, "it", "there", "and", "then", and the preposition "like" all tend to higher frequencies. In any of these concomitant groups, an uncommonly high (or low) frequency for one word-type creates a better than random probability that the others will occur more (or less) often than usual.

Evidence of these three kinds of systematic effect deserves to be pursued much further than it has been. But a countervailing truth should never be ignored. For all its complexity, the language is still a fuzzy system in which high probabilities do not amount to certainties. Even such strong and obvious partners as "I" and "me" or "I" and "am" seldom yield correlations above about 0.80. Even in a slot where a noun seems inevitable, another adjective can always intervene. Even when the noun itself occurs, it may be a complete neologism, not drawn from the existing stock of English nouns. Where else, in a continuing process of creation, are new nouns likely to occur but in syntactic noun-slots? Even when "I--I--I" seems to exhaust the likelihood of hesitant repetition, a stammerer can extend the odds. And there is no word in the language that cannot be introduced in an unlikely position when it is cited as a specimen: "the 'of' that should appear just there is missing from the text".

To the extent that statistical analysis takes the concept of randomness -- epitomized by coloured marbles drawn blind from a bag and scattered across a carpet -- as its basis, it may seem that the challenge I have described is well founded and that statistics has little or no place in the study of phenomena as systematic as these patterns of word-frequency. The rejoinder that, until we know much more than we do about their systematic aspect, we can treat word-frequencies as operating pretty much at random is untenable. Ignorant as we are, we already know too much to allow ourselves that cheap comfort. The rejoinder that a degree of genuine fluctuation (as noted in the preceding paragraph) leaves sufficient room for chance to operate cuts only at the edges of the question.

The strongest rejoinder is that the challenge has no force. For, unless the language is systematic in some quite unusual sense, the situation is just like that which attaches to the many applications of statistical method in which the question to be tested is framed as a "null hypothesis". The possibility that a given set of data may contain evidence favourable to a particular conclusion is tested by assuming that it contains none. Whenever such an hypothesis is falsified, it is because the data reveal some meaningful relationship, some sign of a pattern amidst the empty noise. Must we give way, in all such cases, to the argument that statistical analysis is inappropriate because it emerged, on inspection, that the data were not random? It would be strange to confine the use of statistics to fields where the randomness of the phenomena precluded any positive result. Or should we suppose that the language differs somehow from all those other systems, natural and artificial, in which the use of statistics is valid? While such a case might be put, I have never encountered it and cannot see what form it might take. When Johnson's Imlac tells Rasselas that it is not the business of the poet to number the streaks of the tulip, his objection to an emphasis on random particulars is sometimes misunderstood. The general truths for which Johnson's predilection is well known include those that derive from interrelated particulars. To understand phenomena of that kind, whether in the language or elsewhere, we usually rely nowadays upon statistical analysis.

It may well be that those who believe statistical analysis should not be used when the data exhibit evidence of system are merely confusing specimens with variables. Among the measurable characteristics (that is, variables) which distinguish most human specimens from each other, many have an obviously systematic aspect, expressive of various biological or cultural determinants. Some of these are mutually dependent. But there is nothing intrinsically inappropriate about an analysis designed to test whether a given set of specimens falls into intelligible clusters on one or more such variables. If we confine our analysis to the members of the set (and do not, for the moment, draw inferences about those larger populations from which they derive), we may reasonably ask whether height and/or body-weight and/or voice-pitch and/or earning power and/or proportion of time taken up by paid employment and/or proportion of time devoted to child-care differentiate, to a statistically significant extent, between given sub-sets of specimens. We might seek, for example, to compare adults and children, men and women, Europeans and North Americans, Lemuel Gulliver's Bigendians and Littlendians, those born under different signs of the zodiac, or even those whose names began with letters from one half of the alphabet and those from the other. The presence of systematic properties in the variables would not entail meaningful results for the more arbitrary sub-sets of specimens.

Even if we were to suppose (or were to be convinced by evidence that nobody has yet adduced) that the systematic features of language are of so special a nature as to deny us the use of analytical methods appropriate to other biological and cultural systems, the challenge we are considering would still rest upon another error. The postulate of randomness that underlies the use of null hypotheses is not essential to every form of statistical analysis. It has no necessary place in the "descriptive" or "reconstructive" statistics that meets most of the requirements of computational stylistics. Means, standard deviations, z-scores, and correlation-coefficients allow direct comparisons between different sets of data. The Mann-Whitney, or rank sum, test makes no assumptions about the forms of distribution it is used to analyse. Even when the distribution of specimens on one or more variables is matched against a probabilistic base through the use of Student's t-test or the chi-squared test, the relative degrees of divergence from a hypothetical randomness can be compared directly, one with another. (These methods are abused, of course, when a series of results for more or less interdependent variables is treated as if the variables were independent and some astronomical level of odds is then offered as an assurance that the outcome is not a chance-effect. At one stage of his career, the methods of A. Q. Morton were weakened, in this way, by the interdependence of some of the variables he used (Morton 1978).

If the challenge we are considering has any force, it is in the territory of "predictive" statistics, where a new specimen is tested to establish the likelihood that it belongs to a named population. Direct comparisons between sets of specimens are no longer enough when the underlying question is not whether the new specimen resembles one set rather than another but whether the sets themselves are representative samples of the larger populations from which they are drawn. Comparisons between two sets of twenty or so writers, such as I have undertaken, show statistically significant differences between writers of different periods and different nationalities and also between male and female writers in the frequencies with which many common words are used.[7]

But one must ask whether sets of twenty times twenty, or twenty times again, would admit firm claims about the stylistic propensities of British, or North American, or Australian writers. The simplest approach to the difficulty is to enlarge the sample until its representative character is effectively unquestionable. Since that approach is likely to remain impracticable, others must be entertained. The notion of a random sample is attractive, especially perhaps to those who make the sort of challenge we have been considering. But a random sample of, say, British writers would need to be extremely large before it could be taken seriously. The notion of a constructed sample (in which the main sub-sets are carefully represented) seems more practicable but it is fraught with conceptual difficulties. In the familiar model of marbles scattered across the carpet, we are usually asked to examine the distribution of colours. But let us suppose that the marbles were also distinguished by size, shape, weight, transparency, and so on. After endeavouring to control all those other variables, we might feel confident in forming some conclusion about the distribution of the colours only to discover that certain of the marbles were magnetically charged.

I would like to propose (perhaps only as an expedient until such time as our samples are more truly representative) that, in the analysis of such profoundly systematic objects as literary texts, it may be sufficient to think in terms of specimens from a repertoire and not of samples from a population. If we are trying to establish whether a hitherto unknown text is likely to be the work of Smith, the concept of a population requires that we envisage the works that Smith might have written as well as those he did write. (If we do not, we are likely to flout those statistical prescripts that allow for differences between sample and population.) The concept of a repertoire, on the other hand, would allow us to assume--until the evidence itself proved otherwise--that a new text by Smith would be unlikely to differ greatly from his known work. (The meaning, here, of "unlikely" and "greatly" would be derived from our experience of cases where the truth is known.) The concept of a repertoire also allows for the marked but not absolute differences in Smith's work when he moves from one genre to another.

Membership of a statistical population is assessed by conformity to a series of potentially independent variables. Membership of a repertoire might be assessed by compliance with a set of fuzzy and possibly interdependent rules. Many of the same statistical measures are appropriate in both cases; but, in the latter case, it would be absurd to calculate an overall probability by compounding the probabilities for each variable in a manner that implied their mutual independence. The notion of interdependent rules, expressed in the language of statistical description, fits well with the systematic qualities of the language, as sketched in my opening paragraphs.

The concept of repertoires, whether personal or "communal", may even continue to stand literary and linguistic statisticians in good stead when we have made what I regard as our most necessary advance. Once we have established a "grammar of probabilities" in which major interrelationships of word-frequency are shown, we shall be able to use it (rather than an abstract postulate like randomness) as a base for assessing the ways in which and the extent to which particular texts diverge from given norms. Existing "frequency dictionaries" show the relative frequencies of specified word-types in different sorts of texts. A "grammar of probabilities" would show a range of concomitant variations of frequency between such word-types in such texts. (They might, for example, be expressed as correlation coefficients.) That would make for subtler and more accurate statistical analyses of our texts. It would allow us to enrich our computational methods. And it might finally overcome the doubts of those whose challenge I have tried to answer.

[CHWP Titles / Titres]


Notes

[1] Editorial note. The author was specifically requested to address the topic of this essay, which as he remarks is frequently raised in discussions among literary scholars. The editors would like to encourage submissions which discuss his suggestion that we "think in terms of specimens from a repertoire and not of samples from a population".

[2] For a recent specimen of the former, see Smith 1990; for the latter, Holmes 1994.

[3] Corns 1991; Olsen 1993, in a volume of Computers and the Humanities that contains nine articles responding to this paper and the author's rejoinder to the responses. A somewhat earlier version of the article is available online, at http://humanities.uchicago.edu/homes/mark/Signs.html.

[4] Fish 1980, firmly and effectively rebutted in Milic 1985; Peer 1989, to which I offer a rejoinder in Burrows 1992a.

[5] Editorial note. Cf. J. Bradley, "TACT Design", CHWP, B.1; cf. notice on TACT distribution.

[6] Lancashire 1992; a final version will be published in Hockey and Ide.

[7] Burrows 1992b, forthcoming in Hockey and Ide.