CHWP B.12 | | Lancashire, "English Renaissance Knowledge Base" |
3. What are the Differences between COCOA Tags and SGML Tags?
3.1. Form and Span
COCOA tags normally have three parts in their form: (1) delimiter characters (the diamond
brackets, or some other symbols not found in the text), (2) a variable or type name, and (3) a
value or token name. It is a convention that the variable or type name may be dropped if the
delimiters themselves can carry that meaning. For example,
<AUTHOR John Palsgrave>
<<John Palsgrave>>
amount to the same thing. In the first form, single diamond brackets are the delimiters that
separate the variable-type AUTHOR and the value-token John Palsgrave
from the text. The variable or type may take any form. For example, other tags could be
TITLE, DATE, PUBLISHER, etc. The value or token
following it, John Palsgrave, may change, say, into other authors such as Cotgrave,
Florio or Thomas. In the second form, the double diamond brackets are understood to stand for
<AUTHOR > rather than <TITLE > or
<DATE >. Any other tags must use different and unique delimiters.
COCOA tags of a given type, like <AUTHOR John Palsgrave>, hold until
another tag of the same type occurs. That is, every word in the text following <AUTHOR
John Palsgrave> would be tagged as being written by Palsgrave until a subsequent
<AUTHOR > tag occurred. The span of such COCOA tags, then, is indefinite.
SGML/TEI tags differ in five principal ways, for my purposes.[8]
- SGML/TEI tags do not have an abbreviated form, in which the delimiters stand for
the tag itself. The variable name must always be present.
- SGML/TEI tags may have an indefinite span, prevailing until they are replaced by
another tag of the same kind (e.g., <page.break>), but they also may
take a closing tag, normally the opening tag with a forward slash preceding the tag
variable. Thus the SGML tag <col> would be concluded by
</col>. This is so for the following reason.
- All two-part tags of this kind surround text, and this text itself, not an editorial word
or phrase inside the tag, becomes the value or token for the tag value or token. Thus the
title page of Cotgrave's dictionary states Compiled by RANDLE
COTGRAVE. This would be tagged in SGML/TEI as Compiled by
<author> RANDLE COTGRAVE </author>.
- SGML/TEI tags may take attributes inside their delimiters. For instance, the
<col> tag could have the attributes location=,
width=, etc.
- SGML/TEI allows for special tags called "entities", which function as
string substitutions for what is in the text. The form of an entity is always
&, followed by the entity's name, and closed by semi-colon
";". For our purposes, entities serve to represent infrequent but
unusual strings within an established ISO character set. For example,
­ and ē could stand respectively for a soft
hyphen and an e with a macron over it.
By permitting closing tags, SGML, unlike COCOA-style tagging, employs the text itself to tag
the text, whereas the value or the token in a COCOA tag must always be added to the text
by the editor. Another way of putting this is that SGML recognizes the difference between tags
authorized by the text, and ones created by the editor from scratch. SGML tagging is textually
'conservative'.
3.2. Structure
COCOA-style tags in a text have no structure as a group; they can go anywhere, in any sequence,
and are relevant only to the words that follow. Often such tags are never explained within an
electronic text. In contrast, SGML-tags are normally declared in a "document-type
definition" (DTD) at the start of the text. This declaration associates those tags in
hierarchies, trees or groups of tags. A DTD normally expects that a text is structured like a
Ukrainian doll or a Chinese box, one smaller piece or tag division completely inside another.
'Higher' or 'outer' structural units never overlap with 'lower' or 'inner' units. For example, a DTD
for a novel might tell us that paragraphs fall always inside chapters, and chapters always inside
books, i.e. that a paragraph is never split between two chapters. Or a DTD for a play might tell
us that speeches always occur inside scenes, and scenes inside acts; a speech that carries on over
two scenes would break the hierarchy. Texts that have flat 'lattice' structures, or several different
structures, may also be represented within this formalism.
[Return to Table of Contents] [Continue]
Notes
[8] I emphasize that SGML is a much more complex system
than I am able to describe here. I touch only on certain characteristics obvious to the scholar doing
the tagging.