Recording Evidence

A number of features are required to correctly record source information in a transcription. This section illustrates how STEMMA deals with them.


  • Annotation. Text annotation is written information added outside the main body of a text. This may have been added by the author or by-hand after it was printed or published. The following main types can be identified and are supported by the <Anom> semantic mark-up element:
    • Footnotes and Endnotes. Text added at the end of a page or section.
    • Maginalia. Text added in a margin.
    • Interlinear notes. Text added between lines.
    • Intralinear notes. Text inserted within a line, usually marked with a caret. Identification of sublinear and supralinear variants.
  • Uncertain characters. Sequences of characters may be unreadable or uncertain (i.e. there are several distinct possibilities). Recording this correctly is essential for accurate searching. See below.
  • Struck-out characters. Characters crossed out in the original. See the corresponding element in Presentational Mark-up. No current support for different colours apparently James Joyce used different coloured markers.
  • Uncertain interpretation. Adding a suggested meaning or spelling correction to a word or phrase that is readable but not recognised. Supported via the <Alt> mark-up.
  • Original emphasis, such as bold, italic, or underline. See the respective elements in Presentational Mark-up.
  • Numbering of lines and pages. See <Line> and <Page> mark-up.



Some of these terms and concepts may be found in Editorial Methods for Journals, volume 1, and The Conventions of Textual Treatment, chapter five.


In the sections on Semantic Mark-up and Properties, the original source form of something is presented in element data rather than in attribute values so that these features may be employed syntactically. Attribute values, on the other hand, are significantly more restricted.


Traditional editorial notations for uncertain characters are not well-suited to digital text as they do not facilitate efficient and accurate searching within the limits of what is known. TEI has elements such as <choose> and <unclear>, and a comprehensive formalised notation may be found at: under Transcriptions. Although less comprehensive, perhaps the most compact is the UCF (Uncertain Character Format) devised by FreeUKGEN. This is based on the regex pattern-matching language although it must be remembered that this exists within target strings rather than search strings. Regex, in turn, is an extension of tradition wildcard characters[1].This UCF is the basis of the notation used within STEMMA and the following table is from the FreeBMD pages:



_ (Underscore)

A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.

* (Asterisk)

Several adjacent uncertain characters. A single * is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a _ or another *. Note: If it is clear there is a space, then * * is used to represent 2 words, neither of which can be read.


A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets. For example, [79] would mean either a 7 or a 9, whereas [C_] would mean a C or possibly some other character.


Repeat count - the preceding character occurs somewhere between min and max times. max may be omitted, meaning there is no upper limit. So _{1,} would be equivalent to *, and _{0,1} means that it is unclear if there is any character.


UCF also defines a ‘?’ character that is used to represent the situation where all of the characters have been read but you remain uncertain of the word, e.g. “RACHARD?” This is not used within STEMMA because it is ambiguous with ‘?’ representing an absent value, and the equivalent feature is supported by <Alt> mark-up.


Some examples:


 [lt]                   Can't tell if it's an l or a t.

___                 Three unreadable characters.

[x_]                  I think the character is an ‘x’

_{2,3}              Two or three unreadable characters.

*                       Unknown number of unreadable characters.

_{0,1}              Not sure if there's a letter or an ink blob.


Early STEMMA designs considered using an ANSI escape sequence to bracket a set of UCF characters. For instance, <APC>_12[68]<ST> where APC=0x9F and ST=0x9C. This was partly to avoid unconditionally reserving a whole set of characters but also to allow them in attribute values as well as element data. The current version accommodates them in a <Ucf> element:


<Ucf> ucf-sequence </Ucf>

[1] Wildcard characters represent variable sequences. There are several schemes but most allocate a single character to represent 0-or-more unknown characters (e.g. ‘*’) and another to represent exactly one unknown character (e.g. ‘?’). These may be combined so that, for instance, ‘?*’ represents 1-or-more unknown characters. Note that since ‘*?’ ≡ ‘?*’ and ‘**’ ≡ ‘*’ then any contiguous sequence of ‘*’ and ‘?’ can be simplified to just [?...][*], i.e. 0-or-more ‘?’ followed by an optional ‘*’.