Introduction‎ > ‎


Some of the primary goals of the STEMMA project were:


  • Define a data model and source format to represent my existing family history data (including micro-history data) accurately and without having to bend any rules.
  • To make the complete data (including history, evidence, reasoning, and transcription) searchable in a structured way, not just of plain text.
  • To be able to store copious amount of rich-text narrative in a structured way, including reference notes, semantic mark-up, transcription anomalies, and hyperlinks — not a simple Notes feature.
  • To clearly indicate the source of any data, and to separate objective information from subjective inference and conclusions.
  • To allow the data to be crafted by hand in the absence of a compliant software product.
  • To make the representation as globally applicable possible (i.e. locale-independent and culturally neutral).
  • To allow datasets to be validated for conformity to a well-defined schema using standard tools.
  • To make the model easily extensible without having multiple versions of the main data schema.
  • To make extensive use of modern data standards.


Rather than getting mired in arguments over whether the data should be lineage-linked, event-linked, or evidence-linked, STEMMA strives to be able to represent all aspects of the data and leave the flavour of analysis or presentation to the software modules that manipulate it. This means focusing primarily on a complete and accurate representation of the data that was found, and providing comprehensive support for representing related inferences and conclusions. It deliberately does not mandate any specific research process, or strive for compatibility with any existing software product or data model.


A source format is just a plain-text machine-readable definitive version of data that can be used for multiple different purposes. The term is analogous to source code in a programming context, which is a definitive representation of computer instructions that can be compiled for different machines, and in any locale. Most people will think of file formats when seeing this term, although serialisation format is the more correct technical term. A serialisation format includes other contexts where data is represented in a series of bytes, such as for transmission over a communications network.


A data model underpins the design of any source format. It dictates what entities need to be represented, and what relationships exist between those entities, but without specifying a particular syntax to use in a physical source format.


A run-time object model defines the structure of indexed data held in memory, and the software interfaces for accessing them. A standard model would allow run-time interoperability between products of different types (e.g. analysis or reporting) or from different vendors. It would also be required for connecting to Datasets published over the Internet, or in the ‘cloud’.


A data model hints at an object model but doesn’t actually define one. A run-time object model has to be optimised for data lookup and access, whereas a data model has to define a normalised copy of the data that is self-consistent and has no duplication. If the data model were ever used to form the basis of a commercial product then a subsequent project could be to define an associated run-time object model.


Neither a standard data model nor a standard object model would mandate a particular database type or format, including in-memory architectures. That would be a choice for commercial product designers. In other words, a database model is describes a derivative data form rather than a definitive one.


In order to separate objective information from subjective inference and conclusions, the STEMMA data model has two notional sub-models: informational and conclusional (see Our Days of Future Passed — Part III).