1. The Data

1.1. Facilitating digital publication: ‘text markup’

The technology that underpins the project is called XML (Extensible Markup Language), a ubiquitous international standard for encoding and exchanging data. Although used nowadays in a wide range of operations involving data exchange, such as the transmission of information from one financial database to another, XML has firm roots in the humanities.

In fact, in research projects involving textual materials, XML can prove very useful for modelling humanities knowledge for a number of reasons, including its independence from any particular computer platform or software, the extremely robust basis it provides for encoding document-based materials and the fact that it potentially facilitates the generation of a wide variety of different representations of the encoded materials afterwards. This is no accident, since XML developed in part as a technology to facilitate digital publication.

One of the core principles of XML is that the representation of the structural and semantic content of a text should be kept separate from its presentation. The core information about the text is applied by means of a system of XML ‘tags’ that encode parts of the text, and any ‘visualisation’ of the text that is required for publishing purposes is then produced in a separate process. This is particularly useful in humanities scholarship, because it allows academics to concentrate on the structure and content of the source materials, and issues around scholarly interpretation of the text, leaving issues of presentation to the later publication processes.

1.2. Text Encoding Initiative: creating ‘added value’ for the core textual materials

In this project we elected to use a particular set of XML specifications called TEI (Text Encoding Initiative), an international and interdisciplinary standard that since 1994 has “been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation.”1 TEI XML has the technical rigour which allows computers to carry out complex processing, while at the same time being flexible and relatively easy for the average scholar to use, whether or not they have experience in using computers.

TEI allows us to encode scholarly assertions about the source materials in a complex and fine-grained manner. The TEI supplies a default set of tags and a standard procedure for adapting it to meet a project's particular need. For Jane Austen's Fiction Manuscript: a Digital Edition we have created a customized TEI markup scheme2 that allows scholars to capture the layout of the text on a page, different types of addition, deletion, abbreviation and correction, along with various features of orthography and spelling. The project Guidelines can be downloaded from here.
Powered by TEI

1.3. Image processing

Manuscript images have been cropped, resized and optimized for the different uses necessary for the web publication (thumbnails, low resolutions, zoomable). Adobe Photoshop has been used to convert the manuscript images for web use as well as to strip off the embedded colour profile in order to ensure a correct handling of the colours by Photoshop. Manuscript images have been processed using Zoomify, in order to make them zoomable, interactive and easy to view on the web.

1.4. Metadata

Metadata about the manuscript images is collected in METS (Metadata Encoding & Transmission Standard) documents. These documents provide a standard mechanism for preserving the technical information about the source and applied transformations of the image files. They also aid the maintenance of the site by representing the relationships among sets of images, and between images and the transcribed text.

The metadata was automatically extracted from the images using the ImageMagick and ExifTool software, and the METS files were generated from these using XSLT, with a custom Python script acting as glue to this process.

2. The website

2.1. The Diplomatic Display

The complex and detailed XML markup of this edition of Jane Austen's fictional manuscripts has proven to be particularly challenging for rendering the transcription in the web application. By definition, XML documents can only have a tree structure and, when dealing with text, it is often necessary to deal with overlapping hierarchies by specific means. The TEI approaches this problem by using empty elements (often called "milestones") and cross-referenced attributes. The TEI encoding of this edition by necessity makes extensive use of such mechanisms, thus making the processing of overlapping hierarchies problematic with XML-based processing technologies such as XSLT (eXtendable Stylesheet Language Transformation), which takes the base XML-encoded documents and transforms then into XHTML documents, suitable for display by web browsers. This is a common problem for projects that use XML for the representation of texts and several approaches have been discussed within the Digital Humanities community.

The project's technical team also contributed to the debate with a poster presented at Digital Humanities 2008 in Oulu, Finland (Pierazzo and Viglianti, 'XSLT (2.0) handbook for multiple hierarchies processing'). The poster discussed problematic transformations from TEI to XHTML and identified three approaches to the problem. One in particular dealt with line breaks in the text: in TEI they are marked by empty elements, but in the XHTML it has been necessary to have text lines contained by a <span> element to display the multiple interlinear insertions present on the manuscripts. In the poster it was suggested to employ XSLT 2.0 to process the document more than once, first breaking elements around milestones and then including the now separated elements into the necessary wrappers. Since 2008, these techniques have been greatly optimised; although the seemingly unavoidable multi-processing of the document still prevents faster processing and the problem of XML overlapping hierarchies remains an open debate in the field.

The density of mark-up also brought other complications that challenged the ideal separation of content from presentation. Whenever possible, the XML contains meaningful and formalised identifiers for the encoding of presentational features. It is then the job of the XSLT processing to generate the corresponding XHTML styled by CSS and JavaScript. However, the transcription presented in the web application aims at being as close as possible to the manuscript and the amount of variation typical of manuscript documents forced us to occasionally introduce rendering information in the mark-up, such as dimensions of blank spaces and horizontal positioning of interlinear insertions. Despite these challenges, the XSLT processing remains an essential component of the diplomatic edition for the thorough rendering of all the textual and editorial features annotated in TEI by the editors.

The front-end development of the Austen website includes the mark-up of the structure of the page, which is achieved by using xMod templates (built on XHTML). It also includes the styles that lay out the page and the way it looks, and this is achieved by the use of cascading style sheets. A number of jQuery plugins have been used to increase the visual experience of the site and being able to accomplish some of the effects required for the project. The list of jQuery plugins and jQuery UI widgets (jQuery user interface) include:

  • jquery.tipsy - provides a tooltip effect based on an anchor tag's title attribute
  • jquery.qtip - provides rounded corners and mouse over effect footnotes
  • ui.draggable - provides drag and drop features for the patches, as in The Watsons (Morgan Library & Museum), p. b[7]-7.
  • Accordion plugin - provides the expand and collapse functionality for block insertions, as in The Watsons (Morgan Library & Museum), p. b[9]-2.

The site has also been tested and checked across several browsers (FF 3 and above, IE7, IE8, Chrome 4 and above, Safari 4 and above, and Opera 10.5) and operating systems (Windows, MAC and Linux) to make sure the site works and behave the same way across different browsers and platforms.

2.2 Facsimile View

The Facsimile view was developed using XSLT to create indices of all the manuscripts' images. Each of the images can be viewed in very high definition thanks to the integration of the Zoomify plugin with the web application.

2.3 Search

The Austen search facilities and interface were developed using Apache Solr and Ajax-Solr. Apache Solr is an open source search platform that uses the Apache Lucene search library for indexing and search. Ajax-Solr is a JavaScript framework for creating search user interfaces to Solr.

The search index is created by converting all the manuscripts into the Solr XML format using XSLT, and then adding those XML files to Solr. The search interface was then built using Ajax-Solr and Apache Cocoon.

3. Transcription Tool

The Transcription Tool was developed using Django. Django is a high-level Python Web development framework. On the backend the tool uses a database as its datastore.

The tool allows teachers to upload images that students will later transcribe. After the transcription is complete, the teachers can review it and export it to HTML.