Skip to content

4.2 Data Archiving

What Is the Purpose of Archiving DSE Datasets?

Archiving data is essential for ensuring the long-term usability and preservation of the rich information contained in a DSE. As mentioned in the introduction to long-term preservation, it also enables various forms of reuse, such as distant reading approaches or computational linguistic methods. Additionally, archived data can be re-visualized for different purposes, for example, the emerging visualization and research tool ORD-Explore aims to support a wide range of DSEs by allowing the upload of their TEI/XML data.

Furthermore, comprehensive data archiving makes it possible to republish datasets later using new tools within a redesigned or reconstructed front end. However, for such a "revitalization" of archived data to be successful, complete technical documentationis crucial. This documentation must clearly explain how the datasets interact to ensure their accurate reimplementation.

Archiving by Specialized Institutions

Institutions specializing in humanities data possess expert knowledge of standard data formats, ensuring that archived materials remain both accessible and searchable. Additionally, they typically guarantee a minimum preservation period of 10 years. However, as long as these institutions continue to operate, it is reasonable to assume that the data will remain available for significantly longer periods.

In Switzerland, the current standard archiving solution, supported by the Swiss National Science Foundation, is the Swiss National Data and Service Centre for the Humanities (DaSCH). Various projects are stored within the same database - the DaSCH Service Platform (DSP) - which, as data volume grows, facilitates a broad, cross-project search within DaSCH resources. In addition to DSEs, the platform also hosts other humanities datasets, including encyclopedias, photography collections, and bibliographies. To ensure long-term accessibility and academic citation, DaSCH assigns an Archival Resource Key (ARK) identifier to both entire projects and individual objects within them (e.g., an XML file). This persistent identifier guarantees that resources remain referenceable and accessible, even if datasets are modified over time. There are two ways to archive data in DaSCH:

  • Simple data model: TEI/XML files, along with basic metadata, are submitted to DaSCH. This metadata is then searchable via the DaSCH Service Platform (DSP).

  • As an elaborated data model: In collaboration with the DSE project, DaSCH models part of the data as an RDF database on the DSP. This approach enables more complex searches for relationships between the data records. An example of this is the Bernoulli-Euler Letter Edition Online. In addition to displaying facsimiles (which can be stored on DaSCH's own IIIF server or integrated from external IIIF servers), it also allows for the generic publication of transcriptions and searchability by index. Although the transcriptions in the Bernoulli-Euler Letter Edition are not TEI/XML datasets, they will be stored on the DSP in the future as part of other DSEs. The main limitations are that their presentation is only static and that minimal annotation, agreed upon with DaSCH, is recommended from the outset. The resulting structured front end (the DSP-APP) goes beyond simple data archiving but remains straightforward and generic compared to most DSE front ends.

Both archiving methods employed by DaSCH are implemented using the same data format, RDF-triples, and differ primarily in the complexity of their structuring.

In Austria, long-term archiving of data is available through the Humanities Asset Management System GAMS of the University of Graz, though this service is primarily intended for projects and collaborations at the University of Graz. Similarly, the Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH) of the Austrian Academy of Sciences archives its DSE data in the ARCHE.

In Germany, Textgrid is a major curated repository for XML data. Textgrid allows users to search for metadata and download the data but does not provide functionality for displaying it. Additionally, various universities and academies in Germany offer text repositories (not necessarily for DSEs). An overview of these repositories is provided by Text+, a consortium of the German National Research Data Infrastructure (NFDI).

Archiving as Data Backup

In addition to discipline-specific databases like DaSCH or GAMS, it is also possible to store data with minimal or no external curation in a scientific database. For this purpose, the database Zenodo, maintained by CERN and OpenAIRE, is a suitable option. Structuring or curating the data is only possible to a limited extent and must be carried out by the project itself. As a result, many projects opt to store 'database dumps' - uncurated snapshots of their own databases - on Zenodo. These dumps can be versioned and receive a DOI, providing a persistent, centrally registered access point on the Internet.

One example is the database dump of the somewhat older project Cædmon's Hymn: A Multimedia Study, Edition, and Archive which was archived under the DOI 10.5281/zenodo.1226549 on Zenodo. For projects that use GitHub (which is not a FAIR repository but is suitable for various forms of backup in the workflow), there is also an interface with Zenodo for long-term data backup.

Other data repositories such as OLOS or soon also SWISSUbase can be used in a similar way to Zenodo. One example already mentioned, in which data archiving and the static presentation of data go hand in hand, is the GitHub repository for the DSE Arthur Schnitzler's Letters, along with other projects that use the DSE-Static-Cookiecutter tool. This tool processes the source data on GitHub through another GitHub instance to create a static website. As noted in connection with static_presentations via GitHub, data privacy concerns may arise regarding the use of GitHub. Alternatives include GitLab instances from public institutions, which operate on the same principles as GitHub but are hosted on their own institutional servers, offering full control over the data.

The SNSF provides an overview of recommended repositories that fulfil its requirements for open research data standards.

Archiving/sharing transcription data

For the specific reuse purpose of ATR model training, PAGE and Alto-XML data can be stored in the htr-united repository (txt data is also welcome, but it doesn't typically appear in the workflows described here).

Transcriptiones, which primarily enables historians to easily store and make accessible transcriptions of sources not published elsewhere, serves a broader reuse purpose. There are no restrictions on data formats, and the legal and technical barriers to publication are intentionally kept very low (generic presentation, no facsimiles). Since DSE projects typically already present and archive their data elsewhere, transcriptiones should be seen more as an additional storage option.

Archiving/Sharing Metadata

Metadata aggregators connect metadata and link back to their respective resources. A particularly noteworthy aggregator is correspSearch, which gathers correspondence metadata (such as people, places, sending and receiving dates, etc.) from 490 DSE projects (as of 2024). Before being transferred to correspSearch, the metadata must be converted from the TEI/XML encoding CorrespDesc into the Correspondence Metadata Interchange Format (CMIF). Therefore, it is essential to mark up correspondences according to the coding guidelines of the TEI/XML element CorrespDesc from the very beginning when sharing metadata with correspSearch.

Another metadata aggregator is the Swiss platform Metagrid, which links biographical data from online humanities resources. It is especially beneficial for Swiss DSE projects, as several Swiss DSEs, databases, archives, and libraries already share their metadata here.

The metadata aggregator GeoNames can be particularly useful for place names. This database compiles geographic data from various sources and allows users to edit and improve the entries.

Publicizing and Making Accessible

Sharing metadata not only helps to connect open linked data but also raises awareness of a project. Ultimately, project information should be disseminated as widely as possible to achieve this goal. This can be done by submitting the project to major overviews and collections of DSE mentioned in this handbook once the project is completed. The Catalogue Digital Editions is particularly important to us, as it connects edition data with the German Library Network (DBIS). We hope that similar solutions will be found for non-German library networks as well.

Additionally, it makes sense to report DSE projects directly to relevant library networks as an online resource so they can be easily found through library searches.