Digital preservation is a crucial undertaking for institutions managing digital collections to ensure long-term accessibility and integrity for future use. The growth of digital data publishing adds a new dimension to digital preservation. Institutions must consider the larger file sizes of research datasets, new platforms to manage the records, and thus new workflows to integrate with digital preservation systems. Manual preservation efforts are increasingly difficult to scale and maintain efficiently and this necessitates robust and adaptable solutions to make the workflows as efficient as possible.
This case study summarizes how Virginia Tech University Libraries has addressed these challenges by automating data preservation actions for its Virginia Tech Data Repository. The basis of this case study is a presentation delivered by Brandie Pullen, a Resident Librarian specializing in Data Curation at Virginia Tech University Libraries, given at the Figshare and Symplectic North American Virtual User Conference 2025.
Virginia Tech Libraries use Figshare for the Virginia Tech Data Repository, which accepts data and code from Virginia Tech-affiliated researchers. The Figshare platform enables Virginia Tech to more easily upload large files, make records highly discoverable, and track the reuse of datasets.
Digital Preservation requires a suite of tools combined with people-managed workflows. Rather than prescribing how an institution should conduct their digital preservation, the Figshare platform is designed to integrate with the preservation solution at an institution. The primary mechanism for this is the Figshare API.
To address the need for robust data preservation, Virginia Tech Libraries developed a set of Python scripts that pull information using the Figshare API. These scripts are designed to create preservation copies of ingested and published datasets and transfer them to their preservation system, APTrust. The initial development of these scripts was based on those created at the University of Arizona Libraries.
Automated Published Dataset Process: The presentation detailed a six-step automated process for published datasets:
- Metadata Entry: Once a dataset is published, metadata information is manually entered into an internal spreadsheet.
- Configuration File Generation: The generate_config.py script is run. This script pulls information from the spreadsheet to create a configurations.ini file, which will be used in subsequent steps.
- Publication Bag Download: The PubFolder_Download.py script is executed. It uses the newly created configurations.ini file and the Figshare API to download the publication bag, retrieving all data associated with the specified Figshare item.
- Curator Review and Addition: A curator then manually adds a provenance log and relevant email correspondences to the downloaded publication bag. This log documents changes made from ingest to publication.
- Transfer to APTrust: The PubBagDART_TransferBagAPTrust.py script is run. This script sends the created publication bag to APTrust via DART, adhering to Bag-it specifications.
- Process Completion: Once the bag is successfully transferred, the preservation process for that dataset is complete.
Data Backup and System Updates: Beyond preservation in APTrust, Virginia Tech also maintains a copy of each preservation bag on an SSD. This adheres to the LOCKSS (Lots Of Copies Keeps Stuff Safe) principle, ensuring multiple backups in case APTrust encounters issues.
The automated system has been in place since 2022 and is continuously updated to maintain stability and adapt to changes in Figshare or institutional needs. In the summer of 2024, significant updates included the development of batch processing capabilities for larger data collections, allowing for the simultaneous processing of multiple items. The project’s code is available on GitHub.
Watch the recording: https://doi.org/10.6084/m9.figshare.29446829