For GFBio Wiki: (===== Publication of Type 1 Data via BioCASe Data Pipelines at ZFMK Data Center ====
The ZFMK Data Center is one of the seven GFBio Collection Data Centers that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication at ZFMK includes the management systems Diversity Workbench as well as the online platform Morph·D·Base, and digital asset management system easydb. Management tools and archiving processes as done at the GFBio data center ZFMK are described under Technical Documentations. This includes services for documentation, processing and archiving of the provided original data and metadata sets (source data; SIP). Data producers are welcome to use xls templates as provided under Templates for data submission.
The workflow for submission, archiving and publication of data at ZFMK Datacenter follows the standard for a Open Archival Information System (OAIS, https://www.iso.org/standard/57284.html and https://public.ccsds.org/pubs/650x0m2.pdf). This ISO standard basically distinguished between different information packages for submission (SIP), archiving (AIP), and dissemination (DIP). For an overview of ISO standards for digital archives see: https://gfbio.biowikifarm.net/wiki/ISO_Standards_for_Digital_Archives.
The different modules from Diversity Workbench for specimen occurrence data, literature, taxonomies, and others are used at ZFMK for data and metadata import, metadata enrichment and data quality control (see https://www.gfbio.org/data/tools).
The workflow with these central components is illustrated in figure 1 and described in the text below.
Figure 1: The ZFMK Workflow, BioCASe data pipelines for GFBio Type 1 Data.
ABCD - Access to Biological Collections Data schema
SIP - Submission Information Package
AIP - Archival Information Package
DIP - Dissemination Information Package
VAT - Visualizing and Analysing Tool
Data providers submit their original research data and corresponding metadata via the GFBio Submission System to ZFMK data center. Completeness of the data and metadata are checked and missing data are requested from the data provider. A Submission Information Package (SIP according to OAIS) is build by several steps, including corrections, back-answers, cleansing, and refinement of the original data. Changes on the data are tracked in a GitLab revision control system at ZFMK Data Center, following a standard procedure as documented in Data flow for Original Data in the internal Wiki of ZFMK Data Center. Correspondence with data providers are stored and documented in a ticketing system. All relevant information is stored and archived on tape.
For multimedia data is Morph·D·Base used, where a user account is provided and the user can transfer his data directly. All available metadata are stored for each record.
Each SIP is imported into the management systems and prepared for dissemination by transforming the original research data and corresponding metadata to meet domain specific requirements as well as requirements data exchange, such as standards like ABCD.
Different types of data require different types of management systems for curation. At ZFMK we use for curation of the following data types specialized software suits:
Sensible data: Each of the specialized systems listed above allows to withhold or blur data for publication. This can be the complete entry or part of an entry, e.g. information about the exact sampling location of a specimen. All sensible data are handled according to our Data Policy: Data provision for upload. For personal data the GDPR as described in the ZFMK Privacy Policy (see https://datacenter.zfmk.de/wiki/internal/doku.php/gfbio:privacy_policy) applies.
The data and metadata submitted to ZFMK can be enriched and annotated within the specialized management systems listed above. This is done manually by the by ZFMK data curator in close cooperation with the data provider or by domain experts with access to the management systems.
As far as part of GFBio consensus documents they will be published.
Identifiers: Identifiers are used to provide unambiguous identification of information, e.g. unique identifiers for person names such as ORCID or to interlink information with one another. Identifiers can be added to the (meta-)data by using controlled classifications (i.e. whether the identifier is a sequence information, a person identifier, or a crossref for literature, etc.) and URLs.
Licenses: Different Licenses can be applied to the data submitted to ZFMK. They are part of the metadata on unit or dataset level. All metadata stored and published by ZFMK receive the Creative Common CC0 waiver (https://creativecommons.org/publicdomain/zero/1.0/deed.en). Creative Common licences are recommended by GFBio, The most frequently used license at ZFMK for specimen related data and multimedia is the CC BY-SA 4.0. An overview about all available CC licenses are here.
All data uploaded, curated, and archived in the management systems of ZFMK Datacenter can be published. Publishing of datasets are negotiated with the data provider. Aspects to consider are sensible data for withhold (see above), or publishing restrictions caused by third parties.
Datasets containing occurrence data are published by creating a snapshot from the data and metadata in DiversityWorkbench for one dataset. This is done with the external helper tool, available from: ZFMK GitLab: VCAT-Transfer. The tool transfers the data and metadata to a MySQL database. There all data are mapped using the BioCASe Provider Software to the ABCD 2.1 Standard. A Dissemination Information Package (DIP according to OAIS) is created and stored as zip-archive in the digital asset management system easydb at ZFMK. Each DIP is versioned and the version is identified by a date suffix and its version number consisting of a major version and a minor version (e.g. 2.1). Major changes, such as the addition of further data, increment the major version. Minor changes, e.g. correction of typing errors or changes in the metadata are reflected in an increment of the minor version.
Datasets stored and curated in Morph·D·Base or easyDB are published from within the software.
For each published major version of an occurrence dataset a DOI is assigned. Datasets in Morph·D·Base or easyDB receive a DOI on demand.
The ZFMK is registered at ZB MED and can therefore create a DOI at DataCite DOI Fabrica. The DOI is added to the corresponding version of the information package and is also part of the citation of the data set (see below).
Published datasets are citable using direct URLs to the DIP or via the DOIs. Based on the data provider's input the citation of the dataset will be prepared by the ZFMK Data curator adjusting the input (submission metadata) to be conform with the GFBio citation pattern. The citation is finalized in close collaboration with the data provider. For details see General part: GFBio publication of type 1 data via BioCASe data pipelines
Example: ZFMK Ichthyology Working Group (2018). The Ichthyology collection at the Zoological Research Museum Alexander Koenig. [Dataset]. Version: 2.0. Data Publisher: Zoological Research Museum Koenig - Leibniz Institute for Animal Biodiversity. https://doi.org/10.20363/ZFMK-Coll.Ichthyology-2018-03.
Archival Infomation Packages (AIPs according to OAIS) are created from all data and metadata submitted and curated within the ZFMK in-house-management systems.
For detailed information about backups and recovery see ZFMK Preservation Plan at (https://datacenter.zfmk.de/wiki/internal/doku.php/gfbio:digital_preservation_plan).
Indexed and faceted data are available in public portals such as GBIF, Europeana and GFBio, which are operated by national or international consortia. Specialized web portals for access to the data are developed and provided by the ZFMK Datacenter. These include the online collection catalogue, the portal of the German Barcode of Life project, GBOL, or interfaces to the data, which also provide APIs for machine readable formats and access to the data using CETAF stable identifiers.
The published data are provided with a recommended citation, license and DOI (see above).
We provide landing pages and direct download links to the datasets from within search results of the GFBio web portal, our GitLab installation at gitlab.zfmk.de (login required), the digital asset management system easydb (see above), and the BioCASe Provider Software (BPS) and local query tool of BPS as operated at ZFMK.
For GFBio Wiki only:
BioCASe Local Query Tool, landing page: All ZFMK datasets are accessible using the query tool of BioCASe Provider Software. A landing page for each data package is additionally available under ZFMK easydb. Additionally dataset or project specific websites may be available as landing page for the data.
The BioCASe Monitor service (BMS): See general part: GFBio publication of type 1 data via BioCASe data pipelines