====== Dataflow for Preservation of Digital Information at the SMNS ====== ===== Data pipeline of research data and corresponding metadata using SMNS in-house-management systems (DWB) ===== The [[https://kb.gfbio.org/pages/viewpage.action?pageId=113905865|SMNS – Data Center]] is one of the ten [[https://kb.gfbio.org/display/KB/Data+Centers|GFBio Collection Data Centers]] that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication at the SMNS includes the management systems [[https://diversityworkbench.net/Portal/Diversity_Workbench|Diversity Workbench]]. Management tools and archiving processes as done at the Datacenter are described under [[https://gfbio.biowikifarm.net/wiki/Technical_Documentations|Technical Documentations]]. This includes services for documentation, processing and archiving of the provided original data and metadata sets (source data; SIP). Data producers are welcome to use Spreadsheet templates as provided under [[https://gfbio.biowikifarm.net/wiki/Forms_and_Assessments|Templates for data submission]]. The workflow for submission, archiving and publication of data follows the standard for a __O__pen __A__rchival __I__nformation __S__ystem ([[https://www.iso.org/standard/57284.html|OAIS - Open archival information system]] and [[https://public.ccsds.org/pubs/650x0m2.pdf|Reference Model for an Open Archival Information System (pdf)]]). This ISO standard basically distinguished between different information packages for submission (SIP), archiving (AIP), and dissemination (DIP). For an overview of ISO standards for digital archives see [[ https://gfbio.biowikifarm.net/wiki/ISO_Standards_for_Digital_Archives|ISO Standards for Digital Archives]]. The different modules from Diversity Workbench for specimen occurrence data, literature, taxonomies, and others are used at the SMNS for data and metadata import, metadata enrichment and data quality control (see [[https://www.gfbio.org/data/tools|Tools & Workbenches for Data Management at GFBio]]). The workflow with these central components is illustrated in figure 1 and described in the text below. **Figure 1: The SMNS Data-Flow.** {{ :it:SMNS_Workflow_20190211.jpg|Figure 1: The SMNS Data Workflow.}} ; ABCD : Access to Biological Collections Data schema ; SIP : Submission Information Package ; AIP : Archival Information Package ; DIP : Dissemination Information Package ; VAT : Visualizing and Analysing Tool ==== Submission and Ingestion of Data ==== Data providers submit their original research data and corresponding metadata via the [[https://submissions.gfbio.org/|GFBio Submission System]] to our datacenter or contact it directly using the Email: . Completeness of the data and metadata are checked and missing data are requested from the data provider. A Submission Information Package (SIP according to OAIS) is build by several steps, including corrections, back-answers, cleansing, and refinement of the original data. Changes on the data are tracked in the GitLab revision control system at the SMNS, following a standard procedure as documented in [[dataflow:raw_dataflow|Data flow for Original Data]]. Correspondence with data providers are stored and documented in text files. Each SIP is imported into the management systems and prepared for dissemination by transforming the original research data and corresponding metadata to meet domain specific requirements as well as requirements data exchange, such as standards like [[https://abcd.tdwg.org/|ABCD]]. ==== Curation of data and metadata ==== Different types of data require different types of management systems for curation. At the SMNS for curation of the following data types we use specialized software suits: ; Occurence data : All specimen related data are integrated in [[http://diversityworkbench.net/Portal/Diversity_Workbench|DiversityWorkbench]] (DWB) database suite via the integrated import wizard and can be actively curated and managed by domain experts and/or data providers (user account on request). The occurrence data (according to [[https://kb.gfbio.org/display/KB/GFBio+Consensus+Document+for+citation+pattern+of+ABCD+datasets|GFBio consensus documents]]) are stored at unit level in the DWB Moduls DiversityCollection, DiversityAgents, DiversityTaxonNames and DiversityReferences and linked within each other. Metadata are cataloged in DiversityProjects. As far as mandatory or recommended as part of GFBio consensus documents they will be published. ; Metadata : Metadata describing data and associated multimedia are either stored together with the data entries (unit level) or handled in different management modules of DiversityWorkbench, such as DiversityProjects or DiversityAgents. The latter provide information about a set of entries, i.e. the dataset, or metadata. **Sensible data**: Each of the specialized systems listed above allows to withhold or blur data for publication. This can be the complete entry or part of an entry, e.g. information about the exact sampling location of a specimen. All sensible data are handled according to our [[:policies:datapolicy|Data Policy: Data provision for upload]]. For personal data the GDPR as described in the [[:policies:privacypolicy|Privacy Policy]] === Enrichment and Annotation of Data and Metadata === The data and metadata submitted to the SMNS Execute Department for IT and Biodiversity Informatics can be enriched and annotated within the management systems listed above. This is done manually by one of the SMNS data curators in close cooperation with the data provider or by domain experts with access to the management systems. **Identifiers:** Identifiers are used to provide unambiguous identification of information, e.g. unique identifiers for person names such as ORCID or to interlink information with one another. Identifiers can be added to the (meta-)data by using controlled classifications (i.e. whether the identifier is a sequence information, a person identifier, or a crossref for literature, etc.) and URLs. **Licenses:** Different licenses can be applied to the submitted data. They are part of the metadata on unit or dataset level. All metadata stored and published by the Datacenter receive the [[https://creativecommons.org/publicdomain/zero/1.0/deed.en|Creative Common CC0 waiver]]. The most frequently used license for specimen related data and multimedia is the [[https://creativecommons.org/licenses/by-sa/4.0/|CC BY-SA 4.0]]. An overview about all available CC licenses are [[https://creativecommons.org/about/cclicenses/|here]]. ==== Publication of Data ==== All data uploaded, curated, and archived in the management systems of the SMNS can be published. Publishing of datasets are negotiated with the data provider. Aspects to consider are sensible data for withhold (see above), or publishing restrictions caused by third parties. == Provision of versioned Datasets == Datasets containing occurrence data are published by creating a snapshot from the data and metadata in DiversityWorkbench for one dataset. This is done with the internal tools of the DiversityCollection. All data are mapped using the [[https://wiki.bgbm.org/bps|BioCASe Provider Software]] to the [[https://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html|ABCD 2.06 Standard]]. A Dissemination Information Package (DIP according to OAIS) is created and stored as zip-archive. Each DIP is versioned and the version is identified by the date. == DOI assignment == For each published major version of an occurrence dataset a DOI is assigned. The SMNS is registered at [[https://www.zbmed.de/|ZB MED]] and can therefore create a DOI at [[https://doi.datacite.org/|DataCite DOI Fabrica]]. The DOI is added to the corresponding version of the information package and is also part of the citation of the data set (see below). == Citation == Published datasets are citable using direct URLs to the DIP or via the DOIs. Based on the data provider's input the citation of the dataset will be prepared by the SMNS data curator adjusting the input (submission metadata) to be conform with the GFBio citation pattern. The citation is finalized in close collaboration with the data provider. For details see General part of [[https://gfbio.biowikifarm.net/wiki/Data_Publishing/General_part:_GFBio_publication_of_type_1_data_via_BioCASe_data_pipelines|GFBio publication of type 1 data via BioCASe data pipelines]] Example: ''Staatliches Museum für Naturkunde Stuttgart. (2017). The Golden wasps collection at the Staatliches Museum für Naturkunde Stuttgart (Version 20200805) [Data set]. Staatliches Museum für Naturkunde Stuttgart. https://doi.org/10.35069/SMNS-COLL.GOLDENWASPS'' ==== Archiving ==== Archival Information Packages (AIPs according to OAIS) are created from all data and metadata submitted and curated within the SMNS in-house-management systems. ; GitLab : In GitLab are all submitted files - as they are - archived. Furthermore the used import schemes for DiversityWorkbench are archived here. ; DWB : Occurence data stored in DiversityWorkbench are exported on a regular basis as tab-separated csv-files and archived in the intranet filesystem of the SMNS. ; SMNS Intranet Filesystem : Backups stored in specific folders on the SMNS intranet file system are transferred to backup systems on a regular basis. ; SMNS backup system : The generated AIPs are archived in the SMNS backup system. These backups are stored with two identical copies at two different buildings of the SMNS. For detailed information about backups and recovery see [[it:preservationplan|Preservation Plan]]. ==== Access to data via different portals ==== Indexed and faceted data are available in public portals such as GBIF, Europeana and GFBio, which are operated by national or international consortia. Specialized web portals for access to the data are developed and provided by the SMNS. These include the [[https://collections.smns-bw.org|SMNS digital collection catalogue]], the portal of the [[https://bolgermany.de|German Barcode of Life project (GBOL)]], or interfaces to the data, which also provide APIs for machine readable formats and access to the data using CETAF stable identifiers ([[https://id.smns-bw.org|id.smns-bw.org]]. The published data are provided with a recommended citation, license and DOI (see above). === Access to published data (unit level) === ; GFBio, VAT, and LAND : GFBio has developed a web portal that provides search functionalities for biodiversity related datasets and data. All uploaded data are annotated by GFBio's Terminology server, thus providing a richer search experience. A Visualization and Annotation Tool (VAT) allows for analysis and modelling of geo-referenced data. See General part of [[https://gfbio.biowikifarm.net/wiki/Data_Publishing/General_part:_GFBio_publication_of_type_1_data_via_BioCASe_data_pipelines|GFBio publication of type 1 data via BioCASe data pipelines]]. The "Lebendiger Atlas - Natur Deutschland (LAND)" provides an overview of Biodiversity data from Germany: [[https://land.gbif.de/|land.gbif.de]]. Here data from Germany, which are made available for GFBio, are made findable. ; Europeana : The multimedia data are accessible via [[https://www.europeana.eu/|Europeana]]. ; Digital Collection Catalogue : All data based on physical vouchers within the natural history collections of the SMNS are accessible via the [[https://collections.smns-bw.org/|SMNS Digital Collection Catalogue]] ; id.smns-bw.org : the API to all occurrence data are accessible by humans and machines in html, json, oder rdf format using [[https://id.smns-bw.org/smns/collection/|id.smns-bw.org/smns/collection/]]. === Access to original and raw data (dataset level) === We provide landing pages and direct download links to the datasets from within search results of the [[https://www.gfbio.org/search?q=smns+zip|GFBio web portal]], our GitLab installation at gitlab.smns-bw.org (login required), the BioCASe Provider Software (BPS) and [[https://biocase.smns-bw.org/biocase/querytool/main.cgi|local query tool of BPS]] as operated at the SMNS.