Data ingestion from dataprovider to GBOL webportal (harvest)

related: Webportal -> public (dissemination)

Recipe

Scripts

Application: sync_dwb_webportal.
Production: gaia:/var/data/sync_dwb_webportal
GIT: ssh://git@fredie.eu:6020/sync_dwb_webportal.git
BOLD Login: pgrobe •••••••••• für http://boldsystems.org/

Proposed program structure (future)

MySQL Prerequisite

Set mysql server to flush data from multiple queries together not for each query seperately. Otherwise it will take years.

edit /etc/mysql/my.cnf, put the following in section [mysqld]:

[mysqld]
[...]
innodb_flush_log_at_trx_commit = 0
[...]

see:

Transfer data from BOLD to gbol_bold

Skript zum Herunterladen der BOLD Daten

gaia:/var/data/sync_dwb_webportal/bold_data/downloads/download.sh

holt alle in gbol_bold.bold_projects definierten BOLD Projekte über deren API runter (API call im Script).
diese werden als .xml gespeichert und in utf-8 konvertioert
das Skript gaia:/var/data/sync_dwb_webportal/bold_data/xml2sql_2.py überträgt die einzelnen Dateien dann in gbol_bold.
anschließend wird die Tabelle gbol2_taxa gefüllt

Institute assignment

Every entry in GBOL is produced and published by one GBOL partner institute. In DiversityCollection this is documentes using the External Datasource fild for Collection Specimen.

BOLD data get the institute assigned accoring to their BOLD project (e.g BCDER ⇔ GBOL2 - Dermaptera = project_id 12). This does not work in all cases, therefore the view gbol_bold.all_new has a harcoded transformation according to the information stored in the table specimen_identifiers and the fields: institution_storing and catalognum.

Test Script for all_new:

SELECT
        `b_i`.`record_id` AS `specimen_id`,
        `b_i`.`record_id` AS `record_id`,
        `b_i`.`catalognum` AS `catalognum`,
        `b_i`.`fieldnum` AS `fieldnum`,
        IF(SUBSTR(b_i.sampleid,4,3)='ZSM' OR LEFT(b_i.sampleid,2)='FJ', 12,
			IF(LEFT(b_i.`institution_storing`,19)='Research Collection', 12, 
				IF(b_i.`institution_storing`='Biodiversity Institute of Ontario', 5, 
					IF(b_i.`institution_storing`='Friedrich-Schiller-University Jena', 47, 
						IF(LEFT(b_i.`institution_storing`,12)='Humboldt Uni', 15, 
							IF(b_i.`institution_storing`='National Museum of Scotland', 12, 
								IF(b_i.`institution_storing`='National Museum of Natural History, Prague', 12, 
									IF(b_i.`institution_storing`='Senckenberg Museum of Natural History Gorlitz' OR LEFT(b_i.catalognum,4)='SMNG', 22, 
										IF(LEFT(b_i.`institution_storing`,4)='SNSB', 12, 
											IF(b_i.`institution_storing`='Technische Universitat Muenchen', 12, 
												IF(LEFT(b_i.`institution_storing`,29)='Zoologisches Forschungsmuseum', 5, 
		r.project_id))))))))))) AS project_id,
        b_i.`institution_storing` AS `institution_storing`,
        `b_g`.`gbol_tax_id` AS `gbol_tax_id`
    FROM
        ((`gbol_bold`.`specimen_identifiers` `b_i`
        LEFT JOIN `gbol-python`.`GBOL_Data` `g_d` ON (((`g_d`.`term` = `b_i`.`catalognum`)
            AND (`g_d`.`field_id` = 1))))
        LEFT JOIN `gbol_bold`.`gbol2_taxa` `b_g` ON ((`b_g`.`SampleID` = `b_i`.`sampleid`)))
        INNER JOIN `gbol_bold`.`record` r ON r.record_id=b_i.record_id
WHERE r.record_id IN (
	4780949,  -- empty, via catalognum ~ '%ZSM%'
	1156551,  -- Research Collection...
	5889403,  -- Biodiversity Institute of Ontario
	1938912,  -- Friedrich-Schiller-University Jena
	3984066,  -- Humboldt Universitaet Berlin, Museum fuer Naturkunde, Provisional Research Collection
	1114004,  -- Mined from GenBank, NCBI (via catalognum ~ 'FJ%')
	4977873,  -- National Museum of Scotland (via catalognum ~ '%ZSM%')
	1747078,  -- National Museum of Natural History, Prague
	5512094,  -- Senckenberg Museum of Natural History Gorlitz
	4502392,3097284, -- SNSB, Zoologische Staatssammlung Muenchen
	3078574, -- Technische Universitat Muenchen
	6557029, -- Tiroler Landesmuseum Ferdinandeum
	4798975, -- Zentrum fur Biodokumentation des Saarlandes
	6292787  -- Zoologisches Forschungsmuseum Alexander Koenig, Bonn
);

Transfer data from DiversityCollection instances and gbol_bold

Script

Application: Transfer.py
Production: gaia:/var/data/sync_dwb_webportal
GIT: ssh://git@fredie.eu:6020/sync_dwb_webportal.git
IMPORTANT: use branch bjoerns_test with python3 environment

Use screen or byobu for a session that can be re-attached when connection to gaia is lost

screen -S sync

sudo su
su gbol-dev
cd /var/data/sync_dwb_webportal
git checkout master
git pull origin master
source bin/activate
python Transfer.py

script needs about one and half an hour for generation of the temporary database.

Copy temporary database to web portal production database

Exit from user gbol-dev to become root again as the script ./helper/copy_transfer.sh needs su permissions

exit

cd /var/data/sync_dwb_webportal/helper/
bash ./copy_transfer.sh

Create Blast database

Script

Application: blastdb.py
Production: gaia:/var/data/pyramidgbol/gbol/
GIT: ssh://git@fredie.eu:6020/sync_dwb_webportal.git
IMPORTANT: activate python3 environment, execute two times

become user gbol-dev again

su gbol-dev

activate the virtual environment for python 3 and execute the script two times

cd /var/data/pyramidgbol/gbol/
source ../bin/activate
python create_blastdb.py
python create_blastdb.py

Create Solr indices on public and restricted data

Connect to solr web page https://fredie.eu/solr with user solr_usr password ••••••••••

Select public_collection from Core Selector
Select Dataimport from tasks list below the Core Selector
press Execute

The progress can be watched by pressing the Refresh Status button. It takes about 30 minutes

When indexing is finished for public collection, the same steps must be done for the restricted collection:

Select restricted_collection from Core Selector
Select Dataimport from tasks list below the Core Selector
press Execute

The indexing of the restricted collection takes about 3 hours.