====== Data ingestion from dataprovider to GBOL webportal (harvest) ====== FIXME * related: [[gbol:data_disseminate|Webportal -> public (dissemination)]] ===== Recipe ===== * [[gbol:data_ingest#MySQL Prerequisite|MySQL Prerequisite]] * [[gbol:data_ingest#Transfer data from BOLD to gbol_bold|Transfer data from BOLD to gbol_bold]] * [[gbol:data_ingest#Transfer data from DiversityCollection instances and gbol_bold|Transfer data from DiversityCollection instances and gbol_bold]] * [[gbol:data_ingest#Copy temporary database to web portal production database|Copy temporary database to web portal production database]] * [[gbol:data_ingest#Create Blast database|Create Blast database]] * [[gbol:data_ingest#Create Solr indices on public and restricted data|Create Solr indices on public and restricted data]] ===== Scripts ===== ; Application : sync_dwb_webportal. ; Production : gaia:/var/data/sync_dwb_webportal ; GIT : [[ssh://git@fredie.eu:6020/sync_dwb_webportal.git]] ; BOLD Login : pgrobe U2FsdGVkX1/GZghnxjH6azQSYTLEvRQ6eTjBzG3BQYs= für http://boldsystems.org/ ---- ===== Proposed program structure (future) ===== digraph g { graph [labelloc="t", overlap=false, rankdir="LR", fontname="Courier New",fontsize=11]; ratio=auto; S [label="SourceStatements |- Identification \l|- Specimen \l|- Events \l|- Barcoding \l", shape=record, labeljust="l"]; P [label="PrepareTransfer(observable) |- interate over SourceStatements \l|- each statement becomes Observable \l|-use HSM: https://github.com/faif/python-patterns/blob/master/other/hsm/hsm.py \l", shape=record]; O [label="Observer", shape=record, URL="https://github.com/tylerlaberge/PyPattyrn#observer-pattern"]; C [label="Chain |- get from hash \l |- transformation hooks? \l |- write to target \l", shape=record, URL="https://github.com/tylerlaberge/PyPattyrn#chain-of-responsibility-pattern"]; P -> S [label="iterate"]; P -> O [label="place"]; O -> C [label="delegate"]; } ---- ===== MySQL Prerequisite ===== Set mysql server to flush data from multiple queries together not for each query seperately. Otherwise it will take years. edit /etc/mysql/my.cnf, put the following in section [mysqld]: [mysqld] [...] innodb_flush_log_at_trx_commit = 0 [...] see: * http://www.itworld.com/article/2695666/data-center/diagnosing-high-disk-utilization-in-mysql.html * https://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit ---- ===== Transfer data from BOLD to gbol_bold ===== ==== Skript zum Herunterladen der BOLD Daten ==== ''gaia:/var/data/sync_dwb_webportal/bold_data/downloads/download.sh'' * holt alle in gbol_bold.bold_projects definierten BOLD Projekte über deren API runter (API call im Script). * diese werden als .xml gespeichert und in utf-8 konvertioert * das Skript gaia:/var/data/sync_dwb_webportal/bold_data/xml2sql_2.py überträgt die einzelnen Dateien dann in gbol_bold. * anschließend wird die Tabelle gbol2_taxa gefüllt ==== Institute assignment ==== Every entry in GBOL is produced and published by one GBOL partner institute. In DiversityCollection this is documentes using the External Datasource fild for Collection Specimen. BOLD data get the institute assigned accoring to their BOLD project (e.g BCDER <=> GBOL2 - Dermaptera = project_id 12). This does not work in all cases, therefore the view ''gbol_bold.all_new'' has a harcoded transformation according to the information stored in the table **specimen_identifiers** and the fields: **institution_storing** and **catalognum**. Test Script for all_new: SELECT `b_i`.`record_id` AS `specimen_id`, `b_i`.`record_id` AS `record_id`, `b_i`.`catalognum` AS `catalognum`, `b_i`.`fieldnum` AS `fieldnum`, IF(SUBSTR(b_i.sampleid,4,3)='ZSM' OR LEFT(b_i.sampleid,2)='FJ', 12, IF(LEFT(b_i.`institution_storing`,19)='Research Collection', 12, IF(b_i.`institution_storing`='Biodiversity Institute of Ontario', 5, IF(b_i.`institution_storing`='Friedrich-Schiller-University Jena', 47, IF(LEFT(b_i.`institution_storing`,12)='Humboldt Uni', 15, IF(b_i.`institution_storing`='National Museum of Scotland', 12, IF(b_i.`institution_storing`='National Museum of Natural History, Prague', 12, IF(b_i.`institution_storing`='Senckenberg Museum of Natural History Gorlitz' OR LEFT(b_i.catalognum,4)='SMNG', 22, IF(LEFT(b_i.`institution_storing`,4)='SNSB', 12, IF(b_i.`institution_storing`='Technische Universitat Muenchen', 12, IF(LEFT(b_i.`institution_storing`,29)='Zoologisches Forschungsmuseum', 5, r.project_id))))))))))) AS project_id, b_i.`institution_storing` AS `institution_storing`, `b_g`.`gbol_tax_id` AS `gbol_tax_id` FROM ((`gbol_bold`.`specimen_identifiers` `b_i` LEFT JOIN `gbol-python`.`GBOL_Data` `g_d` ON (((`g_d`.`term` = `b_i`.`catalognum`) AND (`g_d`.`field_id` = 1)))) LEFT JOIN `gbol_bold`.`gbol2_taxa` `b_g` ON ((`b_g`.`SampleID` = `b_i`.`sampleid`))) INNER JOIN `gbol_bold`.`record` r ON r.record_id=b_i.record_id WHERE r.record_id IN ( 4780949, -- empty, via catalognum ~ '%ZSM%' 1156551, -- Research Collection... 5889403, -- Biodiversity Institute of Ontario 1938912, -- Friedrich-Schiller-University Jena 3984066, -- Humboldt Universitaet Berlin, Museum fuer Naturkunde, Provisional Research Collection 1114004, -- Mined from GenBank, NCBI (via catalognum ~ 'FJ%') 4977873, -- National Museum of Scotland (via catalognum ~ '%ZSM%') 1747078, -- National Museum of Natural History, Prague 5512094, -- Senckenberg Museum of Natural History Gorlitz 4502392,3097284, -- SNSB, Zoologische Staatssammlung Muenchen 3078574, -- Technische Universitat Muenchen 6557029, -- Tiroler Landesmuseum Ferdinandeum 4798975, -- Zentrum fur Biodokumentation des Saarlandes 6292787 -- Zoologisches Forschungsmuseum Alexander Koenig, Bonn ); ---- ===== Transfer data from DiversityCollection instances and gbol_bold ===== Script ; Application : Transfer.py ; Production : gaia:/var/data/sync_dwb_webportal ; GIT : [[ssh://git@fredie.eu:6020/sync_dwb_webportal.git]] ; IMPORTANT : use branch bjoerns_test with python3 environment ---- Use [[https://wiki.ubuntuusers.de/Screen/|screen]] or [[https://wiki.ubuntuusers.de/byobu/|byobu]] for a session that can be re-attached when connection to gaia is lost screen -S sync sudo su su gbol-dev cd /var/data/sync_dwb_webportal git checkout master git pull origin master source bin/activate python Transfer.py script needs about one and half an hour for generation of the temporary database. ---- ===== Copy temporary database to web portal production database ===== Exit from user gbol-dev to become root again as the script ./helper/copy_transfer.sh needs su permissions exit cd /var/data/sync_dwb_webportal/helper/ bash ./copy_transfer.sh ===== Create Blast database ===== Script ; Application : blastdb.py ; Production : gaia:/var/data/pyramidgbol/gbol/ ; GIT : [[ssh://git@fredie.eu:6020/sync_dwb_webportal.git]] ; IMPORTANT : activate python3 environment, execute two times ---- become user gbol-dev again su gbol-dev activate the virtual environment for python 3 and execute the script two times cd /var/data/pyramidgbol/gbol/ source ../bin/activate python create_blastdb.py python create_blastdb.py ---- ===== Create Solr indices on public and restricted data ===== Connect to solr web page https://fredie.eu/solr with user ''solr_usr'' password U2FsdGVkX18QEbl0z1phI48XXjafijdr+CP/Vkm3b1Y= - Select ''public_collection'' from ''Core Selector'' - Select ''Dataimport'' from tasks list below the ''Core Selector'' - press ''Execute'' The progress can be watched by pressing the ''Refresh Status'' button. It takes about 30 minutes When indexing is finished for public collection, the same steps must be done for the restricted collection: - Select ''restricted_collection'' from ''Core Selector'' - Select ''Dataimport'' from tasks list below the ''Core Selector'' - press ''Execute'' The indexing of the restricted collection takes about 3 hours. {{:gbol:solr_choose_core.png?250 |}} {{:gbol:solr_import_public2.png?300|}}