Import GBIF taxonomy into a MySQL table

The GBIF taxonomy is available as tab separated values file (tsv) here: https://doi.org/10.15468/39omei

The download is about 700 MB in size and contains a file Taxon.tsv. This file can be imported into a MySQL table to use it as taxonomy.

Steps for import

Create database and table

sudo mysql -u root
 
 -- do not forget to set charset for the database
CREATE DATABASE gbif_taxonomy DEFAULT CHARSET utf8mb4 DEFAULT COLLATE utf8mb4_general_ci;
 -- for later access by another user
GRANT ALL ON gbif_taxonomy.* TO USER@localhost;
 
USE gbif_taxonomy;
 
 
 -- create a table that takes all the data
 -- the table first has varchar(11) definitions for NameUsage keys as they contain empty strings in the table. this will be fixed later
 -- there is also a doubled TaxonID that prevents the use of it as primary key
CREATE TABLE `Taxon` (
  `TaxonID` INT(11) NOT NULL AUTO_INCREMENT,
  `datasetID` VARCHAR(50) DEFAULT NULL,
  `parentNameUsageID` VARCHAR(11) DEFAULT NULL,
  `acceptedNameUsageID` VARCHAR(11) DEFAULT NULL,
  `originalNameUsageID` VARCHAR(11) DEFAULT NULL,
  `scientificName` VARCHAR(255) DEFAULT NULL,
  `scientificNameAuthorShip` VARCHAR(255) DEFAULT NULL,
  `canonicalName` VARCHAR(255) DEFAULT NULL,
  `genericName` VARCHAR(255) DEFAULT NULL,
  `specificEpithet` VARCHAR(255) DEFAULT NULL,
  `infraspecificEpithet` VARCHAR(255) DEFAULT NULL,
  `taxonRank` VARCHAR(255) DEFAULT NULL,
  `nameAccordingTo` VARCHAR(255) DEFAULT NULL,
  `namePublishedIn` text,
  `taxonomicStatus` VARCHAR(255) DEFAULT NULL,
  `nomenclaturalStatus` VARCHAR(255) DEFAULT NULL,
  `taxonRemarks` VARCHAR(255) DEFAULT NULL,
  `kingdom` VARCHAR(255) DEFAULT NULL,
  `phylum` VARCHAR(255) DEFAULT NULL,
  `class` VARCHAR(255) DEFAULT NULL,
  `order` VARCHAR(255) DEFAULT NULL,
  `family` VARCHAR(255) DEFAULT NULL,
  `genus` VARCHAR(255) DEFAULT NULL,
  KEY (TaxonID)
  );
 

Import the data from Taxon.tsv

The file Taxon.tsv must be copied to the directory that is allowed for use with the LOAD DATA command. This can be found with:

SHOW VARIABLES WHERE Variable_name = 'secure_file_priv';

Copy the Taxon.tsv file to the shown directory. Than use the LOAD DATA INFILE to import the data into the Taxon table. The first line must be ignored as it contains the column headers.

LOAD DATA INFILE '/var/lib/mysql-files/Taxon.tsv' INTO TABLE Taxon IGNORE 1 LINES;

This will take a while

Changing the column definitions and adding keys

There is a doubled entry for one TaxonID. As one entry contains only 'incertae sedis' as name values it can be deleted and the TaxonID column can be changed to a primary key column

SELECT COUNT(TaxonID), TaxonID FROM Taxon GROUP BY TaxonID HAVING COUNT(TaxonID) > 1;
 
SELECT * FROM Taxon WHERE TaxonID = 9819794;
 
DELETE FROM Taxon WHERE TaxonID = 9819794 AND scientificName = "incertae sedis";
 
ALTER TABLE Taxon ADD PRIMARY KEY (TaxonID);

The columns *NameUsageID contain integer values and should have keys. First, the empty sting values in some of the rows must be replaced with NULL:

UPDATE Taxon SET parentNameUsageID = NULL WHERE parentNameUsageID = '';
 
UPDATE Taxon SET acceptedNameUsageID = NULL WHERE acceptedNameUsageID = '';
 
UPDATE Taxon SET originalNameUsageID = NULL WHERE originalNameUsageID = '';

Change the datatype of the *NameUsageID columns to INT:

ALTER TABLE Taxon MODIFY parentNameUsageID INT(11);
 
ALTER TABLE Taxon MODIFY acceptedNameUsageID INT(11);
 
ALTER TABLE Taxon MODIFY originalNameUsageID INT(11);

Add keys to several columns to have faster access to the data:

ALTER TABLE Taxon ADD KEY(parentNameUsageID);
 
ALTER TABLE Taxon ADD KEY(acceptedNameUsageID);
 
ALTER TABLE Taxon ADD KEY(originalNameUsageID);
 
ALTER TABLE Taxon ADD KEY(kingdom);
 
ALTER TABLE Taxon ADD KEY(phylum);
 
ALTER TABLE Taxon ADD KEY(`class`);
 
ALTER TABLE Taxon ADD KEY(`order`);
 
ALTER TABLE Taxon ADD KEY(family);
 
ALTER TABLE Taxon ADD KEY(genus);
 
ALTER TABLE Taxon ADD KEY(scientificName);
 
ALTER TABLE Taxon ADD KEY(genericName);
 
ALTER TABLE Taxon ADD KEY(specificEpithet);

Clean up and preparation for exchange with other databases

If I want to transfer the data into another database system it might be necessary to resolve the parentUsageNameIDs, acceptedUsageNameIDs and originalUsageNameIDs to the scientific names that they refer to. This is for example needed when one wants to build a new tree in a database with its own primary keys for the scientific names.

Therefore, I added a name column behind each of the UsageNameID column that is filled with the referenced name. The update queries take several minutes.

ALTER TABLE Taxon ADD COLUMN parentScientificName VARCHAR(255) AFTER parentNameUsageID;
ALTER TABLE Taxon ADD COLUMN parentCanonicalName VARCHAR(255) AFTER parentNameUsageID;
 
UPDATE Taxon t INNER JOIN Taxon p ON (t.parentNameUsageID = p.TaxonID)
SET t.parentCanonicalName = p.canonicalName,
t.parentScientificName = p.scientificName
;
 
ALTER TABLE Taxon ADD KEY(parentScientificName);
ALTER TABLE Taxon ADD KEY(parentCanonicalName);
 
ALTER TABLE Taxon ADD COLUMN acceptedName VARCHAR(255) AFTER acceptedNameUsageID;
UPDATE Taxon t INNER JOIN Taxon p ON (t.acceptedNameUsageID = p.TaxonID)
SET t.acceptedName = p.scientificName
;
ALTER TABLE Taxon ADD KEY(acceptedName);
 
ALTER TABLE Taxon ADD COLUMN originalName VARCHAR(255) AFTER originalNameUsageID;
UPDATE Taxon t INNER JOIN Taxon p ON (t.originalNameUsageID = p.TaxonID)
SET t.originalName = p.scientificName
;
ALTER TABLE Taxon ADD KEY(originalName);

There are some problems in the dataset. Some scientific names occur multiple times with different TaxonIDs. Some of them are marked as synonyms, some as accepted names.

SELECT the datasets where a scientificName occurs more than once into a temporary table, start with accepted names:

CREATE TEMPORARY TABLE accepted_doubles
SELECT COUNT(scientificName), scientificName FROM Taxon 
WHERE taxonomicStatus = 'accepted'
GROUP BY scientificName
HAVING COUNT(scientificName) > 1;

Many of the doubles have the author name moved to scientificName and scientificNameAuthorShip contains only the a koma and the year:

SELECT DISTINCT 
-- t.TaxonID, t.scientificName, t.scientificNameAuthorship, t.canonicalName 
t.* FROM Taxon t 
INNER JOIN accepted_doubles d ON (t.scientificName = d.scientificName)
WHERE t.scientificNameAuthorship LIKE ', %'
;

delete them from the taxon table:

DELETE  
t FROM Taxon t 
INNER JOIN accepted_doubles d ON (t.scientificName = d.scientificName)
WHERE t.scientificNameAuthorship LIKE ', %'
;

There are a number of Taxa that occur in twice in Animalia and Plants or Fungi or Bacteria. These are in part doubled and wrongly connected to more than one parent taxa in the different kingdoms, e. g. Acantholepis. I am not sure how to handle that.

SELECT t1.scientificName, t1.kingdom, t2.kingdom FROM Taxon t1 
INNER JOIN accepted_doubles d ON (t1.scientificName = d.scientificName)
INNER JOIN Taxon t2 ON(t1.scientificName = t2.scientificName)
WHERE t1.kingdom != t2.kingdom;

Add a column to mark the scientific names that are used in different kingdoms:

ALTER TABLE Taxon ADD COLUMN different_kingdoms BOOL DEFAULT 0;

Choose all such taxa and set the marker for them to 1. First select all taxa that are doubled into a temporary table, as it fasten the process:

CREATE TEMPORARY TABLE taxon_doubles
SELECT COUNT(scientificName), scientificName FROM Taxon 
-- WHERE taxonomicStatus = 'synonym'
GROUP BY scientificName
HAVING COUNT(scientificName) > 1;
SELECT t1.scientificName, t1.kingdom, t2.kingdom FROM Taxon t1 
INNER JOIN taxon_doubles d ON (t1.scientificName = d.scientificName)
INNER JOIN Taxon t2 ON(t1.scientificName = t2.scientificName)
WHERE t1.kingdom != t2.kingdom;
UPDATE Taxon t1
INNER JOIN taxon_doubles d ON (t1.scientificName = d.scientificName)
INNER JOIN Taxon t2 ON(t1.scientificName = t2.scientificName)
SET t1.different_kingdoms = 1
WHERE t1.kingdom != t2.kingdom
;

Create table for vernacular names

Create the table

CREATE TABLE `VernacularName` (
  `TaxonID` INT(11) NOT NULL,
  `vernacularName` VARCHAR(255) DEFAULT NULL,
  `language` VARCHAR(2) DEFAULT NULL,
  `country` VARCHAR(255) DEFAULT NULL,
  `countryCode` VARCHAR(10) DEFAULT NULL,
  `sex` VARCHAR(50) DEFAULT NULL,
  `lifestage` VARCHAR(255) DEFAULT NULL,
  `source` VARCHAR(255) DEFAULT NULL,
  KEY `TaxonID` (`TaxonID`),
  KEY `vernacularName` (`vernacularName`),
  KEY `country` (`country`),
  KEY `countryCode` (`countryCode`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

a single line (717874) in the tsv file must be deleted as it contains no valid data. The backslash can not be used as vernacular

2925549	\						The Woody Plants of Korea (한국의 수목)

copy the file Vernacular.tsv to directory allowed for loading files, e. g. /var/lib/mysql-files/

import the data:

LOAD DATA INFILE '/var/lib/mysql-files/VernacularName.tsv' INTO TABLE VernacularName IGNORE 1 LINES;

Fixing wrong taxa in Taxon table

NOTE: This only fixes the taxonomicStatus. As the taxon hierarchy is rebuild on the way to the webportal when using the sync_dwb_portal.py script, the hierarchy will taken from the taxon that is marked as 'accepted'

Some taxa are marked as accepted that are synonyms or unaccepted. This in part lead to completly wrong assignments of taxa, for example the Lepidopteran genus Epicimelia is considered as subtaxon of the Malacostracaen taxon Axiidae and the taxonimcStatus was set to accepted. This must be changed by hand, when such errors occur:

SELECT * FROM Taxon WHERE canonicalName LIKE 'Epicimelia';
 -- set the wrongly accepted taxon to doubtful
UPDATE Taxon SET taxonomicStatus='doubtful' WHERE TaxonID = 8128092;
 -- set the taxon marked as doubtful to accepted
UPDATE Taxon SET taxonomicStatus='accepted' WHERE TaxonID = 1860101;
SELECT * FROM Taxon WHERE canonicalName LIKE 'Eueides';
 -- set the wrongly accepted taxon to doubtful
UPDATE Taxon SET taxonomicStatus='doubtful' WHERE TaxonID = 7911822;
 -- set the taxon marked as doubtful to accepted
UPDATE Taxon SET taxonomicStatus='accepted' WHERE TaxonID = 1904294;

Fix taxa in Taxon table, the long correct way:

NOTE: This is the long way, fixing the taxonomy in Taxon table entries. Since the taxon hierarchy is rebuild in sync_dwb_webportal.py script, the former way to just change the taxonomicStatus value will work when the data are used there

Cepolidae is a Homonym in Mollusca und Actinopterygii, the fish species of genus Cepola are all put into the wrong molluscan family Cepolidae, the older family Cepolidae in fishes does not exist in GBIF

add Cepolidae entry:

INSERT INTO Taxon VALUES (NULL, UUID(), 587, 'Perciformes', 'Perciformes', NULL, NULL, NULL, NULL, 'Cepolidae Rafinesque, 1815', 'Rafinesque, 1815', 'Cepolidae', 'Cepolidae', '', '', 'family', 'WORMS', 'urn:lsid:marinespecies.org:taxname:125527', 'homonym', '', '', 'Animalia', 'Chordata', 'Actinopterygii', 'Perciformes', 'Cepolidae', '', 0);

get the id from the new family

SELECT * FROM Taxon WHERE canonicalName = 'Cepolidae' AND `order` != 'Stylommatophora';

add Cepola entry:

INSERT INTO Taxon VALUES (NULL, UUID(), 9825260, 'Cepolidae', 'Cepolidae Rafinesque, 1815', NULL, NULL, NULL, NULL, 'Cepola Linnaeus, 1764', 'Linnaeus, 1764', 'Cepola', 'Cepola', '', '', 'genus', 'WORMS', 'urn:lsid:marinespecies.org:taxname:125953', 'homonym', '', '', 'Animalia', 'Chordata', 'Actinopterygii', 'Perciformes', 'Cepolidae', 'Cepola', 0);

copy entries that are connected to the wrong genus Cepola to the new genus Cepola:

get the id from the new genus

SELECT * FROM Taxon WHERE canonicalName = 'Cepola' AND `order` != 'Stylommatophora';

copy the species entries

INSERT INTO Taxon SELECT NULL, UUID(), 9825324, 'Cepola', 'Cepola Linnaeus, 1764', acceptedNameUsageID, acceptedName, originalNameUsageID, originalName, scientificName, scientificNameAuthorShip, canonicalName, genericName, specificEpithet, infraspecificEpithet, taxonRank, nameAccordingTo, namePublishedIn, taxonomicStatus, nomenclaturalStatus, taxonRemarks, 'Animalia', 'Chordata', 'Actinopterygii', 'Perciformes', 'Cepolidae', 'Cepola', different_kingdoms FROM Taxon WHERE parentCanonicalName LIKE 'Cepola' AND `order` = 'Stylommatophora';

delete the old, wrongly connected lines

DELETE FROM Taxon WHERE parentCanonicalName = 'Cepola' AND `order` = 'Stylommatophora';

set the taxonomicStatus of the new Cepolidae and Cepola to accepted, set the old ones to homonym and 'wrong parent', respectively

UPDATE Taxon SET taxonomicStatus = 'homonym'  WHERE canonicalName = 'Cepolidae' AND `order` = 'Stylommatophora';
UPDATE Taxon SET taxonomicStatus = 'accepted'  WHERE canonicalName = 'Cepolidae' AND `order` != 'Stylommatophora';
 
UPDATE Taxon SET taxonomicStatus = 'accepted'  WHERE canonicalName = 'Cepola' AND `order` != 'Stylommatophora';
UPDATE Taxon SET taxonomicStatus = 'wrong parent'  WHERE canonicalName = 'Cepola' AND `order` = 'Stylommatophora';

Usage?

The scripts in [sync_dwb_webportal|https://gitlab.leibniz-lib.de/wwalbaum/sync_dwb_webportal] use the Taxon table to add the taxa to the taxa fetched from TaxonNames databases.

Thoughts:

copy the data to a table that resembles GBOL_Taxa in the GBOL-Portal? Perhaps the transfer can be adopted to check taxon names against this table instead of checking against TaxonNames databases.

CREATE TABLE `GBOL_Taxa` (
  `id` INT(10) UNSIGNED NOT NULL,
  `taxon` VARCHAR(100) NOT NULL,
  `author` VARCHAR(100) DEFAULT NULL,
  `rank` VARCHAR(25) NOT NULL,
  `parent_id` INT(10) UNSIGNED DEFAULT NULL,
  `lft` INT(10) UNSIGNED DEFAULT NULL,
  `rgt` INT(10) UNSIGNED DEFAULT NULL,
  `known` INT(10) UNSIGNED NOT NULL DEFAULT '0' COMMENT 'number of species',
  `collected` INT(10) UNSIGNED NOT NULL DEFAULT '0' COMMENT 'number of collected species',
  `barcode` INT(10) UNSIGNED NOT NULL DEFAULT '0' COMMENT 'number of species with barcodes',
  `collected_individuals` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  `barcode_individuals` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  `rank_code` INT(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `left` (`lft`),
  KEY `right` (`rgt`),
  KEY `parent_id` (`parent_id`),
  KEY `taxon` (`taxon`),
  KEY `idx_rank_code` (`rank_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
;

alter some columns and add a column scientificName that can be used for comparison with parentScientificName column. Not sure if it is useful to compare the scientific names instead of canonical names. scientificName and parentScientificName contain the taxon name and author.

ALTER TABLE GBOL_Taxa MODIFY `id` INT(11) NOT NULL AUTO_INCREMENT;
 
ALTER TABLE GBOL_Taxa MODIFY author VARCHAR(255) DEFAULT NULL;
 
ALTER TABLE GBOL_Taxa ADD COLUMN scientificName VARCHAR(255);

Copy the data from Taxon table to GBOL_Taxa. This sets all `id` of the taxa names to a new value. The connection to parent taxa must be inserted later by an UPDATE

INSERT INTO GBOL_Taxa (taxon, author, `rank`, scientificName)
SELECT DISTINCT canonicalName, scientificNameAuthorShip, taxonRank, scientificName
FROM Taxon
WHERE taxonomicStatus = 'accepted' AND kingdom = 'animalia' AND different_kingdoms = 0
;

this took 1h 17min.

delete the rows with no taxa name in it

DELETE FROM GBOL_Taxa WHERE taxon = '';

update the parent_ids in GBOL_Taxa by comparing them to the parent names in Taxon table. The parentName column in the Taxon table contains the scientificName of the parent taxon.

UPDATE GBOL_Taxa gt
-- SELECT COUNT(*)
-- t.parentName, gt.taxon, gt2.`id`
-- FROM GBOL_Taxa gt
INNER JOIN Taxon t ON (gt.scientificName = t.scientificName)
INNER JOIN GBOL_Taxa gt2 ON(t.parentName = gt2.scientificName) 
SET gt.parent_id = gt2.`id`
;