Documentation for Neo4j Data Migration and Validation

This documentation covers the process of migrating fauna data from CSV files into a Neo4j graph database and subsequently validating the integrity of the migration using automated tests. The process is divided into three main components: data migration for creating nodes and relationships (fauna_taxon_migration_relationships.py), utility scripts for setting up and managing the database (fauna_taxon_migration.py), and test scripts for validating the data migration (test_taxon_data_migration.py)

1. Database Management: fauna_taxon_migration.py

This script contains utility functions for managing the graph database structure, such as creating indexes, creating nodes and processing batches of node data from CSV files.

There are 7 types:

  • Fauna_Taxon: Holding taxon metadata.
  • Area: Holding areas metadata
  • Users: Holding users metadata
  • Status: Holding status information about the taxon created, for exmaple the taxon entry could be „Valid“ „public“ „New“ and so
  • Rank: Taxon rank is a level in a taxonomic hierarchy
  • Author: The person who discovered the Taxon
  • Reference_Papers: Reference papers that refers to the taxon entry

Functionalities

  • Index Management: Functions create_indexs_for_graphs() and drop_indexes_for_graphs() manage indexes for faster query performance.
  • Node Creation: Separate functions are provided for batch processing different types of nodes (batch_taxon_transaction, batch_area_transaction, etc.), allowing for flexible and organized data migration.

Execution

Run the script directly to perform database setup tasks or node data migrations as needed. Modify the main() function to include or exclude specific operations.

2. Data Relationship Migration: fauna_taxon_migration_relationships.py

Overview

TaxonRelationshipBuilder is a Python class designed to facilitate the migration of taxonomic relationships from CSV files into a Neo4j graph database. It manages connections to the database, processes CSV files in batches, and handles the creation of various types of relationships between nodes mentioned above.

The are 9 relationships that connects the fauna_taxon nodes with other nodes:

  • has_status: a realtionships between the fauna_taxon node and status node
  • found_in: a relationships between the fauna_taxon node and area node
  • has_rank: relationships between the fauna_taxon node and rank node
  • has_author: relationships between the fauna_taxon node and author node
  • has_user: relationships between the fauna_taxon node and user node
  • has_parent: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon parent which is also fauna_taxon node
  • has_family: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon famil which is also fauna_taxon node
  • has_genus: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon genus which is also fauna_taxon node
  • has_reference_paper: relationships between the fauna_taxon node and reference_paper node

Key Features

  • Configurable Connection: Establishes a connection to a Neo4j database with retry logic for resilience.
  • Concurrent Batch Processing: Leverages ThreadPoolExecutor for parallel processing of data batches, improving efficiency.
  • Flexible Relationship Building: Supports the creation of multiple relationship types from different CSV sources.
  • Logging: Using Python's logging module for informative and debuggable output.

Usage

  1. Initialization: Create an instance of TaxonRelationshipBuilder with the Neo4j connection URI and optional credentials.
  2. Build Relationships: Call build_all_taxon_relationships() to start processing predefined CSV files and creating relationships in the database.
  3. Clean-up: The close() method is automatically called to close the database connection once operations are complete.

3. Data Migration Validation: test_taxon_data_migration.py

This script uses pytest to define and run tests that validate the integrity of the data migration process.

Key Components

  • Neo4j Connection Fixture: A pytest fixture neo4j_driver that sets up and tears down the Neo4j connection for tests.
  • Node Type Validation: The test_node_types_exist function checks for the existence of expected node labels in the database.
  • Relationship Type Validation: The test_relationship_types_exist function verifies that all expected relationship types are present.

Running Tests

Execute the tests using the pytest command. Ensure Neo4j is running and accessible at the specified URI.

Note: development brach name: feature_taxon_data_migration