Metainformationen zur Seite
Documentation for Neo4j Data Migration and Validation
This documentation covers the process of migrating fauna data from CSV files into a Neo4j graph database and subsequently validating the integrity of the migration using automated tests. The process is divided into three main components: data migration for creating nodes and relationships (fauna_taxon_migration_relationships.py), utility scripts for setting up and managing the database (fauna_taxon_migration.py), and test scripts for validating the data migration (test_taxon_data_migration.py)
1. Database Management: fauna_taxon_migration.py
This script contains utility functions for managing the graph database structure, such as creating indexes, creating nodes and processing batches of node data from CSV files.
There are 7 types:
- Fauna_Taxon: Holding taxon metadata.
- Area: Holding areas metadata
- Users: Holding users metadata
- Status: Holding status information about the taxon created, for exmaple the taxon entry could be „Valid“ „public“ „New“ and so
- Rank: Taxon rank is a level in a taxonomic hierarchy
- Author: The person who discovered the Taxon
- Reference_Papers: Reference papers that refers to the taxon entry
Functionalities
- Index Management: Functions
create_indexs_for_graphs()anddrop_indexes_for_graphs()manage indexes for faster query performance. - Node Creation: Separate functions are provided for batch processing different types of nodes (
batch_taxon_transaction,batch_area_transaction, etc.), allowing for flexible and organized data migration.
Execution
Run the script directly to perform database setup tasks or node data migrations as needed. Modify the main() function to include or exclude specific operations.
2. Data Relationship Migration: fauna_taxon_migration_relationships.py
Overview
TaxonRelationshipBuilder is a Python class designed to facilitate the migration of taxonomic relationships from CSV files into a Neo4j graph database. It manages connections to the database, processes CSV files in batches, and handles the creation of various types of relationships between nodes mentioned above.
The are 9 relationships that connects the fauna_taxon nodes with other nodes:
- has_status: a realtionships between the fauna_taxon node and status node
- found_in: a relationships between the fauna_taxon node and area node
- has_rank: relationships between the fauna_taxon node and rank node
- has_author: relationships between the fauna_taxon node and author node
- has_user: relationships between the fauna_taxon node and user node
- has_parent: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon parent which is also fauna_taxon node
- has_family: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon famil which is also fauna_taxon node
- has_genus: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon genus which is also fauna_taxon node
- has_reference_paper: relationships between the fauna_taxon node and reference_paper node
Key Features
- Configurable Connection: Establishes a connection to a Neo4j database with retry logic for resilience.
- Concurrent Batch Processing: Leverages
ThreadPoolExecutorfor parallel processing of data batches, improving efficiency. - Flexible Relationship Building: Supports the creation of multiple relationship types from different CSV sources.
- Logging: Using Python's
loggingmodule for informative and debuggable output.
Usage
- Initialization: Create an instance of
TaxonRelationshipBuilderwith the Neo4j connection URI and optional credentials. - Build Relationships: Call
build_all_taxon_relationships()to start processing predefined CSV files and creating relationships in the database. - Clean-up: The
close()method is automatically called to close the database connection once operations are complete.
3. Data Migration Validation: test_taxon_data_migration.py
This script uses pytest to define and run tests that validate the integrity of the data migration process.
Key Components
- Neo4j Connection Fixture: A
pytestfixtureneo4j_driverthat sets up and tears down the Neo4j connection for tests. - Node Type Validation: The
test_node_types_existfunction checks for the existence of expected node labels in the database. - Relationship Type Validation: The
test_relationship_types_existfunction verifies that all expected relationship types are present.
Running Tests
Execute the tests using the pytest command. Ensure Neo4j is running and accessible at the specified URI.
Note: development brach name: feature_taxon_data_migration