====== Documentation for Neo4j Data Migration and Validation ====== This documentation covers the process of migrating fauna data from CSV files into a Neo4j graph database and subsequently validating the integrity of the migration using automated tests. The process is divided into three main components: data migration for creating nodes and relationships (''fauna_taxon_migration_relationships.py''), utility scripts for setting up and managing the database (''fauna_taxon_migration.py''), and test scripts for validating the data migration (''test_taxon_data_migration.py'') ===== 1. Database Management: fauna_taxon_migration.py ===== This script contains utility functions for managing the graph database structure, such as creating indexes, creating nodes and processing batches of node data from CSV files. There are 7 types: * Fauna_Taxon: Holding taxon metadata. * Area: Holding areas metadata * Users: Holding users metadata * Status: Holding status information about the taxon created, for exmaple the taxon entry could be "Valid" "public" "New" and so * Rank: Taxon rank is a level in a taxonomic hierarchy * Author: The person who discovered the Taxon * Reference_Papers: Reference papers that refers to the taxon entry ==== Functionalities ==== * **Index Management**: Functions ''create_indexs_for_graphs()'' and ''drop_indexes_for_graphs()'' manage indexes for faster query performance. * **Node Creation**: Separate functions are provided for batch processing different types of nodes (''batch_taxon_transaction'', ''batch_area_transaction'', etc.), allowing for flexible and organized data migration. ==== Execution ==== Run the script directly to perform database setup tasks or node data migrations as needed. Modify the ''main()'' function to include or exclude specific operations. ===== 2. Data Relationship Migration: fauna_taxon_migration_relationships.py ===== ==== Overview ==== ''TaxonRelationshipBuilder'' is a Python class designed to facilitate the migration of taxonomic relationships from CSV files into a Neo4j graph database. It manages connections to the database, processes CSV files in batches, and handles the creation of various types of relationships between nodes mentioned above. The are 9 relationships that connects the fauna_taxon nodes with other nodes: * has_status: a realtionships between the fauna_taxon node and status node * found_in: a relationships between the fauna_taxon node and area node * has_rank: relationships between the fauna_taxon node and rank node * has_author: relationships between the fauna_taxon node and author node * has_user: relationships between the fauna_taxon node and user node * has_parent: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon parent which is also fauna_taxon node * has_family: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon famil which is also fauna_taxon node * has_genus: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon genus which is also fauna_taxon node * has_reference_paper: relationships between the fauna_taxon node and reference_paper node ==== Key Features ==== * **Configurable Connection**: Establishes a connection to a Neo4j database with retry logic for resilience. * **Concurrent Batch Processing**: Leverages ''ThreadPoolExecutor'' for parallel processing of data batches, improving efficiency. * **Flexible Relationship Building**: Supports the creation of multiple relationship types from different CSV sources. * **Logging**: Using Python's ''logging'' module for informative and debuggable output. ==== Usage ==== - **Initialization**: Create an instance of ''TaxonRelationshipBuilder'' with the Neo4j connection URI and optional credentials. - **Build Relationships**: Call ''build_all_taxon_relationships()'' to start processing predefined CSV files and creating relationships in the database. - **Clean-up**: The ''close()'' method is automatically called to close the database connection once operations are complete. ===== 3. Data Migration Validation: test_taxon_data_migration.py ===== This script uses ''pytest'' to define and run tests that validate the integrity of the data migration process. ==== Key Components ==== * **Neo4j Connection Fixture**: A ''pytest'' fixture ''neo4j_driver'' that sets up and tears down the Neo4j connection for tests. * **Node Type Validation**: The ''test_node_types_exist'' function checks for the existence of expected node labels in the database. * **Relationship Type Validation**: The ''test_relationship_types_exist'' function verifies that all expected relationship types are present. ==== Running Tests ==== Execute the tests using the ''pytest'' command. Ensure Neo4j is running and accessible at the specified URI. Note: development brach name: [[https://gitlab.smns-bw.org/fauna_developers/fauna_org/-/tree/feature_taxon_data_migration?ref_type=heads|feature_taxon_data_migration]]