====== Documentation for Neo4j Data Migration and Validation ======

This documentation covers the process of migrating fauna data from CSV files into a Neo4j graph database and subsequently validating the integrity of the migration using automated tests. The process is divided into three main components: data migration for creating nodes and relationships (''fauna_taxon_migration_relationships.py''), utility scripts for setting up and managing the database (''fauna_taxon_migration.py''), and test scripts for validating the data migration (''test_taxon_data_migration.py'')
===== 1. Database Management: fauna_taxon_migration.py =====

This script contains utility functions for managing the graph database structure, such as creating indexes, creating nodes and processing batches of node data from CSV files.

There are 7 types:

  * Fauna_Taxon: Holding taxon metadata.
  * Area: Holding areas metadata
  * Users: Holding users metadata
  * Status: Holding status information about the taxon created, for exmaple the taxon entry could be "Valid" "public" "New" and so
  * Rank: Taxon rank is a level in a taxonomic hierarchy
  * Author: The person who discovered the Taxon
  * Reference_Papers: Reference papers that refers to the taxon entry

==== Functionalities ====

  * **Index Management**: Functions ''create_indexs_for_graphs()''  and ''drop_indexes_for_graphs()''  manage indexes for faster query performance.
  * **Node Creation**: Separate functions are provided for batch processing different types of nodes (''batch_taxon_transaction'', ''batch_area_transaction'', etc.), allowing for flexible and organized data migration.
==== Execution ====

Run the script directly to perform database setup tasks or node data migrations as needed. Modify the ''main()''  function to include or exclude specific operations.

===== 2. Data Relationship Migration: fauna_taxon_migration_relationships.py =====

==== Overview ====

''TaxonRelationshipBuilder''  is a Python class designed to facilitate the migration of taxonomic relationships from CSV files into a Neo4j graph database. It manages connections to the database, processes CSV files in batches, and handles the creation of various types of relationships between nodes mentioned above.

The are 9 relationships that connects the fauna_taxon nodes with other nodes:

  * has_status: a realtionships between the fauna_taxon node and status node
  * found_in: a relationships between the fauna_taxon node and area node
  * has_rank: relationships between the fauna_taxon node and rank node
  * has_author: relationships between the fauna_taxon node and author node
  * has_user: relationships between the fauna_taxon node and user node
  * has_parent: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon parent which is also fauna_taxon node
  * has_family: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon famil which is also fauna_taxon node
  * has_genus: relationships between the fauna_taxon node and fauna_taxon node, this relationship points to the taxon genus which is also fauna_taxon node
  * has_reference_paper: relationships between the fauna_taxon node and reference_paper node

==== Key Features ====

  * **Configurable Connection**: Establishes a connection to a Neo4j database with retry logic for resilience.
  * **Concurrent Batch Processing**: Leverages ''ThreadPoolExecutor''  for parallel processing of data batches, improving efficiency.
  * **Flexible Relationship Building**: Supports the creation of multiple relationship types from different CSV sources.
  * **Logging**: Using Python's ''logging''  module for informative and debuggable output.
==== Usage ====

  - **Initialization**: Create an instance of ''TaxonRelationshipBuilder''  with the Neo4j connection URI and optional credentials.
  - **Build Relationships**: Call ''build_all_taxon_relationships()''  to start processing predefined CSV files and creating relationships in the database.
  - **Clean-up**: The ''close()''  method is automatically called to close the database connection once operations are complete.
===== 3. Data Migration Validation: test_taxon_data_migration.py =====

This script uses ''pytest''  to define and run tests that validate the integrity of the data migration process.

==== Key Components ====

  * **Neo4j Connection Fixture**: A ''pytest''  fixture ''neo4j_driver''  that sets up and tears down the Neo4j connection for tests.
  * **Node Type Validation**: The ''test_node_types_exist''  function checks for the existence of expected node labels in the database.
  * **Relationship Type Validation**: The ''test_relationship_types_exist''  function verifies that all expected relationship types are present.
==== Running Tests ====

Execute the tests using the ''pytest''  command. Ensure Neo4j is running and accessible at the specified URI.

Note: development brach name: [[https://gitlab.smns-bw.org/fauna_developers/fauna_org/-/tree/feature_taxon_data_migration?ref_type=heads|feature_taxon_data_migration]]