Validation / Pre-pipeline Scripts

This subfolder contains three validation scripts designed to ensure the integrity and correctness of configuration, metadata, and RNA-seq dataframe files used in the REMIND-Cancer project.


1. validate_configuration_file.py

This script validates paths in the configuration file to ensure they exist and are correctly formatted.

Usage:

python src/utils/validate_configuration_file.py -c path/to/configuration_file.json

Checks Performed:

  • Ensures the configuration file exists.

  • Verifies that paths (except path_to_results) are valid.

  • Checks if path_to_fimo_executable is an executable file.

  • Reports missing or incorrectly formatted paths.


2. validate_metadata_file.py

This script validates a metadata CSV file by checking required columns and ensuring referenced files exist.

Usage:

python src/utils/validate_metadata_file.py -m path/to/metadata.csv

Checks Performed:

  • Ensures the metadata file exists.

  • Confirms required columns exist:pid, tumor_origin, path_to_wgs_file, cohort, ge_data_available.

  • Verifies whether all path_to_wgs_file entries exist.

  • Checks the validity of corresponding VCF files (expects .vcf filenames based on .wgs entries).

  • Prints the number of valid WGS and VCF files.


3. validate_rnaseq_dataframe.py

This script validates an RNA-seq dataframe to ensure correct formatting and column structure.

Usage:

python src/utils/validate_rnaseq_dataframe.py -r path/to/rnaseq.csv

Checks Performed:

  • Ensures the RNA-seq file exists.

  • Confirms that the file is comma-separated.

  • Verifies the presence of the pid column.

  • Counts and prints the number of PIDs and genes.


General Notes

  • Ensure you have Python 3.x installed.

  • Install dependencies using:

    pip install -r requirements.txt
    
  • Modify file paths accordingly before running scripts.

These validation scripts help maintain data integrity and prevent issues before running further analyses.