Validation / Pre-pipeline Scripts
This subfolder contains three validation scripts designed to ensure the integrity and correctness of configuration, metadata, and RNA-seq dataframe files used in the REMIND-Cancer project.
1. validate_configuration_file.py
This script validates paths in the configuration file to ensure they exist and are correctly formatted.
Usage:
python src/utils/validate_configuration_file.py -c path/to/configuration_file.json
Checks Performed:
Ensures the configuration file exists.
Verifies that paths (except
path_to_results) are valid.Checks if
path_to_fimo_executableis an executable file.Reports missing or incorrectly formatted paths.
2. validate_metadata_file.py
This script validates a metadata CSV file by checking required columns and ensuring referenced files exist.
Usage:
python src/utils/validate_metadata_file.py -m path/to/metadata.csv
Checks Performed:
Ensures the metadata file exists.
Confirms required columns exist:
pid,tumor_origin,path_to_wgs_file,cohort,ge_data_available.Verifies whether all
path_to_wgs_fileentries exist.Checks the validity of corresponding VCF files (expects
.vcffilenames based on.wgsentries).Prints the number of valid WGS and VCF files.
3. validate_rnaseq_dataframe.py
This script validates an RNA-seq dataframe to ensure correct formatting and column structure.
Usage:
python src/utils/validate_rnaseq_dataframe.py -r path/to/rnaseq.csv
Checks Performed:
Ensures the RNA-seq file exists.
Confirms that the file is comma-separated.
Verifies the presence of the
pidcolumn.Counts and prints the number of PIDs and genes.
General Notes
Ensure you have Python 3.x installed.
Install dependencies using:
pip install -r requirements.txt
Modify file paths accordingly before running scripts.
These validation scripts help maintain data integrity and prevent issues before running further analyses.