Setting Up The Pipeline
This script automates the setup of the REMIND-Cancer project’s folder structure and the creation of a JSON file to track results. It takes a configuration file as input and performs the following:
Creates a structured folder system for patient data.
Copies WGS (Whole Genome Sequencing) VCF files into the structured folder.
Generates a results JSON file to track processed data.
Optional: Run Validation Scripts Before This
Before running the main pipeline, ensure everything is set up correctly by validating the required files. Navigate to the utils subfolder and run the following scripts:
Validate Configuration File
python src/utils/validate_configuration_file.py -c path/to/configuration_file.json
Validate Metadata File
python src/utils/validate_metadata_file.py -m path/to/metadata.csv
Validate RNA-Seq Dataframe
python src/utils/validate_rnaseq_dataframe.py -r path/to/rnaseq.csv
These scripts will check for missing files, incorrect formats, and inconsistencies to prevent errors in the pipeline.
Usage
Run the script in order to set up the pipeline structure using:
python src/pipeline_setup/create_initial_structure.py -c path/to/configuration_file.json
What This Script Does
1. Load Configuration File
The script begins by reading a JSON configuration file, which contains:
The dataset name.
The path to the metadata CSV file.
The output location for patient folders.
The output path for the results JSON file.
2. Create Folder Structure for Patients
This step reads the metadata CSV file and generates a directory structure:
output_path_to_patient_folders/
├── patient_1_tumor/
│ ├── sample1_original.vcf
├── patient_2_tumor/
│ ├── sample2_original.vcf
Each patient’s folder is named as
pid_tumor_origin.Each VCF file is copied and renamed with
_original.vcf.If a VCF file is missing, a warning is logged.
3. Create a Results Tracking JSON
A JSON file is generated to track the presence of sequencing and expression data.
Example Output (results.json)
{
"results": {
"original": {
"primary_tumor_wgs": ["path/to/patient_1/sample1_original.vcf"],
"primary_tumor_wgs_and_rnaseq": [],
"metastasic_tumor_wgs": [],
"metastasic_tumor_wgs_and_rnaseq": []
}
}
}
The script:
Extracts
pidfrom each patient folder.Checks
tumor_originand whether RNA-seq data exists.Classifies files accordingly.
Detailed Breakdown of Functions
create_folder_structure(metadata_path, output_folder)
Reads the metadata CSV.
Iterates through each patient ID (
pid).Creates a patient subfolder if it doesn’t exist.
Copies the VCF file to the correct location.
Logs missing files.
create_results_json(metadata_path, patient_folders_path, results_json_path)
Reads metadata to extract tumor and expression data.
Scans patient folders for VCF files.
Categorizes each file into one of four groups:
primary_tumor_wgsprimary_tumor_wgs_and_rnaseqmetastasic_tumor_wgsmetastasic_tumor_wgs_and_rnaseq
Saves results as a JSON file.
main(config_path)
Parses the configuration file.
Calls
create_folder_structure().Calls
create_results_json().
Expected Console Output
When you run the script, you’ll see output similar to:
#### (1/2) Creating Folder Structure ####
Setting up folder structure: 100%|██████████████| 50/50 [00:03<00:00, 15.5it/s]
✅ Folder structure created at: /path/to/patient_folders
❗ Number of VCF files not copied: 3
#### (2/2) Creating Results JSON File ####
Generating JSON file: 100%|██████████████| 50/50 [00:02<00:00, 20.2it/s]
✅ Results JSON saved at: /path/to/results.json
Prerequisites
Python 3.x installed.
Install dependencies using:
pip install -r requirements.txt
Ensure all file paths in
configuration_file.jsonare correctly set.
Troubleshooting
Issue |
Solution |
|---|---|
|
Check that the metadata CSV path is correct in the config file. |
|
Ensure the |
|
Check that the metadata contains valid tumor and RNA-seq data references. |
This script automates the setup process, ensuring that data is structured properly before pipeline execution. 🚀