Setting Up The Pipeline

This script automates the setup of the REMIND-Cancer project’s folder structure and the creation of a JSON file to track results. It takes a configuration file as input and performs the following:

Creates a structured folder system for patient data.
Copies WGS (Whole Genome Sequencing) VCF files into the structured folder.
Generates a results JSON file to track processed data.

Optional: Run Validation Scripts Before This

Before running the main pipeline, ensure everything is set up correctly by validating the required files. Navigate to the utils subfolder and run the following scripts:

Validate Configuration File

python src/utils/validate_configuration_file.py -c path/to/configuration_file.json

Validate Metadata File

python src/utils/validate_metadata_file.py -m path/to/metadata.csv

Validate RNA-Seq Dataframe

python src/utils/validate_rnaseq_dataframe.py -r path/to/rnaseq.csv

These scripts will check for missing files, incorrect formats, and inconsistencies to prevent errors in the pipeline.

Usage

Run the script in order to set up the pipeline structure using:

python src/pipeline_setup/create_initial_structure.py -c path/to/configuration_file.json

What This Script Does

1. Load Configuration File

The script begins by reading a JSON configuration file, which contains:

The dataset name.
The path to the metadata CSV file.
The output location for patient folders.
The output path for the results JSON file.

2. Create Folder Structure for Patients

This step reads the metadata CSV file and generates a directory structure:

output_path_to_patient_folders/
 ├── patient_1_tumor/
 │   ├── sample1_original.vcf
 ├── patient_2_tumor/
 │   ├── sample2_original.vcf

Each patient’s folder is named as pid_tumor_origin.
Each VCF file is copied and renamed with _original.vcf.
If a VCF file is missing, a warning is logged.

3. Create a Results Tracking JSON

A JSON file is generated to track the presence of sequencing and expression data.

Example Output (`results.json`)

{
    "results": {
        "original": {
            "primary_tumor_wgs": ["path/to/patient_1/sample1_original.vcf"],
            "primary_tumor_wgs_and_rnaseq": [],
            "metastasic_tumor_wgs": [],
            "metastasic_tumor_wgs_and_rnaseq": []
        }
    }
}

The script:

Extracts pid from each patient folder.
Checks tumor_origin and whether RNA-seq data exists.
Classifies files accordingly.

Detailed Breakdown of Functions

`create_folder_structure(metadata_path, output_folder)`

Reads the metadata CSV.
Iterates through each patient ID (pid).
Creates a patient subfolder if it doesn’t exist.
Copies the VCF file to the correct location.
Logs missing files.

`create_results_json(metadata_path, patient_folders_path, results_json_path)`

Reads metadata to extract tumor and expression data.
Scans patient folders for VCF files.
Categorizes each file into one of four groups:
- primary_tumor_wgs
- primary_tumor_wgs_and_rnaseq
- metastasic_tumor_wgs
- metastasic_tumor_wgs_and_rnaseq
Saves results as a JSON file.

`main(config_path)`

Parses the configuration file.
Calls create_folder_structure().
Calls create_results_json().

Expected Console Output

When you run the script, you’ll see output similar to:

#### (1/2) Creating Folder Structure ####
Setting up folder structure: 100%|██████████████| 50/50 [00:03<00:00, 15.5it/s]
✅ Folder structure created at: /path/to/patient_folders
❗ Number of VCF files not copied: 3

#### (2/2) Creating Results JSON File ####
Generating JSON file: 100%|██████████████| 50/50 [00:02<00:00, 20.2it/s]
✅ Results JSON saved at: /path/to/results.json

Prerequisites

Python 3.x installed.
Install dependencies using:
```
pip install -r requirements.txt
```
Ensure all file paths in configuration_file.json are correctly set.

Troubleshooting

Issue	Solution
`FileNotFoundError: Metadata file not found`	Check that the metadata CSV path is correct in the config file.
`VCF file does not exist`	Ensure the `path_to_wgs_file` column in the metadata file contains valid file paths.
`Results JSON is empty`	Check that the metadata contains valid tumor and RNA-seq data references.

This script automates the setup process, ensuring that data is structured properly before pipeline execution. 🚀