Setting Up The Pipeline

This script automates the setup of the REMIND-Cancer project’s folder structure and the creation of a JSON file to track results. It takes a configuration file as input and performs the following:

  1. Creates a structured folder system for patient data.

  2. Copies WGS (Whole Genome Sequencing) VCF files into the structured folder.

  3. Generates a results JSON file to track processed data.


Optional: Run Validation Scripts Before This

Before running the main pipeline, ensure everything is set up correctly by validating the required files. Navigate to the utils subfolder and run the following scripts:

Validate Configuration File

python src/utils/validate_configuration_file.py -c path/to/configuration_file.json

Validate Metadata File

python src/utils/validate_metadata_file.py -m path/to/metadata.csv

Validate RNA-Seq Dataframe

python src/utils/validate_rnaseq_dataframe.py -r path/to/rnaseq.csv

These scripts will check for missing files, incorrect formats, and inconsistencies to prevent errors in the pipeline.


Usage

Run the script in order to set up the pipeline structure using:

python src/pipeline_setup/create_initial_structure.py -c path/to/configuration_file.json

What This Script Does

1. Load Configuration File

The script begins by reading a JSON configuration file, which contains:

  • The dataset name.

  • The path to the metadata CSV file.

  • The output location for patient folders.

  • The output path for the results JSON file.

2. Create Folder Structure for Patients

This step reads the metadata CSV file and generates a directory structure:

output_path_to_patient_folders/
 ├── patient_1_tumor/
 │   ├── sample1_original.vcf
 ├── patient_2_tumor/
 │   ├── sample2_original.vcf
  • Each patient’s folder is named as pid_tumor_origin.

  • Each VCF file is copied and renamed with _original.vcf.

  • If a VCF file is missing, a warning is logged.

3. Create a Results Tracking JSON

A JSON file is generated to track the presence of sequencing and expression data.

Example Output (results.json)

{
    "results": {
        "original": {
            "primary_tumor_wgs": ["path/to/patient_1/sample1_original.vcf"],
            "primary_tumor_wgs_and_rnaseq": [],
            "metastasic_tumor_wgs": [],
            "metastasic_tumor_wgs_and_rnaseq": []
        }
    }
}

The script:

  • Extracts pid from each patient folder.

  • Checks tumor_origin and whether RNA-seq data exists.

  • Classifies files accordingly.

Detailed Breakdown of Functions

create_folder_structure(metadata_path, output_folder)

  • Reads the metadata CSV.

  • Iterates through each patient ID (pid).

  • Creates a patient subfolder if it doesn’t exist.

  • Copies the VCF file to the correct location.

  • Logs missing files.

create_results_json(metadata_path, patient_folders_path, results_json_path)

  • Reads metadata to extract tumor and expression data.

  • Scans patient folders for VCF files.

  • Categorizes each file into one of four groups:

    • primary_tumor_wgs

    • primary_tumor_wgs_and_rnaseq

    • metastasic_tumor_wgs

    • metastasic_tumor_wgs_and_rnaseq

  • Saves results as a JSON file.

main(config_path)

  • Parses the configuration file.

  • Calls create_folder_structure().

  • Calls create_results_json().


Expected Console Output

When you run the script, you’ll see output similar to:

#### (1/2) Creating Folder Structure ####
Setting up folder structure: 100%|██████████████| 50/50 [00:03<00:00, 15.5it/s] Folder structure created at: /path/to/patient_folders
❗ Number of VCF files not copied: 3

#### (2/2) Creating Results JSON File ####
Generating JSON file: 100%|██████████████| 50/50 [00:02<00:00, 20.2it/s] Results JSON saved at: /path/to/results.json

Prerequisites

  • Python 3.x installed.

  • Install dependencies using:

    pip install -r requirements.txt
    
  • Ensure all file paths in configuration_file.json are correctly set.


Troubleshooting

Issue

Solution

FileNotFoundError: Metadata file not found

Check that the metadata CSV path is correct in the config file.

VCF file does not exist

Ensure the path_to_wgs_file column in the metadata file contains valid file paths.

Results JSON is empty

Check that the metadata contains valid tumor and RNA-seq data references.

This script automates the setup process, ensuring that data is structured properly before pipeline execution. 🚀