# Setting Up The Pipeline This script automates the setup of the **REMIND-Cancer** project’s folder structure and the creation of a JSON file to track results. It takes a **configuration file** as input and performs the following: 1. **Creates a structured folder system** for patient data. 2. **Copies WGS (Whole Genome Sequencing) VCF files** into the structured folder. 3. **Generates a results JSON file** to track processed data. --- ## **Optional: Run Validation Scripts Before This** Before running the main pipeline, ensure everything is set up correctly by validating the required files. Navigate to the `utils` subfolder and run the following scripts: ### **Validate Configuration File** ```bash python src/utils/validate_configuration_file.py -c path/to/configuration_file.json ``` ### **Validate Metadata File** ```bash python src/utils/validate_metadata_file.py -m path/to/metadata.csv ``` ### **Validate RNA-Seq Dataframe** ```bash python src/utils/validate_rnaseq_dataframe.py -r path/to/rnaseq.csv ``` These scripts will check for missing files, incorrect formats, and inconsistencies to prevent errors in the pipeline. --- ## **Usage** Run the script in order to set up the pipeline structure using: ```bash python src/pipeline_setup/create_initial_structure.py -c path/to/configuration_file.json ``` --- ## **What This Script Does** ### **1. Load Configuration File** The script begins by reading a JSON configuration file, which contains: - The dataset name. - The path to the **metadata CSV file**. - The output location for patient folders. - The output path for the results JSON file. ### **2. Create Folder Structure for Patients** This step reads the **metadata CSV file** and generates a directory structure: ``` output_path_to_patient_folders/ ├── patient_1_tumor/ │ ├── sample1_original.vcf ├── patient_2_tumor/ │ ├── sample2_original.vcf ``` - Each **patient’s folder** is named as `pid_tumor_origin`. - Each **VCF file** is copied and renamed with `_original.vcf`. - If a **VCF file is missing**, a warning is logged. ### **3. Create a Results Tracking JSON** A JSON file is generated to track the presence of sequencing and expression data. #### **Example Output (`results.json`)** ```json { "results": { "original": { "primary_tumor_wgs": ["path/to/patient_1/sample1_original.vcf"], "primary_tumor_wgs_and_rnaseq": [], "metastasic_tumor_wgs": [], "metastasic_tumor_wgs_and_rnaseq": [] } } } ``` The script: - Extracts `pid` from each patient folder. - Checks `tumor_origin` and whether RNA-seq data exists. - Classifies files accordingly. ## **Detailed Breakdown of Functions** ### **`create_folder_structure(metadata_path, output_folder)`** - Reads the metadata CSV. - Iterates through each **patient ID (`pid`)**. - Creates a **patient subfolder** if it doesn’t exist. - Copies the **VCF file** to the correct location. - Logs **missing files**. ### **`create_results_json(metadata_path, patient_folders_path, results_json_path)`** - Reads metadata to extract tumor and expression data. - Scans patient folders for VCF files. - Categorizes each file into one of four groups: - `primary_tumor_wgs` - `primary_tumor_wgs_and_rnaseq` - `metastasic_tumor_wgs` - `metastasic_tumor_wgs_and_rnaseq` - Saves results as a JSON file. ### **`main(config_path)`** - Parses the **configuration file**. - Calls `create_folder_structure()`. - Calls `create_results_json()`. --- ## **Expected Console Output** When you run the script, you’ll see output similar to: ```bash #### (1/2) Creating Folder Structure #### Setting up folder structure: 100%|██████████████| 50/50 [00:03<00:00, 15.5it/s] ✅ Folder structure created at: /path/to/patient_folders ❗ Number of VCF files not copied: 3 #### (2/2) Creating Results JSON File #### Generating JSON file: 100%|██████████████| 50/50 [00:02<00:00, 20.2it/s] ✅ Results JSON saved at: /path/to/results.json ``` --- ## **Prerequisites** - **Python 3.x** installed. - Install dependencies using: ```bash pip install -r requirements.txt ``` - Ensure all file paths in `configuration_file.json` are correctly set. --- ## **Troubleshooting** | Issue | Solution | | ---------------------------------------------- | -------------------------------------------------------------------------------------- | | `FileNotFoundError: Metadata file not found` | Check that the metadata CSV path is correct in the config file. | | `VCF file does not exist` | Ensure the `path_to_wgs_file` column in the metadata file contains valid file paths. | | `Results JSON is empty` | Check that the metadata contains valid tumor and RNA-seq data references. | This script automates the setup process, ensuring that data is structured properly before pipeline execution. 🚀