Skip to content
Snippets Groups Projects
README.md 11.6 KiB
Newer Older
Sukanya Denni's avatar
Sukanya Denni committed
# Assembly workflow for pangenome
Sukanya Denni's avatar
Sukanya Denni committed
This workflow uses [Snakemake](https://snakemake.readthedocs.io/en/stable/) to quickly assemble genomes with a HTML report summarizing obtained assembly stats. This workflow uses PacBio HiFi data.
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
A first script (```prejob.sh```) prepares the data until fasta.gz files are obtained. A second script (```job.sh```) runs the genome assembly and stats.

Sukanya Denni's avatar
Sukanya Denni committed
![workflow DAG](fig/rule_dag.svg)
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
## Table of contents
[TOC]
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
## Repo directory structure
Sukanya Denni's avatar
Sukanya Denni committed

```
Sukanya Denni's avatar
Sukanya Denni committed
├── README.md
├── job.sh
├── prejob.sh
├── workflow
│   ├── rules
│   ├── modules
│   ├── scripts
│   ├── pre-job_snakefiles
|   └── Snakefile
Sukanya Denni's avatar
Sukanya Denni committed
└──  .config
    ├── snakemake_profile
    |  └── slurm
    |       ├── cluster_config.yml
    |       ├── config.yaml
    |       ├── CookieCutter.py
    |       ├── settings.json
    |       ├── slurm_utils.py
    |       ├── slurm-jobscript.sh
    |       ├── slurm-status.py
    |       └── slurm-submit.py
    └── masterconfig.yaml
Sukanya Denni's avatar
Sukanya Denni committed
```

Sukanya Denni's avatar
Sukanya Denni committed
## Requirements
- snakemake >= 6.5.1
- singularity

## Workflow steps, programs & Docker images pulled by Snakemake
All images here will be pulled automatically by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
Sukanya Denni's avatar
Sukanya Denni committed
Images are stored on the project's container registry but come from various container libraries:
Sukanya Denni's avatar
Sukanya Denni committed

**Pre-assembly**
- Conversion of PacBio bam to fasta & fastq
    - **smrtlink** (https://www.pacb.com/support/software-downloads/)
        - image version: 9.0.0.92188 ([link](https://hub.docker.com/r/bryce911/smrtlink/tags))
- Fastq to fasta conversion
    - **seqtk** (https://github.com/lh3/seqtk)
        - image version: 1.3--dc0d16b ([link](https://hub.docker.com/r/nanozoo/seqtk))
- Raw data quality control
    - **fastqc** (https://github.com/s-andrews/FastQC)
        - image version: v0.11.5_cv4 ([link](https://hub.docker.com/r/biocontainers/fastqc/tags))
    - **lonqQC** (https://github.com/yfukasawa/LongQC)
        - image version: latest (April 2022) ([link](https://hub.docker.com/r/grpiccoli/longqc/tags))
- Metrics
    - **genometools** (https://github.com/genometools/genometools)
        - image version: v1.5.9ds-4-deb_cv1 ([link](https://hub.docker.com/r/biocontainers/genometools/tags))
- K-mer analysis
    - **jellyfish** (https://github.com/gmarcais/Jellyfish)
        - image version: 2.3.0--h9f5acd7_3 ([link](https://quay.io/repository/biocontainers/kmer-jellyfish?tab=tags))
    - **genomescope** (https://github.com/tbenavi1/genomescope2.0)
        - image version: 2.0 ([link](https://hub.docker.com/r/abner12/genomescope))

**Assembly**
- Assembly
    - **hifiasm** (https://github.com/chhylp123/hifiasm)
        - image version: 0.16.1--h5b5514e_1 ([link](https://quay.io/repository/biocontainers/hifiasm?tab=tags))
- Metrics
    - **genometools** (same as Pre-assembly)
- Assembly quality control
    - **busco** (https://gitlab.com/ezlab/busco)
Sukanya Denni's avatar
Sukanya Denni committed
        - image version: v5.3.1_cv1 ([link](https://hub.docker.com/r/ezlabgva/busco/tags))
Sukanya Denni's avatar
Sukanya Denni committed
    - **kat** (https://github.com/TGAC/KAT)
        - image version: 2.4.1--py35h355e19c_3 ([link](https://quay.io/repository/biocontainers/kat))
- Error rate, QV & phasing
    - **meryl** and **merqury** (https://github.com/marbl/meryl, https://github.com/marbl/merqury)
        - image version: 1.3--hdfd78af_0 ([link](https://quay.io/repository/biocontainers/merqury?tab=tags))
- Detect assembled telomeres
    - **FindTelomeres** (https://github.com/JanaSperschneider/FindTelomeres)
        - **Biopython** image version: 1.75 ([link](https://quay.io/repository/biocontainers/biopython?tab=tags))
- Haplotigs and overlaps purging 
    - **purge_dups** (https://github.com/dfguan/purge_dups)
        - image version: 1.2.5--h7132678_2 ([link](https://quay.io/repository/biocontainers/purge_dups?tab=tags))
Sukanya Denni's avatar
Sukanya Denni committed
        - **matplotlib** image version: v0.11.5-5-deb-py3_cv1 ([link](https://hub.docker.com/r/biocontainers/matplotlib-venn/tags))

**Report**
- **R markdown**
    - image version: 4.0.3 ([link](https://hub.docker.com/r/reslp/rmarkdown/tags))
Sukanya Denni's avatar
Sukanya Denni committed

## How to run the workflow

### Profile setup
Sukanya Denni's avatar
Sukanya Denni committed
The current profile is made for SLURM. To run this workflow on another HPC, create another profile (https://github.com/Snakemake-Profiles) and add it in the `.config/snakemake_profile` directory. Change the `CLUSTER_CONFIG` and `PROFILE` variables in `job.sh` and `prejob.sh`.
If you are using the current SLURM setup, change line 13 to your email adress in the `cluster_config`.yml file.
Sukanya Denni's avatar
Sukanya Denni committed
### SLURM logs
SLURM submission scripts, prejob.sh and job.sh, output standard and error output into slurm_logs directory. This directory must exist before running any of these submission script else slurm will refuse to submit these jobs.

```
# create if not exist
mkdir -p slurm_logs
```

Sukanya Denni's avatar
Sukanya Denni committed
### Workflow execution
Go in the `Assemb_v2_Snakemake_FullAuto` directory to run the bash scripts.

1. **Data preparation**

Modify the following variables in file `.config/masterconfig.yaml`:
- `root`
    - The absolute path where you want the output to be.
- `data`
    - The path to the directory containing all input tar files.
This workflow can automatically determine the name of files in the specified `data` directory, or run only on given files :
- `get_all_tar_filename: True` will uncompress all tar files. If you want to choose the the files to uncompress, use `get_all_tar_filename: False` and give the filenames as a list in `tarIDS`


Modify the `SNG_BIND` variable in `prejob.sh`, it has to be the same as the variable `root` in `.config/masterconfig.yaml`. Change line 17 to your email adress.
If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
Then run 
```bash
sbatch prejob.sh
```
Sukanya Denni's avatar
Sukanya Denni committed
This will create multiple directories to prepare the data for the workflow. You will end up with a `bam_files` directory containing all *bam* files, renamed as the tar filename if your data was named "ccs.bam", and a `fastx_files` directory containing all *fasta* and *fastq* files. The `extract` directory contains all other files that were in the tar ball.
Sukanya Denni's avatar
Sukanya Denni committed
```
workflow_results
└── 00_raw_data
    ├── bam_files
    ├── extract
    └── fastx_files
```
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
2. **Running the workflow**

Sukanya Denni's avatar
Sukanya Denni committed
The `fastx_files` directory will be the starting point for the assembly workflow. You can add other datasets but the workflow needs a *fasta.gz* file. If *bam* files or *fastq.gz* files are available, the workflow runs raw data quality control steps.
Sukanya Denni's avatar
Sukanya Denni committed

You will have to modify other variables in file `.config/masterconfig.yaml`:
- `get_all_filename: True` will run the workflow with all *fasta.gz* in the `fastx_files` directory as input. If you want to choose the input for the workflow, use `get_all_filename: False` and give the fasta filenames as a list in `IDS`.
- Choose your run name with `run`.
- Specify the organism ploidy with `ploidy`.
- Choose the BUSCO lineage with `lineage`.
Sukanya Denni's avatar
Sukanya Denni committed
- There are 3 modes to run hifiasm. In all cases, the organism have to be sequenced in PacBio HiFi. To choose the mode, modify the variable `mode` in file `.config/masterconfig.yaml` to either :
    - `default` for a HiFi-only assembly.
    - `trio` if you have parental reads (either HiFi or short reads) in addition to the sequencing of the organism.
Sukanya Denni's avatar
Sukanya Denni committed
        - Add a key corresponding to your filename and modify the variables `p1` and `p2` to be the parental reads. Supported filetypes are *fasta*, *fasta.gz*, *fastq* and *fastq.gz*.
Sukanya Denni's avatar
Sukanya Denni committed
    - `hi-c` if the organism has been sequenced in paired-end Hi-C as well.
Sukanya Denni's avatar
Sukanya Denni committed
        - Add a key corresponding to your filename an modify the variables `r1` and `r2` to be the paired-end Hi-C reads. Supported filetypes are *fasta*, *fasta.gz*, *fastq* and *fastq.gz*.

Sukanya Denni's avatar
Sukanya Denni committed
For example
Sukanya Denni's avatar
Sukanya Denni committed
```yaml
# trio datasets
sibling_1_dataset_placeholder:
  p1: path/to/parent/1/reads
  p2: path/to/parent/2/reads

sibling_2_dataset_placeholder:
  p1: path/to/parent/1/reads
  p2: path/to/parent/2/reads

# Hi-C datasets
hic_sample_dataset_1_name_placeholder:
  r1: path/to/r1/reads
  r2: path/to/r2/reads

hic_sample_dataset_2_name_placeholder:
  r1: path/to/r1/reads
  r2: path/to/r2/reads
```
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
Modify the `SNG_BIND` variable in `job.sh`, it has to be the same as the variable `root` in `.config/masterconfig.yaml`. Change line 17 to your email adress.
If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
Then run 
```bash
sbatch job.sh
```
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
All the slurm output logs are in the `slurm_logs` directory. There are .out and .err files for the worklow (*snakemake.cortex**) and for each rules (*rulename.cortex**).
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
### Dry run
Sukanya Denni's avatar
Sukanya Denni committed
To check if the workflow will run fine, you can do a dry run: uncomment line 56 in `job.sh` and comment line 59, then run
Sukanya Denni's avatar
Sukanya Denni committed
```bash
sbatch job.sh
```
Check the snakemake.cortex*.out file in the `slurm_logs` directory, you should see a summary of the workflow.
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
### Outputs
Sukanya Denni's avatar
Sukanya Denni committed
These are the directories for the data produced by the workflow:
- An automatic report is generated in the `RUN` directory.
Sukanya Denni's avatar
Sukanya Denni committed
- `01_raw_data_QC` contains all quality control ran on the reads. FastQC and LongQC create html reports on fastq and bam files respectively, reads stats are given by Genometools, and predictions of genome size and heterozygosity are given by Genomescope (in directory `04_kmer`).
- `02_genome_assembly` contains 2 assemblies. The first one is in `01_raw_assembly`, it is the assembly obtained with hifiasm. The second one is in `02_after_purge_dups_assembly`, it is the hifiasm assembly after haplotigs removal by purge_dups. Both assemblies have a `01_assembly_QC` directory containing assembly statistics done by Genometools (in directory `assembly_stats`), BUSCO analyses (`busco`), k-mer profiles with KAT (`katplot`) and completedness and QV stats with Merqury (`merqury`) as well as assembled telomeres with FindTelomeres (`telomeres`).
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
```
workflow_results
├── 00_raw_data
└── FILENAME
    └── RUN
        ├── 01_raw_data_QC
        │   ├── 01_fastQC
        │   ├── 02_longQC
        │   ├── 03_genometools
        |   └── 04_kmer
        |       └── genomescope
        └── 02_genome_assembly
            ├── 01_raw_assembly
            │   ├── 00_assembly
            |   └── 01_assembly_QC
            |       ├── assembly_stats
            |       ├── busco
            |       ├── katplot
            |       ├── merqury
            |       └── telomeres
            └── 02_after_purge_dups_assembly
                ├── 00_assembly
                |   ├── hap1
                |   └── hap2
                └── 01_assembly_QC
                    ├── assembly_stats
                    ├── busco
                    ├── katplot
                    ├── merqury
                    └── telomeres
```
Sukanya Denni's avatar
Sukanya Denni committed

Sukanya Denni's avatar
Sukanya Denni committed
## Known problems/errors
Sukanya Denni's avatar
Sukanya Denni committed
### HPC
The workflow does not work if the HPC does not allow a job to run other jobs.
### BUSCO
The first time you run the workflow, if there are multiple samples, the BUSCO lineage might be downladed multiple times. This can create a conflict between the jobs using BUSCO and may interrupt some of them. In that case, you only need to rerun the workflow once everything is done.
### Snakemake locked directory
Sukanya Denni's avatar
Sukanya Denni committed
When you try to rerun the workflow after cancelling a job, you may have to unlock the results directory. To do so, go in `.config/snakemake_profile/slurm` and uncomment line 14 of `config.yaml`. Run the workflow once to unlock the directory (it should only take a few seconds). Still in `config.yaml`, comment line 14. The workflow will be able to run and create outputs.