RNASeq Practice

Description	Hands On Lab Exercises for RNASeq
Related-course materials	Linux for Dummies
Authors	Julie Orjuela (julie.orjuela_AT_irf.fr), Pierre Larmande (pierre.larmande_AT_ird.fr), Christine Tranchant (christine.tranchant_AT_ird.fr)
Creation Date	04/02/2022
Last Modified Date	14/03/2022

Summary

Preambule: Dataset used during this pratice

Practice 1: Connect on the cluster and prepare your working environment - ssh,srun,scp
Practice 2: Check Reads Quality - fastqc,multiqc
Practice 3: fastq cleaning - cutadapt
Practice 4: Using the workflow manager TOGGLe to execute cutadapt and fastqc on a large number of samples
Practice 5: Running Hisat2 and Stringtie with TOGGLe

TIP

TIP 1: renaming fastq files - bash
TIP 2: scripts "slurm"
TIP 3: How to choose a node to execute my analysis

Links
License

Preambule. Dataset used during this pratice

Datasets used in this practical

Origin :

ref : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3488244/
data : NCBI SRA database under accession number SRS307298 S. cerevisiae.
Genome size of S. cerevisiae : 12M (12.157.105) (https://www.yeastgenome.org/strain/S288C#genome_sequence)

In this session, we will analyze RNA-seq data from one sample of S. cerevisiae (NCBI SRA SRS307298). It is from two different origin (CENPK and Batch), with three biological replications for each origin (rep1, rep2 and rep3).

Where is this dataset on the cluster ?

Dataset were downloaded on the i-trop cluster here : /data2/formation/TP_read2count/RAW_DATA (server nas)

```
RAW_DATA/
├── adapt-125pbLib.txt
├── FASTQ
│   ├── SRR453566_1.fastq.gz
│   ├── SRR453566_2.fastq.gz
│   ├── SRR453567_1.fastq.gz
│   ├── SRR453567_2.fastq.gz
│   ├── SRR453568_1.fastq.gz
│   ├── SRR453568_2.fastq.gz
│   ├── SRR453569_1.fastq.gz
│   ├── SRR453569_2.fastq.gz
│   ├── SRR453570_1.fastq.gz
│   ├── SRR453570_2.fastq.gz
│   ├── SRR453571_1.fastq.gz
│   ├── SRR453571_2.fastq.gz
│   ├── SRR453578_1.fastq.gz
│   └── SRR453578_2.fastq.gz
├── REF
│   ├── GCF_000146045.2_R64_cds_from_genomic.fna
│   ├── GCF_000146045.2_R64_cds_from_genomic.fna.gz
│   ├── GCF_000146045.2_R64_genomic.fna
│   └── GCF_000146045.2_R64_genomic.gtf
└── samples.txt

2 directories, 20 files
```

Practice 1. Connect on the cluster and prepare your working environment - `ssh,srun,scp`

Connection on the cluster through `ssh` mode

We will work on the i-trop cluster using SLURM scheduler.

ssh login@bioinfo-master.ird.fr

Opening an interactive bash session on a node via slurm `srun -p partition --pty bash -i`

Read this survival document containig basic commands to SLURM (https://southgreenplatform.github.io/trainings/slurm/)

srun  --pty bash -i

Prepare your input files

Create your working directory in the scratch partition (/scratch).

Please replace LOGIN with your own user login.

cd /scratch
mkdir LOGIN
cd LOGIN

Copy fastq files from the nas into your scratch directory.

scp -r nas:$PATHTODATA/RAWDATA/* /scratch/LOGIN/

Check that the files have been correctly copied with the command ls -alR ou tree. You should see 14 gzipped fastq files, a samples.txt file and a adapt-125pbLib.txt file.

[orjuela@node25 RAWDATA]$ more samples.txt
CENPK   CENPK_rep1  PATH/SRR453569_1.fastq.gz   PATH/SRR453569_2.fastq.gz
CENPK   CENPK_rep2  PATH/SRR453570_1.fastq.gz   PATH/SRR453570_2.fastq.gz
CENPK   CENPK_rep3  PATH/SRR453571_1.fastq.gz   PATH/SRR453571_2.fastq.gz
Batch   Batch_rep1  PATH/SRR453566_1.fastq.gz   PATH/SRR453566_2.fastq.gz
Batch   Batch_rep2  PATH/SRR453567_1.fastq.gz   PATH/SRR453567_2.fastq.gz
Batch   Batch_rep3  PATH/SRR453568_1.fastq.gz   PATH/SRR453568_2.fastq.gz

Check the size of fastq directory

du -sh FASTQ_PATH

Practice 2. Check Reads Quality

FastQC performs some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Create the repertory `FASTQC` in your scratch directory

mkdir FASTQC
cd FASTQC

Load FastQC software (last version)

module load bioinfo/FastQC/0.11.9

Run fastqc in the whole of samples (it will take 10min)

fastqc /scratch/tranchant-802/fastq/* -o /scratch/tranchant-802/FASTQC/

Run MultiQC

Multiqc is a modular tool to aggregate results from bioinformatics analyses across many samples into a single report. Use this tool to visualise all the results of fastqc . https://multiqc.info/

#charge module
module load bioinfo/multiqc/1.9

#launch Multiqc to create a html report centralizing informations generated by fastqc per fastq file
multiqc /scratch/LOGIN/FASTQC

Transfer fastqc and multiqc files

now, transfert results from : node -> nas -> computer

Transfert from /scratch to NAS

scp -r FASTQC nas:/home/LOGIN/

Transfert from NAS to your computer using scp or filezilla

scp -r LOGIN@bioinfo-nas.ird.fr:/home/LOGIN/FASTQC .

Open the multiqc report `multiqc_report.html` on your favorite web navigator

Remove data in scratch directory

rm -rf /scratch/LOGIN/FASTQC/

Your turn !

Launch th same analysis through a slurm script !

Practice 3. fastq cleaning

Using `cutadapt` to remove adapters and to trim reads based on quality

cutadapt website

cutadapt  -q 30,30 -m 35  -B GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -B GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -b GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG-o P1_R1.CUTADAPT.fastq.gz -p P1_R2.CUTADAPT.fastq.gz P1_R1.fq.gz P1_R2.fq.gz

Cutadapt supports trimming of multiple types of adapters:

Adapter type	Command-line option
3’ adapter	-a ADAPTER
5’ adapter	-g ADAPTER
5’ or 3’ (both possible)	-b ADAPTER

-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs

-p is the short form of --paired-output. The option -B is used here to specify an adapter sequence that cutadapt should remove from the second read in each pair.

Practice 4 : Using the workflow manager `TOGGLe` to execute cutadapt and fastqc on a large number of samples

Data used for this pratice

Input data for this TP are accessible in the directory /data2/formation/TP_read2count/.

fastq files : RAW_DATA/FASTQ_RENAMED
configuration file used by TOGGLe : scripts/TOGGLE_CONFIG/cleaning.config.txt
slurm script used to launch our analysis worflow withTOGGLe : scripts/runTOGGLe_cleaning.sh

PATHTODATA="nas:/data2/formation/TP_read2count/"
* Input data : $PATHTODATA/RAW_DATA/FASTQ_RENAMED
* Config file: $PATHTODATA/scripts/TOGGLE_CONFIG/cleaning.config.txt
* Script : $PATHTODATA/scripts/runTOGGLe_cleaning.sh

Nb : You can also download a predefined TOGGLe configuration file such as fastqCheckQuality.config.txt and modify it.

Prepare your working environment

Opening an interactive bash session on a node via slurm

srun -p partition --pty bash -i

Create your working directory in the scratch partition (/scratch).

# Create a new directory in your /scratch
mkdir /scratch/$USER-TOGGLe
cd /scratch/$USER-TOGGLe/
mkdir cleaning
cd  cleaning

Import fastq from $PATHTODATA to your /scratch repertory and configure TOGGLe to cleaning analysis

# declare where data is
PATHTODATA="nas:/data2/formation/TP_read2count/"

# Transfert reads to cleaning to /scratch
scp -r $PATHTODATA/RAW_DATA/FASTQ_RENAMED .

Import TOGGLE configuration file into your /scratch directory

Copy the configuration file used by TOGGLe into the directory cleaning and display the content of this file

scp $PATHTODATA/scripts/TOGGLE_CONFIG/cleaning.config.txt .

[orjuela@nodeXX cleaning]$ more cleaning.config.txt 
$order
1=cutadapt
2=fastqc

$fastqc

$cutadapt
-q 30,30
-m 35
-u 9 
-B GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC
-B ATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCC
-B GTCCATTATATGTCTCCCAAACCACCAAACTCTTTGACTCCGGTGTGTTG
-B GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
-b GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC
-b ATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCC
-b GTCCATTATATGTCTCCCAAACCACCAAACTCTTTGACTCCGGTGTGTTG
-b GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG

# PUT YOUR OWN SLURM CONFIGURATION HERE IF AVAILABLE RESSOURCES
$slurm
-p PARTITION
--nodelist=nodeXXX

# If your data are already in scratch don't activate this option
#$scp
#/scratch/

$env
module load bioinfo/TOGGLE-dev/0.3.7

Modify this configuration file

Using a text editor such nano, modify the configuration file cleaning.config.txt and check the whole of parameters.

Adapt the parameters of Cutadapt and FastQC if necessary
Change the slurm configuration (key $slurm) by adding the correct partition and node names :
- -p PARTITION. Partition can be short, normal, highmem etc...
- --nodelist=nodeX (node used by slurm)
if your fastq files are in a project space, use/activate the $scp key , TOGGLe will transfer data from your project space into the scratch directory of the node used by slurm.

Nb : If your data are already in /scratch, deactivate this option and activate $env key giving module load bioinfo/TOGGLE-dev/0.3.7.

Create a slurm script to launch TOGGLe

This script is also available in the /data2/formation/TP_read2count/scripts/runTOGGLe_cleaning.sh

#!/bin/bash -l
#SBATCH -J TOGGLeCleaning
#SBATCH --export=ALL
#SBATCH -e toggle."%j".err
#SBATCH -o toggle."%j".out
#SBATCH -p PARTITION
#SBATCH --nodelist=nodeXX 

# Defining scratch and destination repertories

REP="/scratch/$USER-TOGGLe/cleaning"
dir="$REP/FASTQ_RENAMED"
out="$REP/OUTPUT_TOGGLE-CLEANING"
config="$REP/cleaning.config.txt"

# Software-specific settings exported to user environment
module load bioinfo/TOGGLE-dev/0.3.7

# running tooglegenerator 
toggleGenerator.pl -d $dir -c $config -o $out --nocheck;

echo "FIN de TOGGLe for cleaning ^^"

Convert runTOGGLe_cleaning in an executable file with chmod +x runTOGGLe_cleaning.sh

Launch the script runTOGGLe_cleaning.sh through slurm

sbatch ./runTOGGLe_cleaning.sh

Check your jobs are correctly running

Explore output OUT TOGGLe and check if everything was ok.
Check if your jobs are running with : squeue -u login -i 10
test the top command ('c' and 'u')

Remove your data in the scratch directory

Practice 5 : Running Hisat2 and Stringtie with TOGGLe

Data used for this practice

Input data are available in the directory /data2/formation/TP_read2count :

fastq files : RAW_DATA/FASTQ_RENAMED
Genome sequence (fasta) : RAW_DATA/REF/GCF_000146045.2_R64_genomic.fna
Genome annotation (gff) : RAW_DATA/REF/GCF_000146045.2_R64_genomic.gtf
configuration file used by TOGGLe : scripts/TOGGLE_CONFIG/RNASeqHisat2Stringtie.config.txtt
slurm script used to launch our analysis worflow withTOGGLe : scripts/runTOGGLeRNASEQ.sh

 PATHTODATA="nas:/data2/formation/TP_read2count"
* Input data : $PATHTODATA/RAW_DATA/FASTQ_RENAMED
* Reference : $PATHTODATA/RAW_DATA/REF/GCF_000146045.2_R64_genomic.fna
* Annotation : $PATHTODATA/RAW_DATA/REF/GCF_000146045.2_R64_genomic.gtf
* config file : $PATHTODATA/scripts/TOGGLE_CONFIG/RNASeqHisat2Stringtie.config.txt
* Script to run TOGGLe: $PATHTODATA/scripts/runTOGGLeRNASEQ.sh

In this practive, we don't transfert data from nas into /scratch manually ! TOGGLe is able to manage data transfert for you ! Let's try it!

Prepare your analysis

Opening an interactive bash session on a node via slurm

srun -p partition --pty bash -i

Copy data used for this practice in your own projet directory (NAS, NAS2 or NAS3 servers).

# 1 Go to your project on NASX
ssh NASX
cd /dataX/project/MyProject/

# 2 Create a directory to work
mkdir TEST-TOGGLe
cd  TEST-TOGGLe

# 3 Declare a variable : to change if needed
PATHTODATA="nas:/data2/formation/TP_read2count/"

# 4 Transfert  READS and REF  test to NAS, NAS2 or NAS3 
scp -r $PATHTODATA/RAW_DATA/FASTQ_RENAMED .
scp -r $PATHTODATA/RAW_DATA/REF .

# 5 Make à copy of the configuration file used by TOGGLe adapted to RNAseq
scp $PATHTODATA/scripts/TOGGLE_CONFIG/RNASeqHisat2Stringtie.config.txt .

# 6 Transfert script used to run TOGGLe
scp $PATHTODATA/scripts/runTOGGLeRNASEQ.sh .

Adapt the configuration file used by `TOGGLe`

Using a text editor, adapt the file RNASeqHisat2Stringtie.config.txt and check the whole of parameters. This predefined configuration file ca be obtained from RNASeqReadCount.config.txt

Change SLURM key $slurm giving -p PARTITION. Partition can be short, normal, highmem etc...
Complete the $scp key with /scratch. TOGGLe will use this key to copy data from a nas server into a node (scratch partition) then it will run jobs using data in the scratch directory.
Nb : If your data are already in /scratch, comment this key
Add, after the $env key, the line module load bioinfo/TOGGLE-dev/0.3.7.
Check parameters of every step in RNASeqHisat2Stringtie.config.txt as recommended by stringtie manual recomendations http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual.

$ more RNASeqHisat2Stringtie.config.txt 
$order
1=hisat2
2=samtoolsView
3=samtoolsSort
4=stringtie

$hisat2
--dta

$samtoolsview
-b
-h

$stringtie
-e
-B

$cleaner
1
2

$slurm
--job-name=TOGGLe
--partition normal
--export=ALL

$scp
/scratch/

$env
module load bioinfo/TOGGLE-dev/0.3.7

HISAT and STRINGTIE in some words :

Before mapping, a reference genome index is required. TOGGle creates automatically index genome if indexes are absent in reference folder.

Mapping is performed using HISAT2 and mapped reads are assembled into transcripts with StringTie. Transcript assemblies can be done with or without a reference annotation.

Stringtie is able to estimate the expression levels of the "reference" transcripts provided and generate a matrix of read counts mapped to particular genomic features (e.g., genes).

Create a slurm script to launch your analysis - `runTOGGLeRNASEQ.sh`

This script is also avalaible in the data directory used for this practice.

$ more runTOGGLe_RNAseq.sh 
#!/bin/bash -l
#SBATCH -J TOGGLeRNASeq
#SBATCH --export=ALL
#SBATCH -e toggle."%j".err
#SBATCH -o toggle."%j".out
#SBATCH -p PARTITION

# Defining my project and destination repertories
REP="/dataX/projects/MyProject/TEST-TOGGLe"
dir="$REP/FASTQ_RENAMED"
out="$REP/OUTPUT_TOGGLE-H2SSM"
config="$REP/RNASeqHisat2Stringtie.config.txt"
ref="$REP/REF/GCF_000146045.2_R64_genomic.fna"
gff="$REP/REF/GCF_000146045.2_R64_genomic.gtf"

# Software-specific settings exported to user environment
module load bioinfo/TOGGLE-dev/0.3.7

# running tooglegenerator 
toggleGenerator.pl -d $dir -c $config -r $ref -g $gff -o $out --report --nocheck;

echo "FIN de TOGGLe  - RNASEQ ^^"

Launch the TOGGLE analysis - `runTOGGLeRNASEQ.sh`

Convert runTOGGLeRNASEQ in an executable file with `chmod +x runTOGGLeRNASEQ.sh`

chmod +x ./runTOGGLeRNASEQ.sh

Launch runTOGGLeRNASEQ.sh in sbatch mode

sbatch ./runTOGGLeRNASEQ.sh

Check your jobs are correctly running

Explore the output OUTPUT_TOGGLE-H2SSM and check if everything was ok.
Check the final_results directory and observe .gtf files

Convert `GTF` file into `COUNTS` file

create a `samples_gtf.txt` file

Go to the finalResults directory and create a samples_gtf.txt doing ...

cd /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults
ls *gtf | sed 's/.STRINGTIE.gtf//' - > names.txt
realpath *gtf > paths.txt
paste names.txt paths.txt > samples_gtf.txt
rm names.txt paths.txt

samples_gtf.txt is looking now such as (no header) :

Batch-rep1  /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/Batch-rep1.STRINGTIE.gtf
Batch-rep2  /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/Batch-rep2.STRINGTIE.gtf
Batch-rep3  /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/Batch-rep3.STRINGTIE.gtf
CENPK-rep1  /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/CENPK-rep1.STRINGTIE.gtf
CENPK-rep2  /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/CENPK-rep2.STRINGTIE.gtf
CENPK-rep3  /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/CENPK-rep3.STRINGTIE.gtf

Convert GTF to counts

module load system/python/3.7.2
python3 /data2/formation/TP_read2count/scripts/prepDE.py3 -i samples_gtf.txt

gene_count_matrix.csv and transcript_count_matrix.csv has been created!!!

You can use DESeq2 and edgeR for analyzing differential expression in R or some shiny packages such as DIANE https://oceanecsn.github.io/DIANE/ or PIVOT https://github.com/kimpenn/PIVOT

TIP

TIP 1 : Renaming reads names

Several ways exist to rename files. We propose here a solution from a csv file. We will use symbolic links to avoid to change original names. We will use the naming convention defined in TOGGLe https://toggle.ird.fr/manual/quickManual/!!

[orjuela@master0 FASTQ]$ pwd
/data2/formation/TP_read2count/RAW_DATA/FASTQ

[orjuela@master0 FASTQ]$ realpath *gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_2.fastq.gz
...
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453578_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453578_2.fastq.gz

[orjuela@master0 FASTQ]$ realpath *gz > ../names_tmp.txt

names_tmp.txt can be modified in excel for example in order to have a second column with the new name such as in a new file (named file4renaming.csv par example) :

[orjuela@master0 RAW_DATA/]$ more file4renaming.csv
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_1.fastq.gz  Batch-rep1_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_2.fastq.gz  Batch-rep1_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_1.fastq.gz  Batch-rep2_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_2.fastq.gz  Batch-rep2_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_1.fastq.gz  Batch-rep3_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_2.fastq.gz  Batch-rep3_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453569_1.fastq.gz  CENPK-rep1_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453569_2.fastq.gz  CENPK-rep1_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453570_1.fastq.gz  CENPK-rep2_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453570_2.fastq.gz  CENPK-rep2_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453571_1.fastq.gz  CENPK-rep3_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453571_2.fastq.gz  CENPK-rep3_R2.fastq.gz

Now we will use this file4renaming.csv to create symbolic links.

# go to the new FASTQ repertory
mkdir FASTQ_RENAMED
cd FASTQ_RENAMED

# first verify command line is ok before to launch "eval"
while read -r line; do echo "ln -s $line";  done < ../file4renaming.csv

# Once all its ok,
while read -r line; do ln -s $line;  done < ../file4renaming.csv

ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_1.fastq.gz    SRR453566_R1.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_2.fastq.gz    SRR453566_R2.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_1.fastq.gz    SRR453567_R1.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_2.fastq.gz    SRR453567_R2.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_1.fastq.gz    SRR453568_R1.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_2.fastq.gz    SRR453568_R2.fastq.gz

# verify links by using ls -l command

TIP 2 : scripts

Slurm script to run fastqc and cutadapt

A first script with all commands used for the pratice 2 : `run_fastqc-multiqc.sh`

The command to execute this script

sbatch /data2/formation/TP_read2count/scripts/run_fastqc-multiqc.sh

Before running this script, don't forget to update the script (ex: email adress used by slurm) . This script runs with the TP data but don't forget to update it if you want use it with other datasets (READS variable in the script)

```
#!/bin/bash

######## CONFIGURATION SLURM ##
# Definir le nom du job
#SBATCH--job-name=FastQC

# on definit la partition où lancer la commande et le noeud
#SBATCH -p normal
####SBATCH -w nodeX

# On choisit le nombre de coeur
#SBATCH -c 2

# Definir notre adresse mail
#SBATCH --mail-user=PUT_YOUR_EMAIL@ird.fr

# type de mail à recevoir
# BEGIN, ABORT, END ou ALL
#SBATCH --mail-type=ALL
############################################

READS="nas:/data2/formation/TP_read2count/RAW_DATA/FASTQ/"

# Print le noeud où mon job tourne
echo "###### Je suis dans ce node : $HOSTNAME"

# creer un repertoire personnel dans le scratch du noeud
cd /scratch
mkdir $USER-$SLURM_JOB_ID
cd $USER-$SLURM_JOB_ID
echo "###### Je suis dans ce node : $HOSTNAME"

# recupérer les données depuis le nas
echo "##### on fait le scp des reads vers le node ..."
echo "## scp -r $READS /scratch/$USER-$SLURM_JOB_ID"
scp -r $READS /scratch/$USER-$SLURM_JOB_ID

## créer un dossier FastQC
mkdir FastQC

## charger le logiciel
module load bioinfo/FastQC/0.11.9

# lancer l'analyse
echo "#### on lance FastQ ..."
cd FastQC

echo "## fastqc /scratch/$USER-$SLURM_JOB_ID/FASTQ/* -o /scratch/$USER-$SLURM_JOB_ID/FASTQC -t 2"
fastqc /scratch/$USER-$SLURM_JOB_ID/FASTQ/* -o /scratch/$USER-$SLURM_JOB_ID/FASTQC -t 2

#charge module multiQC
module load bioinfo/multiqc/1.9

echo "#### on lance multiqc ..."
multiqc /scratch/$USER-$SLURM_JOB_ID/FastQC

# recupération des données
echo '#### je transfert les reads vers le projet ...'
echo "## scp -r FastQC $READS"
scp -r /scratch/$USER-$SLURM_JOB_ID/FastqQC $READS

#suppression des données
#cd /scratch
#rm - rf $USER-$SLURM_JOB_ID

#stats sur le job
seff $SLURM_JOB_ID
```

An improved script with all commands used for the pratice 2 : `run_fastqc-multiqc-arg.sh`

The command to excute this script

sbatch /data2/formation/TP_read2count/scripts/run_fastqc-multiqc-arg.sh PUT_path2reads PUT_output_directory_name

```
#!/bin/bash
#SBATCH -p normal
#SBATCH -J fastqc-multiqc
#SBATCH -c 4

if [[ -z $1 ]]; then
echo "ce script attends 2 arguments : pathtoreads et outputname";
exit;
fi

if [[ -z $2 ]]; then
echo "ce script attends outputname en 2 argument";
exit;
fi

#chargement des modules
#module load bioinfo/FastQC/0.11.9
module load bioinfo/FastQC/0.11.5
module load bioinfo/multiqc/1.9

# ==> declaration des variables
#on recupere le chemin du directoire ou sont les reads
DIRFASTQ=$(realpath $1)
TMP=/scratch/$USER-readsquality
READS=$TMP/READS
OUTFASTQC=$TMP/OUTPUTFASTQC
OUTMQC=$TMP/$2
FINAL=$(dirname $DIRFASTQ)

#on cree un dossier de travail dans le scratch d'un node
mkdir -p $TMP
#on cree un dossier ou on va placer les reads
mkdir -p $READS
#on cree un dossier pour FASTQC ou on va placer les reads
mkdir -p $OUTFASTQC
#on cree un dossier pour MultiQC ou on va placer les reads
mkdir -p $OUTMQC

#on transfert les reads dans le dossier tmp/fastq
echo "=> On transfert les reads du projet au /scratch ..."
echo "scp nas:$DIRFASTQ/*gz $READS"
scp nas:$DIRFASTQ/*gz $READS

# fastQC
echo "Les fastq sont dans : $READS dans le node"
echo "=> On lance FastQC ..."
echo "fastqc $READS/* -o $OUTFASTQC -t 4"
fastqc $READS/* -o $OUTFASTQC -t 4

#MultiQC
echo "=> On lance MultiQC ..."
echo "multiqc $OUTFASTQC --outdir $OUTMQC"
multiqc $OUTFASTQC --outdir $OUTMQC

# on transfert les données
echo "=> On retransfert les resultats sur le projet ..."
echo "scp -r $OUTMQC nas:$FINAL/"
scp -r $OUTMQC nas:$FINAL/

# on efface les données du /scratch
echo "=> On efface les données du /scratch . bye!!!"
rm -rf $TMP

```

TIP 3: How to choose a node to execute my analysis ]

Links

slides https://southgreenplatform.github.io/trainings//files/RNAseq_ouaga_2019_10102019-short.pdf

Practice : https://bioinfo.ird.fr/index.php/tutorials-fr/rnaseq1/

License

mentions légales

Tutorials – RNASeq practice : from fastq to read count

RNASeq Practice

Summary

Preambule. Dataset used during this pratice

Datasets used in this practical

Where is this dataset on the cluster ?

Practice 1. Connect on the cluster and prepare your working environment - ssh,srun,scp

Connection on the cluster through ssh mode

Opening an interactive bash session on a node via slurm srun -p partition --pty bash -i

Prepare your input files

Create your working directory in the scratch partition (/scratch).

Copy fastq files from the nas into your scratch directory.

Check the size of fastq directory

Practice 2. Check Reads Quality

Create the repertory FASTQC in your scratch directory

Load FastQC software (last version)

Run fastqc in the whole of samples (it will take 10min)

Run MultiQC

Transfer fastqc and multiqc files

Open the multiqc report multiqc_report.html on your favorite web navigator

Remove data in scratch directory

Your turn !

Practice 3. fastq cleaning

Using cutadapt to remove adapters and to trim reads based on quality

Practice 4 : Using the workflow manager TOGGLe to execute cutadapt and fastqc on a large number of samples

Data used for this pratice

Prepare your working environment

Opening an interactive bash session on a node via slurm

Create your working directory in the scratch partition (/scratch).

Import fastq from $PATHTODATA to your /scratch repertory and configure TOGGLe to cleaning analysis

Import TOGGLE configuration file into your /scratch directory

Copy the configuration file used by TOGGLe into the directory cleaning and display the content of this file

Modify this configuration file

Create a slurm script to launch TOGGLe

Launch the script runTOGGLe_cleaning.sh through slurm

Check your jobs are correctly running

Remove your data in the scratch directory

Practice 5 : Running Hisat2 and Stringtie with TOGGLe

Data used for this practice

Prepare your analysis

Opening an interactive bash session on a node via slurm

Copy data used for this practice in your own projet directory (NAS, NAS2 or NAS3 servers).

Adapt the configuration file used by TOGGLe

HISAT and STRINGTIE in some words :

Create a slurm script to launch your analysis - runTOGGLeRNASEQ.sh

Launch the TOGGLE analysis - runTOGGLeRNASEQ.sh

Convert runTOGGLeRNASEQ in an executable file with chmod +x runTOGGLeRNASEQ.sh

Launch runTOGGLeRNASEQ.sh in sbatch mode

Check your jobs are correctly running

Convert GTF file into COUNTS file

create a samples_gtf.txt file

Convert GTF to counts

TIP

TIP 1 : Renaming reads names

TIP 2 : scripts

Slurm script to run fastqc and cutadapt

A first script with all commands used for the pratice 2 : run_fastqc-multiqc.sh

An improved script with all commands used for the pratice 2 : run_fastqc-multiqc-arg.sh

TIP 3: How to choose a node to execute my analysis ]

Links

License

itrop

Practice 1. Connect on the cluster and prepare your working environment - `ssh,srun,scp`

Connection on the cluster through `ssh` mode

Opening an interactive bash session on a node via slurm `srun -p partition --pty bash -i`

Create the repertory `FASTQC` in your scratch directory

Open the multiqc report `multiqc_report.html` on your favorite web navigator

Using `cutadapt` to remove adapters and to trim reads based on quality

Practice 4 : Using the workflow manager `TOGGLe` to execute cutadapt and fastqc on a large number of samples

Adapt the configuration file used by `TOGGLe`

Create a slurm script to launch your analysis - `runTOGGLeRNASEQ.sh`

Launch the TOGGLE analysis - `runTOGGLeRNASEQ.sh`

Convert runTOGGLeRNASEQ in an executable file with `chmod +x runTOGGLeRNASEQ.sh`

Convert `GTF` file into `COUNTS` file

create a `samples_gtf.txt` file

A first script with all commands used for the pratice 2 : `run_fastqc-multiqc.sh`

An improved script with all commands used for the pratice 2 : `run_fastqc-multiqc-arg.sh`