RNASeq Practice
Description | Hands On Lab Exercises for RNASeq |
---|---|
Related-course materials | Linux for Dummies |
Authors | Julie Orjuela (julie.orjuela_AT_irf.fr), Pierre Larmande (pierre.larmande_AT_ird.fr), Christine Tranchant (christine.tranchant_AT_ird.fr) |
Creation Date | 04/02/2022 |
Last Modified Date | 14/03/2022 |
Summary
Preambule: Dataset used during this pratice
Practice 1: Connect on the cluster and prepare your working environment - ssh,srun,scp
Practice 2: Check Reads Quality - fastqc
,multiqc
Practice 3: fastq cleaning - cutadapt
Practice 4: Using the workflow manager TOGGLe
to execute cutadapt and fastqc on a large number of samples
Practice 5: Running Hisat2 and Stringtie with TOGGLe
- TIP 1: renaming fastq files -
bash
- TIP 2: scripts "slurm"
- TIP 3: How to choose a node to execute my analysis
Preambule. Dataset used during this pratice
Datasets used in this practical
Origin :
- ref : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3488244/
- data : NCBI SRA database under accession number SRS307298 S. cerevisiae.
- Genome size of S. cerevisiae : 12M (12.157.105) (https://www.yeastgenome.org/strain/S288C#genome_sequence)
In this session, we will analyze RNA-seq data from one sample of S. cerevisiae (NCBI SRA SRS307298). It is from two different origin (CENPK and Batch), with three biological replications for each origin (rep1, rep2 and rep3).
Where is this dataset on the cluster ?
Dataset were downloaded on the i-trop cluster here : /data2/formation/TP_read2count/RAW_DATA (server nas)
```
RAW_DATA/
├── adapt-125pbLib.txt
├── FASTQ
│ ├── SRR453566_1.fastq.gz
│ ├── SRR453566_2.fastq.gz
│ ├── SRR453567_1.fastq.gz
│ ├── SRR453567_2.fastq.gz
│ ├── SRR453568_1.fastq.gz
│ ├── SRR453568_2.fastq.gz
│ ├── SRR453569_1.fastq.gz
│ ├── SRR453569_2.fastq.gz
│ ├── SRR453570_1.fastq.gz
│ ├── SRR453570_2.fastq.gz
│ ├── SRR453571_1.fastq.gz
│ ├── SRR453571_2.fastq.gz
│ ├── SRR453578_1.fastq.gz
│ └── SRR453578_2.fastq.gz
├── REF
│ ├── GCF_000146045.2_R64_cds_from_genomic.fna
│ ├── GCF_000146045.2_R64_cds_from_genomic.fna.gz
│ ├── GCF_000146045.2_R64_genomic.fna
│ └── GCF_000146045.2_R64_genomic.gtf
└── samples.txt
2 directories, 20 files
```
Practice 1. Connect on the cluster and prepare your working environment - ssh,srun,scp
Connection on the cluster through ssh
mode
We will work on the i-trop cluster using SLURM scheduler.
ssh login@bioinfo-master.ird.fr
Opening an interactive bash session on a node via slurm srun -p partition --pty bash -i
Read this survival document containig basic commands to SLURM (https://southgreenplatform.github.io/trainings/slurm/)
srun --pty bash -i
Prepare your input files
Create your working directory in the scratch partition (/scratch).
Please replace LOGIN with your own user login.
cd /scratch
mkdir LOGIN
cd LOGIN
Copy fastq files from the nas into your scratch directory.
scp -r nas:$PATHTODATA/RAWDATA/* /scratch/LOGIN/
Check that the files have been correctly copied with the command ls -alR
ou tree
. You should see 14 gzipped fastq files, a samples.txt
file and a adapt-125pbLib.txt
file.
[orjuela@node25 RAWDATA]$ more samples.txt
CENPK CENPK_rep1 PATH/SRR453569_1.fastq.gz PATH/SRR453569_2.fastq.gz
CENPK CENPK_rep2 PATH/SRR453570_1.fastq.gz PATH/SRR453570_2.fastq.gz
CENPK CENPK_rep3 PATH/SRR453571_1.fastq.gz PATH/SRR453571_2.fastq.gz
Batch Batch_rep1 PATH/SRR453566_1.fastq.gz PATH/SRR453566_2.fastq.gz
Batch Batch_rep2 PATH/SRR453567_1.fastq.gz PATH/SRR453567_2.fastq.gz
Batch Batch_rep3 PATH/SRR453568_1.fastq.gz PATH/SRR453568_2.fastq.gz
Check the size of fastq directory
du -sh FASTQ_PATH
Practice 2. Check Reads Quality
FastQC performs some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Create the repertory FASTQC
in your scratch directory
mkdir FASTQC
cd FASTQC
Load FastQC software (last version)
module load bioinfo/FastQC/0.11.9
Run fastqc in the whole of samples (it will take 10min)
fastqc /scratch/tranchant-802/fastq/* -o /scratch/tranchant-802/FASTQC/
Run MultiQC
Multiqc is a modular tool to aggregate results from bioinformatics analyses across many samples into a single report. Use this tool to visualise all the results of fastqc . https://multiqc.info/
#charge module
module load bioinfo/multiqc/1.9
#launch Multiqc to create a html report centralizing informations generated by fastqc per fastq file
multiqc /scratch/LOGIN/FASTQC
Transfer fastqc and multiqc files
now, transfert results from : node -> nas -> computer
- Transfert from /scratch to NAS
scp -r FASTQC nas:/home/LOGIN/
- Transfert from NAS to your computer using
scp
orfilezilla
scp -r LOGIN@bioinfo-nas.ird.fr:/home/LOGIN/FASTQC .
Open the multiqc report multiqc_report.html
on your favorite web navigator
Remove data in scratch directory
rm -rf /scratch/LOGIN/FASTQC/
Your turn !
- Launch th same analysis through a slurm script !
Practice 3. fastq cleaning
Using cutadapt
to remove adapters and to trim reads based on quality
cutadapt -q 30,30 -m 35 -B GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -B GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -b GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG-o P1_R1.CUTADAPT.fastq.gz -p P1_R2.CUTADAPT.fastq.gz P1_R1.fq.gz P1_R2.fq.gz
Cutadapt
supports trimming of multiple types of adapters:
Adapter type | Command-line option |
---|---|
3’ adapter | -a ADAPTER |
5’ adapter | -g ADAPTER |
5’ or 3’ (both possible) | -b ADAPTER |
-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs
-p is the short form of --paired-output. The option -B is used here to specify an adapter sequence that cutadapt should remove from the second read in each pair.
Practice 4 : Using the workflow manager TOGGLe
to execute cutadapt and fastqc on a large number of samples
Data used for this pratice
Input data for this TP are accessible in the directory /data2/formation/TP_read2count/
.
- fastq files :
RAW_DATA/FASTQ_RENAMED
- configuration file used by TOGGLe :
scripts/TOGGLE_CONFIG/cleaning.config.txt
- slurm script used to launch our analysis worflow withTOGGLe :
scripts/runTOGGLe_cleaning.sh
PATHTODATA="nas:/data2/formation/TP_read2count/"
* Input data : $PATHTODATA/RAW_DATA/FASTQ_RENAMED
* Config file: $PATHTODATA/scripts/TOGGLE_CONFIG/cleaning.config.txt
* Script : $PATHTODATA/scripts/runTOGGLe_cleaning.sh
Nb : You can also download a predefined TOGGLe configuration file such as fastqCheckQuality.config.txt and modify it.
Prepare your working environment
Opening an interactive bash session on a node via slurm
srun -p partition --pty bash -i
Create your working directory in the scratch partition (/scratch).
# Create a new directory in your /scratch
mkdir /scratch/$USER-TOGGLe
cd /scratch/$USER-TOGGLe/
mkdir cleaning
cd cleaning
Import fastq from $PATHTODATA to your /scratch repertory and configure TOGGLe to cleaning analysis
# declare where data is
PATHTODATA="nas:/data2/formation/TP_read2count/"
# Transfert reads to cleaning to /scratch
scp -r $PATHTODATA/RAW_DATA/FASTQ_RENAMED .
Import TOGGLE configuration file into your /scratch directory
Copy the configuration file used by TOGGLe into the directory cleaning and display the content of this file
scp $PATHTODATA/scripts/TOGGLE_CONFIG/cleaning.config.txt .
[orjuela@nodeXX cleaning]$ more cleaning.config.txt
$order
1=cutadapt
2=fastqc
$fastqc
$cutadapt
-q 30,30
-m 35
-u 9
-B GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC
-B ATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCC
-B GTCCATTATATGTCTCCCAAACCACCAAACTCTTTGACTCCGGTGTGTTG
-B GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
-b GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC
-b ATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCC
-b GTCCATTATATGTCTCCCAAACCACCAAACTCTTTGACTCCGGTGTGTTG
-b GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
# PUT YOUR OWN SLURM CONFIGURATION HERE IF AVAILABLE RESSOURCES
$slurm
-p PARTITION
--nodelist=nodeXXX
# If your data are already in scratch don't activate this option
#$scp
#/scratch/
$env
module load bioinfo/TOGGLE-dev/0.3.7
Modify this configuration file
Using a text editor such nano, modify the configuration file cleaning.config.txt
and check the whole of parameters.
-
Adapt the parameters of
Cutadapt
andFastQC
if necessary -
Change the slurm configuration (key
$slurm
) by adding the correct partition and node names :- -p PARTITION. Partition can be short, normal, highmem etc...
- --nodelist=nodeX (node used by slurm)
-
if your fastq files are in a project space, use/activate the
$scp
key , TOGGLe will transfer data from your project space into the scratch directory of the node used by slurm.
Nb : If your data are already in /scratch, deactivate this option and activate $env
key giving module load bioinfo/TOGGLE-dev/0.3.7
.
Create a slurm script to launch TOGGLe
This script is also available in the /data2/formation/TP_read2count/scripts/runTOGGLe_cleaning.sh
#!/bin/bash -l
#SBATCH -J TOGGLeCleaning
#SBATCH --export=ALL
#SBATCH -e toggle."%j".err
#SBATCH -o toggle."%j".out
#SBATCH -p PARTITION
#SBATCH --nodelist=nodeXX
# Defining scratch and destination repertories
REP="/scratch/$USER-TOGGLe/cleaning"
dir="$REP/FASTQ_RENAMED"
out="$REP/OUTPUT_TOGGLE-CLEANING"
config="$REP/cleaning.config.txt"
# Software-specific settings exported to user environment
module load bioinfo/TOGGLE-dev/0.3.7
# running tooglegenerator
toggleGenerator.pl -d $dir -c $config -o $out --nocheck;
echo "FIN de TOGGLe for cleaning ^^"
- Convert runTOGGLe_cleaning in an executable file with
chmod +x runTOGGLe_cleaning.sh
Launch the script runTOGGLe_cleaning.sh through slurm
sbatch ./runTOGGLe_cleaning.sh
Check your jobs are correctly running
- Explore output
OUT
TOGGLe and check if everything was ok. - Check if your jobs are running with : squeue -u login -i 10
- test the top command ('c' and 'u')
Remove your data in the scratch directory
Practice 5 : Running Hisat2 and Stringtie with TOGGLe
Data used for this practice
Input data are available in the directory /data2/formation/TP_read2count
:
- fastq files :
RAW_DATA/FASTQ_RENAMED
- Genome sequence (fasta) :
RAW_DATA/REF/GCF_000146045.2_R64_genomic.fna
- Genome annotation (gff) :
RAW_DATA/REF/GCF_000146045.2_R64_genomic.gtf
- configuration file used by TOGGLe :
scripts/TOGGLE_CONFIG/RNASeqHisat2Stringtie.config.txtt
- slurm script used to launch our analysis worflow withTOGGLe :
scripts/runTOGGLeRNASEQ.sh
PATHTODATA="nas:/data2/formation/TP_read2count"
* Input data : $PATHTODATA/RAW_DATA/FASTQ_RENAMED
* Reference : $PATHTODATA/RAW_DATA/REF/GCF_000146045.2_R64_genomic.fna
* Annotation : $PATHTODATA/RAW_DATA/REF/GCF_000146045.2_R64_genomic.gtf
* config file : $PATHTODATA/scripts/TOGGLE_CONFIG/RNASeqHisat2Stringtie.config.txt
* Script to run TOGGLe: $PATHTODATA/scripts/runTOGGLeRNASEQ.sh
In this practive, we don't transfert data from nas into /scratch manually ! TOGGLe is able to manage data transfert for you ! Let's try it!
Prepare your analysis
Opening an interactive bash session on a node via slurm
srun -p partition --pty bash -i
Copy data used for this practice in your own projet directory (NAS, NAS2 or NAS3 servers).
# 1 Go to your project on NASX
ssh NASX
cd /dataX/project/MyProject/
# 2 Create a directory to work
mkdir TEST-TOGGLe
cd TEST-TOGGLe
# 3 Declare a variable : to change if needed
PATHTODATA="nas:/data2/formation/TP_read2count/"
# 4 Transfert READS and REF test to NAS, NAS2 or NAS3
scp -r $PATHTODATA/RAW_DATA/FASTQ_RENAMED .
scp -r $PATHTODATA/RAW_DATA/REF .
# 5 Make à copy of the configuration file used by TOGGLe adapted to RNAseq
scp $PATHTODATA/scripts/TOGGLE_CONFIG/RNASeqHisat2Stringtie.config.txt .
# 6 Transfert script used to run TOGGLe
scp $PATHTODATA/scripts/runTOGGLeRNASEQ.sh .
Adapt the configuration file used by TOGGLe
Using a text editor, adapt the file RNASeqHisat2Stringtie.config.txt
and check the whole of parameters. This predefined configuration file ca be obtained from RNASeqReadCount.config.txt
-
Change SLURM key
$slurm
giving -p PARTITION. Partition can be short, normal, highmem etc... -
Complete the
$scp
key with /scratch. TOGGLe will use this key to copy data from a nas server into a node (scratch partition) then it will run jobs using data in the scratch directory.
Nb : If your data are already in /scratch, comment this key -
Add, after the
$env
key, the linemodule load bioinfo/TOGGLE-dev/0.3.7
. -
Check parameters of every step in
RNASeqHisat2Stringtie.config.txt
as recommended by stringtie manual recomendations http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual.
$ more RNASeqHisat2Stringtie.config.txt
$order
1=hisat2
2=samtoolsView
3=samtoolsSort
4=stringtie
$hisat2
--dta
$samtoolsview
-b
-h
$stringtie
-e
-B
$cleaner
1
2
$slurm
--job-name=TOGGLe
--partition normal
--export=ALL
$scp
/scratch/
$env
module load bioinfo/TOGGLE-dev/0.3.7
HISAT and STRINGTIE in some words :
Before mapping, a reference genome index is required. TOGGle creates automatically index genome if indexes are absent in reference folder.
Mapping is performed using HISAT2 and mapped reads are assembled into transcripts with StringTie. Transcript assemblies can be done with or without a reference annotation.
Stringtie is able to estimate the expression levels of the "reference" transcripts provided and generate a matrix of read counts mapped to particular genomic features (e.g., genes).
Create a slurm script to launch your analysis - runTOGGLeRNASEQ.sh
This script is also avalaible in the data directory used for this practice.
$ more runTOGGLe_RNAseq.sh
#!/bin/bash -l
#SBATCH -J TOGGLeRNASeq
#SBATCH --export=ALL
#SBATCH -e toggle."%j".err
#SBATCH -o toggle."%j".out
#SBATCH -p PARTITION
# Defining my project and destination repertories
REP="/dataX/projects/MyProject/TEST-TOGGLe"
dir="$REP/FASTQ_RENAMED"
out="$REP/OUTPUT_TOGGLE-H2SSM"
config="$REP/RNASeqHisat2Stringtie.config.txt"
ref="$REP/REF/GCF_000146045.2_R64_genomic.fna"
gff="$REP/REF/GCF_000146045.2_R64_genomic.gtf"
# Software-specific settings exported to user environment
module load bioinfo/TOGGLE-dev/0.3.7
# running tooglegenerator
toggleGenerator.pl -d $dir -c $config -r $ref -g $gff -o $out --report --nocheck;
echo "FIN de TOGGLe - RNASEQ ^^"
Launch the TOGGLE analysis - runTOGGLeRNASEQ.sh
Convert runTOGGLeRNASEQ in an executable file with chmod +x runTOGGLeRNASEQ.sh
chmod +x ./runTOGGLeRNASEQ.sh
Launch runTOGGLeRNASEQ.sh in sbatch mode
sbatch ./runTOGGLeRNASEQ.sh
Check your jobs are correctly running
- Explore the output
OUTPUT_TOGGLE-H2SSM
and check if everything was ok. - Check the final_results directory and observe .gtf files
Convert GTF
file into COUNTS
file
create a samples_gtf.txt
file
Go to the finalResults directory
and create a samples_gtf.txt
doing ...
cd /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults
ls *gtf | sed 's/.STRINGTIE.gtf//' - > names.txt
realpath *gtf > paths.txt
paste names.txt paths.txt > samples_gtf.txt
rm names.txt paths.txt
samples_gtf.txt
is looking now such as (no header) :
Batch-rep1 /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/Batch-rep1.STRINGTIE.gtf
Batch-rep2 /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/Batch-rep2.STRINGTIE.gtf
Batch-rep3 /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/Batch-rep3.STRINGTIE.gtf
CENPK-rep1 /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/CENPK-rep1.STRINGTIE.gtf
CENPK-rep2 /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/CENPK-rep2.STRINGTIE.gtf
CENPK-rep3 /dataX/projects/MyProject/TEST-TOGGLe/OUTPUT_TOGGLE-H2SSM/finalResults/CENPK-rep3.STRINGTIE.gtf
Convert GTF to counts
module load system/python/3.7.2
python3 /data2/formation/TP_read2count/scripts/prepDE.py3 -i samples_gtf.txt
gene_count_matrix.csv
and transcript_count_matrix.csv
has been created!!!
You can use DESeq2 and edgeR for analyzing differential expression in R or some shiny packages such as DIANE https://oceanecsn.github.io/DIANE/ or PIVOT https://github.com/kimpenn/PIVOT
TIP
TIP 1 : Renaming reads names
Several ways exist to rename files. We propose here a solution from a csv file. We will use symbolic links to avoid to change original names. We will use the naming convention defined in TOGGLe https://toggle.ird.fr/manual/quickManual/!!
[orjuela@master0 FASTQ]$ pwd
/data2/formation/TP_read2count/RAW_DATA/FASTQ
[orjuela@master0 FASTQ]$ realpath *gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_2.fastq.gz
...
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453578_1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453578_2.fastq.gz
[orjuela@master0 FASTQ]$ realpath *gz > ../names_tmp.txt
names_tmp.txt can be modified in excel for example in order to have a second column with the new name such as in a new file (named file4renaming.csv
par example) :
[orjuela@master0 RAW_DATA/]$ more file4renaming.csv
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_1.fastq.gz Batch-rep1_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_2.fastq.gz Batch-rep1_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_1.fastq.gz Batch-rep2_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_2.fastq.gz Batch-rep2_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_1.fastq.gz Batch-rep3_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_2.fastq.gz Batch-rep3_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453569_1.fastq.gz CENPK-rep1_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453569_2.fastq.gz CENPK-rep1_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453570_1.fastq.gz CENPK-rep2_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453570_2.fastq.gz CENPK-rep2_R2.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453571_1.fastq.gz CENPK-rep3_R1.fastq.gz
/data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453571_2.fastq.gz CENPK-rep3_R2.fastq.gz
Now we will use this file4renaming.csv
to create symbolic links.
# go to the new FASTQ repertory
mkdir FASTQ_RENAMED
cd FASTQ_RENAMED
# first verify command line is ok before to launch "eval"
while read -r line; do echo "ln -s $line"; done < ../file4renaming.csv
# Once all its ok,
while read -r line; do ln -s $line; done < ../file4renaming.csv
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_1.fastq.gz SRR453566_R1.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453566_2.fastq.gz SRR453566_R2.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_1.fastq.gz SRR453567_R1.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453567_2.fastq.gz SRR453567_R2.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_1.fastq.gz SRR453568_R1.fastq.gz
ln -s /data2/formation/TP_read2count/RAW_DATA/FASTQ/SRR453568_2.fastq.gz SRR453568_R2.fastq.gz
# verify links by using ls -l command
TIP 2 : scripts
Slurm script to run fastqc and cutadapt
A first script with all commands used for the pratice 2 : run_fastqc-multiqc.sh
- The command to execute this script
sbatch /data2/formation/TP_read2count/scripts/run_fastqc-multiqc.sh
- Before running this script, don't forget to update the script (ex: email adress used by slurm) . This script runs with the TP data but don't forget to update it if you want use it with other datasets (READS variable in the script)
```
#!/bin/bash
######## CONFIGURATION SLURM ##
# Definir le nom du job
#SBATCH--job-name=FastQC
# on definit la partition où lancer la commande et le noeud
#SBATCH -p normal
####SBATCH -w nodeX
# On choisit le nombre de coeur
#SBATCH -c 2
# Definir notre adresse mail
#SBATCH --mail-user=PUT_YOUR_EMAIL@ird.fr
# type de mail à recevoir
# BEGIN, ABORT, END ou ALL
#SBATCH --mail-type=ALL
############################################
READS="nas:/data2/formation/TP_read2count/RAW_DATA/FASTQ/"
# Print le noeud où mon job tourne
echo "###### Je suis dans ce node : $HOSTNAME"
# creer un repertoire personnel dans le scratch du noeud
cd /scratch
mkdir $USER-$SLURM_JOB_ID
cd $USER-$SLURM_JOB_ID
echo "###### Je suis dans ce node : $HOSTNAME"
# recupérer les données depuis le nas
echo "##### on fait le scp des reads vers le node ..."
echo "## scp -r $READS /scratch/$USER-$SLURM_JOB_ID"
scp -r $READS /scratch/$USER-$SLURM_JOB_ID
## créer un dossier FastQC
mkdir FastQC
## charger le logiciel
module load bioinfo/FastQC/0.11.9
# lancer l'analyse
echo "#### on lance FastQ ..."
cd FastQC
echo "## fastqc /scratch/$USER-$SLURM_JOB_ID/FASTQ/* -o /scratch/$USER-$SLURM_JOB_ID/FASTQC -t 2"
fastqc /scratch/$USER-$SLURM_JOB_ID/FASTQ/* -o /scratch/$USER-$SLURM_JOB_ID/FASTQC -t 2
#charge module multiQC
module load bioinfo/multiqc/1.9
echo "#### on lance multiqc ..."
multiqc /scratch/$USER-$SLURM_JOB_ID/FastQC
# recupération des données
echo '#### je transfert les reads vers le projet ...'
echo "## scp -r FastQC $READS"
scp -r /scratch/$USER-$SLURM_JOB_ID/FastqQC $READS
#suppression des données
#cd /scratch
#rm - rf $USER-$SLURM_JOB_ID
#stats sur le job
seff $SLURM_JOB_ID
```
An improved script with all commands used for the pratice 2 : run_fastqc-multiqc-arg.sh
- The command to excute this script
sbatch /data2/formation/TP_read2count/scripts/run_fastqc-multiqc-arg.sh PUT_path2reads PUT_output_directory_name
```
#!/bin/bash
#SBATCH -p normal
#SBATCH -J fastqc-multiqc
#SBATCH -c 4
if [[ -z $1 ]]; then
echo "ce script attends 2 arguments : pathtoreads et outputname";
exit;
fi
if [[ -z $2 ]]; then
echo "ce script attends outputname en 2 argument";
exit;
fi
#chargement des modules
#module load bioinfo/FastQC/0.11.9
module load bioinfo/FastQC/0.11.5
module load bioinfo/multiqc/1.9
# ==> declaration des variables
#on recupere le chemin du directoire ou sont les reads
DIRFASTQ=$(realpath $1)
TMP=/scratch/$USER-readsquality
READS=$TMP/READS
OUTFASTQC=$TMP/OUTPUTFASTQC
OUTMQC=$TMP/$2
FINAL=$(dirname $DIRFASTQ)
#on cree un dossier de travail dans le scratch d'un node
mkdir -p $TMP
#on cree un dossier ou on va placer les reads
mkdir -p $READS
#on cree un dossier pour FASTQC ou on va placer les reads
mkdir -p $OUTFASTQC
#on cree un dossier pour MultiQC ou on va placer les reads
mkdir -p $OUTMQC
#on transfert les reads dans le dossier tmp/fastq
echo "=> On transfert les reads du projet au /scratch ..."
echo "scp nas:$DIRFASTQ/*gz $READS"
scp nas:$DIRFASTQ/*gz $READS
# fastQC
echo "Les fastq sont dans : $READS dans le node"
echo "=> On lance FastQC ..."
echo "fastqc $READS/* -o $OUTFASTQC -t 4"
fastqc $READS/* -o $OUTFASTQC -t 4
#MultiQC
echo "=> On lance MultiQC ..."
echo "multiqc $OUTFASTQC --outdir $OUTMQC"
multiqc $OUTFASTQC --outdir $OUTMQC
# on transfert les données
echo "=> On retransfert les resultats sur le projet ..."
echo "scp -r $OUTMQC nas:$FINAL/"
scp -r $OUTMQC nas:$FINAL/
# on efface les données du /scratch
echo "=> On efface les données du /scratch . bye!!!"
rm -rf $TMP
```
TIP 3: How to choose a node to execute my analysis ]
Links
slides https://southgreenplatform.github.io/trainings//files/RNAseq_ouaga_2019_10102019-short.pdf
Practice : https://bioinfo.ird.fr/index.php/tutorials-fr/rnaseq1/