bioinfo@ird.fr

Tutorials – Polymorphism detection

Name Polymorphism detection from fastq files on a linux terminal.
Description This page describes a serie of tools and linux commands used to detect polymorphism from raw data (fastq files) to polymorphism file (vcf).
Authors christine Tranchant-Dubreuil (christine.tranchant@ird.fr)
Creation Date 10/03/2017
Last Modified Date 06/08/2018

We need, in this tutorial:

  • 2 fastq files paired
  • a reference file used for the mapping step.

Keywords

fastq-stats, fastqc, cutadapt, bwa, samtools, gatk, picardtools

Files format

fastq, sam, bam, vcf

Summary

Getting basic informations about FASTQ files

How to get fastq file size? du -sh

[tranchant@master0 ~]$ du -sh *.fastq
311M    C3KB2ACXX_5_12_11_debar.fastq
525M    C3KB2ACXX_5_12_12_debar.fastq

How to get sequences number by fastq files (uncompressed files only)?

[tranchant@master0 ~]$ wc -l *.fastq | awk '{ print $2" \t "$1/4}'
C3KB2ACXX_5_12_11_debar.fastq    1354891
C3KB2ACXX_5_12_12_debar.fastq    2249353

Getting quickly a report with various statistics about fastq file with ea-utils

This software runs quickly, faster than fastqc and the output can be parsed and formatted with some basics linux command
ea-utils website

How to use fastq-stats with one fastq file
[tranchant@node11 FASTQ]$ fastq-stats -D PdFIE94_R1.fq.gz
reads   15034838
len 144
len mean    144.0000
len stdev   0.0000
len min 144
phred   33
window-size 2000000
cycle-max   35
qual min    2
qual max    42
qual mean   40.2191
qual stdev  4.8129
%A  25.5314
%C  26.3730
%G  22.4570
%T  25.6209
%N  0.0178
total bases 2165016672
How to use fastq-stats on several fastq files
[tranchant@node2 tranchant]$ ls
DV.UNMAPPED_1.fastq              DV.UNMAPPED_2.fastq
DT.UNMAPPED_1.fastq              DT.UNMAPPED_2.fastq

# you can execute every fastq-stats command by using for loop (bash)
[tranchant@node2 tranchant]$ for file in *fastq; do fastq-stats -D $file > $file.fastq-stats ; done;

#Check that the files have been created
[tranchant@node2 tranchant]$ ls -l *stats
-rw-r--r-- 1 tranchant ggr 246  8 juin  14:55 DT.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 247  8 juin  14:55 DT.UNMAPPED_2.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246  8 juin  14:55 DV.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246  8 juin  14:55 DV.UNMAPPED_2.fastq.fastq-stats

#check one file
[tranchant@node2 tranchant]$ cat DT.UNMAPPED_1.fastq.fastq-stats
reads   11043989
len 101
len mean    99.3377
len stdev   7.4780
len min 30
phred   33
window-size 2000000
cycle-max   35
qual min    2
qual max    41
qual mean   37.7341
qual stdev  3.8379
%A  29.2366
%C  20.8381
%G  20.7269
%T  29.1967
%N  0.0017
total bases 1097084447
How to get a tab file from all files generated by the command ls | cut | sed | sh
# To generate a header
[tranchant@node11 FASTQ]$ cut -f1 PdFIE94_R1.fq.gz.fastq-stats | paste -s - - > all.stats

# To display header file
[tranchant@node11 FASTQ]$ head all.stats
reads   len len mean    len stdev   len min phred   window-size cycle-max   qual min    qual max    qual mean   qual stdev  %A  %C  %G  %T  %N  total bases

# To parse every file and to create a tab-delimited file
[tranchant@node11 FASTQ]$ awk 'BEGIN { num=0;  }  { num++; if (num == 1) { print FILENAME }  print $NF; if (num == 18) {print "\n"; num=0; } } END {  } ' *fastq-stats   |   awk ' BEGIN { RS="\n\n"; ;} { print $1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19;}' >> all.stats

# Display 10 first lines of the file all.stats
[tranchant@node11 FASTQ]$ head all.stats
reads   len len mean    len stdev   len min phred   window-size cycle-max   qual min    qual max    qual mean   qual stdev  %A  %C  %G  %T  %N  total bases
PdFIE94_R1.fq.gz.fastq-stats 144 144.0000 0.0000 144 33 2000000 35 2 42 40.2191 4.8129 25.5314 26.3730 22.4570 25.6209 0.0178 2165016672
PdFIE94_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.8602 10.4600 26.6998 23.6108 22.8715 26.4631 0.3548 2270260538
PdFIE95_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1422 4.9354 25.2075 26.6116 22.5737 25.5869 0.0202 1873211197
PdFIE95_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0387 10.3787 26.5003 24.0084 23.0377 26.0990 0.3546 1978006229
PdFIE96_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1133 4.9931 25.0246 26.6680 22.6115 25.6756 0.0204 1488540911
PdFIE96_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.9494 10.4096 26.5905 24.0838 23.1924 25.7780 0.3554 1571815927
PdFIE98_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.2191 4.8173 25.3874 26.4649 22.4244 25.7031 0.0201 1443785200
PdFIE98_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0267 10.3824 26.6316 23.8662 22.8668 26.2791 0.3564 1524556400

Getting various statistics about fastq and performing a quality control check with fastqc

  • to perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it. fastqc website
[tranchant@master0 ~]$ fastqc *.fastq -o .
[tranchant@master0 ~]$ ls
total 543577236
drwxr-xr-x 4 tranchant ggr                4096  9 mars  22:11 PdFIE94_R1.fq_fastqc
-rw-r--r-- 1 tranchant ggr              225516  9 mars  22:11 PdFIE94_R1.fq_fastqc.zip

FASTQ cleaning

Using cutadapt to remove adapters and to trim reads based on quality

cutadapt  -q 30,30 -m 35  -B GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -B GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -b GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG-o P1_R1.CUTADAPT.fastq.gz -p P1_R2.CUTADAPT.fastq.gz P1_R1.fq.gz P1_R2.fq.gz
  • Cutadapt supports trimming of multiple types of adapters:
Adapter type Command-line option
3’ adapter -a ADAPTER
5’ adapter -g ADAPTER
5’ or 3’ (both possible) -b ADAPTER

-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs

-p is the short form of --paired-output. The option -B is used here to specify an adapter sequence that cutadapt should remove from the second read in each pair.


Mapping step with bwa

Creating an index of the reference genome with bwa index

bwa index -a is

is for short genome
bwtsw for genome >2Gb

Performing mapping with bwa aln and bwa sampe/samse

Getting sai files with bwa aln
bwa aln reference file_forward.fastq > file_forward.sai
bwa aln reference file_reverse.fastq > file_reverse.sai
Getting sam file with bwa sampe (paired)

Here is a description for the contents of the SAM file: https://samtools.github.io/hts-specs/SAMv1.pdf

bwa sampe reference file_forward.sai file_reverse.sai  file_forward.fastq file_reverse.fastq -f file.sam  -r '@RG        ID:RG  SM:RG  PL:Illumina'
Getting sam file with bwa samse (single)
bwa samse reference file.sai file.fastq file_reverse.fastq -f file.sam  -r '@RG        ID:RG  SM:RG  PL:Illumina'

Performing mapping with bwa mem

bwa mem  reference file.fastq  file_reverse.fastq  -R '@RG\tID:RC3\tSM:RC3\tPL:Illumina' > file.sam

Processing sam file with picardtools and samtools

Creating Reference dictionary with picardtools CreateSequenceDictionary

/usr/bin/java -Xmx8g  -jar /us/local/picard-tools-2.5.0/picard.jar CreateSequenceDictionary REFERENCE=Reference.fasta OUTPUT=Reference.dict

Creating Reference indexes with samtools faidx

samtools faidx Rreference.fasta

Converting the sam file into a bam file and sorting the bam file with picardtools SortSam

/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar SortSam  CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate  INPUT=file.BWASAMPE.sam OUTPUT=file.PICARDTOOLSSORT.bam

The bam index (.bai) is created auomtically by picardtools (samtools index command unnecessary)

Getting some stats from bam file with samtools flagstat

#command
$samtools flagstat file.PICARDTOOLSSORT.bam

#output
14524094 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
11219302 + 0 mapped (77.25%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 prorubyy paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Getting some stats from bam file with samtools idxstats

#command
$samtools idxstats file.PICARDTOOLSSORT.bam

#The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. #It is written to stdout.
Chromosome_8.1  7978604 2274010 0
Chromosome_8.2  8319966 2274086 0
Chromosome_8.3  6606598 1774006 0
Chromosome_8.4  5546968 1483879 0
Chromosome_8.5  4490059 1217779 0
Chromosome_8.6  4133993 1132535 0
Chromosome_8.7  3415785 945505  0
Chromosome_8.8  535760  117502  0
*   0   0   3304792

Getting coverage from bam file with bedtools coverage

#command
# -bg option : reports in BEDGRAPH format only for those regions of the genome that actually have coverage. 

# This option produces genome-wide coverage output in BEDGRAPH format much more concise since consecutive positions with the same coverage are reported as a single output line describing the start and end coordinate of the interval having the coverage level, followed by the coverage level itself.

# -bga option : a complete report including the regions with zero coverage.
$bedtools genomecov -ibam AAOSW.PICARDTOOLSSORT.bam -g /scratch/tranchant/referenceFiles/OglaRS2ADWL02.fa -bga > AAOSW.PICARDTOOLSSORT.bam.BEDTOOLSGENOMECOV

#The output is TAB-delimited with each line consisting of reference sequence name, start position, end position,  coverage level
$less AAOSW.PICARDTOOLSSORT.bam.BEDTOOLSGENOMECOV
Chr01   0       2       1
Chr01   2       7       3
Chr01   7       13      6
Chr01   13      14      8
Chr01   14      15      9
Chr01   15      16      11
Chr01   16      18      12
Chr01   18      19      13
Chr01   19      23      14
Chr01   23      29      16
Chr01   29      32      18
Chr01   32      33      19
Chr01   33      34      20
Chr01   34      36      21
Chr01   36      38      22
Chr01   38      39      24
Chr01   39      40      26

Remove multimapping and incorrectly mapping reads with samtools view

# Reads correctly mapped extracted
samtools view -h -b -f=0*02 -o file.SAMTOOLSVIEW-MAPPED.bam file.PICARDTOOLSSORT.bam

# Unmapped reads extracted
samtools view -h -b -F=0*02 -o file.SAMTOOLSVIEW-UNMAPPED.bam file.PICARDTOOLSSORT.bam

Creating index of the last bam generated with samtools index

#Command
samtools index file.SAMTOOLSVIEW-MAPPED.bam

Adding read group with picardtools AddOrReplaceReadGroups

... if they haven't been added precedently

#Command
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar AddOrReplaceReadGroups INPUT=20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT=20135.PICARDTOOLSADDORREPLACEREADGROUPS.bam RGPL=Illumina RGLB=20135 RGPU=20135 RGSM=20135 CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log
  • RGPL= Read Group Platform (i.e. Illumina)
  • RGLB= Read Group Library (i.e. DNA preparation library identity i.e "HG19")
  • RGPU= Read Group Platform Unit (i.e Barcode)
  • RGSM= Read Group Sample Name (i.e. Sample Name)

</a

Marking duplicates with picardtools MarkDuplicates

... if it is necessary

#Command
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar MarkDuplicates INPUT=20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT=20135.PICARDTOOLSMARKDUPLICATES.bam METRICS_FIL=metrics.txt CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log

Local realignment around INDELS using GATK

Creating a target list of intervals which need to be realigned with gatk realignerTargetCreator

#command
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T RealignerTargetCreator  -R reference -I file.bam -o file.GATKREALIGNERTARGETCREATOR.intervals

Performing realignment of the target intervals with gatk indelRealigner

#Command
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T IndelRealigner  -R reference -I file.bam -targetIntervals file.GATKREALIGNERTARGETCREATOR.intervals -o file.GATKINDELREALIGNER.bam

SNP/INDEL calling

with GATK haplotypeCaller

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar  -T HaplotypeCaller  -R reference -I file-ind1.PICARDTOOLSMARKDUPLICATES.bam -I  file-ind2.PICARDTOOLSMARKDUPLICATES.bam  -o file.GATKHAPLOTYPECALLER.vcf

with GATK unifiedGenotyper

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar   -T UnifiedGenotyper -R reference -I file.bam  -o file.GATKUNIFIEDGENOTYPER.vcf

Basic Filtering vcf files using GATK

with gatk variantFiltration

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T VariantFiltration --filterName 'FILTER-DP' --filterExpression 'DP<10 || DP>600' --filterName 'LowQual' --filterExpression 'QUAL<30'   -R /reference -o file.GATKVARIANTFILTRATION.vcf --variant file.GATKHAPLOTYPECALLER.vcf

with gatk selectVariants

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -selectType SNP -T SelectVariants  -R reference --variant file.GATKVARIANTFILTRATION.vcf -o file.GATKSELECTVARIANT.vcf

More informations

NGS workflow Managers and linux trainings realized by the South Green Platform


License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).