bioinfo@ird.fr

Tutorials – How to manage my data after an analysis?

How to manage my data after an analysis?

Description The size of the NGS files (raw or generated by analysis) can be enormous and often varies from ten to hundreds of gigabytes. The storage space is high and requires significant hardware resources. Compressing NGS data is therefore a natural way to significantly reduce the cost of the storage infrastructure and speed up the analyses.
Authors Christine Tranchant-Dubreuil and Julie Orjuela - DIADE, IRD
Creation Date 09/02/2022
Last Modified Date 13/02/2022

Summary


Some NGS file format

Almost all the high-throughput sequencing data you will deal with should mainly have the following formats such as fastq, fq, sam, bam or vcf.

Format Description Used by
fastq a text-based format to store a nucleotide sequence and its corresponding quality sequence, generally given by the sequencing providers QC tools, cleaning tools, mapping tools
sam a tab-delimited text file that contains sequence alignment data, generally produced after mapping of fastq sequences to a reference tools (Sequence Alignment/Map format). samtools
bam binary version of a SAM file (saving storage and faster manipulation) samtools, gatk
vcf The vcf (Variant Call Format) format is a text format used to describe single nucleotide variants (SNVs) as well as insertions, deletions, and other sequence variations. vcftools, BCFtools

Don't forget to compress your large NGS files including VCF, FASTQ, and SAM files. File compression is a simple method of storing a file that has been reduced in size.

How to get the size of my files or directories ?

A directory

You can use the du command (disk usage). To find the size of the specific directory, use the following syntax du -h path2dir or du -sh path2dir:

nas3$ du -h /data3/projects/riceAnnot/
1,1G    /data3/projects/riceAnnot/abyssContigsFiltered/Obarthii
861M    /data3/projects/riceAnnot/abyssContigsFiltered/Oglaberrima
2,0G    /data3/projects/riceAnnot/abyssContigsFiltered
...
16G /data3/projects/riceAnnot/VCFs
33G /data3/projects/riceAnnot/ALLVCFs/tmp
38G /data3/projects/riceAnnot/ALLVCFs
58G /data3/projects/riceAnnot/
[tranchant@master0 scratch-scripts]$ du -sh
/data3/projects/riceAnnot/
58G /data3/projects/riceAnnot/
My projects on the cluster

To list the size of all your projects on nas, nas2 and nas3, use the following script :

nas3$ /opt/scripts/Data/project.sh 

### OWNER - project1 1,5T   /data/projects/project1
### OWNER - project2 352G   /data2/projects/project2

----------------- Projects as participants
toggle
irigin
====> TOTAL : 2

Be careful : Run this script can be time consuming depending on the number of projects and their size !

How to know if I have fastq, sam or vcf files in my project directory ?

Use the following script : /opt/scripts/Data/searchFile.sh path2project extension

$ /opt/scripts/Data/searchFile.sh /data3/projects/myproject/ vcf
    final_biallelic_random_10perc.vcf
    final_chr01_C013_C034.recode.vcf
    Filtered_biallelic.recode.vcf
    final_chr01.recode.vcf

How to compress a fastq or vcf files ?

List the content of the directory that contains fastq file
bash-4.2# ls -lh
SRR453571_1.fastq.gz
-rwxr-xr-x 1 root formation 541M  4 févr. 13:32 SRR453571_2.fastq.gz
-rwxr-xr-x 1 root formation 1,3G  4 févr. 13:32 SRR453578_1.fastq.gz
-rwxr-xr-x 1 root formation 4,9G  4 févr. 13:33 SRR453578_2.fastq
bash-4.2# du -h SRR453578_2.fastq
8,1G    SRR453578_2.fastq
Compress a fastq file with the command gzip -9
bash-4.2# du -h SRR453578_2.fastq
8,1G    SRR453578_2.fastq

bash-4.2#gzip -9  SRR453578_2.fastq

bash-4.2#bash-4.2# ls -lh
total 9,3G
-rwxr-xr-x 1 root formation 514M  4 févr. 13:32 SRR453571_1.fastq.gz
-rwxr-xr-x 1 root formation 541M  4 févr. 13:32 SRR453571_2.fastq.gz
-rwxr-xr-x 1 root formation 1,3G  4 févr. 13:32 SRR453578_1.fastq.gz
-rwxr-xr-x 1 root formation 1,3G  4 févr. 13:33 SRR453578_2.fastq.gz

bash-4.2# du -h SRR453578_2.fastq.gz
1,3G    SRR453578_2.fastq.gz
How to compress a vcf file ?
#get vcf size
$ du -sh ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf
7,0G    ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf

# compress vcf file
$ gzip -9 ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf

# get size of the compressed vcf file
$ du -sh ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf.gz
901M    ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf.gz

How to compress a sam files ?

SAM format can be compressed in the binary mode BAM by using samtools http://www.htslib.org/

samtools view -bh file.sam > file.bam

Most tools recognize this format. BAM format allows you to reduce the size of a file by up to 3 times, allowing you gain place in your project.

After a mapping, which files should I keep ? which files should I remove ?

After mapping you can decide to stock BAM files. In this case, fastq files are not needed.
It's possible to convert bam files to fastq files. You can do it by using bedtools command
https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html.

bedtools bamtofastq [OPTIONS] -i file.bam -fq fastq

In any case, keeping both is redundant !

Licence

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).