How to manage my data after an analysis?
Description | The size of the NGS files (raw or generated by analysis) can be enormous and often varies from ten to hundreds of gigabytes. The storage space is high and requires significant hardware resources. Compressing NGS data is therefore a natural way to significantly reduce the cost of the storage infrastructure and speed up the analyses. |
---|---|
Authors | Christine Tranchant-Dubreuil and Julie Orjuela - DIADE, IRD |
Creation Date | 09/02/2022 |
Last Modified Date | 13/02/2022 |
Summary
- Some NGS file format
- How to know if I have fastq, sam or vcf files in my project directory ?
- How to get the size of my files or directories ?
- How to compress a fastq or vcf files ?
- How to compress a sam files ?
- After a mapping, which files should I keep ? which files should I remove ?
- License
Some NGS file format
Almost all the high-throughput sequencing data you will deal with should mainly have the following formats such as fastq, fq, sam, bam or vcf.
Format | Description | Used by |
---|---|---|
fastq | a text-based format to store a nucleotide sequence and its corresponding quality sequence, generally given by the sequencing providers | QC tools, cleaning tools, mapping tools |
sam | a tab-delimited text file that contains sequence alignment data, generally produced after mapping of fastq sequences to a reference tools (Sequence Alignment/Map format). | samtools |
bam | binary version of a SAM file (saving storage and faster manipulation) | samtools, gatk |
vcf | The vcf (Variant Call Format) format is a text format used to describe single nucleotide variants (SNVs) as well as insertions, deletions, and other sequence variations. | vcftools, BCFtools |
Don't forget to compress your large NGS files including VCF, FASTQ, and SAM files. File compression is a simple method of storing a file that has been reduced in size.
How to get the size of my files or directories ?
A directory
You can use the du
command (disk usage). To find the size of the specific directory, use the following syntax du -h path2dir
or du -sh path2dir
:
nas3$ du -h /data3/projects/riceAnnot/
1,1G /data3/projects/riceAnnot/abyssContigsFiltered/Obarthii
861M /data3/projects/riceAnnot/abyssContigsFiltered/Oglaberrima
2,0G /data3/projects/riceAnnot/abyssContigsFiltered
...
16G /data3/projects/riceAnnot/VCFs
33G /data3/projects/riceAnnot/ALLVCFs/tmp
38G /data3/projects/riceAnnot/ALLVCFs
58G /data3/projects/riceAnnot/
[tranchant@master0 scratch-scripts]$ du -sh
/data3/projects/riceAnnot/
58G /data3/projects/riceAnnot/
My projects on the cluster
To list the size of all your projects on nas, nas2 and nas3, use the following script :
nas3$ /opt/scripts/Data/project.sh
### OWNER - project1 1,5T /data/projects/project1
### OWNER - project2 352G /data2/projects/project2
----------------- Projects as participants
toggle
irigin
====> TOTAL : 2
Be careful : Run this script can be time consuming depending on the number of projects and their size !
How to know if I have fastq, sam or vcf files in my project directory ?
Use the following script : /opt/scripts/Data/searchFile.sh path2project extension
$ /opt/scripts/Data/searchFile.sh /data3/projects/myproject/ vcf
final_biallelic_random_10perc.vcf
final_chr01_C013_C034.recode.vcf
Filtered_biallelic.recode.vcf
final_chr01.recode.vcf
How to compress a fastq or vcf files ?
List the content of the directory that contains fastq file
bash-4.2# ls -lh
SRR453571_1.fastq.gz
-rwxr-xr-x 1 root formation 541M 4 févr. 13:32 SRR453571_2.fastq.gz
-rwxr-xr-x 1 root formation 1,3G 4 févr. 13:32 SRR453578_1.fastq.gz
-rwxr-xr-x 1 root formation 4,9G 4 févr. 13:33 SRR453578_2.fastq
bash-4.2# du -h SRR453578_2.fastq
8,1G SRR453578_2.fastq
Compress a fastq file with the command gzip -9
bash-4.2# du -h SRR453578_2.fastq
8,1G SRR453578_2.fastq
bash-4.2#gzip -9 SRR453578_2.fastq
bash-4.2#bash-4.2# ls -lh
total 9,3G
-rwxr-xr-x 1 root formation 514M 4 févr. 13:32 SRR453571_1.fastq.gz
-rwxr-xr-x 1 root formation 541M 4 févr. 13:32 SRR453571_2.fastq.gz
-rwxr-xr-x 1 root formation 1,3G 4 févr. 13:32 SRR453578_1.fastq.gz
-rwxr-xr-x 1 root formation 1,3G 4 févr. 13:33 SRR453578_2.fastq.gz
bash-4.2# du -h SRR453578_2.fastq.gz
1,3G SRR453578_2.fastq.gz
How to compress a vcf file ?
#get vcf size
$ du -sh ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf
7,0G ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf
# compress vcf file
$ gzip -9 ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf
# get size of the compressed vcf file
$ du -sh ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf.gz
901M ALL.Chr01.F4.GenotypeGVCFS.MQ0.vcf.gz
How to compress a sam files ?
SAM format can be compressed in the binary mode BAM by using samtools http://www.htslib.org/
samtools view -bh file.sam > file.bam
Most tools recognize this format. BAM format allows you to reduce the size of a file by up to 3 times, allowing you gain place in your project.
After a mapping, which files should I keep ? which files should I remove ?
After mapping you can decide to stock BAM files. In this case, fastq files are not needed.
It's possible to convert bam files to fastq files. You can do it by using bedtools
command
https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html.
bedtools bamtofastq [OPTIONS] -i file.bam -fq fastq
In any case, keeping both is redundant !
Licence
The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).