Trainings 2019 - Linux for Jedi - Practice

Advanced Linux Practice page

Description	Hands On Lab Exercises for Linux
Related-course materials	Linux for Jedi
Authors	Christine Tranchant-Dubreuil (christine.tranchant@ird.fr) & Gautier Sarah (gautier.sarah
Creation Date	11/03/2018
Last Modified Date	14/04/2019

Summary

Preambule: Softwares to install before connecting to a distant linux server
Practice 1: Get Connecting on a linux server by ssh
Practice 2: Preparing working environnement
Practice 3: Monitoring processes) with w, ps, kill, top
Practice 4: Using the && separator
Practice 5: Searching for text using grep
Practice 6: Displaying lines with sed
Practice 7: Deleting lines with sed
Practice 8: Parsing files with sed using regexp
Practice 9: Modifying files with sed
Practice 10: Manipulating files with awk
Links
License

Preambule

List of Softwares to install before connecting to a distant linux server more information
Arborescence image :

Practice 1 : Get Connecting on a linux server by `ssh`

In mobaXterm:

Click the session button, then click SSH.
- In the remote host text box, type: HOSTNAME (see table below)
- Check the specify username box and enter your user name
In the console, enter the password when prompted.
Once you are successfully logged in, you will use this console for the rest of the lecture.

Cluster HPC	hostname
IRD HPC	bioinfo-master.ird.fr
AGAP HPC	cc2-login.cirad.fr

Connect on the HPC

Practice 2 : Preparing working environnement

Type qrsh to connect on one node
Move into the directory /scratch
Create a working directory such as Formation-X (X corresponds to your login id/number)
Move into this directory just created and check the current/working directory just by looking the prompt

Practice 3 : Monitoring processes

Displaying the list of processes

Type the command w through 2 consoles : one connected on bioinfo-master, the other connected on one node
Type (on the node) the command pswithout option, then with the option u, ua, uax
Type the command topon the node
Then use the "option" c to display the complete process
Then use the "option" u to display only your processes

Kill a process - downloading files from SRA through two ways

We want to download one fastq file from NCBI SRA (available here https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR304976 ) using SRAtoolkit as below :

module load sratoolkit fastq-dump SRRXXXX

This will download the SRA file (in sra format) and then convert them to fastq file for you. More details on https://isugenomics.github.io/bioinformatics-workbook/dataAcquisition/fileTransfer/sra.html

Download the fastq file in the directory created precedently in /scratch
Open a new terminal to connect on bioinfo-master.ird.fr
From this terminal, display the list of processes running on the node where you are downloading the fastq file with fastq-dump
From this terminal, kill your process "fastq-dump" directly from bioinfo-master

Practice 4 : Using the && separator

On the console, type the 2 following linux commands to get data necessary for the next (we will explain the two commands latter):

# get the file on the web and decompress the gzip file 
wget http://sg.ird.fr/LINUX-TP/LINUX-TP.tar.gz && tar -xzvf LINUX-TP.tar.gz

Check the content of your home directory on the server now (cf. filetree just below)

Practice 5 : Searching for text using `grep`

from a gff file

Go on the following page : http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/
Copy the url of the rice genome annotation file (gff format) that we will use to download the file directly on the server
Go to the bank directory and type the following command :

wget gff_url

Prints the number of lines with the word gene in the gff file - grep -P
Counts the number of genes - grep -c
Search for the nbs-lrr genes - grep -i
Removes the lines with putative word - grep -v
Counts the number of mRNA in the chromosome 1 - grep -c regexp
Counts the number of mRNA in the first five chromosomes - grep -c regexp

from a fasta file

Get from the same website the cDNA sequences of the rice genome (fasta format) wget
Get the help of infoseq program - infoseq --h
Run infoseq program on the fasta file just downloaded
```
infoseq -sequence FASTA_FILE | head
```
Display only accession, length and pgc column with the options of infoseq
What is the shorthest sequence (Accession and length)? infoseq, sort, head
What is the longuest sequence (Accession and length)? infoseq, sort, head
Count the number of sequences with a length between 1000 and 9999 with grep

Practice 6 : Displaying lines with `sed`

For this exercise, you will work on the fastq file LINUX-TP/Data/fastq/pairedTwoIndividusGzippedIrigin/irigin1_1.fastq.gz

Print the 8 first lines
Print the lines 5 to 12
Print only the sequences ids
Print only the sequences ids and nucleotides sequences

Practice 7 : Deleting lines with `sed`

For this exercise, you will work on the fastq file LINUX-TP/Data/fastq/pairedTwoIndividusIrigin/irigin1_1.fastq

Delete the end of the file from the line 9
Delete the lines containing only a +
Delete the lines containing only a + and the quality sequences

Practice 8 : File parsing with `sed` using regexp

From the gff file precedently downloaded

Count the number of genes

From a vcf file

Download the vcf file available at this url http://sg.ird.fr/LINUX-TP/OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION-100000.vcf.tar.gz
How many polymorphisms were considered bad and filtered out (Displaying all the lines without neither the PASS tag nor starting with # )?

Practice 9 : File modification with `sed`

From the vcf file OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION-100000.vcf

Transform the vcf file in a coordinate file chr\tpos\tpos
Now, in the VCF file, we would like to replace the genotypes by allelic dose. This means that we should replace the whole field by 0 when the genotype is 0/0, by 1 when the genotype is 0/1 and 2 when the genotype is 1/1

From fasta files in `LINUX-TP/Fasta`

In fasta directory, there are two files : C_AllContigs.fasta and contig_tgicl.fasta. Before to generate a unique file with all 2 libraries, we would like to tag each sequence per its origin. In each file, add the respective tag VS1- / VS2- just before the identifier.

# File C_AllContigs.fasta initially
>C_pseu_c1
AAAAATGTTTGAAATCCACTTGGCATTMAATGGTGAAAGAATTTTAGATTTTTATATACT
CCCTCGGTAAGGAAATTGTTGTCTCATTTTGGGATTCACAATTATTACCAACAGTGCAAG
GGTTTT

#File C_AllContigs.fasta
>VS1-C_pseu_c1
AAAAATGTTTGAAATCCACTTGGCATTMAATGGTGAAAGAATTTTAGATTTTTATATACT
CCCTCGGTAAGGAAATTGTTGTCTCATTTTGGGATTCACAATTATTACCAACAGTGCAAG
GGTTTT

Rq : Test first the sed command on one file and STDOUT, then store the results in new files named RN-VS.MID1.clean.sff.fasta …

Generate a file named all-contigs.fasta with all the sequences - cat file1 file2 > file3
Count the number of sequences in the fasta file just created grep -c ">"
Count the sequence number of each library in this file

From fastq files in `Data/fastq/pairedTwoIndividusIrigin`

In the directory Data/fastq/pairedTwoIndividusIrigin transform the fastq file irigin1_1.fastq in fasta format
In one command line transform all fastq files of the directory in fasta (save the files before)

Practice 10 : Manipulating files with `awk`

From the gff file precedently downloaded

Extract the coordinate from the gff file
Calculate the mean of the gene length
Calculate the mean of the gene length for the chromosome 1
Count the number of genes above 2000bp length
Bonus: calculate the mean of gene length for each chromosomes in one command line

From the result of a nucmer analysis

We want to rapidly align an assembly against a entire genome using nucmer. (i.e., assembling etc.) to a reference genome. Type the three following commands :

#So we compare one multifasta that have been created against a genome
nucmer --mum reference.fasta contigs.fasta -p ctgVSref.NUCMER

#The previous command produces a file named ctgVSref.NUCMER.delta that can then be filtered using delta-filter and formatted using show-coords to produce a human-readable table of overlapping alignments between the two multifastas.

#Filtering the nucmer results 
#The -l in delta-filter sets the minimum alignment length to 300. The -q “Maps each position of each query to its best hit in the reference, allowing for reference overlaps”.
delta-filter -l300 -q ctgVSref.NUCMER.delta > ctgVSref.filter300.delta

#Generate results (tab format)
#The -c and -l in show-coords indicate that percent identity and sequence length information, respectively, should be included in the output. -L sets the minimum alignment length to display, -r sorts the output lines by reference IDs and coordinates, and -T switches the output to tab-delimited format.
show-coords -c -l -L 300 -r -T ctgOMAP.filter300.delta > ctgOMAP.filter300.delta.coords.txt

Count the number of contigs in the fasta file
Count the number of alignements performed by nucmer
Count the number of contigs that have been aligned
sort by alignment percent ascending
count the number of alignement with alignment % > 50 then 80

Practice 11

Go into the directory LINUX-TP/Data/fastq/pairedTwoIndividusGzippedIrigin - cd
List the directory content
Run fastq-stats program ( more to get stats about the fastq file irigin1_1.fastq.gz
```
fastq-stats -D irigin1_1.fastq.gz
```

Use a for loop to run fastq-stats with every fastq file in the directory

for file in *fastq; do 
fastq-stats -D $file > $file.fastq-stats ; 
done;

License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).

Trainings 2019 – Linux for Jedi – Practice

Advanced Linux Practice page

Summary

Preambule

Practice 1 : Get Connecting on a linux server by `ssh`

Practice 2 : Preparing working environnement

Practice 3 : Monitoring processes

Displaying the list of processes

Kill a process - downloading files from SRA through two ways

Practice 4 : Using the && separator

Practice 5 : Searching for text using `grep`

from a gff file

from a fasta file

Practice 6 : Displaying lines with `sed`

Practice 7 : Deleting lines with `sed`

Practice 8 : File parsing with `sed` using regexp

From the gff file precedently downloaded

From a vcf file

Practice 9 : File modification with `sed`

From the vcf file OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION-100000.vcf

From fasta files in `LINUX-TP/Fasta`

From fastq files in `Data/fastq/pairedTwoIndividusIrigin`

Practice 10 : Manipulating files with `awk`

From the gff file precedently downloaded

From the result of a nucmer analysis

Practice 11

Links

License

itrop

Trainings 2019 – Linux for Jedi – Practice

Advanced Linux Practice page

Summary

Preambule

Practice 1 : Get Connecting on a linux server by ssh

Practice 2 : Preparing working environnement

Practice 3 : Monitoring processes

Displaying the list of processes

Kill a process - downloading files from SRA through two ways

Practice 4 : Using the && separator

Practice 5 : Searching for text using grep

from a gff file

from a fasta file

Practice 6 : Displaying lines with sed

Practice 7 : Deleting lines with sed

Practice 8 : File parsing with sed using regexp

From the gff file precedently downloaded

From a vcf file

Practice 9 : File modification with sed

From the vcf file OgOb-all-MSU7-CHR6.GATKVARIANTFILTRATION-100000.vcf

From fasta files in LINUX-TP/Fasta

From fastq files in Data/fastq/pairedTwoIndividusIrigin

Practice 10 : Manipulating files with awk

From the gff file precedently downloaded

From the result of a nucmer analysis

Practice 11

Links

License

itrop

Practice 1 : Get Connecting on a linux server by `ssh`

Practice 5 : Searching for text using `grep`

Practice 6 : Displaying lines with `sed`

Practice 7 : Deleting lines with `sed`

Practice 8 : File parsing with `sed` using regexp

Practice 9 : File modification with `sed`

From fasta files in `LINUX-TP/Fasta`

From fastq files in `Data/fastq/pairedTwoIndividusIrigin`

Practice 10 : Manipulating files with `awk`