Thursday, March 2, 2017

Alignment


After the data has been combined and the reads trimmed to remove adapter contamination has to be aligned, in my case, to a reference genome.
The reference genome is in FASTA format and I usually change the names of the scaffolds or the chromosomes (if they are defined). I also edit the gff3 file that contains the gene annotation for further analysis.

Here there is basic information about alignment and a list of many alignment methods I could try:
https://en.wikibooks.org/wiki/Next_Generation_Sequencing_%28NGS%29/Alignment

Why I'm trying different methods? because using the following method I got a 90% coverage when I was expecting around 95%. Also there is a big difference in the results I obtained from the TruSeq samples than the Nextera. The TruSeq samples contain much more structural variation. Why is that?

The first method I tried is "speedseq":



https://raw.githubusercontent.com/hall-lab/speedseq/master/etc/speedseq_workflow.png
In my case, because my genomes are very small, it takes around 20 min/sample.
For some reason, it gives very different alignment coverage results for the 6 samples (data mentioned in my previous post).

It uses BWA mem to do the alignment and a combination of many other tools to get different results:
Here there is an example of its use for different purposes:
https://github.com/hall-lab/speedseq/blob/master/example/run_speedseq.sh

The resulting files after running it are:
  • example.bam
  • example.discordants.bam
  • example.splitters.bam
  • example.vcf.gz
  • example.sv.vcf.gz
Second tool I used is NextGenMap. It very easy to use.
I have two files but I read this in its documentation:
IMPORTANT: NextGenMap does not check/compare read names. It assumes that the reads are in the correct order. Thus, the input file must only contain pairs of reads. Mixing single end and paired end reads is not supported and will corrupt the mapping results.
Because I am not sure of the order of the reads in the files I tried interleave the files:


ngm-utils interleave -1 forward_reads.fq -2 reverse_reads.fq \
  -o suffled_reads.fq

The benefit of using ngm-utils interleave is that the read names of the 
pairs are checked. If there are reads without a mate present, they will 
be filtered.

It produces a sam file but I want the same output than with "speedseq" so I use this command from "samtools" to transform the sam file into bam file:
samtools view -bT $ref  $sam > $output
The next script produces all the needed output files from a bam file (this script is not mine but now I can't find the place from where I copied). It uses samblaster to do it like speedseq does.

#!/bin/bash

BAM=$1



OUT_DIR=$2



LUMPY_HOME=src/lumpy-sv

SAMBAMBA=bin/sambamba

SAMBLASTER=bin/samblaster

BAMLIBS=$LUMPY_HOME/scripts/bamkit/bamlibs.py

SAMTOBAM="$SAMBAMBA view -S -f bam -l 0"

SAMSORT="$SAMBAMBA sort -m 1G --tmpdir "

PAIREND_DISTRO=$LUMPY_HOME/scripts/pairend_distro.py

BAMGROUPREADS=$LUMPY_HOME/scripts/bamkit/bamgroupreads.py

BAMFILTERRG=$LUMPY_HOME/scripts/bamkit/bamfilterrg.py

MAX_SPLIT_COUNT=2

MIN_NON_OVERLAP=20



if [ ! -d "$OUT_DIR" ]; then

    mkdir $OUT_DIR

fi



BAM_BASE=`basename $BAM .bam`

OUTPUT=`basename $BAM`.vcf

OUTBASE=`basename "$OUTPUT"`



# make pipes

TEMP_DIR=$OUT_DIR/$BAM_BASE

mkdir -p $TEMP_DIR/spl $TEMP_DIR/disc

mkfifo $TEMP_DIR/spl_pipe

mkfifo $TEMP_DIR/disc_pipe



echo "$PYTHON $BAMGROUPREADS \

    -i $BAM \

| $SAMBLASTER \

    --acceptDupMarks \

    --excludeDups \

    --addMateTags \

    --maxSplitCount $MAX_SPLIT_COUNT \

    --minNonOverlap $MIN_NON_OVERLAP \

    --splitterFile $TEMP_DIR/spl_pipe \

    --discordantFile $TEMP_DIR/disc_pipe "



$PYTHON $BAMGROUPREADS \

    -i $BAM \

| $SAMBLASTER \

    --acceptDupMarks \

    --excludeDups \

    --addMateTags \

    --maxSplitCount $MAX_SPLIT_COUNT \

    --minNonOverlap $MIN_NON_OVERLAP \

    --splitterFile $TEMP_DIR/spl_pipe \

    --discordantFile $TEMP_DIR/disc_pipe  \

> /dev/null &



$SAMTOBAM $TEMP_DIR/spl_pipe \

| $SAMSORT $TEMP_DIR/spl \

    -o $TEMP_DIR/$BAM_BASE.splitters.bam  \

    /dev/stdin &



$SAMTOBAM $TEMP_DIR/disc_pipe \

| $SAMSORT $TEMP_DIR/disc \

    -o $TEMP_DIR/$BAM_BASE.discordants.bam \

    /dev/stdin



wait

rm $TEMP_DIR/spl_pipe

rm $TEMP_DIR/disc_pipe

rmdir $TEMP_DIR/spl

rmdir $TEMP_DIR/disc

Using this alignment I don't get any SV for any of the samples. And Interleaving the reads doesn't not change much the results. Why the splitter.bam file is emptyfor all of them? Is it the algorithm filtering them?


Trimming Illumina paired-end reads

Trimming Illumina paired-end reads


First I will explain a little about my data in the current project.

I have one sample sequenced in 2015:


​Workflow ​GenerateFASTQ
Application ​NextSeq FASTQ Only
Assay ​Nextera XT
Description Chemistry ​Amplicon


Adapter CTGTCTCTTATACACATCT

And 5 other samples sequenced in 2016:


​Workflow ​GenerateFASTQ
Application ​NextSeq FASTQ Only
Assay ​TruSeq LT
Description Chemistry ​Default


Adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT


Each of the files comes in 8 separated gz files that have to be combined in 2 files because they paired-end reads.

This would be my bash script file "combineFiles.sh":


#!/bin/bash

for R in 1 2; do
    zcat $SAMPLE_DIR/$pre1""*R$R*.fastq.gz | gzip -n9 > \
 $FASTQDIR/"$pre2"_R$R.fastq.gz & 
done


Then they need to be trimmed to remove possible adaptor contamination. There are many tools to do that and I've tried some of them.

The first tool I've used was “skewer-0.1.127-linux-x86_64”:


#-l minimum length of read
#-z gzip output
#-x forward adapter(s)/primer(s) 
#-y forward adapter(s)/primer(s) 
#in nextera x and y are the same


skewer-0.1.127-linux-x86_64 -l 20 -z -x $adaptor -y $adaptor2 $file1 $file2


Skewer geneterated a report where for the TruSeq samples said this:


10625038 read pairs processed; of these: 
5 ( 0.00%) short read pairs filtered out after trimming by size control
0 ( 0.00%) empty read pairs filtered out after trimming by size control
10625033 (100.00%) read pairs available; of these: 
2639092 (24.84%) trimmed read pairs available after processing 
7985941 (75.16%) untrimmed read pairs available after processing


And for the Nextera sample this:


11606799 read pairs processed; of these: 
1576791 (13.59%) short read pairs filtered out after trimming by size control 
1 ( 0.00%) empty read pairs filtered out after trimming by size control
10030007 (86.41%) read pairs available; of these:
7759731 (77.37%) trimmed read pairs available after processing 
2270276 (22.63%) untrimmed read pairs available after processing
The second tool was “Trimmomatic” (I tested it only for the 5 TruSeq samples) :


java -jar Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33 $file1 $file2 \
$output1 $output2 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:15 MINLEN:36

This algorithm created two files corresponding to the two pair reads but one of the files was very big and the other very small. When I tried to use them in the alignment step it failed. The percentage of trimmed reads was very low so I decided to try still another tool:

The third tool I tried was “cutadapt”:


cutadapt-1.12/cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o $output1 \
-p $output2 $file1 $file2

The report of cutadapt for one of the TruSeq samples was:

Total read pairs processed: 10,625,038
Read 1 with adapter: 123,448 (1.2%)
Read 2 with adapter: 125,400 (1.2%)
Pairs written (passing filters): 10,625,038 (100.0%)


When I tried to search manually the adapters in the data I used this command:

zcat SRR026762.fastq.gz | head -n 20000 | grep ATCTCGTATGCCGTCTTCTGCTTG

I found this link very useful to learn more about trimming illumina data: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data