Trimming Illumina paired-end reads
I found this link very useful to learn more about trimming illumina data: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data
First I will explain a little about my data in the current project.
I have one sample sequenced in 2015:
And 5 other samples sequenced in 2016:
Each of the files comes in 8 separated gz files that have to be combined in 2 files because they paired-end reads.
This would be my bash script file "combineFiles.sh":
The first tool I've used was “skewer-0.1.127-linux-x86_64”:
Skewer geneterated a report where for the TruSeq samples said this:
And for the Nextera sample this:
This algorithm created two files corresponding to the two pair reads but one of the files was very big and the other very small. When I tried to use them in the alignment step it failed. The percentage of trimmed reads was very low so I decided to try still another tool:
The third tool I tried was “cutadapt”:
The report of cutadapt for one of the TruSeq samples was:
When I tried to search manually the adapters in the data I used this command:
zcat
SRR026762.fastq.gz | head -n 20000 | grep ATCTCGTATGCCGTCTTCTGCTTGI have one sample sequenced in 2015:
| Workflow | GenerateFASTQ |
| Application | NextSeq FASTQ Only |
| Assay | Nextera XT |
| Description Chemistry | Amplicon |
| Adapter | CTGTCTCTTATACACATCT |
And 5 other samples sequenced in 2016:
| Workflow | GenerateFASTQ |
| Application | NextSeq FASTQ Only |
| Assay | TruSeq LT |
| Description Chemistry | Default |
| Adapter | AGATCGGAAGAGCACACGTCTGAACTCCAGTCA |
| AdapterRead2 | AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT |
Each of the files comes in 8 separated gz files that have to be combined in 2 files because they paired-end reads.
This would be my bash script file "combineFiles.sh":
#!/bin/bash
for R in 1 2; do
zcat $SAMPLE_DIR/$pre1""*R$R*.fastq.gz | gzip -n9 > \
$FASTQDIR/"$pre2"_R$R.fastq.gz &
done
Then they need to be trimmed to remove possible adaptor contamination. There are many tools to do that and I've tried some of them.The first tool I've used was “skewer-0.1.127-linux-x86_64”:
#-l minimum length of read
#-z gzip output
#-x forward adapter(s)/primer(s)
#-y forward adapter(s)/primer(s)
#in nextera x and y are the same
skewer-0.1.127-linux-x86_64 -l 20 -z -x $adaptor -y $adaptor2 $file1 $file2
Skewer geneterated a report where for the TruSeq samples said this:
10625038 read pairs processed; of these:
5 ( 0.00%) short read pairs filtered out after trimming by size control
0 ( 0.00%) empty read pairs filtered out after trimming by size control
10625033 (100.00%) read pairs available; of these:
2639092 (24.84%) trimmed read pairs available after processing
7985941 (75.16%) untrimmed read pairs available after processing
And for the Nextera sample this:
11606799 read pairs processed; of these:
1576791 (13.59%) short read pairs filtered out after trimming by size control
1 ( 0.00%) empty read pairs filtered out after trimming by size control
10030007 (86.41%) read pairs available; of these:
7759731 (77.37%) trimmed read pairs available after processing
2270276 (22.63%) untrimmed read pairs available after processing
The second tool was “Trimmomatic” (I tested it only for the 5 TruSeq samples) :
java -jar Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33 $file1 $file2 \
$output1 $output2 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:15 MINLEN:36
This algorithm created two files corresponding to the two pair reads but one of the files was very big and the other very small. When I tried to use them in the alignment step it failed. The percentage of trimmed reads was very low so I decided to try still another tool:
The third tool I tried was “cutadapt”:
cutadapt-1.12/cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o $output1 \
-p $output2 $file1 $file2
The report of cutadapt for one of the TruSeq samples was:
Total read pairs processed: 10,625,038
Read 1 with adapter: 123,448 (1.2%)
Read 2 with adapter: 125,400 (1.2%)
Pairs written (passing filters): 10,625,038 (100.0%)
When I tried to search manually the adapters in the data I used this command:
I found this link very useful to learn more about trimming illumina data: http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data
No comments:
Post a Comment