#!/bin/bash
shopt -s expand_aliases
alias ~=”:«’~bash’”
:«’~~~bash’
Usage
$ removeDuplicates.sh fastqR1 [fastqR2 [fastqR3 ...]]
---
title: removeDuplicates.sh
---
flowchart TD
R1[(fastqR1)] --> RD[removeDuplicates.sh]
R2[(faqstR2)] --> RD
RN[(...)] --> RD
RD --> UNIQUE[(
stdout
| R1 | R2 | ... | # |
|---|---|---|---|
| ... | ... | ... | ... |
- The input fastq files can be gz or zip compressed.
- Multiple fastq files, even more than two, are supported in theory. However, in practice, one has only
R1andR2. - Two sequencings are duplicates if they are the same across all fastq file. The duplication number
#follows the tandom tab separated list of reads in fastq files instdoutofremoveDuplicates.sh.
Why use several fastq files as input of removeDuplicates.sh
The paired-end next-generation sequencing (NGS) is quite common. Although mappable segment may be only in R1 or R2, the other end still helps to determine the locus of the sequence. See demultiplex.sh.
Should I directly input raw fastq file, or remove adapter, barcode and so on before the input into removeDuplicates.sh
The stdout of removeDuplicates.sh are aligned to the so-call markers in demultiplex.sh to determine the loci of lines. If you preserve adapter, barcode and so on in the input fastq files, it is suggested to provide them in markers as well.
Source
fqlist=""
for fq in "$@"
do
if (file $fq | grep -q compressed)
then
fqlist="$fqlist <(zcat $fq)"
else
fqlist="$fqlist $fq"
fi
done
eval paste $fqlist | sed -n '2~4p' | sort | uniq -c | gawk '
{
for (i = 2; i <= NF; ++i)
printf("%s\t", $i)
print $1
}
'
alias ~~~=":" # This suppresses a warning and is not part of source.