Usage

$ removeDuplicates.sh fastqR1 [fastqR2 [fastqR3 ...]]

---
title: removeDuplicates.sh
---
flowchart TD
    R1[(fastqR1)] --> RD[removeDuplicates.sh]
    R2[(faqstR2)] --> RD
    RN[(...)] --> RD
    RD --> UNIQUE[(
        stdout
        
            
                R1
                R2
                ...
                #
            
            
                ...
                ...
                ...
                ...
            
        
    )]

R1	R2	...	#
...	...	...	...

The input fastq files can be gz or zip compressed.
Multiple fastq files, even more than two, are supported in theory. However, in practice, one has only R1 and R2.
Two sequencings are duplicates if they are the same across all fastq file. The duplication number # follows the tandom tab separated list of reads in fastq files in stdout of removeDuplicates.sh.

Why use several `fastq` files as input of `removeDuplicates.sh`

The paired-end next-generation sequencing (NGS) is quite common. Although mappable segment may be only in R1 or R2, the other end still helps to determine the locus of the sequence. See demultiplex.sh.

Should I directly input raw `fastq` file, or remove `adapter`, `barcode` and so on before the input into `removeDuplicates.sh`

The stdout of removeDuplicates.sh are aligned to the so-call markers in demultiplex.sh to determine the loci of lines. If you preserve adapter, barcode and so on in the input fastq files, it is suggested to provide them in markers as well.

Source

fqlist=""
for fq in "$@"
do
    if (file $fq | grep -q compressed)
    then
        fqlist="$fqlist <(zcat $fq)"
    else
        fqlist="$fqlist $fq"
    fi
done

eval paste $fqlist | sed -n '2~4p' | sort | uniq -c | gawk '
    {
        for (i = 2; i <= NF; ++i)
            printf("%s\t", $i)
        print $1
    }
'

alias ~~~=":" # This suppresses a warning and is not part of source.

Usage

stdout

Why use several fastq files as input of removeDuplicates.sh

Should I directly input raw fastq file, or remove adapter, barcode and so on before the input into removeDuplicates.sh

Source

Why use several `fastq` files as input of `removeDuplicates.sh`

Should I directly input raw `fastq` file, or remove `adapter`, `barcode` and so on before the input into `removeDuplicates.sh`