#!/bin/bash

shopt -s expand_aliases

alias ~=”:«’~bash’”

:«’~~~bash’

Usage

$ removeDuplicates.sh fastqR1 [fastqR2 [fastqR3 ...]]
---
title: removeDuplicates.sh
---
flowchart TD
    R1[(fastqR1)] --> RD[removeDuplicates.sh]
    R2[(faqstR2)] --> RD
    RN[(...)] --> RD
    RD --> UNIQUE[(
        

stdout

R1 R2 ... #
... ... ... ...
)]

Why use several fastq files as input of removeDuplicates.sh

The paired-end next-generation sequencing (NGS) is quite common. Although mappable segment may be only in R1 or R2, the other end still helps to determine the locus of the sequence. See demultiplex.sh.

Should I directly input raw fastq file, or remove adapter, barcode and so on before the input into removeDuplicates.sh

The stdout of removeDuplicates.sh are aligned to the so-call markers in demultiplex.sh to determine the loci of lines. If you preserve adapter, barcode and so on in the input fastq files, it is suggested to provide them in markers as well.

Source

fqlist=""
for fq in "$@"
do
    if (file $fq | grep -q compressed)
    then
        fqlist="$fqlist <(zcat $fq)"
    else
        fqlist="$fqlist $fq"
    fi
done

eval paste $fqlist | sed -n '2~4p' | sort | uniq -c | gawk '
    {
        for (i = 2; i <= NF; ++i)
            printf("%s\t", $i)
        print $1
    }
'
alias ~~~=":" # This suppresses a warning and is not part of source.