Usage

$ sxCutR2AdapterFilterCumulate.sh \
    demultiplex_file \
    minToMapShear

---
title: sxCutR2AdapterFilterCumulate.sh
---
flowchart TD
    ONTARGET[(
        demultiplex_file
        
            
                R2
                R1
                #
                id
                rstart2
                rend2
                qstart2
                qend2
                rstart1
                rend1
                qstart1
                qend1
                ...
            
            
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
            
        
    )] --> sxCRAFC[sxCutR2AdapterFilterCumulate.sh]
    sxCRAFC --> QUERY[(
        stdout
        
            
                query
                #
                id
            
            
                ...
                ...
                ...
            
        
    )]

R2	R1	#	id	rstart2	rend2	qstart2	qend2	rstart1	rend1	qstart1	qend1	...
...	...	...	...	...	...	...	...	...	...	...	...	...	...

query	#	id
...	...	...

The stdout of demultiplex.sh (demultiplex_file) needs further post-process before feed to core part of rearr. Note that we put R2 before R1 for in-house data when calling demultiplex.sh.
Our in-house data have query on R2.

R2 = primer(21bp) + barcode(18bp) + 3bp + query(44bp) + RCscaffold(83/93bp). We extract query from R2 as follows.
1. Trim 3’ RCscaffold.
2. Extract query 3bp downstream to qend2 (the alignment end position of barcode).
3. Filter out query shorter than minToMapShear.
4. Accumulate the adjacent duplicates of query by sxCumulateToMapCutAdaptMarker.awk.

Source

cutadaptPlain()
{
    # Usage: cutadaptPlain <plainseq 3'adapter
    # cutadapt does not accept plainseq. This function transform plainseq to fasta before feed to cutadapt, and then transform the fasta output back to plainseq
    # Input: plainseq
    # Output: 3' trimmed plainseq
    sed '=' | sed '1~2s/^/>s/' | cutadapt -a $1 - 2> /dev/null | sed '1~2d'
}

rmDupFile=$1
minToMapShear=$2
cut -f1 $rmDupFile | cutadaptPlain GCACCGACTCGGTGCCACTTTTTCAAGTTGATAACGGACTAGCCTTATTTTAACTTGCTATTTCTAGCTCTAAAAC | paste - <(cut -f3-4,8 $rmDupFile) | gawk -F "\t" -v OFS="\t" -v minToMapShear=$minToMapShear '
{
    if ($4 + 3 + minToMapShear <= length($1)) {
        print substr($1, $4 + 4), $2, $3
    }
}' | gawk -f sxCumulateToMapCutAdaptMarker.awk

alias ~~~=":" # This suppresses a warning and is not part of source.