Shi Xing post-process from demultiplex to rearr
#!/bin/bash
shopt -s expand_aliases
alias ~=”:«’~bash’”
:«’~~~bash’
Usage
sxCutR2AdapterFilterCumulate.md demultiplexFile minToMapShear >toMapFile
Introduction
The output of demultiplex.md
needs further post-process before feed to rearr
. For Shi Xing’s data, this is done by this in-house script. The output format fits the input format of rearr
.
query<tab>count<tab>refId<newline>
The post-process consists of three steps.
- Remove
adapter
from 3’ of R2. - Remove
primer
,barcode
and a 3bp gap from 5’ of R2. - Filter out if the remain of R2 is shorter than minToMapShear.
- Accumulate the adjacent duplicates by
sxCumulateToMapCutAdaptSpliter.awk
.
Source
cutadaptPlain()
{
# Usage: cutadaptPlain <plainseq 3'adapter
# cutadapt does not accept plainseq. This function transform plainseq to fasta before feed to cutadapt, and then transform the fasta output back to plainseq
# Input: plainseq
# Output: 3' trimmed plainseq
sed '=' | sed '1~2s/^/>s/' | cutadapt -a $1 - 2> /dev/null | sed '1~2d'
}
rmDupFile=$1
minToMapShear=$2
cut -f1 $rmDupFile | cutadaptPlain GCACCGACTCGGTGCCACTTTTTCAAGTTGATAACGGACTAGCCTTATTTTAACTTGCTATTTCTAGCTCTAAAAC | paste - <(cut -f3-4,8 $rmDupFile) | gawk -F "\t" -v OFS="\t" -v minToMapShear=$minToMapShear '
{
if ($4 + minToMapShear <= length($1)) {
print substr($1, $4 + 1), $2, $3
}
}' | gawk -f sxCumulateToMapCutAdaptSpliter.awk
alias ~~~=":" # This suppresses a warning and is not part of source.