`rearrangement`

rearrangement is the core chimeric alignment engine of rearr.

$ rearrangement -h
### Basic Usage
rearrangement <input_file 3<reference_file

### Parameters
-h, -help, --help: Display help.
# Aligning Parameters
-s0: Mismatching score. (default: -6)
-s1: Matching score for non-extension reference part. (default: 4)
-s2: Matching score for extension reference part. (default: 2)
-u: Gap-extending penalty. (default: -3)
-v: Gap-opening penalty. (default: -9)
-ru: Gap-extending penalty for unaligned reference ends. (default: 0)
-rv: Gap-opening penalty for unaligned reference ends. (default: 0)
-qu: Gap-extending penalty for unaligned query parts. (default: 0)
-qv: Gap-opening penalty for unaligned query parts. (default: -5)

---
title: rearrangement
---
flowchart TD
    QUERY[(
        input_file
        
            
                query
                #
                id
            
            
                ...
                ...
                ...
            
        
    )] --> REARR[rearrangement]
    REF[(
        reference_file
        
            
                start1
                ref1
                end1
                start2
                ref2
                end2
            
            
                ...
                ...
                ...
                ...
                ...
                ...
            
        
    )] --> REARR
    REARR --> ALG[(
        stdout
        
            
                idx
                #
                score
                id
            
            
                ref1
                ref2
            
            
                query
            
        
    )]

query	#	id
...	...	...

start1	ref1	end1	start2	ref2	end2
...	...	...	...	...	...

`input_file`

query is the query sequence.
# is the duplication number of the query.
id is the 0-based line number of reference in reference_file.

`reference_file`

rearr takes two references (ref1 and ref2) as input. This is useful in analyzing the double cleavage CRISPR experiments, say to delete a large portion of the genome.
The upstream part of query is aligned toref1, which in the case of two-cleavage deletion, is the sequence around the upstream cleavage site, The actual ref1 may depend on the repair junction (deletion, inversion or duplication).
The downstream part of query is aligned toref2, which in the case of two-cleavage deletion, is the sequence around the downstream cleavage site.
For the single cleavage case, ref2 just repeats ref1.
The region between start1 and end1 is the non-extension region of ref1. The regions upstream to start1 or downstream to end1 are extension regions of ref1. The matching score (s2) for extension regions is lower than that (s1) for the non-extension region.
The region between start2 and end2 is the non-extension region of ref2. The regions upstream to start2 or downstream to end2 are extension regions of ref2.
The reason to distinguish extension and non-extension regions is to avoid dummy templated insertion induced by small mutations away from the cleavage site for the single cleavage case, as shown in the following video.

`stdout`

Every three lines of the standard output represents a single alignment.

The first line is a header line.
- idx is the 1-based line number of query in input_file.
- # is the duplication number of the query.
- score is the alignment score.
- id is the 0-based line number of reference in reference_file.
The second line is the sequence of the reference.
The third line is the query with idx.
The second and third lines together form the actual alignment, as shown in the following example.

1       1       157     9300
---aGTTGGCTAGTCAATACCTGAAGAGAGATTGGCCTGGAGTAAAAGC-TGAtaAAAGCTGATGATCGGAATGATTACAGGTAAATTAGTAGTTTTTGCCTATTTTCTTTAGAAACGGTTTTACTTAAAGCTATGTTACATATAGATAATGTAACACTCTAGt-------
CTG----------------------------TTGGCCTGGAGTAAAAGCATGAT----------GATCGGAATGATTACAGGTAAA------------------------------------------------------------------------------CAAAAAA

`correct_micro_homology.awk`

Microhomology is common in CRISPR editing output. When microhomology happens, rearrangement cannot determine how to align query to ref1 and ref2, as show in the following video.

correct_micro_homology.awk allows one to specify which end of the double strand break should be corrected toward the cleavage site up to the microhomology equivalence.

$ gawk -f correct_micro_homology.awk -- \
    reference_file \
    direction_file \
    < rearrangement_file

---
title: correct_micro_homology.awk
---
flowchart TD
    REF[(
        reference_file
        
            
                start1
                ref1
                end1
                start2
                ref2
                end2
            
            
                ...
                ...
                ...
                ...
                ...
                ...
            
        
    )] --> CMH[correct_micro_homology.awk]
    DIRECTION[(
        direction_file
        
            
                up/down
            
            
                ...
            
        
    )] --> CMH
    ALG[(
        rearrangement_file
        
            
                idx
                #
                score
                id
            
            
                ref1
                ref2
            
            
                query
            
        
    )] --> CMH
    CMH --> CORRECTED[(
        stdout
        
            
                idx
                #
                score
                id
                udangle
                rstart1
                qstart1
                rend1
                qend1
                random
                rstart2
                qstart2
                rend2
                qend2
                ddangle
                cut1
                ref1+cut2
            
            
                ref1
                ref2
            
            
                query
            
        
    )]

start1	ref1	end1	start2	ref2	end2
...	...	...	...	...	...

up/down
...

reference_file is the same as that takes by rearrangement.
Each line in direction_file corresponds to each line in reference_file, which contains an up or down string to specify whether the upstream DSB end or the downstream DSB end should be corrected towards the cleavage site.
rearrangement_file is the stdout of rearrangement.
stdout of correct_micro_homology.awk is similar to stdout of rearrangement but with an extended header line.
- udangle is the upstream unaligned part of query.
- rstart1 and rend1 specifies the ref1 range for the upstream block of the chimeric alignment.
- qstart1 and qend1 specifies the query range for the upstream block of the chimeric alignment.
- random is the unaligned part of query between the upstream and downstream block of the chimeric alignment..
- rstart2 and rend2 specifies the ref2 range for the downstream block pf the chimeric alignment.
- qstart2 and qend2 specifies the query range for the downstream block of the chimeric alignment.
- ddangle is the downstream unaligned part of query.

Core

rearrangement and correct_micro_homology.awk forms the core part of rearr. They are generally piped together.

$ rearrangement \
    < input_file \
    3< reference_file |
  gawk -f correct_micro_homology.awk -- \
    reference_file \
    direction_file

More than two blocks

rearrangement and correct_micro_homology.awk supports chimeric alignments with more than two blocks. The core part of rearr for multiple blocks is as follows.

---
title: core
---
flowchart TD
    QUERY[(
        input_file
        
            
                query
                #
                id
            
            
                ...
                ...
                ...
            
        
    )] --> REARR[rearrangement]
    REF[(
        reference_file
        
            
                start1
                ref1
                end1
                start2
                ref2
                end2
                ...
                startN
                refN
                endN
            
            
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
                ...
            
        
    )] --> REARR
    REARR --> ALG[(
        rearrangement_file
        
            
                idx
                #
                score
                id
            
            
                ref1
                ref2
                ...
            
            
                query
            
        
    )]

    REF --> CMH[correct_micro_homology.awk]
    DIRECTION[(
        direction_file
        
            
                up/down:1:2
                up/down:2:3
                ...
                up/down:N-1:N
            
            
                ...
                ...
                ...
                ...
            
        
    )] --> CMH
    ALG --> CMH
    CMH --> CORRECTED[(
        stdout
        
            
                idx
                #
                score
                id
                dangle0
                rstart1
                qstart1
                rend1
                qend1
                dangle1
                rstart2
                qstart2
                rend2
                qend2
                dangle2
                ...
                rstartN
                qstartN
                rendN
                qendN
                dangleN
            
            
                ref1
                ref2
            
            
                query
            
        
    )]

query	#	id
...	...	...

start1	ref1	end1	start2	ref2	end2	...	startN	refN	endN
...	...	...	...	...	...	...	...	...	...

up/down:1:2	up/down:2:3	...	up/down:N-1:N
...	...	...	...

Each row of direction_file has multiple fields corresponding to the junctions of adjacent references. The extended header of stdout of correct_micro_homology.awk contains information for all alignment blocks and all unaligned parts of query.