id is the 0-based line number of reference in reference_file.
reference_file
rearr takes two references (ref1 and ref2) as input. This is useful in analyzing the double cleavage CRISPR experiments, say to delete a large portion of the genome.
The upstream part of query is aligned toref1, which in the case of two-cleavage deletion, is the sequence around the upstream cleavage site, The actual ref1 may depend on the repair junction (deletion, inversion or duplication).
The downstream part of query is aligned toref2, which in the case of two-cleavage deletion, is the sequence around the downstream cleavage site.
For the single cleavage case, ref2 just repeats ref1.
The region between start1 and end1 is the non-extension region of ref1. The regions upstream to start1 or downstream to end1 are extension regions of ref1. The matching score (s2) for extension regions is lower than that (s1) for the non-extension region.
The region between start2 and end2 is the non-extension region of ref2. The regions upstream to start2 or downstream to end2 are extension regions of ref2.
The reason to distinguish extension and non-extension regions is to avoid dummy templated insertion induced by small mutations away from the cleavage site for the single cleavage case, as shown in the following video.
stdout
Every three lines of the standard output represents a single alignment.
The first line is a header line.
idx is the 1-based line number of query in input_file.
# is the duplication number of the query.
score is the alignment score.
id is the 0-based line number of reference in reference_file.
The second line is the sequence of the reference.
The third line is the query with idx.
The second and third lines together form the actual alignment, as shown in the following example.
Microhomology is common in CRISPR editing output. When microhomology happens, rearrangement cannot determine how to align query to ref1 and ref2, as show in the following video.
correct_micro_homology.awk allows one to specify which end of the double strand break should be corrected toward the cleavage site up to the microhomology equivalence.
reference_file is the same as that takes by rearrangement.
Each line in direction_file corresponds to each line in reference_file, which contains an up or down string to specify whether the upstream DSB end or the downstream DSB end should be corrected towards the cleavage site.
rearrangement_file is the stdout of rearrangement.
stdout of correct_micro_homology.awk is similar to stdout of rearrangement but with an extended header line.
udangle is the upstream unaligned part of query.
rstart1 and rend1 specifies the ref1 range for the upstream block of the chimeric alignment.
qstart1 and qend1 specifies the query range for the upstream block of the chimeric alignment.
random is the unaligned part of query between the upstream and downstream block of the chimeric alignment..
rstart2 and rend2 specifies the ref2 range for the downstream block pf the chimeric alignment.
qstart2 and qend2 specifies the query range for the downstream block of the chimeric alignment.
ddangle is the downstream unaligned part of query.
Core
rearrangement and correct_micro_homology.awk forms the core part of rearr. They are generally piped together.
rearrangement and correct_micro_homology.awk supports chimeric alignments with more than two blocks. The core part of rearr for multiple blocks is as follows.
Each row of direction_file has multiple fields corresponding to the junctions of adjacent references. The extended header of stdout of correct_micro_homology.awk contains information for all alignment blocks and all unaligned parts of query.