The chain_self_alignments
module is designed for the duplication detection by chaining alignment hits from self-sequence alignment.
The chain_self_alignments
alignment module expects alignment hits coordinate data, for example, from SegMantX’s generate_alignments
module:
Query start | Query end | Subject start | Subject end | Percent sequence identity |
---|---|---|---|---|
133470 | 147930 | 64534 | 78969 | 95.1 |
… | … | … | … | … |
329875 | 330416 | 326586 | 327127 | 93 |
Click here to visit an example file containing alignments hits coordinate data
The input data for chain_self_alignments
should be supplied in this format as tab-delimited file. Alternatively, a BLAST output format 7 file can be used, for example:
Click here to visit an example file containing a blast output format 7
SegMantX chain_self_alignments --input_file tests/NC_018218.1.alignment_coordinates.tsv --output_file tests/NC_018218.1.chains.tsv
-i or --input_file
: Input file received from ‘generate_alignments’ (i.e., five columns: q.start, q.end, s.start, s.end, identity). Alternatively, provide BLAST output format 7 and use –blast_outfmt7 flag).-o, --output_file
: Filename of the chaining output file (Default: chaining_output.tsv).Output of self-sequence alignment chaining module:
(Default) output filename | Description |
---|---|
chaining_output.tsv | Main output file of the chaining procedure containing chaining coordinates and metrics |
chaining_output.tsv.indices | Output file to trace back original local alignment hits that have been chained |
Click here to visit an example table for ‘chaining_output.tsv’
Click here to visit an example table for ‘chaining_output.tsv.indices’
To use a file derived from BLAST in output format 7 as input you can use the following flag:
-B or --blast_outfmt7
: Indicates if the input file is BLAST output format 7 (Default: False).Example:
SegMantX chain_self_alignments --input_file tests/NC_018218.1.blast.x7 --output_file tests/NC_018218.1.chains.tsv --blast_outfmt7
To set a threshold for the max. gap size (in nucleotides) between alignment hits for chaining:
-G or --max_gap
: Maximum gap size between alignment hits for chaining (default: 5000).To set the maximum gap size to 6000 (in nucleotides):
SegMantX chain_self_alignments --input_file tests/NC_018218.1.alignment_coordinates.tsv --output_file tests/NC_018218.1.chains.tsv --max_gap 6000
To set a threshold for the scaled gap size between alignment hits for chaining:
SG or --scaled_gap
: Minimum scaled gap between alignment hits for chaining (Default: 1.0).To set the scaled gap to 2:
SegMantX chain_self_alignments --input_file tests/NC_018218.1.alignment_coordinates.tsv --output_file tests/NC_018218.1.chains.tsv --scaled_gap 2
Choosing the correct sequence topology ensures that alignment hits on circular sequences (e.g., most plasmids or viral genomes) are correctly chained, even when fragmented due to their linear representation in FASTA-files. This is important for avoiding discontinuous alignments that can occur when aligning circular sequences in a linear format (i.e., FASTA format). The sequence topology is set to linear by default.
The sequence topology for the chaining can be set to circular using:
-Q or --is_query_circular
: Indicates a circular sequence topology (Default: False).Note, that on circular sequence topologies it is necessary to supply the sequence length to SegMantX (e.g., –sequence_length or –fasta_file). See below to see options how to provide the sequence length to SegMantX.
To set a circular sequence topology for the query:
SegMantX chain_self_alignments --input_file tests/NZ_AP022172.1.alignment_coordinates.tsv --output_file tests/NZ_AP022172.1.chains.tsv --is_query_circular --fasta_file tests/NZ_AP022172.1.fasta
The sequence length is required for correct alignment chaining on sequences with circular sequence topology.
To set the sequence length manually:
-L, --sequence_length
: Size of the sequence (is required with circular sequence topology). Otherwise, provide fasta file (i.e.,
using –fasta_file) (Default: None).To set the sequence length manually to 187669:
SegMantX chain_self_alignments --input_file tests/NZ_AP022172.1.alignment_coordinates.tsv --output_file tests/NZ_AP022172.1.chains.tsv --sequence_length 187669
To determine the sequence length automatically from FASTA-file:
-f or --fasta_file
: Fasta file to read out the sequence length. Required if the sequence topology is circular and
–sequence_size is not provided manually.To set the sequence length automatically to 187669 by providing the FASTA-file:
SegMantX chain_self_alignments --input_file tests/NZ_AP022172.1.alignment_coordinates.tsv --output_file tests/NZ_AP022172.1.chains.tsv --fasta_file tests/NZ_AP022172.1.fasta
To discard alignment hits for chaining according to their length:
-ml or --min_length
: Minium length of alignment hits for chaining (default: 200).To set the minimum alignment hit length to 300:
SegMantX chain_self_alignments --input_file tests/NC_018218.1.alignment_coordinates.tsv --output_file tests/NC_018218.1.chains.tsv --min_length 300