The generate_alignments
module facilitates sequence alignment generations files that can be further processed by the chaining modules for duplication detection and sequence comparison.
The generate_alignments
alignment module expects a plain FASTA-file (e.g., genomic sequence):
>NZ_AP022172.1 Escherichia coli strain WP5-S18-ESBL-09 plasmid pWP5-S18-ESBL-09_1, complete sequence
CTATGATAGAACTACAAACTAACGGAGTAGAACGGCAACTTTTAGATAATTCTTGTGGCATAAGTTCTGG
GGTATTTAAATAAAATATAAGACAGGCTGTCTATCTTACAGACAGCCTTTTTATAATAACAGAAAGAATA
TATATCATACTTAACGGAAGAGATAATATGAAAACAGTACCCGTATAAACAGCAAGCAATCCCATTTTTC
CAGGTGTATCAGTAAAAAAACCGCTGTTCCAGAAATCAGGCCGGGTAAATTTCAGAGCTGTATCTTCAAT
ATACCATTTTGCAACCGGATACAAAACCATTCCGCAGAGAGAAATACACCAGAACAGTAATCTGTATTTA
AAATCATAATCCCATGACATATACAGCATATATCCCCCCGTCACCCATCCCCACCACATATTATTAAAAT
AATATTTTCTGTTCATCTGCCAGTTTTCCTTTCTTTTGCACTTTTGTCAGTGTACTGATGCATGACAACA
...
SegMantX generate_alignments --query_file tests/NZ_AP022172.1.fasta --blast_output_file tests/NZ_AP022172.1.blast.x7 --alignment_hits_file tests/NZ_AP022172.1.alignment_coordinates.tsv --self_sequence_alignment
-q or --query_file
: Path to the query nucleotide FASTA file (required).-b or --blast_output_file
: Path to the output file for BLASTn results (Default: blast_output.txt).-a or --alignment_hits_file
: Path to the output file containing (main) alignment hit data for the chaining process (Default: alignment_hits.tsv).-SA or --self_sequence_alignment
: Indicates to perform a self-sequence alignment (Default: False).SegMantX generate_alignments --query_file tests/NZ_CP018634.1.fasta --subject_file tests/NZ_CP022004.1.fasta --blast_output_file tests/NZ_CP018634.1_vs_NZ_CP022004.1.blast.x7 --alignment_hits_file tests/NZ_CP018634.1_vs_NZ_CP022004.1.alignment_coordinates.tsv
-q or --query_file
: Path to the query nucleotide FASTA file (required).-s or --subject_file
: Path to the subject nucleotide FASTA file (required).-b or --blast_output_file
: Path to the output file for BLASTn results (Default: blast_output.txt).-a or --alignment_hits_file
: Path to the output file containing (main) alignment hit data for the chaining process (Default: alignment_hits.tsv).(Default) output filename | Description |
---|---|
blast_output.txt | Output file of BLASTn search in output format 7 |
alignment_hits.tsv | Output file restricted to q.start, q.end, s.start, s.end, and percentage sequence identity from BLASTn search, which is sufficient for the chaining modules |
Click here to visit an example file containing a blast output format 7
Click here to visit an example file containing alignments hits coordinate data
Note, that these files have been created using --is_query_circular
option (see below).
Choosing the correct sequence topology ensures that alignment hits on circular sequences (e.g., most plasmids or viral genomes) are correctly chained, even when fragmented due to their linear representation in FASTA-files. This is important for avoiding discontinuous alignments that can occur when aligning circular sequences in a linear format (i.e., FASTA format). The sequence topology is set to linear by default.
The sequence topology for the alignment can be set to circular using:
-Q or --is_query_circular
: Indicates if the query sequence is circular (Default: False).-S or --is_subject_circular
: Indicates if the subject sequence is circular (Default: False).Circular sequence topology in a self-sequence alignment towards duplication detection:
SegMantX generate_alignments --query_file tests/NZ_AP022172.1.fasta --blast_output_file tests/NZ_AP022172.1.blast.x7 --alignment_hits_file tests/NZ_AP022172.1.alignment_coordinates.tsv --is_query_circular --self_sequence_alignment
Circular sequence topology in an sequence alignment towards sequence comparison:
SegMantX generate_alignments --query_file tests/NZ_CP018634.1.fasta --subject_file tests/NZ_CP022004.1.fasta --blast_output_file tests/NZ_CP018634.1_vs_NZ_CP022004.1.blast.x7 --alignment_hits_file tests/NZ_CP018634.1_vs_NZ_CP022004.1.alignment_coordinates.tsv --is_query_circular --is_subject_circular
The E-value in BLAST estimates the number of random alignments expected by chance. It helps filtering low-confidence matches and retaining biologically relevant seeds for chaining.
1e-9
(Fewer, more reliable seeds) - RECOMMENDED1e-5
(Balance of sensitivity and specificity)1e-1
(More seeds, but may introduce noise—use cautiously) - NOT RECOMMENDEDThe E-value for BLASTn computation can be set using:
-e or --evalue
: E-value threshold for BLASTn (Default: 1e-9).Example run setting E-value to 1e-10
:
SegMantX generate_alignments --query_file tests/NZ_AP022172.1.fasta --blast_output_file tests/NZ_AP022172.1.blast.x7 --alignment_hits_file tests/NZ_AP022172.1.alignment_coordinates.tsv --self_sequence_alignment --evalue 1e-10
The word size defines the length of the initial exact match (seed) required for alignment extension. Choose a higher word size to speed up the alignment generation and chaining on larger sequences.
Word size = 7-11
(More sensitive, detects weak matches)Word size = 20-30
(Faster, focuses on strong matches)The word size for BLASTn computation can be set using:
-W or --word_size
: Word size for BLASTn search (increase it for large genomes) (Default: 11).Example run setting word size to 28:
SegMantX generate_alignments --query_file tests/NZ_AP022172.1.fasta --blast_output_file tests/NZ_AP022172.1.blast.x7 --alignment_hits_file tests/NZ_AP022172.1.alignment_coordinates.tsv --self_sequence_alignment --word_size 28
Threads refer to the number of CPU cores used to parallelize BLASTn computations. More threads allow for faster processing by handling multiple alignment tasks simultaneously, improving overall speed and efficiency.
The number of threads for BLASTn computation can be set using:
-T or --number_of_threads
: Number of threads for BLASTn search (Default: 1).Example run setting threads to 4:
SegMantX generate_alignments --query_file tests/NZ_AP022172.1.fasta --blast_output_file tests/NZ_AP022172.1.blast.x7 --alignment_hits_file tests/NZ_AP022172.1.alignment_coordinates.tsv --self_sequence_alignment --number_of_threads 4
This parameter sets a threshold for sequence alignments hits being considered for chaining according to a percentage sequence identity (i.e., alignment hits below the threshold are excluded).
The percentage sequence identity for BLASTn computation can be set using:
-i or --min_identity_percentage
: Minimum percentage identity for BLASTn matches (Default: 60).Example run setting minimumg percentage sequence identity to 70:
SegMantX generate_alignments --query_file tests/NZ_AP022172.1.fasta --blast_output_file tests/NZ_AP022172.1.blast.x7 --alignment_hits_file tests/NZ_AP022172.1.alignment_coordinates.tsv --self_sequence_alignment --min_identity_percentage 70