MafFilter Manual 1.3.1

1 Introduction
2 Filters

The MafFilter Manual

This is the manual of MafFilter, version 1.3.1.


• Introduction:
• Filters:
— The Detailed Node Listing — Introduction
• Description:
• Run:
• General:
Filters
• Extracting:
• Cleaning:
• Analyzing:
• Exporting:
• Miscelaneous:
Extracting data
• Subset:
• SelectOrphans:
• Merge:
• Concatenate:
• OrderFilter:
• ExtractFeature:
• SelectChr:
• WindowSplit:
Cleaning alignment blocks
• MinBlockLength:
• MinBlockSize:
• XFullGap:
• RemoveEmptySequences:
• AlnFilter:
• AlnFilter2:
• EntropyFilter:
• MaskFilter:
• QualFilter:
• FeatureFilter:
Statistical analysis
• Descriptive:
• Phylogenetics:
Arguments:
• BlockSize:
• BlockLength:
• SequenceLength:
• AlnScore:
• BlockCounts:
• PairwiseDivergence:
• SiteStatistics:
• FourSpeciesSitePatternCounts:
• SiteFrequencySpectrum:
• PolymorphismStatistics:
• DiversityStatistics:
• CountClusters:
• ModelFit:
Phylogenetics
• DistanceEstimation:
• DistanceBasedPhylogeny:
• NewOutgroup:
• DropSpecies:
• TreeFilter:
Exporting blocks and data
• Output:
• OutputAlignments:
• OutputAsTable:
• VcfOutput:
• MsmcOutput:
• PlinkOutput:
• OutputCoordinates:
• OutputTrees:
• OutputDistanceMatrices:
Miscelaneous filters
• LiftOver:
• SystemCall:

1 Introduction

MafFilter is a program for post-processing genome alignments in the Multiple Alignment Files (MAF) format. MAF files consist of a series of "blocks", displaying the alignment of subsequences of the original genomes. In a typical MAF file, such blocks represent synteny blocks. See the UCSC website for a detailed description of the file format.

How to obtain MAF files is not covered in this manual. MAF files are typically generated by packages such as BlastZ, MultiZ, Last or Galaxy, and you should refer to the manual of these resources for more detailed help. These packages also contain tools for manipulating MAF files, which you might find useful in combination to the program described here.

Note that several examples are provided along with the source code of the program, and can serve as good training starts. This manual intends to provide an exhaustive description of the options used in these examples, as well as to provide general information about the way the program works and its underlying algorithms.

1.1 Description of the program

The main goal of MafFilter is to apply a series of "filters" to the original input file. A “filter” can be seen as a small program that takes as input one or several maf blocks, and outputs one or several maf block, after performing some actions on them:

                         +--------+
   Alignment block(s)    |        |   Alignment block(s)
------------------------>+ Filter +------------------------>
                         |        |
                         +--------+

We can roughly distinguish four types of filters:

Filters that take one block as input, and output one single block, possibly after removing some sequences, or even the full block if it did not match a certain criterion,
Filters that take one block as input, and split them into several smaller blocks, for instance of a fixed size,
Filters that take several blocks as input and output a larger one, typically after concatenating the input blocks,
Filters that do not modify the input block and simply forward it, yet after having performed some computations, for instance calculating various statistics.

A run of MafFilter typically consists in a workflow of filters, applied in a serial way, the output of one filter being taken as input as the next one. In the current version of MafFilter, only linear workflows are possible, without branching:

    +----------+    +----------+    +----------+
    |          |    |          |    |          |
--->+ Filter 1 +--->+ Filter 2 +--->+ Filter 3 +---> Etc.
    |          |    |          |    |          |
    +----------+    +----------+    +----------+

Branching processes can however be achieved by outputing the alignment blocks at the branching point, and run various instances of MafFilter on the resulting files:

     Run 1
    +----------+    +----------+    +----------+
    |          |    |          |    |          |
--->+ Filter 1 +--->+ Filter 2 +--->+ Filter 3 +---> Etc.
    |          |    |          |    |          |
    +----------+    +----+-----+    +----------+
                         |
                         |    +------------+
                         |    |            |
     Run 2               +--->+ Filter 2.1 +---> Etc.
                              |            |
                              +------------+

Some filters allow to generate additional data associated to each alignment block, such as phylogenetic trees. Such data cannot be written to file together with the blocks, as the MAF format does not support this. These data are however stored in memory by MafFilter together with the block, as “extra-data”. Extra-data are associated with a special tag which is used for refering to them, including for writting them in appropriate output files.

The next section describes how to specify the series of filters to be applied, and the next chapter provides an exhaustive list of available filters.

1.2 How to run the program

MafFilter is a program written in C++ with the Bio++ libraries. It is a command line program (that is, without graphical user interface) which uses the Bio++ Options (BppO) syntax. This syntax is of type ‘keys’=‘values’, which can be passed directly as arguments of the maffilter program

maffilter option1=value1 option2=value2 ... optionN=valueN

or more conveniently, stored in an “options” file (specified with param=option_file), or a combination of both:

maffilter param=option_file
maffilter param=option_file option1=value1 option2=value2

See the Bio++ Program Suite (BppSuite) manual for more information.

MafFilter typically takes as input a MAF file and a series of commands which determine the output file(s) that will be written and where. Additional input files might be specified, such as gene annotation files (typically in GFF/GTF format). Here is the structure of a minimal option file:

DATA=example.file //A user-defined variable, representing the input maf file, without extension
input.file=$(DATA).maf.gz  //Input maf file, gzipped
input.file.compression=gzip
output.log=$(DATA).maffilter.log //Output log file
maf.filter= //A coma separated list of filters, here empty (the programs only read the input file and exits).

The list of filters can be split on multiple lines for convenience, providing each line is escaped by a ’\’ at the end to indicate that the next line is a continuation of the current one:

maf.filter=Filter1,\
           Filter2,\
           Filter3

Please refer to the Bio++ Program Suite manual for more details, including the use of variables, priority of option values, etc.

1.3 General options

Synopsis:

input.file=mydata.maf.gz
input.file.compression=gzip
input.format=Maf
output.log=mydata.maffilter.log
...

input.file=mydata.fasta
input.file.compression=none
input.format=Fasta(zero_based=yes)
output.log=mydata.maffilter.log
...

Detailed description of commands:

input.file={path}: The MAF file to parse
input.file.compression=[none|zip|gzip|bzip2]: The compression format of the input file.
input.format={Maf|Fasta}: MafFilter takes by default a MAF file. It is, however, possible to use a fasta file as input, which will lead MafFilter to run in "single sequence" mode. In this mode, each sequence of the original file is considered as a distinct block. This mode might be useful to extract annotations and estimate simple statistics. Fasta files can be compressed in the same way as MAF files. The sequence names are expected to be of the form Species:Chromosome:start:strand:length, with possible empty fields. Coordinates are expected to be 0-based (that is, the first nucleotide is position 0). However, passing the zero_based=no argument to the format name will assume that the first nucleotide is position 1.
input.dots={error|as_gaps|as_unresolved}: If set to as_gaps, dotted characters in sequences will be converted to gaps and treated as such in any further step of the analysis, including output to files. If set to as_unresolved, dots will be converted to ’N’.
input.check_sequence_size={boolean}: Tell if sequence sizes are consistent (default: true). An error will be returned in case of inconsistency. Desabling size check can be useful in combination with input.dots=as_gaps. In case of inconsistency, a warning will be displayed unless verbose is set to false.
output.log={path}: The file where to write log messages.
maf.filter={Filter1, Filter2, ...}: The main command, specifying the list of filter commands to apply, in the given order (that is, the input of filter 2 will be the output of filter 1). Filters take a series of arguments, specified within parentheses: Filter(arg1=opt1, arg2=opt2, etc). The next section present all available filters and their corresponding arguments.

2 Filters

2.1 Extracting data

2.1.1 Subset: remove sequences from blocks by species name

The Subset filter allows to filter sequences for each block by keeping only a selection of species.

Synopsis:

maf.filter=                                 \
    [...],
    Subset(                                 \
        species=(species1, species2, etc),  \
        strict=yes,                         \
        keep=no,                            \
        remove_duplicates=yes),             \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. Only sequences corresponding to the selected species will be kept. Block without sequences from the selection will be discarded.
strict=yes/no: If set to yes, only blocks containing at least one sequence from each species in the selection will be kept.
keep=yes/no: If strict is set to yes, tells if additional sequences should be kept in the block, or remove so that the output blocks only contain species from the selection.
remove_duplicates=yes/no: If set to yes, only keep blocks for which only one sequence for each species is present. Blocks with paralogous sequences in at least one species will therefore be discarded.

2.1.2 SelectOrphans: keep only block with the specified species

The SelectOrphans filter allows to discard all block with species out of the specified range. This is typically used to extract blocks with no homolog ("orphan").

Synopsis:

maf.filter=                                 \
    [...],
    SelectOrphans(                          \
        species=(species1, species2, etc),  \
        strict=yes,                         \
        remove_duplicates=yes),             \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. All block containing at least one species outside this list will be discarded.
strict=yes/no: If set to yes, only block containing at least one sequence from each species in the selection will be kept.
remove_duplicates=yes/no: If set to yes, only keep blocks for which only one sequence for each species is present. Blocks with paralogous sequences in at least one species will therefore be discarded.

2.1.3 Merging: merge synthenic blocks

The Merge filter fuses consecutive blocks based on their coordinates. Blocks will be fused only if the sequences from a specified set of species can be considered as syntenic. Two sequences are considered as syntenic if:

they are from the same species
they are from the same chromosome
they are from the same strand
they are distant by less than a defined threshold (the start of sequence 2 - the end of sequence 1 < threshold in bp).

Synopsis:

maf.filter=                                 \
    [...],
    Merge(                                  \
        species=(species1, species2, etc),  \
        dist_max=0,                         \
        ignore_chr=(Random,Unknown),        \
        rename_chimeric_chromosomes=yes),   \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. These species only will be used for the synteny check. Other species will be concatenated in any situation, and their coordinates removed.
dist_max={int}: Maximum distance (in nt) to consider sequences as syntenic. If dist_max is greater than 0, sequences will be filled with ’N’ to preserve coordinates when they are fused.
ignore_chr={none | (list)}: An optional parenthetic list of chromosomes to ignore (typically "Unknown", or "Random", etc.). Sequences annotated with such chromosomes will not be checked for synteny and the corresponding block will not be fused.
rename_chimeric_chromosomes={boolean}: When distinct chromosomes are merged (in non specified species), their names are concatenated. If this option is set to "yes", then merged chromosomes will be renamed as "chimtigXXX" where XXX is a number auto-incremented for each species.

2.1.4 Concatenating: merge consecutive blocks up to a certain size

The Concatenate filter fuses consecutive blocks until the concatenated block reaches a minimal size.

Synopsis:

maf.filter=                                 \
    [...],
    Concatenate(minimum_size=10000, ref_species=Hsapiens), \
    [...]

Arguments:

minimum_size={int}: The minimum size for the blocks to be reached.
ref_species={string}: If given, only blocks with identical chromosome tags in the reference species will be merged.

2.1.5 OrderFilter: check and eventually discard unordered or overlapping blocks

The OrderFilter filter check if blocks are sorted according to previous block.

Synopsis:

maf.filter=                                 \
    [...],
    OrderFilter(                            \
        reference=ref_genome,               \
        do_unsorted=discard,                \
        do_overlapping=discard),            \
    [...]

Arguments:

reference={string}

The sequence for which coordinates should be checked. Block where the reference sequence is absent will be considered as ordered.

Important note: the alignment is supposed to be projected according to the reference sequence. Using the filter on another species might lead to unexpected behavior, in particular as the the reference species is supposed to be mapped on the + strand only.

do_unsorted={none|discard|error}

Action to perform in case an unsorted block is found (that is, when the right coordinate of the current block is before the left coordinate of the previous one). If ’discard’, the block will be removed. If ’error’, the maffilter with stop with an error.

do_overlapping={none|discard|error}

Action to perform in case an ovelrapping block is found (that is, when the left coordinate of the current block is before the right coordinate of the previous one, and the right coordinate of the current block is not before the left coordinate of the previous one). If ’discard’, the block will be removed. If ’error’, the maffilter with stop with an error.

2.1.6 Extract features from the alignment

The ExtractFeature extracts parts of the alignment corresponding to certain features. The features to extract are typically provided via a GFF file.

Synopsis:

maf.filter=                                 \
    [...]
    ExtractFeature(                         \
        ref_species=species1,               \
        feature.file=species1.gff3.gz,      \
        feature.file.compression=gzip,      \
        feature.format=GFF,                 \
        feature.type=mRNA,                  \
        compression=gzip,                   \
        complete=yes,                       \
        ignore_strand=no),                  \
    [...]

Arguments:

ref_species={string}: The name of the species for which the coordinates of the features are provided.
feature.file={path}: The file where the features are described.
feature.file.compression={none|gzip|zip|bzip2}: Compression format for the feature file.
feature.format={GFF|GTF|BedGraph}: Format for the feature file, currently GFF (v3.0), GTF or BedGraph.
feature.type={all|(list)}: Specifies which type of feature should be extracted from the feature file, if several are available.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).
complete={yes|no}: Tell if only complete features should be extracted or if partial extraction is allowed.
ignore_strand={yes|no}: If yes, features will be extracted “as is”. Otherwise, the invert-complement sequence will be returned if it is located on the negative strand.

2.1.7 Extract blocks for a given chromosome/contig/scaffold

The SelectChr filter keeps only block from a given chromosome (or contig or scaffold...) for a given reference species.

Synopsis:

maf.filter=                                 \
    [...],
    SelectChr(                              \
        ref_species=species1,               \
        chromosome=chr01),                  \
    [...]

Arguments:

ref_species={string}: The species name for which chromosomes should be extracted.
chromosome={string}: The name of the chromosome/contig/scaffold to be filtered.

2.1.8 Split alignment blocks into smaller parts

The WindowSplit filter subdivides blocks into smaller ones of a regular size, typically for window-based analyses. It is possible to generate overlapping windows.

Synopsis:

maf.filter=                                 \
    [...],
    WindowSplit(                            \
        preferred_size=1000,                \
        align=center,                       \
        keep_small_blocks=no),              \
    [...]

Arguments:

preferred_size={int>0}

The preferred size for the output windows.

align={ragged_left|ragged_right|center|adjust}

Specifies how to align windows on the block:

keep_small_blocks = {boolean}

Tell if blocks with an original size smaller that the window size should be kept (and untouched). This option only works if the align option is set to adjust. The combination of "align=adjust" and "keep_small_blocks=yes" allows to ensure that blocks do not exceed a given size, without losing any data.

ragged_left

Block   |----------------------------|
Windows |----|----|----|----|----|XXXX

ragged_right

Block   |----------------------------|
Windows XXXX|----|----|----|----|----|

center

Block   |----------------------------|
Windows XX|----|----|----|----|----|XX

adjust

Block   |----------------------------|
Windows |-----|-----|-----|-----|----|

In the first three case, some parts of the original block (shown with X) will be discarded. In the last case, the windows will have at least the specified size, but might be larger.

2.2 Cleaning alignment blocks

2.2.1 Minumum block length

The MinBlockLength filter discards blocks with less sites than a given threshold.

Synopsis:

maf.filter=                                 \
    [...],
    MinBlockLength(min_length=1000),        \
    [...]

Arguments:

min_length={int>0}: The minimum length.

2.2.2 Minumum block size

The MinBlockSize filter discards blocks with less sequences than a given threshold.

Synopsis:

maf.filter=                                 \
    [...],
    MinBlockSize(min_size=3),               \
    [...]

Arguments:

min_size={int>0}: The minimum size.

2.2.3 Exclude full gap columns

Remove gap-only columns from blocks. This does not modify coordinates and might be necessary after a Subset filter.

Synopsis:

maf.filter=                                 \
    [...],
    XFullGap(
        species=(species1,species2,etc)),   \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. Any column containing a gap in the specified species will be removed. If other species are present in the block, the corresponding character will be removed, whether it is a gap or not. The coordinates of these species will then be discarded.

2.2.4 Remove empty sequences

The RemoveEmptySequences filter remove sequences containing only gap (or unresolved characters) in each block.

Synopsis:

maf.filter=                                 \
    [...],
    RemoveEmptySequences(unresolved_as_gaps=yes),        \
    [...]

Arguments:

unresolved_as_gaps={boolean}: Tell if unresolved characters count as "empty" as well.

2.2.5 Alignment filtering

Split alignment blocks by removing regions with ambiguous alignments. The local uncertainty in the alignment is determined through a sliding window based approach. For each window, the number of gap characters and the total entropy are computed. Any window for which both the entropy and number of gaps exceed the given thresholds will be removed from the alignment, and the corresponding block split accordingly.

Synopsis:

maf.filter=                                 \
    [...],
    AlnFilter(                              \
        species=(species1,species2,etc),    \
        window.size=10,                     \
        window.step=1,                      \
        max.gap=9,                          \
        max.ent=0.2,                        \
        missing_as_gap=yes,                 \
        relative=no,                        \
        file=data.trash_aln.maf.gz,         \
        compression=gzip),                  \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. All calculations will be performed on the sub-alignment corresponding to these species only.
window.size={int>0}: The width, in bp, of the sliding window.
window.step={int>0}: The step by which the window is moved, in bp.
relative={boolean}: Tell if maximum amount of gap is relative (that is, as a proportion of the total amount of character in each window).
max.gap={int>0|1>double>0}: The maximum number of gaps allowed in each window (if relative is set to no), or the maximum proportion of gaps (if relative is set to yes)
max.ent={float}: The maximum entropy allowed in each window.
missing_as_gap={yes/no}: Tell if missing sequences should be considered as gaps.
file={none|{path}}: An optional file were removed alignment parts will be stored, in the MAF format. This can be helpful for visual inspection and fine tuning of the filter parameters.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.2.6 Alignment filtering 2

This is another algorithm for cleaning alignment blocks (see AlnFilter), using sliding windows. The number of gaps in each alignment column in the window is counted, and the column is masked if it contains more than a given threshold of gaps. consecutive patterns in the window are only counted ones. In the follwing 10nt window:

AATCGGGCGT
AA---GCGGA
AA---CGGGT
CA---CGGGA

positions 3, 4 and 5 will be masked if the maximum number of gaps allowed is 2 or less. The three columns will however count as only one indel event. The window is then discarded if it contains more than a given number of indel events.

Synopsis:

maf.filter=                                 \
    [...],
    AlnFilter2(                             \
        species=(species1,species2,etc),    \
        window.size=10,                     \
        window.step=1,                      \
        max.gap=1,                          \
        max.pos=1,                          \
        relative=no,                        \
        missing_as_gap=yes,                 \
        file=data.trash_aln.maf.gz,         \
        compression=gzip),                  \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. All calculations will be performed on the sub-alignment corresponding to these species only.
window.size={int>0}: The width, in bp, of the sliding window.
window.step={int>0}: The step by which the window is moved, in bp.
relative={boolean}: Tell if maximum amount of gap is relative (that is, as a proportion of the total amount of character in each site).
max.gap={int>0|1>double>0}: The maximum number of gaps allowed in each site (if relative is set to no), or the maximum proportion of gaps (if relative is set to yes)
max.pos={int>0}: The maximum number of positions with gaps (“indel events”).
missing_as_gap={yes/no}: Tell if missing sequences should be considered as gaps.
file={none|{path}}: An optional file were removed alignment parts will be stored, in the MAF format. This can be helpful for visual inspection and fine tuning of the filter parameters.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.2.7 Entropy filtering

This filter removes highly variable regions and split the blocks accordingly. It uses a sliding windows, and compute the entropy for each site in the window. The window is then discarded if it containes more than ’p’ sites with an entropy higher that a user-specified threshold. The alignment block is then split into separate block accordingly.

Synopsis:

maf.filter=                                 \
    [...],
    EntropyFilter(                          \
        species=(species1,species2,etc),    \
        window.size=10,                     \
        window.step=1,                      \
        max.ent=0.2,                        \
        max.pos=3,                          \
        missing_as_gap=yes,                 \
        ignore_gaps=yes,                    \
        file=data.trash_ent.maf.gz,         \
        compression=gzip),                  \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. All calculations will be performed on the sub-alignment corresponding to these species only.
window.size={int>0}: The width, in bp, of the sliding window.
window.step={int>0}: The step by which the window is moved, in bp.
max.ent={float}: The maximum entropy allowed at each site.
max.pos={int>0}: The maximum number of positions with an entropy higher than the given threshold.
missing_as_gap={yes/no}: Tell if unresolved characters should be counted as gaps.
ignore_gaps={yes/no}: Tell if gaps should not be counted in entropy calculation. If no, then gaps are counted as a “fifth” state.
file={none|{path}}: An optional file were removed alignment parts will be stored, in the MAF format. This can be helpful for visual inspection and fine tuning of the filter parameters.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.2.8 Masked sequences filtering

The MaskFilter split alignment blocks by removing regions with too many masked positions (typically showing repeat-content), using sliding windows. A masked position is identified by a lower case letter in the original sequence. Windows with more than a given amount of lower case characters will be discarded, and the corresponding block split.

Synopsis:

maf.filter=                                 \
    [...],
    MaskFilter(                             \
        species=(species1,species2,etc),    \
        window.size=10,                     \
        window.step=1,                      \
        max.masked=2,                       \
        file=data.trash_msk.maf.gz,         \
        compression=gzip),                  \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. All calculations will be performed on the sub-alignment corresponding to these species only.
window.size={int>0}: The width, in bp, of the sliding window.
window.step={int>0}: The step by which the window is moved, in bp.
max.masked={int>0}: The maximum number of lower-case characters allowed in each window.
file={none|{path}}: An optional file were removed alignment parts will be stored, in the MAF format. This can be helpful for visual inspection and fine tuning of the filter parameters.
compresion={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.2.9 Quality filtering

The QualFilter split alignment blocks by removing regions with too low quality score. Windows with an average quality below a given threshold will be discarded, and the corresponding block split.

Synopsis:

maf.filter=                                 \
    [...],
    QualFilter(                             \
        species=(species1,species2,etc),    \
        window.size=10,                     \
        window.step=1,                      \
        min.qual=0.8,                       \
        file=data.trash_qual.maf.gz,        \
        compression=gzip),                  \
    [...]

Arguments:

species=(species1, species2, etc): A coma separated, within parentheses, list of species. All calculations will be performed on the sub-alignment corresponding to these species only. The corresponding sequences must have a quality score line in the original file.
window.size={int>0}: The width, in bp, of the sliding window.
window.step={int>0}: The step by which the window is moved, in bp.
min.qual={double}: The minimum average quality allowed for each window.
file={none|{path}}: An optional file were removed alignment parts will be stored, in the MAF format. This can be helpful for visual inspection and fine tuning of the filter parameters.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.2.10 Remove sequences from features from the alignment

The FeatureFilter remove parts of the alignment corresponding to certain features. The features to extract are typically provided via a GFF file. See ExtractFeature in order to keep only the feature parts.

Synopsis:

maf.filter=                                 \
    [...],
    FeatureFilter(                          \
        ref_species=species1,               \
        feature.file=species1.gff3,         \
        feature.file.compression=none,      \
        feature.format=GFF,                 \
        feature.type=(mRNA,tRNA),           \
        file=data.trash_features.maf.gz,    \
        compression=gzip),                  \
    [...]

Arguments:

ref_species={string}: The name of the species for which the coordinates of the features are provided.
feature.file={path}: The file where the features are described.
feature.file.compression={none|gzip|zip|bzip2}: Compression format for the feature file.
feature.format={GFF|GTF|BedGraph}: Format for the feature file, currently GFF (v3.0), GTF or BedGraph.
feature.type={all|(list)}: Specifies which type of feature should be extracted from the feature file, if several are available.
file={none|{path}}: An optional file were extracted features will be stored, in the MAF format.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.3 Statistical analysis

2.3.1 Descriptive statistics

The section describes how to compute simple statistics for each alignment block. For the sake of computer efficiency, several statistics can be computed simultaneously on each alignment block. The results are written in a text file, with one line per block and one column per statistics. In addition, the coordinates of the block will be reported according to a specified reference species. The choice of statistics is specified by the user. Some of them output several results, which will appear each in a column in the output file.

Descriptive statistics are computed through the SequenceStatistics filter, which takes the following arguments:

Synopsis:

maf.filter=                                 \
    [...],
    SequenceStatistics(                     \
        statistics=(\                       \
            BlockLength,                    \
            AlnScore,                       \
            BlockCounts),                   \
        ref_species=species1,               \
        file=data.statistics.csv,           \
        compression=none),                  \
    [...]

Arguments:

statistics={list of statistics functions}: See below for the list of possible functions and their detailed description.
ref_species={string}: The species to use to report block coordinates in the output file. For block where the reference species is missing, NA will be output.
file={path}: A file path for the output file.
compression={none|gzip|zip|bzip2}: Compression format for output file.

The statistics to compute take the form of functions (just like filters themselves), which can potentially take arguments. Here is the list of currently available statistical functions:

2.3.1.1 Size of alignment blocks

The BlockSize statistics reports the number of sequences in the block.

(No argument)

2.3.1.2 Length of alignment blocks

The BlockLength statistics reports the number of sites in the block.

(No argument)

2.3.1.3 Length of sequence for a species

The SequenceLength statistics computes the length (in bp) of the sequence of one species. It returns 0 if no sequence for this species is found in the current block. It will send an error however if the block has more than one sequence for this species (be sure to use the Subset filter before, if needed).

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            SequenceLength(                 \
                species=species1),          \
            [...]),                         \
        ref_species=species1,               \
        file=data.statistics.csv),          \
    [...]

Arguments:

species={string}: The species for which lengths should be reported.

2.3.1.4 Alignment score

The AlnScore statistics reports the alignment score associated to the block, if any.

(No argument)

2.3.1.5 Character frequencies

The BlockCounts statistics reports the count of each character found in the block.

Note: this statistics does not work in case of duplicated sequences (multiple hits from the same chromosome / scaffold / contig).

Arguments:

species={list}: A list of species to be considered in the counts calculations. If no species is given, all sequences are used.
suffix={string}: An (optional) suffix to append to the statistics name. Useful in case the statistics should be computed for several species independently.

2.3.1.6 Pairwise divergence

The PairwiseDivergence computes the divergence (percentage of mistmatch) between two species. If at least one of the species is missing, NaN is returned. It will send an error however if the block has more than one sequence for this species (be sure to use the Subset filter before, if needed).

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            PairwiseDivergence(             \
                species1=species1,          \
                species2=species1),         \
            [...]),                         \
        ref_species=species1,               \
        file=data.statistics.csv),          \
    [...]

Arguments:

species1={string}: The first species for which divergence should be reported.
species2={string}: The second species for which divergence should be reported.

2.3.1.7 Site statistics

SiteStatistics computes site-wise statistics, including:

Number of sites without gap,
Number of complete sites (no gap, no unresolved character),
Number of complete sites with only one state (constant sites),
Number of complete sites with only two states (biallelic sites),
Number of complete sites with only three states (triallelic sites),
Number of complete sites with all four states (quadriallelic sites),
Number of parsimony-informative sites.

The statistics are computed for a given subset of species (typically forming a ingroup).

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            SiteStatistics(                 \
                species=species1),          \
            [...]),                         \
        ref_species=species1,               \
        file=data.statistics.csv),          \
    [...]

Arguments:

species={list}: A list of species to be considered in the statistics calculations.

2.3.1.8 Counts of parsimony informative sites for a species quartet

The FourSpeciesSitePatternCounts computes the frequency of each type of informative sites. With four species, 3 types are expected: 1100, 0110 and 1010. On entry will be output for each of these types. The remaining sites will be pulled in a column called “Ignored”.

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            FourSpeciesSitePatternCounts(   \
                species1=sp1,               \
                species2=sp2,               \
                species3=sp3,               \
                species4=sp4                \
            [...]),                         \
        ref_species=sp1,                    \
        file=data.statistics.csv),          \
    [...]

Arguments:

species1=species: First species to consider (resp. second, third and fourth).

2.3.1.9 Site frequency spectrum

The SiteFrequencySpectrum computes the site frequency spectrum for each block. Only positions in the alignment with only two states are considered. The proportions of bi-allelic sites are then computed by bins. Lets consider the following example with 7 sequences:

ACGT
ACTT
AGTT
AGTT
TCTT
TCTT
TCTT

It contains one site with 4/7 (eq 3/7), one site with 2/7 (eq 5/7), one site with 1/7 (eq 6/7) and one site 0/7 (eq 7/7). With seven sequences, there are actually 4 possibles frequencies: 1/7, 2/7, 3/7 and 4/7 (plus the 0/7 frequency, corresponding to a site with no-mutation). It is possible to compute all these frequencies individually by settingthe bounds parameter to

SiteFrequencySpectrum(bounds=(-0.5,0.5,1.5,2.5,3.5,4.5), ...)

which will output the number of site with 0, 1, 2, 3 or 4 minor states. Sites with more than 2 states are always counted separately, as well as sites containing unresolved characters. If one want to count only constant sites for instance, one can simply type

SiteFrequencySpectrum(bounds=(-0.5,0.5), ...)

The remaining sites will be pulled in a column called “Ignored”.

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            SiteFrequencySpectrum(          \
                bounds=(-0.5, 0.5, 1.5),    \
                ingroup=(pop1, pop2, pop3), \
                outgroup=species2,          \
            [...]),                         \
        ref_species=pop1,                   \
        file=data.statistics.csv),          \
    [...]

Arguments:

bounds={list of double}: The bounds delimiting the bins of the spectrum to be computed.
ingroup={list}: A list of species forming the ingroup and on which the statistics should be calculated.
outgroup=species: A species name saying which sequence should be used to determine the ancestral state. If non-empty, the unfolded frequency spectrum will be computed. If empty, the folded frequency spectrum will be returned (1/5 and 4/5 frequences will be pulled, so that the maximum frequency will be 1/2).

2.3.1.10 Polymorphism statistics

Compute various statistics describing sequence polymorphism. Two aligned sets of “species” are compared, and the number of polymorphic / fixed sites are computed:

F: Number of sites fixed in the two sets
FF: Number of sites fixed in the two sets, yet with a distinct state
P: Number of sites that are polymorphic in the two sets
FP: Number of sites that are fixed in set 1 but polymorhic in set 2
PF: Number of sites that are fixed in set 2 but polymorphic in set 1

Positions containing a gap or an unresolved character in one set are considered ambiguous. Such positions are counted separately in the following quantities:

X: Number of sites unresolved in the two sets
FX: Number of sites fixed in set 1 and unresolved in set 2
XF: Number of sites fixed in set 2 and unresolved in set 1
PX: Number of sites polymorphic in set 1 and unresolved in set 2
XP: Number of sites polymorphic in set 2 and unresolved in set 1

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            SiteFrequencySpectrum(          \
                bounds=(-0.5, 0.5, 1.5),    \
                ingroup=(pop1, pop2, pop3), \
                outgroup=species2,          \
            [...]),                         \
        ref_species=pop1,                   \
        file=data.statistics.csv),          \
    [...]

Arguments:

species1={list}: A list of species for set 1.
species2={list}: A list of species for set 2.

Note that the “species” terminology relates to multispecies alignments, as originally implemented in the MultiZ aligner. These statistics will however be most relevant when the aligned sequences are actually from individuals from the same population / species. The term “species” is here therefore to be taken in terchnical terms (a sequence id in the alignment), and not biological.

2.3.1.11 Sequence diversity statistics

DiversityStatistics computes sequence diversity statistics, including:

Number of segregating sites,
Watterson’s theta.
Tajima’s pi (average pairwise dissimilarity).
Tajima’s D.

Synopsis:

maf.filter=                                   \
    [...],                                   
    SequenceStatistics(                       \
        statistics=(\                         \
            [...],                                                    
            DiversityStatistics(              \
                ingroup=(samp1,samp2,samp3)), \
            [...]),                           \
        ref_species=samp1,                    \
        file=data.statistics.csv),            \
    [...]

Arguments:

ingroup={list}: A list of sequences to be considered in the statistics calculations.

2.3.1.12 Number of sequence clusters.

CountClusters counts the number of sequence clusters. This requires that a clustering tree was previously computed (see below).

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            CountClusters(                  \
                tree=bionj,                 \
                threshold=0.001),           \
            [...]),                         \
        ref_species=species1,               \
        file=data.statistics.csv),          \
    [...]

Arguments:

tree={string}: The tag name used to store the previously computed tree.
threshold={double}: The threshold to use in order to compute clusters. A threshold of 0.001 for instance means that sequences within each cluster will be divergent by maximum 0.001, while sequences in distinct clusters will be divergent by at least 0.001.

2.3.1.13 Fit a substitution model and estimate parameters

ModelFit fits a substitution model to the input block, given a phylogenetic tree. All nucleotide homogeneous models can be used, with or without rate across sites variation.

Synopsis:

maf.filter=                                 \
    [...],                                   
    SequenceStatistics(                     \
        statistics=(\                       \
            [...],                                                    
            ModelFit(                       \
                model=HKY85(                \
                     kappa=1,               \
                     theta=0.5,             \
                     theta1=0.5,            \
                     theta2=0.5),           \
                rate_distribution=Gamma(    \
                     n=4,                   \
                     alpha=0.5),            \
                tree=BioNJ,                 \
                parameters_output=(         \
                     HKY85.theta,           \
                     HKY85.theta1,          \
                     HKY85.theta2,          \
                     HKY85.kappa),          \
                fixed_parameters=(),        \
                reestimate_brlen=yes,       \
                max_freq_gaps=0.3,          \
                gaps_as_unresolved=yes)),   \
            [...]),                         \
        ref_species=species1,               \
        file=data.statistics.csv),          \
    [...]

Arguments:

model={string}: Substitution model to use. See the Bio++ Program Suite manual for an extensive description of available models. All nucleotide models can be used.
rate_distribution={string}: The distribution for rates across sites. See the Bio++ Program Suite manual for all available distributions.
root_freq={None|Full|GC}: Allow root frequencies to be different (non-stationary model). Root frequencies can be fully parametrized, or parametrized with GC content.
tree={string|none}: The property name under which trees are stored for each block. If set to “none”, then an input file should be given.
tree.file={path}[tree=none]: Path for tree file, in case no property is set.
tree.format={Newick|Nhx}[tree=none]: Format for tree file, in case no property is set.
parameters_output={list}: A list of parameter names to output as statistics.
fixed_parameters={list}: A list of parameters which should not be optimized, but fixed to their initila values.
reestimate_brlen={boolean}: Tell if the branches of the tree should be reestimated for each block together with other model parameters.
max_freq_gaps={float}: The maximum proportion of gaps for a site to be included in the analysis.
gaps_as_unresolved={yes/no}: Tell if remaining gaps should be converted to ’N’ before likelihood computation. This should be ’yes’ unless you specify a substitution model which explicitely allows for gaps.
global_clock={yes/no}: Assume a global clock for branch lengths.
reparametrize={yes/no}: Transform parameters to remove constraints (can improve optimization, but is usually slower).

2.3.2 Phylogenetics

2.3.2.1 Distance matrix estimation

Estimates a pairwise distance matrix.

Synopsis:

maf.filter=                                 \
    [...],
    DistanceEstimation(                     \
        method=count,                       \
        gap_option=no_double_gap,           \
        unresolved_as_gap=no,               \
        extended_names=yes),                \
    [...]

maf.filter=                                 \
    [...],
    DistanceEstimation(                     \
        method=ml,                          \
        model=K80(kappa=2),                 \
        rate=Gamma(n=4, alpha=0.5),         \
        parameter_estimation=initial,       \
        max_freq_gaps=0.33,                 \
        gaps_as_unresolved=yes,             \
        profiler=none,                      \
        message_handler=none,               \
        extended_names=yes),                \
    [...]

Arguments:

method={count|ml}: Method used to estimate distance, either observed count or maximum likelihood estimate.

Further arguments for the observed counts:

gap_option={string}

Specifies how to deal with gaps:

all: All positions are used. Gaps are considered as a fifth character.
no_full_gap: Positions only made of gaps in the alignment block are ignored. Alternatively, a gap in the two sequences is ocnsidered as a match (gap are a “fifth” charcater).
no_double_gap: For each pairwise comparison, positions where a gap is found in both sequences are ignored.
no_gap: For each pairwise comparison, any gap-containing position is ignored. This is the recommended option for building phylogenies.

unresolved_as_gap={yes|no}

Tell is unresolved characters should be treated as gaps (usually in order to be ignored).

extended_names={boolean}

Tell if sequence coordinates should be included in the sequence names stored in the output matrix.

Further arguments for the ML method:

model={substitution model description}

See the Bio++ Program Suite manual for a description of substitution models available. Only nucleotide models can be used.

model=JC
model=K80(kappa=2)
model=T92(kappa=2, theta=0.5)
model=GTR

rate={rate distribution description}

See the Bio++ Program Suite manual for a description of available options.

rate=Constant
rate=Gamma(n=4, alpha=0.5)

profiler={none|std|{path}}

Where to print optimization steps (nowhere, standard output or to a given file).

message_handler={none|std|{path}}

Where to log optimization (nowhere, standard output or to a given file).

parameter_estimation={initial|pairwise}

How to estimate substitution process parameters (for instance kappa and alpha). Available options are either to leave them to their initial values, or to estimate them for each pair of sequences.

max_freq_gaps={float}

The maximum proportion of gaps for a site to be included in the analysis.

gaps_as_unresolved={yes/no}

Tell if remaining gaps should be converted to ’N’ before likelihood computation. This should be ’yes’ unless you specify a substitution model which explicitely allows for gaps.

Extra-data:

CountDistance or MLDistance: The estimated pairwise distance matrix.

2.3.2.2 Hierarchical clustering based on distances

Performs hierarchical clustering from a previously computed distance matrix.

Synopsis:

maf.filter=                                 \
    [...],
    DistanceBasedPhylogeny(                 \
        method=bionj,                       \
        dist_mat=MLDistance),               \
    [...]

Arguments:

method={string}

Method to use for reconstructing the tree:

upgma: Unweighted pair group method using arithmetic means
wpgma: Weighted pair group method using aritmetic means
nj: Neighbor-Joining (Saitou and Nei, 1987)
bionj: BioNJ method (Gascuel, 1997)

dist_mat={string}

The tag name of the distance matrix to use.

Extra-data:

UPGMA|WPGMA|NJ|BioNJ: The resulting tree.

2.3.2.3 Root trees

The NewOutgroup filter takes an input tree and (re)root it according to a given species name.

Synopsis:

maf.filter=                                 \
    [...],
    NewOutgroup(                            \
        tree_input=BioNJ,                   \
        tree_output=NioNJ_rooted,           \
        outgroup=outgroupSpecies),          \
    [...]

Arguments:

tree_input={string}: The tag name of the tree to reroot.
tree_output={string}: The tag name to use to store the rerooted tree. If the same as tree_input, the rerooted tree will replace the old one.
outgroup={string}: The name of the species to use as outgroup. One and only one sequence for this species should be present in each block for which rerooting is to be performed.

Extra-data:

(user-defined tag): A rooted tree.

2.3.2.4 Remove a species from a tree

The DropSpecies removes a leaf from a tree.

Synopsis:

maf.filter=                                 \
    [...],
    DropSpecies(                            \
        tree_input=BioNJ,                   \
        tree_output=BioNJ_subtree,          \
        species=species1),                  \
    [...]

Arguments:

tree_input={string}: The tag name of the tree to edit.
tree_output={string}: The tag name to use to store the edited tree. If the same as tree_input, the edited tree will replace the old one.
species={string}: The name of the species to remove. All leaves in the tree, if any, corresponding to a sequence for this species will be removed.

Extra-data:

(user-defined tag): An edited tree.

2.3.2.5 Filter blocks based on their associated trees

The TreeFilter (experimental) filters blocks based on their associated trees. Currently, filter is only performed based on the branch lengths.

Synopsis:

maf.filter=                                 \
    [...],
    TreeFilter(                             \
        tree=BioNJ,                         \
        max_brlen=0.01,                     \
        file=data.trash_tree.maf.gz,        \
        compression=gzip),                  \
    [...]

Arguments:

tree={string}: The tag name of the tree to analyse.
max_brlen={float>0}: Branch length threshold: any block with a tree containing at least one branch lenght longer than the given threshold will be filtered out.
file={none|{path}}: An optional file were removed alignment parts will be stored, in the MAF format. This can be helpful for visual inspection and fine tuning of the filter parameters.
compression={none|gzip|zip|bzip2}: Compression format for output file (if file != none).

2.4 Exporting blocks and data

2.4.1 Write alignment blocks to a MAF file

The Output filter writes all blocks to a MAF file.

Synopsis:

maf.filter=                                 \
    [...],
    Output(                                 \
        file=data.filtered.maf.gz,          \
        compression=gzip,                   \
        mask=yes),                          \
    [...]

Arguments:

file={none|{path}}: A file path where to write alignment blocks, in the MAF format.
compression={none|gzip|zip|bzip2}: Compression format for output file.
mask={yes|no}: Tell if sequences should be masked (if a mask annotation is available), or if masking information should be ignored.

2.4.2 Write alignment blocks to an external alignment file.

The OutputAlignments filter writes all blocks to an external alignment file, potentially losing some information such as coordinates and scores. Sequence names will be formated following the Ensembl convention when coordinates are available.

Synopsis:

maf.filter=                                 \
    [...],
    OutputAlignments(                       \
        file=data.filtered.aln,             \
        compression=none,                   \
        format=Clustal,                     \
        mask=no,                            \
        coordinates=yes),                   \
    [...]

Arguments:

file={none|{path}}: A file path where to write alignment blocks. If the file name contains the %i code, one alignment file per block will be generated, where %i will be replaced by the block index. In addition to the %i special code, additional variables can be used (in combination with %i, in order to avoid putative identical file names): chromosome number (%c), begin (%b) and end (%e) of reference sequence.
format={string}: One of the alignment format supported by Bio++ (this includes Fasta, Mase and Clustal, see the Bio++ Program suite reference manual for a complete list of formats and options).
compression={none|gzip|zip|bzip2}: Compression format for output file.
mask={yes|no}: Tell if sequences should be masked (if a mask annotation is available), or if masking information should be ignored.
coordinates={yes|no}: Tell if coordinates of sequences should be written in the header of each sequence (followinf Ensembl syntax). If set to no, or if sequences have no coordinates, only the species name will be used as a sequence name.
ldhat_header={yes|no}: Tell if a header line should be added for each file, in order to be used with the LDhat convert program. This line contains the nimber of sequences, the sequence lengths, and the number 1 (for haploid).
reference={string}: The name of reference sequence, which will only be used when %c, %b or %e codes are used in file names. In case a block does not contain a sequence for the reference species, "ChrNA", "StartNA" and "StopNA" tags will be used.

2.4.3 Export given positions into a table format.

The OutputAsTable filter writes the nucleotide content of given positions in a table format.

Synopsis:

maf.filter=                                 \
    [...],
    OutputTable(                            \
        file=data.txt,                      \
        species=Species1,Species2,Species3, \
        reference=RefSpecies,               \
        compression=none),                  \
    [...]

Arguments:

file={none|{path}}: A file path where to write date in table format.
compression={none|gzip|zip|bzip2}: Compression format for output file.
species={list of {string}}: Select species for which data should be output (one column per species).
reference={string}: Which species should be used for outputting coordinates.

2.4.4 Call SNPs from alignment blocks and export to VCF file

The VcfOutput filter call SNPs from each block and output them to a VCF file. Only substitutions are called for now.

Synopsis:

maf.filter=                                 \
    [...],
    VcfOutput(                              \
        file=snp.vcf.gz,                    \
        compression=gzip,                   \
        reference=Ref,                      \
        genotypes=(sample1,sample2),        \
        all=no),                            \
    [...]

Arguments:

file={none|{path}}: A file path for the output VCF file.
compression={none|gzip|zip|bzip2}: Compression format for output file.
reference={species name}: A species name corresponding to the sequence to use as reference.
genotypes={list of species}: A list of species for which genotypes informations should be written. If not, void, a ’FORMAT’ column will be added to the output, as well as oneextra column per species to genotype. If the species is not present in this particular, or if a N or Gap is present, a dot (’.’) will be written in the column. Please note that species should be unique in each block, otherwise an error will occur.
all={boolean}: If set to ’yes’, also output non-variable positions in the VCF file.

2.4.5 Export segregating sites to MSMC format

The MsmcOutput filter calls all segregating sites and export them to the MSMC format, to be used with the MSMC program https://github.com/stschiff/msmc (Multiple Sequentially Markovian Coalescent). Sites with gap or unresolved characters are not called.

Synopsis:

maf.filter=                                 \
    [...],
    MsmcOutput(                             \
        file=snp.msmc,                      \
        compression=none,                   \
        reference=Ref,                      \
        genotypes=(sample1,sample2)),       \
    [...]

Arguments:

file={none|{path}}: A file path for the output MSMC file.
compression={none|gzip|zip|bzip2}: Compression format for output file.
reference={species name}: A species name corresponding to the sequence to use as reference.
genotypes={list of species}: A list of species for which segregating sites should be called. It might not contain the reference species, which is then only used for coordinates output. Blocks with do not contain all species will not be called, as well as block not contianing the reference species. Note that the alignment has to be projected againt the reference species, otherwise an error will be reported.

2.4.6 Export segregating sites to PLINK format

The PlinkOutput filter calls all biallelic segregating sites and export them to the PLINK format, creating on Ped file and one Map file. Sites with gap or unresolved characters are not called.

Synopsis:

maf.filter=                                 \
    [...],
    PlinkOutput(                            \
        ped_file=snp.ped,                   \
        map_file=snp.map,                   \
        compression=none,                   \
        reference=Ref,                      \
        genotypes=(sample1,sample2),        \
        map3=no)                            \
    [...]

Arguments:

ped_file={none|{path}}: A file path for the output Ped file.
map_file={none|{path}}: A file path for the output Map file.
compression={none|gzip|zip|bzip2}: Compression format for output file.
reference={species name}: A species name corresponding to the sequence to use as reference.
genotypes={list of species}: A list of species for which segregating sites should be called. It might not contain the reference species, which is then only used for coordinates output. Blocks with do not contain all species will not be called, as well as block not contianing the reference species. Note that the alignment has to be projected againt the reference species, otherwise an error will be reported.
map3={boolean}: Tell if genetic distance column should be ommitted in the Map file. The resulting file should be sued together with the –map3 option in PLINK. If set to no, then a 0 genetic distance is written.

2.4.7 Output coordinates of a set of species for each block.

The OutputCoordinates takes as input a list of species and output the coordinates of each of the corresponding sequences in each block. This filter only works if no more than one sequence per species are present, otherwise an error is casted. If some species are missing in some blocks, NA are produced in the output file.

Synopsis:

maf.filter=                                 \
    [...],
    OutputCoordinates(                      \
        file=coordinates.txt.gz,            \
        compression=gzip,                   \
        species=(species1,species2),        \
        output_src_size=yes)                \
    [...]

Arguments:

file={none|{path}}: A file path for the output text file.
compression={none|gzip|zip|bzip2}: Compression format for output file.
species={list of species}: A list of species for which coodinates should be written.
output_src_size={boolean}: Tell if the size of source sequences should be output.

2.4.8 Write trees to a file.

The OutputTrees filter writes trees associated to blocks to a file, in the Newick format. All trees will be stored in the same file, one per line.

Synopsis:

maf.filter=                                 \
    [...],
    OutputTrees(                            \
        tree=BioNJ,                         \
        file=data.trees.dnd,                \
        compression=gzip),                  \
    [...]

Arguments:

tree={string}: The tag used to store previously computed trees.
file={none|{path}}: A file path where to write trees, in the Newick format.
compression={none|gzip|zip|bzip2}: Compression format for output file.
strip_names={boolean}: Tell if leaf names should be stripped. If so, all text after the first dot will be deleted. This is useful to get rid of the sequence coordinates. The option will have no effect in case the names do not contain coordinates.

2.4.9 Write distance matrices to a file.

The OutputDistanceMatrices filter writes distance matrices associated to blocks to a file, in the Phylip format. All matrices will be stored in the same file, one after the other.

Synopsis:

maf.filter=                                 \
    [...],
    OutputDistanceMatrices(                 \
        distance=MLdistance,                \
        file=data.distmat.ph,               \
        compression=gzip),                  \
    [...]

Arguments:

distance={string}: The tag used to store previously computed distance matrices.
file={none|{path}}: A file path where to write distance matrices, in the Phylip format.
compression={none|gzip|zip|bzip2}: Compression format for output file.
strip_names={boolean}: Tell if sequence names should be stripped. If so, all text after the first dot will be deleted. This is useful to get rid of the sequence coordinates. The option will have no effect in case the names do not contain coordinates.

2.5 Miscelaneous filters

2.5.1 Convert coordinates from one genome to another

The LiftOver mimmics the behavior of the famous UCSC liftover utility. Features are read from any supported format, and a convertion table is generated for all features included in the alignment.

Synopsis:

maf.filter=                                 \
    [...]
    LiftOver(                               \
        ref_species=species1,               \
        target_species=species2,            \
        target_closest_position=yes,        \
        feature.file=species1.gff3.gz,      \
        feature.file.compression=gzip,      \
        feature.format=GFF,                 \
        file=sp1_to_sp2.tln,                \
        compression=none),                  \
    [...]

Arguments:

ref_species={string}: The name of the species for which the coordinates of the features are provided.
target_species={string}: The name of the species to which the coordinates of the features should be converted.
target_closest_position={boolean}: In case the target sequence has a gap at the given position, outputs the coordinate of the previous non-gap position, otherwise NA.
feature.file={path}: The file where the features are described.
feature.file.compression={none|gzip|zip|bzip2}: Compression format for the feature file.
feature.format={GFF|GTF|BedGraph}: Format for the feature file, currently GFF (v3.0), GTF or BedGraph.
compression={none|gzip|zip|bzip2}: Compression format for output file.

2.5.2 Call an external program on each block.

The SystemCall filter runs an external program on each block. For each block, the alignment is written to a temporary file, whose name and format can be specified. This input file is then read by the external program. The called program is expected to generate results in the form of an alignment with the same sequence names, so that it can be read by maffilter. In most case, the external program will be a wrapper script.

Synopsis:

maf.filter=                                 \
    [...],
    SystemCall(                             \
        name=myextprog,                     \
        input.file=tmp.fa,                  \
        input.format=Fasta,                 \
        call=myscript.sh,                   \
        output.file=output.fa,              \
        output.format=Fasta),               \
    [...]

Arguments:

name={string}: A label to use, might be useful in case several tools are called.
input.file={none|{path}}: A file path where to write the block alignment, to be used as input by the external program.
input.format={string}: Alignment format for input file.
output.file={none|{path}}: A file path where to read the processed block alignment, output by the external program.
output.format={string}: Alignment format for output file.
call={string}: The system call to run external program.

MafFilter Manual 1.3.1

Table of Contents

The MafFilter Manual

1 Introduction

1.1 Description of the program

1.2 How to run the program

1.3 General options

Synopsis:

Detailed description of commands:

2 Filters

2.1 Extracting data

2.1.1 Subset: remove sequences from blocks by species name

Synopsis:

Arguments:

2.1.2 SelectOrphans: keep only block with the specified species

Synopsis:

Arguments:

2.1.3 Merging: merge synthenic blocks

Synopsis:

Arguments:

2.1.4 Concatenating: merge consecutive blocks up to a certain size

Synopsis:

Arguments:

2.1.5 OrderFilter: check and eventually discard unordered or overlapping blocks

Synopsis:

Arguments:

2.1.6 Extract features from the alignment

Synopsis:

Arguments:

2.1.7 Extract blocks for a given chromosome/contig/scaffold

Synopsis:

Arguments:

2.1.8 Split alignment blocks into smaller parts

Synopsis:

Arguments:

2.2 Cleaning alignment blocks

2.2.1 Minumum block length

Synopsis:

Arguments:

2.2.2 Minumum block size

Synopsis:

Arguments:

2.2.3 Exclude full gap columns

Synopsis:

Arguments:

2.2.4 Remove empty sequences

Synopsis:

Arguments:

2.2.5 Alignment filtering

Synopsis:

Arguments:

2.2.6 Alignment filtering 2

Synopsis:

Arguments:

2.2.7 Entropy filtering

Synopsis:

Arguments:

2.2.8 Masked sequences filtering

Synopsis:

Arguments:

2.2.9 Quality filtering

Synopsis:

Arguments:

2.2.10 Remove sequences from features from the alignment

Synopsis:

Arguments:

2.3 Statistical analysis

2.3.1 Descriptive statistics

Synopsis:

Arguments:

2.3.1.1 Size of alignment blocks

2.3.1.2 Length of alignment blocks

2.3.1.3 Length of sequence for a species

Synopsis:

Arguments:

2.3.1.4 Alignment score

2.3.1.5 Character frequencies

Arguments:

2.3.1.6 Pairwise divergence

Synopsis: