CoreTracker

A Codon reassignment Tracker

UdeM-LBIT/CoreTracker

About CoreTracker

CoreTracker is a package written in python for the prediction of codon reassignments. It detects evidence of codon reassignments from the protein repertoire of a set of genomes by successively applying multiple algorithms. It’s therefore a filtering pipeline that explore all potential reassignments in genomes of the input set, and finally retains the most probable. CoreTracker is the first method for detecting codon reassignments, that accounts for the phylogenetic context of considered genomes.

The algorithm requires as input a groups of orthologous genes (nucleotide and amino acid sequences) from the genomes of interest, a species phylogenetic tree and a preliminary genetic code. Users can either provide their own protein alignment or use CoreTracker's default alignment pipeline which consists of a preliminary alignment with muscle/mafft followed by a refinement using provided HMM profile with HMMER

A thorough description of CoreTracker can be found in our paper :

Noutahi, E., Calderon, V., Blanchette, M., Lang, F. B., & El-Mabrouk, N. (2017). CoreTracker: accurate codon reassignment prediction, applied to mitochondrial genomes. Bioinformatics, 33(21), 3331-3339.

Installation

CoreTracker is compatible with python 2.7 and has only been tested on linux. Installation requires the following external application : gfortran, PyQt4, muscle, mafft and HMMER. Python dependencies are listed in the requirements.txt file.

Install CoreTracker using the Anaconda/Miniconda environment

The easiest and recommended way to install CoreTracker is through conda. If you are already familiar with conda, I recommend to setup a new environment.

Important : For a variety of reason, pip won't work sometimes in a conda environment (for example, if you have $PYTHONPATH setup in your shell profile). In that case, please use the full path to pip in the root environment (ex : /path/to/conda/bin/pip instead of just pip )

# Install Miniconda  (you can ignore this step if you already have Anaconda/Miniconda)
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p ~/anaconda/
# Add anaconda to PATH
# you should add this to your .bashrc so you won't have to run it everytime
export PATH=~/anaconda/bin:$PATH;

# Setup a new environement and activate it
conda create -n crt python=2.7 pip libgfortran
source activate crt
# Install dependencies (if required)
conda config --add channels r
conda config --add channels bioconda
conda install hmmer muscle mafft 
conda install -c anaconda pyqt=4 numpy
conda install -c jlmenut weasyprint=0.23
# Install CoreTracker (remaining dependencies should be installed with coretracker)
conda install -c maclandrol coretracker
Alternatively, a working conda environment is also provided here (env.yml). Common installation problems are due to : a wrong version of numpy, or scipy (compilation errors caused by BLAS/LAPACK or the lack of gcc/gfortran for C/fortran extensions). If you have a segmentation fault after installing CoreTracker, try reinstalling numpy and scipy.
# Using conda, create a new environment based on the provided env.yml file
conda env create -f env.yml
source activate crt
# Now install CoreTracker using pip or conda install
pip install coretracker

Native installation

First install the external dependencies if they are missing. If you plan to use the alignment pipeline implemented in CoreTracker, you will need to install mafft and/or mafft and HMMER package. An up to date numpy with libgfortran support is required. To install the required dependencies, you might need sudo for this task. Depending on your distribution, you could also have to compile and install from sources. From experience, it's sometimes better to install numpy using distribution specific sources.
# Install dependencies (debian distributions)
sudo apt-get update
sudo apt-get install libgfortran python-numpy python-qt4 python-lxml python-six
sudo apt-get install muscle mafft hmmer
Now install CoreTracker with pip or setup.py. If you are familiar with virtualenv, I recommend setting up a virtual environment for coretracker.
pip install coretracker
# Download the latest source from github or PyPi
python setup.py install
# You may want to install seaborn (mentionned in the the requirements.txt file) if you want prettier plots
# This is totally optionnal but recommended
pip install --upgrade seaborn

Usage

To invoke CoreTracker, just type coretracker -h after installation in a terminal. Its command line options will be printed

CoreTracker v:1.4.1 Copyright (C) 2017 Emmanuel Noutahi
usage: coretracker.py [-h] [--wdir OUTDIR] [--version] [--gapfilter GAPFILTER]
                      [--idfilter IDFILTER] [--icfilter ICCONTENT] -t TREE
                      --protseq SEQ --dnaseq DNASEQ [--debug] [--rmconst]
                      [--norefine] [--novalid] [--nofilter] [--align [ALIGN]]
                      [--use_tree [USETREE]] [--expos] [--hmmdir HMMDIR]
                      [--params PARAMS] [--parallel [PARALLEL]]
                      [--imformat {pdf,png,svg}] [--restrict_to RESTRICT_TO]
                      [--codemap CODEMAP]

CoreTracker, A codon reassignment tracker

optional arguments:
  -h, --help            show this help message and exit
  --wdir OUTDIR, --outdir OUTDIR
                        Working directory
  --version             show program's version number and exit
  --gapfilter GAPFILTER, --gap GAPFILTER
                        Remove position with more than `gapfilter` gap from
                        the alignment, using gapfilter as threshold (default
                        :0.6)
  --idfilter IDFILTER, --id IDFILTER
                        Conserve only position with at least `idfilter`
                        residue identity (default : 0.5)
  --icfilter ICCONTENT, --ic ICCONTENT
                        Shannon entropy threshold (default : 0.5 ). This will
                        be used to discard column where IC <
                        max(icvector)*icfilter)
  -t TREE, --intree TREE
                        Input specietree in newick format
  --protseq SEQ, --prot SEQ, -p SEQ
                        Protein sequence input in core format
  --dnaseq DNASEQ, --dna DNASEQ, -n DNASEQ
                        Nucleotides sequences input in core format
  --debug               Enable debug printing
  --rmconst             Remove constant site from filtered alignment.
  --norefine            Do not refine the alignment. By default the alignment
                        will be refined. This option should never be used if
                        you have made your own multiple alignment and
                        concatenate it. Else you will have absurd alignment
                        (TO FIX)
  --novalid             Do not validate prediction by retranslating and
                        checking alignment improvement
  --nofilter, --nf      Do not filter sequence alignment.
  --align [ALIGN]       Choose a program to align your sequences
  --use_tree [USETREE]  This is helpfull only if the mafft alignment is
                        selected. Perform multiple alignment, using species
                        tree as guide tree. A scaling value is needed to
                        compute the branch format for the '--treein' option of
                        mafft. If you're unsure, use the default value. The
                        tree must be rooted and binary. See
                        http://mafft.cbrc.jp/alignment/software/treein.html
  --expos, --export_position
                        Export a json file with the position of each
                        reassignment in the corresponding genome.
  --hmmdir HMMDIR       Link a directory with hmm files for alignment. Each
                        hmmfile should be named in the following format :
                        genename.hmm
  --params PARAMS       Use A parameter file to load parameters. If a
                        parameter is not set, the default will be used
  --parallel [PARALLEL]
                        Use Parallelization during execution for each
                        reassignment. This does not guarantee an increase in
                        speed. CPU count will be used if no argument is
                        provided
  --imformat {pdf,png,svg}
                        Image format to use for output (Codon_data file)
  --restrict_to RESTRICT_TO
                        Restrict analysis to list of genomes only
  --codemap CODEMAP     A tab delimited file that map each species to its
                        genetic code. Default code (--gcode) will be used for
                        remaining species

  

Those options can be confusing for a new user, so here is a break down, depending on what you want to do. Please see the tutoral section for any issues not addressed here (dataset preparation, other tools, output interpretation)

Beginner level
# Vanilla mode
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" 

# Specify output directory
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --wdir "path/to/outdir"

# Control sequence filtering
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir"

# Specify "Codon_data" file output format
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --imformat svg

# Remove constant site from alignment
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --rmconst

# Use mafft to perform alignment
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --align mafft
Intermediate level
# Show verbose for debugging purpose
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --debug

# Use parallel processing on 4 CPU
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --parallel 4

# Perform alignment according to input tree
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --align mafft --use_tree --scale 1

# Specify a directory with HMM files for each gene
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --hmmdir "path/to/hmmdir"

# Deactivate non-conserved position filtering in the alignment (if you have use trimAl for example) 
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --wdir "path/to/outdir" --nofilter

# Deactivate automatic alignment refinement step based on HMM
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --norefine

Wizard level
# Export a json file with the postitions of validated reassigned codon in each genome
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" --expos

# More control over parameters
coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3  --idfilter 0.6 --wdir "path/to/outdir" ---params param.yml

"param.yml" is a parameter file in YAML format used to provide additional control over CoreTracker's internal algorithm. An example of the content of such file is provided below. For a complete list of the hidden parameters, please see this file : parameters

# amino acid to exclude
# this will speed up a lot the analysis
# If you exclude an amino acid, there won't be a reassignment to that aa
EXCLUDE_AA: "CDEHLPTVWY"

# If you exclude a aa here, there won't be a reassignment from that aa
EXCLUDE_AA_FROM: "CDEHLPTVWY"

# Threshold require to determine if an aa is majoritary in a column
AA_MAJORITY_THRESH: 0.5

# Mode to compute suspected species
# possible values : wilcoxon, mannwhitney, kmean, ttest
# Keep this to wilcoxon
MODE: 'wilcoxon'

# Distance matrice: identity or known substitution matrix
MATRIX: 'blosum62'

# Codon must appear at least CODON_COUNT_THRESHOLD time in the alignment
# to be considered for codon reassignment prediction
CODON_COUNT_THRESHOLD: 1

# Codon to amino acid likelihood. 
# Use either consensus or complete alignment for Telford score
USE_CONSENSUS_FOR_LIKELIHOOD: False

# Show codon in non conserved position for its amino acid on the output 'Codon_data' file
SHOW_MIXTE_CODONS: False

# Show global codon data
SHOW_GLOBAL_CODON_DATA: False

# Add the total count in each pie chart in the codon detail figure
ADD_NUMBER_PIE: False

# alpha : rejection of the null hypothesis
CONF: 0.05

Tutorial

The data used in this tutorial is available in the examples folder of the github repository or as a zip file from google drive (ct_examples.zip). In the tutorial dataset, you will find the following files and folders :

  • metazoan_hmm : contains a list of HMM files for several metazoan mitochondrial genes
  • test_convert : contains several nucleotide sequences of mitochondrial metazoan genes in fasta format. The sequences are organised by gene families
  • test_extract : contains two files : genelist and speclist
  • test_corefus : contains two core files : mtzoa1.core and mtzoa2.core
  • test_data : contains mitochondrial metazoan data for an example run of oretracker

Dataset preparation

In this section, we will learn how to prepare dataset for a typical coretracker run. We will introduce the following tools : corextract, coretrans, corefus and coreconvert shipped with CoreTracker.

  1. The core sequence format
  2. The corefasta or core sequence format is the default sequence format used by CoreTracker. It is adapted from the fasta format and its purpose is to keep annoted genes from multiple different genomes in the same file. This format support both aligned and unaligned sequences and simplify the representation of concatenated alignment by allowing clear delimitation of genes start and end.

    The first line in a core file starts with a ">>", taken as a comment containing an identifiant shared by a group or sequences (usually gene name for a list of orthologous sequences). This line is followed by multiple other lines which either start by a sequence identifiant ">" or represent the actual sequence itself in the standard one-letter code. Thus the core sequence format is just a concatenation of multiples fasta files separated by a line starting by ">>" which represent a group identifiant. Except for the absence of limitation on the number of characters per line, the original fasta format specification should be used. The core format support both nucleotide and amino-acid sequences

    An example of such file (with random sequences) is provided bellow :

    >>Gene_1
    >species_1
    ATGTCGTGAAGTAGGGATTCCATTAGTACAATACTTATTGTCTTAAGTATTTTAGTAATCTTATTGAGTGGTATTATTATAATAGAAAGGAACCGAAATAATTATATAATAATATCCTTTTTCTTTTCTATACTCTTTTTAATCTTAAATACTACCTATGAACGTATATTTTCCAATTATTATTT
    >species_2
    GTGAATAATACGCCTCACACTCATTCTCAACCCCCTGACAAAACACATAGCCTACCCCTTCCTTGTACTATCCCTATGAGGCATAATTATAACAAGCTCCATCTGCCTACGACAAACAGACCTAAAATCGCTCATTGCATACTCTTCAAAATCGCCCACGGGCTTACCTAGCAAACTCAAACTACGAACGC
    >species_3
    GTGGGGCAATTTCAAGCCATTAAAAATTTTGAGTGAGTTCCGGGGTCGGCACTTTTAATTTTAACTGCTTTATTAACCCCGATTTGTGTTTTAATAAGTCGAAATTCGATTAAATCTTTGCTTTCTTTTTCTATACGTTTATTGGGTCCGTATTTATGTTACTAGCCATATGACCAAGTAGGAACTACTGACTATCAAGCCCTATGTTGTCAACA
    
    >>Gene_2
    >species_1
    ATGCCACAATTAGAAGCTACAGAAATGAAGAAAGAAAGAATTAAACTGGAGACAATTTTAATAATATAA
    >species_2
    ATGCCCCAACTAAAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG
    >species_3
    ATGCCACAATTAAACAGTAATCTCTTTTTCTATATTGATGATCACTTTTAAATATGAGCCTATTTTAGTACCTGTTTATTCTAAATAA
    >species_4
    ATTCCACAAATAGCACCTATTAGATGTTCTATTAACTATTATTCTTATATACCAAATCAATAAATTGAAAATGATAA
    

    Another example with aligned protein sequences :

    >>Gene_1
    >species_1
    MATEILTAAKFVGAGAASIGAAGSGAGIGTVFGNLIIGYARNPSLKQQLFTYAILGFAISEAMGLFCLMMAFLILFAL
    >species_2
    MATEILTAAKFVGAG--SIGAAGSGAGIGTVFGNLIIGYARNPSLKQQLFTYAILGFAISEAMGLFCLMMAFLILFGL
    
    >>Gene_2
    >species_1
    -------------I-PQIAPI------RWLLL-FIIFSITFILF---------------------------------CSIN-------------------Y------------------YSYIPN-SPKSNELKNIN-----------------------------------LNSINW--KW----------------------------------
    >species_2
    --------------MPQLN----------------------------------------------------------------------------------------------------LF------S--------------------------------------------FFYISVISFSILMI---------TFKYEPIL------VPVY-SK
    >species_3
    --------------MPQLNTT------VWPTIITPI---LLTLFLITQLKILNTNYHLPP------------------SP----------KP---------IKIKNYN-------------------------------KPWEPKWTKICSLHSLPPQS---------------------------------------------------------
    >species_4
    --------------MPQIAPICH-LIYP-------------------------------------------------SIIILLIIVW---------------------------V-----------------------------ALGR---WKTARPTKTSQTSGRKLPS------------------------------PILNLPGKSW--NW--
    
    

  3. Converting between format with coreconvert
  4. CoreTracker accepts already concatenated protein alignment in fasta format or (un)aligned sequences in core format. If your sequences are in other formats, you might want to convert them in an appropriate format. For this purpose, you can use coreconvert, installed with CoreTracker.

    coreconvert support the following format for aligned sequence : fasta, nexus, stockholm, clustal, core and fasta, core for unaligned sequence. Run coreconvert -h for help.

    In the following example, we will try to convert multiple fasta files into a single core file. The input fasta files are available in the folder test_convert of the tutorial data.

    unzip examples.zip
    ls test_convert
    coreconvert --outfmt=core --infmt=fasta  --outfile out.core test_convert/*.fasta
    less out.core
    

    From the output file, you can see that sequences are grouped by orthologous genes. You can flip the grouping by using the option --order_reverse. The new grouping by species does however not make sense for aligned sequences. CoreTracker take as input a grouping of the sequences by orthologous gene. The identifiant following ">>" should thereby be the gene identifiant.

  5. Build a local sequences database from refseq gbff file with corextract
  6. We will start by first downloading mitochondrial genomes from refseq release.

    mkdir genomes
    wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.genomic.gbff.gz -P genomes/
    gunzip genomes/mitochondrion.1.genomic.gbff.gz

    Now, for fast data extraction later, we will build a local database containing mitochondrial genomes from the genbank file. For this purpose, we will use corextract build. You can either go grab a coffee or work on other things since this step can take some time to finish (up to 5min).

    corextract build -db genomes/name.db -ib genomes/index.db -G genomes/mitochondrion.1.genomic.gbff

    We have now two databases, name.db which will help to search faster for genomes based on species name and index.db which link species to their genomes. Now we are going to attempt to construct a dataset from a list of genomes of interest by using they name. We will be using the following two files : test_extract/speclist which contains a list of species name and test_extract/genelist which contains a list of the genes we are trying to extract for each genome. Note that empty lines and lines starting with "#" (comments) are ignored from those files and regular expressions are accepted for the species list. Gene names are queried against a local large database of mitochondrial gene aliases, during sequences extraction, to make sure the gene is not missed. Next versions of the package will also include aliases from chloroplast and nuclear genes, or offer the possibility of searching by sequences similarity.

    In this example, we will search only fully-annoted genomes by using the optional argument --complete. Note that if -db name.db is not used, http requests will be made on ncbi taxonomy in order to guess the correct species name. The closest match to the provided name will be used, otherwise, the species will be skipped.

    corextract extract -G genomes/mitochondrion.1.genomic.gbff -ib genomes/index.db -db genomes/name.db --genelist test_extract/genelist --speclist test_extract/speclist --complete

    You can also provide NCBI taxonomy identifier for all species in your speclist file, then use the option --taxid. I recommend using this option whenever possible to ensure accurate search. For a complete list of the available options, run the help with corextract {build|extract} -h.

    The default output of corextract extract is two files in the core sequence format : output.core which contains the nucleotide sequences and output.core_prot which contains the amino acid sequences. The program also print, in the terminal, warnings about possible frame-shifting and undefined nucleotides (N). Attempts are made to correct partial termination codons where TAA stop codon should completed by the addition of 3' A residues to the mRNA. The success of such corrections relies on the validness of the genbank file and the accuracy of its annotation. Therefore, manual correction might be needed. All sequence corrections are mentionned in the terminal, with the label "FIXED" for easier confirmation.

    The number of genomes sharing a specific gene is shown. So is the genetic code of each extracted genomes. I don't recommend trusting this genetic code however.

  7. Sequence translation with coretrans
  8. CoreTracker requires the a priori amino-acid sequences matching the CDS as input. You can obtain this sequence by translating the nucleotide sequences according to a genetic code. This task can be done by coretrans from the package. coretrans is not only able to translate dna sequences using a specified genetic code, but also align the resulting protein sequences. Current version of the package only accept one of the several genetic code listed at NCBI. In the next releases, it might be possible to pass a user-made genetic table as input.

    coretrans takes as mandatory input, the nucleotide sequences in core format. Here we are going to translate the input file using the metazoan genetic code (4), filter out genes with frame-shifting that we were unable to correct, then align the translated sequences with muscle, and refine them using HMM of each genes.

    coretrans --nuc output.core --align --refine --hmmdir metazoan_hmm --filter --gcode '4' --wdir outdir

    The output files should be listed in the specified output folder, by default, this folder is trans_output.

    The sequence alignment pipeline use either muscle or mafft to align protein sequences. The alignments can be refined by building a preliminary Hidden Markov Model (HMM) model using hmmbuild from HMMER then use it to align the sequences with hmmalign from the same package. This last step can be done on multiple iterations at the end of which only the best alignment according to the posterior probability will be kept. Note that from experimental data, alignment quality often stopped improving after a certain number of iteration, so we use an early stopping criterion which stop the n-loop at iteration i with probability i/n if the alignment quality stop improving. Alternatively, if HMM files are provided, they will be used to align the input sequences with hmmalign on only one iteration.

    It is possible to use a space/tab delimited file (an example here : ) that map each genome to a specific genetic code with the argument --codemap. An example of such file can be found h

    coretrans --nuc output.core --align --refine --hmmdir metazoan_hmm --filter --gcode '4' --wdir outdir --codemap trans_map.txt

    The output files should be listed in the specified output folder, by default, this folder is trans_output.

  9. Basic dataset merging with corefus
  10. We also offer corefus, a basic tool for merging two datasets before running CoreTracker. Merging two datasets, can be considered for example if you want to include new genomes in the analysis in order to either measure the impact of adding species to the phylogeny or use a group of genomes with known genetic code as reference. corefus take as input two nucleotide sequences in corefasta format and as optional argument the corresponding two species trees in newick format.

    corefus --nucs test_corefus/mtzoa1.core test_corefus/mtzoa2.core --out mtzoa
    ls outdir

    The trees are joined by creating a new node and adding the roots of the two input trees as its children. True phylogenetic relationship and branch lengths are therefore ignored. Since this might not be the desired we recommend to build a new species phylogenetic tree with your favorite phylogenetic program. You can for example:

    • Merge the nucleotide sequences with corefus
    • Translate and align the sequences using coretrans
    • Convert the aligned protein sequences from the core format into fasta files with coreconvert
    • Use those alignments to build a new species tree based on a supermatrix or supertree approach

Running CoreTracker

In this section, we will run coretracker on a test dataset and learn how to interpret its ouptut. You are supposed to use the all conserved proteins in the proteome of each species for accurate predictions.

  1. Run coretracker on the metazoan test dataset provided
  2. coretracker -t test_data/species_tree.nwk -p test_data/test_prot.core -n test_data/test_nuc.core --wdir test_data/output --params test_data/param.yml --norefine --parallel --expos --debug --imformat svg

    You can check the parameters used in the param.yml file. With this command, we are going to run CoreTracker in parallel mode, with the default filtering parameters, no alignment refinement and debug printing enabled. We are also going to export predicted reassigned positions in a json file. CoreTracker also accept the --codemap argument to specify the genetic code of some of the genomes in the dataset. This can be used conjointly with --restrict_to that allow user to restrict prediction to only one genome, taking the remaining as reference. --restrict_to take either a single species name or a file containing one species name per line

  3. Interpreting coretracker's output
  4. CoreTracker return several files, but only a handful of them are really useful depending on what you're trying to do :

    # listing the output directory
    ls test_data/output
    
    tree.nwk    filt_alignment.fasta  Ile_to_Gly           Phe_to_Asn
    Arg_to_Asn  gap_filt.fasta        Ile_to_Lys           Phe_to_Gly
    Arg_to_Gln  Gln_to_Arg            Ile_to_Met           Phe_to_Ile
    Arg_to_Gly  Gln_to_Gly            Ile_to_Phe           Phe_to_Ser
    Arg_to_Ile  Gln_to_Lys            Ile_to_Ser           positions.json
    Arg_to_Lys  Gln_to_Ser            Lys_to_Arg           reassignment.json
    Arg_to_Phe  Gly_to_Arg            Lys_to_Asn           Ser_to_Arg
    Arg_to_Ser  Gly_to_Asn            Lys_to_Gln           Ser_to_Asn
    Asn_to_Arg  Gly_to_Ile            Lys_to_Gly           Ser_to_Gln
    Asn_to_Gln  Gly_to_Phe            Lys_to_Ile           Ser_to_Gly
    Asn_to_Gly  Gly_to_Ser            Lys_to_Phe           Ser_to_Ile
    Asn_to_Ile  ic_filt.fasta         Lys_to_Ser           Ser_to_Lys
    Asn_to_Lys  Ile_to_Arg            ori_alignment.core   Ser_to_Phe
    Asn_to_Phe  Ile_to_Asn            ori_alignment.fasta  similarity.json
    Asn_to_Ser  Ile_to_Gln            Phe_to_Arg  
    
    # listing one of the several folder containing specific reassignment
    ls test_data/output/Arg_to_Gly
    
    AGA_codons.svg  AGA_violin.svg  AGG_rea.svg          Codon_data.svg
    AGA_ori.svg     AGG_codons.svg  AGG_violin.svg       Report_Arg_to_Gly.pdf
    AGA_rea.svg     AGG_ori.svg     Arg_to_Gly_data.txt
    
    • ori_alignment.core : original protein alignment in core format after removing species neither in the nucleotide sequences nor in the input tree.
    • ori_alignment.fasta : same protein alignment in fasta format. Here, alignment are concatenated to form a supermatrix.
    • gap_filt.fasta : protein alignment in fasta format, after filtering by gap percentage
    • ic_filt.fasta : protein alignment in fasta format, after filtering by information content (IC).
    • filt_alignment.fasta : final filtered protein alignment (after gap, IC and amino acid identity percent per columns) and pruning to remove species that does not appear in both.
    • tree.nwk : contains the input tree in newick format after pruning to match the species list in the input sequence files.
    • similarity.json : a json file that contains all the pairwise distance between genomes in both the input alignment ('global') and filtered alignment ('filtered').

      {
          "Schistosoma": [
              {
                  "filtered": 0.4472573839662447, 
                  "global": 0.4292988640814728, 
                  "species": "Paracentrotus"
              }
          ], 
          "Fasciola": [
              {
                  "filtered": 0.4694092827004219, 
                  "global": 0.4782608695652174, 
                  "species": "Paracentrotus"
              }, 
              {
                  "filtered": 0.6561181434599156, 
                  "global": 0.6400313356835096, 
                  "species": "Schistosoma"
              }
          ],
          ...
      
      }
      

    • positions.json : a json file that contains the positions, in the genes, likely affected by a given reassignment for each genome. This file is only returned if the flag --expos is set when running coretracker

      {
          "Halocynthia": {
              "AGG:G": [
                  [
                      "cox2", 
                      163
                  ], 
      
                  ...
              ], 
      
              ...
          }, 
      
          ...
      
          "Lumbricus": {
              "AGA:S": [
                  ...
      
                  [
                      "cob", 
                      150
                  ]
              ]
          }
      }
      

    • predictions.txt : a text file that contains a summary of all positive predictions made by the Random Forest (RF) and their performance on the validation tests. This is one of the output of interest.

      #Reference Genetic Code : 4 (Mold Mitochondrial)
      #Exception: 
      AGG (R, S)
              Schistosoma     0.504   Both
              Caenorhabditis  0.993   Both
              Katharina       0.765   Both
      AAA (K, N)
              Paracentrotus   0.727   Both
              Schistosoma     0.833   Both
              Fasciola        0.753   Both
              Patiria         0.917   Both
              ...
      

      The first line of that file start by a # and indicate the reference genetic code used according to NCBI notation. Here the translation table 4 was used. The rest of the file list all predicted exceptions to the translation table. For each codon reassignment, the list of affected genomes (first column), the prediction probability (second column) and the validation output (third column) is show. Validation can take 4 values : Both indicating that both the clade-aware validation and the validation based on alignment improvement are passed — Clad when only the clade-aware validation is passed — Algn when only the alignment validation alignment is passed and None when neither was a success. Predictions that do not pass both validation are shown because it might be of interest to perform supplemental analysis in order to confirm if they are true positives.

      As we can see from this file, for example, AAA codons are predicted to be reassigned from Lysine to Asparagine in 4 genomes, and those predictions were all validated. Indeed, AAA(Lys, Asp) is a known reassignment in Platyhelminthes and Echinodermata and was correctly inferred by coretracker.

    • reassignment.json : a dump of all informations extracted by CoreTracker from the input data in a json file. This file is used to build the input matrix for the Random Forest prediction and is useful if you want to build your own predictor based on informations extracted by CoreTracker. You can also use it as input for reaplot.
    • AA1_to_AA2 : a folder that contains detailed informations about each reassignment. Here the content of Arg_to_Ser is shown as example.

      # listing the content of Arg_to_Ser
      ls test_data/output/Arg_to_Ser 
      AGA_codons.svg  AGA_violin.svg  AGG_rea.svg          Codon_data.svg
      AGA_ori.svg     AGG_codons.svg  AGG_violin.svg       Report_Arg_to_Ser.pdf
      AGA_rea.svg     AGG_ori.svg     Arg_to_Ser_data.txt
      

      • Arg_to_Ser_data.txt is a text file that contains the raw information on the Arg to Ser reassignment. It's divided in 3 sections. The first one Random Forest prediction contains the features values, the prediction of the Random Forest and the results of the validation steps. The second section ( Alignment improvement) contains the p-values of the improvement in alignment for each considered codon reassignment. The final section (Prediction Prob. range for positive) contains, as its name suggest, the range of probability of the predicted true positives.
        ### Random Forest prediction
        genome         codon  ori_aa  rea_aa  fitch  Fisher_pval        Gene_frac       N._rea           N._used          Cod._count      Sub._count       G._len  codon_lik        id    prediction  probability  clad_valid  trans_valid
        Paracentrotus  AGA    R       S       0.0    9.29434047659e-08  1.0             1.0              0.0              0.2             0.2              948.0   2.50632911392    8.0   1.0         0.995        1           1
        Schistosoma    AGG    R       S       0.0    3.90905936669e-05  0.666666666667  0.666666666667   0.0              0.166666666667  0.222222222222   829.0   2.1338028169     10.0  1.0         0.504        1           1
        Schistosoma    AGA    R       S       0.0    3.90905936669e-05  0.666666666667  0.307692307692   0.0769230769231  0.361111111111  0.222222222222   829.0   0.787581699346   8.0   0.0         0.316        0           1
        Fasciola       AGG    R       S       0.0    0.0357142733838    0.5             0.2              0.0              0.208333333333  0.0416666666667  861.0   0.239669421488   10.0  0.0         0.203        0           1
        Patiria        AGA    R       S       0.0    1.58906666891e-05  1.0             0.8              0.0              0.15625         0.125            948.0   2.19469026549    8.0   1.0         0.595        1           1
        
        ...
        
        ### Alignment improvement:
        AGG Ser 1.28e-03
        AGA Ser 3.23e-09
        
        ### Prediction Prob. range for Positive :[ 0.504 - 0.998 ]
        
      • Report_Arg_to_Ser.pdf: contains a report in pdf format that summarize the RF results and the validation output for the reassignment of Arginine codons to Serine.
      • Codon_data.svg : visual of the usage of Arginine codons in columns where Serine is predominant (column I ) vs columns where Arginine is predominant (column II).

        The difference in codons usage is measured by a Fisher's Exact test whose p-values are shown on the image. A color code is applied on leaves to differentiate positive and negative prediction. Red leaves are positives predicted by the Random Forest that also passed the Fisher's Exact test. Green leaves are predicted negatives that passed the Fisher's test. Orange leaves are predicted positives that failed the Fisher's test. Remaining leaves (not selected for the RF prediction of failed both the RF and the Fisher's test) are colored Black.

        The values under and above the branch of each leaves respectively represent the count of Arginine (in the Arg to Ser reassignment) in the global alignment and in the Arginine-filtered alignment for the genome represented by the leaf.

        Each node of the tree is labeled either by Arg or Ser, from the initial leave labeling by identifying interesting genomes, followed by Fitch ancestral reconstruction (see the method section of the article). Internal nodes colors are set according to the Fitch ancestral state reconstruction (green circle for ambiguous states, and red square for suspected reassignments

        .


      • AGA_violin.svg : a violin plot that represent the distribution of the Sum-of-Pairs score SP-score in the filtered protein alignment, when the original code is used vs when AGA is used as a Serine codons in genomes predicted to be affected by AGA(Arg, Ser). For each alignment, the SP-score is computed per columns, in order to have a distribution of score, then the two distributions are compared using a statistical test. The objectif here is to assess whether the alignment quality improve with the translation correction brought by the inferred reassignment. A violin plot is returned for each Arginine codons predicted to be reassigned.

      • AGA_ori.svg : original filtered protein alignment, restricted to positions where at least one AGA codons appear. The total SP-score and information content of this alignment is shown at the top of the image. If the input sequence was in core format, the alignment is delimited by the genes, with each selected positions of the protein alignments displayed. Above the alignment, the information content per column is shown. The predicted genomes with the reassignment that also pass the clade-aware validation test are highlighted in red.


      • AGA_rea.svg : same as above, except that genomes in the protein alignment are corrected according to their new genetic code (codon reassignment). You should generally notice an improvement in the alignment here (from comparing for example its information content and SP-score to the previous values). This is not always true however, for reassignment involving similar amino acids.


      • AGA_codons.svg : same as above, except this is a codon alignment (using the protein alignment as template). Here, neither the alignment statistics, nor the positions in the protein alignment are displayed.


        As you can see here from this image, at several positions where most genomes use Serine codons (black boxes), genomes with the AGR(Arg, Ser) reassigment use AGR codons instead.

  5. coretracker-prep and coretracker-run
  6. For convenience, the main pipeline coretracker has also been splitted into two separate binaries: coretracker-prep and coretracker-run. The former is used to make preparation for predictions, which include sequences alignment, codon alignment, sequence filtering, potential reassignment identification, etc, then save the result on disk, which can be loaded and executed by coretracker-run. An interesting feature of coretracker-prep is to perform all preliminary computation in parallel for a list of dataset then merge them into one giant dataset. This allow the user to split its dataset into smaller chunks that can be analysed simultaneously in parallel, making the whole process faster. You should keep in mind that using this option will also increase memory usage for larger dataset. A similar procedure is also implemented in coretracker-run, which allow the prediction of reassignment, by independently considering pair of amino acids.

Supplemental analysis

Here we will perform some additional analysis with available scripts in order to collect more data on predictions. These supplemental analysis are not always informative and are optional.

  1. Genome clustering with codonclust
  2. codonclust cluster genomes according to their codons content. It's a simple way to find outlier genomes that could introduce errors in the inference. Input data is the reassignment.json file returned by coretracker, so a first run of coretracker is needed. A principal component analysis is first performed in order to reduce the data dimensionality. Help can be obtained with codonclust -h

    Let's use the output file returned by our previous execution of coretracker in the last section

    codonclust -i test_data/output/reassignment.json  -o cluster_output --plot_aa_usage --pca_component 5 --gcode 4 --aar 'CDEHQWPY'

    Here we are using the translation table 4 and 5 components for the PCA. The default algorithm for clustering will be MeanShift. We will plot the amino acid usage at the same time, but the following residues will be discarded : C, D, E, H, Q, W, P, Y, L, M, I, V. Since we did not ask for the codons count (see option --csv in the help), only two files are returned : cluster_output.svg and cluster_output_aa_usage.svg. If a file extension is not set, the default image format will be svg.


    From this plot, we can see that the dataset can be split into two clusters according to their codons usage. It does not appear, however, that some genomes exhibit unique codon usage compared to the rest, which make this dataset somewhat robust to errors emanating from extreme codon usage bias. It should be noted that this does not mean that all predictions made on this dataset will be accurate, as other type of errors such as improper taxa-sampling, RNA editions, etc could be present.

    AA usage after correction for AGR and AAA codon reassignment

    AA usage after correction for AGR and AAA codon reassignment

    The amino acid usage show some variability per genomes in both the non-corrected and the corrected dataset. However, you can notice that after correction, the usage of some amino acids including Arginine (R) is more similar between genomes (tigher proximity of the frequency of Arginine usage in the bottom image). Note that the whole mitochondrial proteome was not used in this case.

  3. Ancestral state reconstruction with ansrec
  4. Once you have your predictions, you could be interested in reconstructing their history. You can perform this task with by using R ancestral state reconstruction packages or phylogenetic tools such as Mesquite and MrBayes. You can also use ansrec a script available on the github repository here : to perform a quick ancestral reconstruction using parsimony (Fitch or Dollo). ansrec takes as mandatory input a json or yaml file containing the codon reassignments. The format of the json file should be Genome --> Amino Acid --> Codons as seen below. The yaml file structure should also obey this format.

    {
       "Halocynthia":{
          "G":[
             "AGG",
             "AGA"
          ]
       },
       "Schistosoma":{
          "S":[
             "AGG"
          ],
          "N":[
             "AAA"
          ]
       },
       "Caenorhabditis":{
          "S":[
             "AGG",
             "AGA"
          ]
       }
    }
    

    Alternatively, you can provide a file in a format similar to coretracker's output (see previous section and predictions.txt). An example run of ansrec with Fitch's parsimony and the resulting output is shown below

    python ansrec -c test_data/out_example.txt --gcode 4 --out ancestral.svg --tree test_data/species_tree.nwk -a 'fitch'

  5. Draw a figure showing all predicted reassignments
  6. An additional script to draw all predicted reassignments on the input tree is provided. This scripts use the reassignment.json file returned by coretracker and a file in a format similar to the predictions.txt

    reaplot -r  test_data/output/reassignment.json -i test_data/output/predictions.txt --gcode 4 --tree test_data/species_tree.nwk --out test_data/output/reaplot.svg --valid both
    

    The complete list of options can be found in the help (reaplot -h. The resulting image is a vector graphics that can be easily edited to suit your needs.

  7. Checking reassignments positions with domaintester
  8. We also provide a script to check the positions of each reassignment against Pfam domains. The main objectif here was to verify if the predicted codon reassignments affects structurally/functionally important positions. domaintester is available on the github repository here :

    # run pfamprepare.sh if you don't have
    # a local version of the pfam database
     bash pfamprepare.sh
     # run domaintester using the previous output of coretracker
     # could take some time
    python domaintester -c test_data/output/ori_alignment.core -p test_data/output/positions.json --outdir test_data/domaintester --hmm Pfam/Pfam-A.hmm -o domains.csv --format 'csv'
    
    ls test_data/domaintester
    atp6  atp8  cob  cox1  cox2  cox3  domains.csv
     # look at the first line
    head -9 test_data/output/domains.csv | column -t
    Species         Gene  Codon  AA  Pos  Domain       Pfam        start  end  description
    Paracentrotus   cox2  AGA    S   168  COX2         PF00116.18  94     213  Cytochrome   C  oxidase  subunit  II,  periplasmic  domain
    Patiria         cox2  AGA    S   168  COX2         PF00116.18  94     213  Cytochrome   C  oxidase  subunit  II,  periplasmic  domain
    Lumbricus       cox2  AGA    S   168  COX2|94-213  PF00116.18  94     213  Cytochrome   C  oxidase  subunit  II,  periplasmic  domain
    Caenorhabditis  cox2  AGA    S   112  COX2         PF00116.18  97     216  Cytochrome   C  oxidase  subunit  II,  periplasmic  domain
    Caenorhabditis  cox2  AGA    S   154  COX2         PF00116.18  97     216  Cytochrome   C  oxidase  subunit  II,  periplasmic  domain
    Caenorhabditis  cox2  AGA    S   168  COX2         PF00116.18  97     216  Cytochrome   C  oxidase  subunit  II,  periplasmic  domain
    Caenorhabditis  cox2  AGA    S   230  NID          NID         -      -    NID
    

    NID (Not In Domains) is used when the position does not appear in a domain, whereas, N/A (not available) is used when no hits were found.


  9. Training Random Forest models with new data
  10. This is not yet offered, but planned for the next version release of CoreTracker


License

This software is licensed under GPLv3, a copy of which can be found on the github repository. Note that this software is provided "as is" without warranty of any kind, express or implied to the extent permitted by applicable law.