Content
- About CoreTracker
- Installation
- Usage
- Tutorial
- License terms
A Codon reassignment Tracker
CoreTracker
is a package written in python for the prediction of codon reassignments. It detects evidence of codon reassignments from the protein repertoire of a set of genomes by successively applying multiple algorithms. It’s therefore a filtering pipeline that explore all potential reassignments in genomes of the input set, and finally retains the most probable. CoreTracker
is the first method for detecting codon reassignments, that accounts for the phylogenetic context of considered genomes.
The algorithm requires as input a groups of orthologous genes (nucleotide and amino acid sequences) from the genomes of interest, a species phylogenetic tree and a preliminary genetic code. Users can either provide their own protein alignment or use CoreTracker
's default alignment pipeline which consists of a preliminary alignment with muscle/mafft followed by a refinement using provided HMM profile with HMMER
A thorough description of CoreTracker
can be found in our paper :
Noutahi, E., Calderon, V., Blanchette, M., Lang, F. B., & El-Mabrouk, N. (2017). CoreTracker: accurate codon reassignment prediction, applied to mitochondrial genomes. Bioinformatics, 33(21), 3331-3339.
CoreTracker
is compatible with python 2.7 and has only been tested on linux. Installation requires the following external application : gfortran
, PyQt4
, muscle
, mafft
and HMMER
. Python dependencies are listed in the requirements.txt
file.
CoreTracker
is through conda
. If you are already familiar with conda, I recommend to setup a new environment. Important : For a variety of reason, pip won't work sometimes in a conda environment (for example, if you have $PYTHONPATH setup in your shell profile). In that case, please use the full path to pip in the root environment (ex : /path/to/conda/bin/pip instead of just pip )
# Install Miniconda (you can ignore this step if you already have Anaconda/Miniconda) wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh bash miniconda.sh -b -p ~/anaconda/ # Add anaconda to PATH # you should add this to your .bashrc so you won't have to run it everytime export PATH=~/anaconda/bin:$PATH; # Setup a new environement and activate it conda create -n crt python=2.7 pip libgfortran source activate crt # Install dependencies (if required) conda config --add channels r conda config --add channels bioconda conda install hmmer muscle mafft conda install -c anaconda pyqt=4 numpy conda install -c jlmenut weasyprint=0.23 # Install CoreTracker (remaining dependencies should be installed with coretracker) conda install -c maclandrol coretrackerAlternatively, a working conda environment is also provided here (env.yml). Common installation problems are due to : a wrong version of numpy, or scipy (compilation errors caused by BLAS/LAPACK or the lack of
gcc/gfortran
for C/fortran
extensions). If you have a segmentation fault
after installing CoreTracker, try reinstalling numpy and scipy.
# Using conda, create a new environment based on the provided env.yml file conda env create -f env.yml source activate crt # Now install CoreTracker using pip or conda install pip install coretracker
sudo
for this task. Depending on your distribution, you could also have to compile and install from sources. From experience, it's sometimes better to install numpy
using distribution specific sources.
# Install dependencies (debian distributions) sudo apt-get update sudo apt-get install libgfortran python-numpy python-qt4 python-lxml python-six sudo apt-get install muscle mafft hmmerNow install CoreTracker with
pip
or setup.py
. If you are familiar with virtualenv, I recommend setting up a virtual environment for coretracker.
pip install coretracker
# Download the latest source from github or PyPi
python setup.py install
# You may want to install seaborn
(mentionned in the the requirements.txt file) if you want prettier plots
# This is totally optionnal but recommended
pip install --upgrade seaborn
To invoke CoreTracker, just type coretracker -h
after installation in a terminal. Its command line options will be printed
CoreTracker v:1.4.1 Copyright (C) 2017 Emmanuel Noutahi usage: coretracker.py [-h] [--wdir OUTDIR] [--version] [--gapfilter GAPFILTER] [--idfilter IDFILTER] [--icfilter ICCONTENT] -t TREE --protseq SEQ --dnaseq DNASEQ [--debug] [--rmconst] [--norefine] [--novalid] [--nofilter] [--align [ALIGN]] [--use_tree [USETREE]] [--expos] [--hmmdir HMMDIR] [--params PARAMS] [--parallel [PARALLEL]] [--imformat {pdf,png,svg}] [--restrict_to RESTRICT_TO] [--codemap CODEMAP] CoreTracker, A codon reassignment tracker optional arguments: -h, --help show this help message and exit --wdir OUTDIR, --outdir OUTDIR Working directory --version show program's version number and exit --gapfilter GAPFILTER, --gap GAPFILTER Remove position with more than `gapfilter` gap from the alignment, using gapfilter as threshold (default :0.6) --idfilter IDFILTER, --id IDFILTER Conserve only position with at least `idfilter` residue identity (default : 0.5) --icfilter ICCONTENT, --ic ICCONTENT Shannon entropy threshold (default : 0.5 ). This will be used to discard column where IC < max(icvector)*icfilter) -t TREE, --intree TREE Input specietree in newick format --protseq SEQ, --prot SEQ, -p SEQ Protein sequence input in core format --dnaseq DNASEQ, --dna DNASEQ, -n DNASEQ Nucleotides sequences input in core format --debug Enable debug printing --rmconst Remove constant site from filtered alignment. --norefine Do not refine the alignment. By default the alignment will be refined. This option should never be used if you have made your own multiple alignment and concatenate it. Else you will have absurd alignment (TO FIX) --novalid Do not validate prediction by retranslating and checking alignment improvement --nofilter, --nf Do not filter sequence alignment. --align [ALIGN] Choose a program to align your sequences --use_tree [USETREE] This is helpfull only if the mafft alignment is selected. Perform multiple alignment, using species tree as guide tree. A scaling value is needed to compute the branch format for the '--treein' option of mafft. If you're unsure, use the default value. The tree must be rooted and binary. See http://mafft.cbrc.jp/alignment/software/treein.html --expos, --export_position Export a json file with the position of each reassignment in the corresponding genome. --hmmdir HMMDIR Link a directory with hmm files for alignment. Each hmmfile should be named in the following format : genename.hmm --params PARAMS Use A parameter file to load parameters. If a parameter is not set, the default will be used --parallel [PARALLEL] Use Parallelization during execution for each reassignment. This does not guarantee an increase in speed. CPU count will be used if no argument is provided --imformat {pdf,png,svg} Image format to use for output (Codon_data file) --restrict_to RESTRICT_TO Restrict analysis to list of genomes only --codemap CODEMAP A tab delimited file that map each species to its genetic code. Default code (--gcode) will be used for remaining species
Those options can be confusing for a new user, so here is a break down, depending on what you want to do. Please see the tutoral section for any issues not addressed here (dataset preparation, other tools, output interpretation)
Beginner level
# Vanilla modecoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core"
# Specify output directorycoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --wdir "path/to/outdir"
# Control sequence filteringcoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir"
# Specify "Codon_data" file output formatcoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --imformat svg
# Remove constant site from alignmentcoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --rmconst
# Use mafft to perform alignmentcoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --align mafft
Intermediate level
# Show verbose for debugging purposecoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --debug
# Use parallel processing on 4 CPUcoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --parallel 4
# Perform alignment according to input treecoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --align mafft --use_tree --scale 1
# Specify a directory with HMM files for each genecoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --hmmdir "path/to/hmmdir"
# Deactivate non-conserved position filtering in the alignment (if you have use trimAl for example)coretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --wdir "path/to/outdir" --nofilter
# Deactivate automatic alignment refinement step based on HMMcoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --norefine
Wizard level
# Export a json file with the postitions of validated reassigned codon in each genomecoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" --expos
# More control over parameterscoretracker -t "tree.nw" -p "protfile.core" -n "nucfile.core" --gapfilter 0.3 --icfilter 0.3 --idfilter 0.6 --wdir "path/to/outdir" ---params param.yml
"param.yml" is a parameter file in YAML format used to provide additional control over CoreTracker's internal algorithm. An example of the content of such file is provided below. For a complete list of the hidden parameters, please see this file : parameters
# amino acid to exclude # this will speed up a lot the analysis # If you exclude an amino acid, there won't be a reassignment to that aa EXCLUDE_AA: "CDEHLPTVWY" # If you exclude a aa here, there won't be a reassignment from that aa EXCLUDE_AA_FROM: "CDEHLPTVWY" # Threshold require to determine if an aa is majoritary in a column AA_MAJORITY_THRESH: 0.5 # Mode to compute suspected species # possible values : wilcoxon, mannwhitney, kmean, ttest # Keep this to wilcoxon MODE: 'wilcoxon' # Distance matrice: identity or known substitution matrix MATRIX: 'blosum62' # Codon must appear at least CODON_COUNT_THRESHOLD time in the alignment # to be considered for codon reassignment prediction CODON_COUNT_THRESHOLD: 1 # Codon to amino acid likelihood. # Use either consensus or complete alignment for Telford score USE_CONSENSUS_FOR_LIKELIHOOD: False # Show codon in non conserved position for its amino acid on the output 'Codon_data' file SHOW_MIXTE_CODONS: False # Show global codon data SHOW_GLOBAL_CODON_DATA: False # Add the total count in each pie chart in the codon detail figure ADD_NUMBER_PIE: False # alpha : rejection of the null hypothesis CONF: 0.05
The data used in this tutorial is available in the examples folder of the github repository or as a zip file from google drive (ct_examples.zip). In the tutorial dataset, you will find the following files and folders :
In this section, we will learn how to prepare dataset for a typical coretracker run. We will introduce the following tools : corextract
, coretrans
, corefus
and coreconvert
shipped with CoreTracker
.
The corefasta or core sequence format is the default sequence format used by CoreTracker
. It is adapted from the fasta format and its purpose is to keep annoted genes from multiple different genomes in the same file. This format support both aligned and unaligned sequences and simplify the representation of concatenated alignment by allowing clear delimitation of genes start and end.
The first line in a core file starts with a ">>
", taken as a comment containing an identifiant shared by a group or sequences (usually gene name for a list of orthologous sequences). This line is followed by multiple other lines which either start by a sequence identifiant ">
" or represent the actual sequence itself in the standard one-letter code. Thus the core sequence format is just a concatenation of multiples fasta files separated by a line starting by ">>
" which represent a group identifiant. Except for the absence of limitation on the number of characters per line, the original fasta format specification should be used. The core format support both nucleotide and amino-acid sequences
An example of such file (with random sequences) is provided bellow :
>>Gene_1 >species_1 ATGTCGTGAAGTAGGGATTCCATTAGTACAATACTTATTGTCTTAAGTATTTTAGTAATCTTATTGAGTGGTATTATTATAATAGAAAGGAACCGAAATAATTATATAATAATATCCTTTTTCTTTTCTATACTCTTTTTAATCTTAAATACTACCTATGAACGTATATTTTCCAATTATTATTT >species_2 GTGAATAATACGCCTCACACTCATTCTCAACCCCCTGACAAAACACATAGCCTACCCCTTCCTTGTACTATCCCTATGAGGCATAATTATAACAAGCTCCATCTGCCTACGACAAACAGACCTAAAATCGCTCATTGCATACTCTTCAAAATCGCCCACGGGCTTACCTAGCAAACTCAAACTACGAACGC >species_3 GTGGGGCAATTTCAAGCCATTAAAAATTTTGAGTGAGTTCCGGGGTCGGCACTTTTAATTTTAACTGCTTTATTAACCCCGATTTGTGTTTTAATAAGTCGAAATTCGATTAAATCTTTGCTTTCTTTTTCTATACGTTTATTGGGTCCGTATTTATGTTACTAGCCATATGACCAAGTAGGAACTACTGACTATCAAGCCCTATGTTGTCAACA >>Gene_2 >species_1 ATGCCACAATTAGAAGCTACAGAAATGAAGAAAGAAAGAATTAAACTGGAGACAATTTTAATAATATAA >species_2 ATGCCCCAACTAAAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG >species_3 ATGCCACAATTAAACAGTAATCTCTTTTTCTATATTGATGATCACTTTTAAATATGAGCCTATTTTAGTACCTGTTTATTCTAAATAA >species_4 ATTCCACAAATAGCACCTATTAGATGTTCTATTAACTATTATTCTTATATACCAAATCAATAAATTGAAAATGATAA
Another example with aligned protein sequences :
>>Gene_1 >species_1 MATEILTAAKFVGAGAASIGAAGSGAGIGTVFGNLIIGYARNPSLKQQLFTYAILGFAISEAMGLFCLMMAFLILFAL >species_2 MATEILTAAKFVGAG--SIGAAGSGAGIGTVFGNLIIGYARNPSLKQQLFTYAILGFAISEAMGLFCLMMAFLILFGL >>Gene_2 >species_1 -------------I-PQIAPI------RWLLL-FIIFSITFILF---------------------------------CSIN-------------------Y------------------YSYIPN-SPKSNELKNIN-----------------------------------LNSINW--KW---------------------------------- >species_2 --------------MPQLN----------------------------------------------------------------------------------------------------LF------S--------------------------------------------FFYISVISFSILMI---------TFKYEPIL------VPVY-SK >species_3 --------------MPQLNTT------VWPTIITPI---LLTLFLITQLKILNTNYHLPP------------------SP----------KP---------IKIKNYN-------------------------------KPWEPKWTKICSLHSLPPQS--------------------------------------------------------- >species_4 --------------MPQIAPICH-LIYP-------------------------------------------------SIIILLIIVW---------------------------V-----------------------------ALGR---WKTARPTKTSQTSGRKLPS------------------------------PILNLPGKSW--NW--
CoreTracker
accepts already concatenated protein alignment in fasta format or (un)aligned sequences in core format. If your sequences are in other formats, you might want to convert them in an appropriate format. For this purpose, you can use coreconvert
, installed with CoreTracker
.
coreconvert
support the following format for aligned sequence : fasta, nexus, stockholm, clustal, core
and fasta, core
for unaligned sequence. Run coreconvert -h
for help.
In the following example, we will try to convert multiple fasta files into a single core file. The input fasta files are available in the folder test_convert of the tutorial data.
unzip examples.zip ls test_convert coreconvert --outfmt=core --infmt=fasta --outfile out.core test_convert/*.fasta less out.core
From the output file, you can see that sequences are grouped by orthologous genes. You can flip the grouping by using the option --order_reverse
. The new grouping by species does however not make sense for aligned sequences. CoreTracker
take as input a grouping of the sequences by orthologous gene. The identifiant following ">>
" should thereby be the gene identifiant.
We will start by first downloading mitochondrial genomes from refseq release.
mkdir genomes wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.genomic.gbff.gz -P genomes/ gunzip genomes/mitochondrion.1.genomic.gbff.gz
Now, for fast data extraction later, we will build a local database containing mitochondrial genomes from the genbank file. For this purpose, we will use corextract build
. You can either go grab a coffee or work on other things since this step can take some time to finish (up to 5min).
corextract build -db genomes/name.db -ib genomes/index.db -G genomes/mitochondrion.1.genomic.gbff
We have now two databases, name.db
which will help to search faster for genomes based on species name and index.db
which link species to their genomes. Now we are going to attempt to construct a dataset from a list of genomes of interest by using they name. We will be using the following two files : test_extract/speclist
which contains a list of species name and test_extract/genelist
which contains a list of the genes we are trying to extract for each genome. Note that empty lines and lines starting with "#" (comments) are ignored from those files and regular expressions are accepted for the species list. Gene names are queried against a local large database of mitochondrial gene aliases, during sequences extraction, to make sure the gene is not missed. Next versions of the package will also include aliases from chloroplast and nuclear genes, or offer the possibility of searching by sequences similarity.
In this example, we will search only fully-annoted genomes by using the optional argument --complete
. Note that if -db name.db
is not used, http requests will be made on ncbi taxonomy in order to guess the correct species name. The closest match to the provided name will be used, otherwise, the species will be skipped.
corextract extract -G genomes/mitochondrion.1.genomic.gbff -ib genomes/index.db -db genomes/name.db --genelist test_extract/genelist --speclist test_extract/speclist --complete
You can also provide NCBI taxonomy identifier for all species in your speclist
file, then use the option --taxid
. I recommend using this option whenever possible to ensure accurate search. For a complete list of the available options, run the help with corextract {build|extract} -h
.
The default output of corextract extract
is two files in the core sequence format : output.core
which contains the nucleotide sequences and output.core_prot
which contains the amino acid sequences. The program also print, in the terminal, warnings about possible frame-shifting and undefined nucleotides (N). Attempts are made to correct partial termination codons where TAA stop codon should completed by the addition of 3' A residues to the mRNA. The success of such corrections relies on the validness of the genbank file and the accuracy of its annotation. Therefore, manual correction might be needed. All sequence corrections are mentionned in the terminal, with the label "FIXED" for easier confirmation.
The number of genomes sharing a specific gene is shown. So is the genetic code of each extracted genomes. I don't recommend trusting this genetic code however.
CoreTracker
requires the a priori amino-acid sequences matching the CDS as input. You can obtain this sequence by translating the nucleotide sequences according to a genetic code. This task can be done by coretrans
from the package. coretrans
is not only able to translate dna sequences using a specified genetic code, but also align the resulting protein sequences. Current version of the package only accept one of the several genetic code listed at NCBI. In the next releases, it might be possible to pass a user-made genetic table as input.
coretrans
takes as mandatory input, the nucleotide sequences in core format. Here we are going to translate the input file using the metazoan genetic code (4), filter out genes with frame-shifting that we were unable to correct, then align the translated sequences with muscle
, and refine them using HMM of each genes.
coretrans --nuc output.core --align --refine --hmmdir metazoan_hmm --filter --gcode '4' --wdir outdir
The output files should be listed in the specified output folder, by default, this folder is trans_output
.
The sequence alignment pipeline use either muscle
or mafft
to align protein sequences. The alignments can be refined by building a preliminary Hidden Markov Model (HMM)
model using hmmbuild
from HMMER
then use it to align the sequences with hmmalign
from the same package. This last step can be done on multiple iterations at the end of which only the best alignment according to the posterior probability will be kept. Note that from experimental data, alignment quality often stopped improving after a certain number of iteration, so we use an early stopping criterion which stop the n-loop at iteration i
with probability i/n
if the alignment quality stop improving. Alternatively, if HMM
files are provided, they will be used to align the input sequences with hmmalign
on only one iteration.
It is possible to use a space/tab delimited file (an example here : ) that map each genome to a specific genetic code with the argument --codemap
. An example of such file can be found h
coretrans --nuc output.core --align --refine --hmmdir metazoan_hmm --filter --gcode '4' --wdir outdir --codemap trans_map.txt
The output files should be listed in the specified output folder, by default, this folder is trans_output
.
We also offer corefus
, a basic tool for merging two datasets before running CoreTracker
. Merging two datasets, can be considered for example if you want to include new genomes in the analysis in order to either measure the impact of adding species to the phylogeny or use a group of genomes with known genetic code as reference. corefus
take as input two nucleotide sequences in corefasta format and as optional argument the corresponding two species trees in newick format.
corefus --nucs test_corefus/mtzoa1.core test_corefus/mtzoa2.core --out mtzoa ls outdir
The trees are joined by creating a new node and adding the roots of the two input trees as its children. True phylogenetic relationship and branch lengths are therefore ignored. Since this might not be the desired we recommend to build a new species phylogenetic tree with your favorite phylogenetic program. You can for example:
corefus
coretrans
coreconvert
In this section, we will run coretracker on a test dataset and learn how to interpret its ouptut. You are supposed to use the all conserved proteins in the proteome of each species for accurate predictions.
coretracker -t test_data/species_tree.nwk -p test_data/test_prot.core -n test_data/test_nuc.core --wdir test_data/output --params test_data/param.yml --norefine --parallel --expos --debug --imformat svg
You can check the parameters used in the param.yml file. With this command, we are going to run CoreTracker
in parallel mode, with the default filtering parameters, no alignment refinement and debug printing enabled. We are also going to export predicted reassigned positions in a json file. CoreTracker also accept the --codemap
argument to specify the genetic code of some of the genomes in the dataset. This can be used conjointly with --restrict_to
that allow user to restrict prediction to only one genome, taking the remaining as reference. --restrict_to
take either a single species name or a file containing one species name per line
CoreTracker
return several files, but only a handful of them are really useful depending on what you're trying to do :
# listing the output directory ls test_data/output tree.nwk filt_alignment.fasta Ile_to_Gly Phe_to_Asn Arg_to_Asn gap_filt.fasta Ile_to_Lys Phe_to_Gly Arg_to_Gln Gln_to_Arg Ile_to_Met Phe_to_Ile Arg_to_Gly Gln_to_Gly Ile_to_Phe Phe_to_Ser Arg_to_Ile Gln_to_Lys Ile_to_Ser positions.json Arg_to_Lys Gln_to_Ser Lys_to_Arg reassignment.json Arg_to_Phe Gly_to_Arg Lys_to_Asn Ser_to_Arg Arg_to_Ser Gly_to_Asn Lys_to_Gln Ser_to_Asn Asn_to_Arg Gly_to_Ile Lys_to_Gly Ser_to_Gln Asn_to_Gln Gly_to_Phe Lys_to_Ile Ser_to_Gly Asn_to_Gly Gly_to_Ser Lys_to_Phe Ser_to_Ile Asn_to_Ile ic_filt.fasta Lys_to_Ser Ser_to_Lys Asn_to_Lys Ile_to_Arg ori_alignment.core Ser_to_Phe Asn_to_Phe Ile_to_Asn ori_alignment.fasta similarity.json Asn_to_Ser Ile_to_Gln Phe_to_Arg # listing one of the several folder containing specific reassignment ls test_data/output/Arg_to_Gly AGA_codons.svg AGA_violin.svg AGG_rea.svg Codon_data.svg AGA_ori.svg AGG_codons.svg AGG_violin.svg Report_Arg_to_Gly.pdf AGA_rea.svg AGG_ori.svg Arg_to_Gly_data.txt
{ "Schistosoma": [ { "filtered": 0.4472573839662447, "global": 0.4292988640814728, "species": "Paracentrotus" } ], "Fasciola": [ { "filtered": 0.4694092827004219, "global": 0.4782608695652174, "species": "Paracentrotus" }, { "filtered": 0.6561181434599156, "global": 0.6400313356835096, "species": "Schistosoma" } ], ... }
--expos
is set when running coretracker
{ "Halocynthia": { "AGG:G": [ [ "cox2", 163 ], ... ], ... }, ... "Lumbricus": { "AGA:S": [ ... [ "cob", 150 ] ] } }
#Reference Genetic Code : 4 (Mold Mitochondrial) #Exception: AGG (R, S) Schistosoma 0.504 Both Caenorhabditis 0.993 Both Katharina 0.765 Both AAA (K, N) Paracentrotus 0.727 Both Schistosoma 0.833 Both Fasciola 0.753 Both Patiria 0.917 Both ...
The first line of that file start by a # and indicate the reference genetic code used according to NCBI notation. Here the translation table 4 was used. The rest of the file list all predicted exceptions to the translation table. For each codon reassignment, the list of affected genomes (first column), the prediction probability (second column) and the validation output (third column) is show. Validation can take 4 values : Both indicating that both the clade-aware validation and the validation based on alignment improvement are passed — Clad when only the clade-aware validation is passed — Algn when only the alignment validation alignment is passed and None when neither was a success. Predictions that do not pass both validation are shown because it might be of interest to perform supplemental analysis in order to confirm if they are true positives.
As we can see from this file, for example, AAA
codons are predicted to be reassigned from Lysine
to Asparagine
in 4 genomes, and those predictions were all validated. Indeed, AAA(Lys, Asp)
is a known reassignment in Platyhelminthes
and Echinodermata
and was correctly inferred by coretracker
.
CoreTracker
from the input data in a json file. This file is used to build the input matrix for the Random Forest prediction and is useful if you want to build your own predictor based on informations extracted by CoreTracker
. You can also use it as input for reaplot
.
# listing the content of Arg_to_Ser ls test_data/output/Arg_to_Ser AGA_codons.svg AGA_violin.svg AGG_rea.svg Codon_data.svg AGA_ori.svg AGG_codons.svg AGG_violin.svg Report_Arg_to_Ser.pdf AGA_rea.svg AGG_ori.svg Arg_to_Ser_data.txt
### Random Forest prediction genome codon ori_aa rea_aa fitch Fisher_pval Gene_frac N._rea N._used Cod._count Sub._count G._len codon_lik id prediction probability clad_valid trans_valid Paracentrotus AGA R S 0.0 9.29434047659e-08 1.0 1.0 0.0 0.2 0.2 948.0 2.50632911392 8.0 1.0 0.995 1 1 Schistosoma AGG R S 0.0 3.90905936669e-05 0.666666666667 0.666666666667 0.0 0.166666666667 0.222222222222 829.0 2.1338028169 10.0 1.0 0.504 1 1 Schistosoma AGA R S 0.0 3.90905936669e-05 0.666666666667 0.307692307692 0.0769230769231 0.361111111111 0.222222222222 829.0 0.787581699346 8.0 0.0 0.316 0 1 Fasciola AGG R S 0.0 0.0357142733838 0.5 0.2 0.0 0.208333333333 0.0416666666667 861.0 0.239669421488 10.0 0.0 0.203 0 1 Patiria AGA R S 0.0 1.58906666891e-05 1.0 0.8 0.0 0.15625 0.125 948.0 2.19469026549 8.0 1.0 0.595 1 1 ... ### Alignment improvement: AGG Ser 1.28e-03 AGA Ser 3.23e-09 ### Prediction Prob. range for Positive :[ 0.504 - 0.998 ]
The difference in codons usage is measured by a Fisher's Exact test whose p-values are shown on the image. A color code is applied on leaves to differentiate positive and negative prediction. Red leaves
are positives predicted by the Random Forest that also passed the Fisher's Exact test. Green leaves
are predicted negatives that passed the Fisher's test. Orange leaves
are predicted positives that failed the Fisher's test. Remaining leaves (not selected for the RF prediction of failed both the RF and the Fisher's test) are colored Black
.
The values under and above the branch of each leaves respectively represent the count of Arginine (in the Arg to Ser reassignment) in the global alignment and in the Arginine-filtered alignment for the genome represented by the leaf.
Each node of the tree is labeled either by Arg or Ser, from the initial leave labeling by identifying interesting genomes
, followed by Fitch ancestral reconstruction (see the method section of the article). Internal nodes colors are set according to the Fitch ancestral state reconstruction (green circle for ambiguous states, and red square for suspected reassignments
As you can see here from this image, at several positions where most genomes use Serine codons (black boxes), genomes with the AGR(Arg, Ser)
reassigment use AGR codons instead.
For convenience, the main pipeline coretracker
has also been splitted into two separate binaries: coretracker-prep
and coretracker-run
. The former is used to make preparation for predictions, which include sequences alignment, codon alignment, sequence filtering, potential reassignment identification, etc, then save the result on disk, which can be loaded and executed by coretracker-run
. An interesting feature of coretracker-prep
is to perform all preliminary computation in parallel for a list of dataset then merge them into one giant dataset. This allow the user to split its dataset into smaller chunks that can be analysed simultaneously in parallel, making the whole process faster. You should keep in mind that using this option will also increase memory usage for larger dataset. A similar procedure is also implemented in coretracker-run
, which allow the prediction of reassignment, by independently considering pair of amino acids.
Here we will perform some additional analysis with available scripts in order to collect more data on predictions. These supplemental analysis are not always informative and are optional.
codonclust
cluster genomes according to their codons content. It's a simple way to find outlier genomes that could introduce errors in the inference. Input data is the reassignment.json
file returned by coretracker
, so a first run of coretracker
is needed. A principal component analysis is first performed in order to reduce the data dimensionality. Help can be obtained with codonclust -h
Let's use the output file returned by our previous execution of coretracker
in the last section
codonclust -i test_data/output/reassignment.json -o cluster_output --plot_aa_usage --pca_component 5 --gcode 4 --aar 'CDEHQWPY'
Here we are using the translation table 4 and 5 components for the PCA. The default algorithm for clustering will be MeanShift
. We will plot the amino acid usage at the same time, but the following residues will be discarded : C, D, E, H, Q, W, P, Y, L, M, I, V
. Since we did not ask for the codons count (see option --csv
in the help), only two files are returned : cluster_output.svg
and cluster_output_aa_usage.svg
. If a file extension is not set, the default image format will be svg
.
From this plot, we can see that the dataset can be split into two clusters according to their codons usage. It does not appear, however, that some genomes exhibit unique codon usage compared to the rest, which make this dataset somewhat robust to errors emanating from extreme codon usage bias. It should be noted that this does not mean that all predictions made on this dataset will be accurate, as other type of errors such as improper taxa-sampling
, RNA editions
, etc could be present.
The amino acid usage show some variability per genomes in both the non-corrected and the corrected dataset. However, you can notice that after correction, the usage of some amino acids including Arginine (R)
is more similar between genomes (tigher proximity of the frequency of Arginine usage in the bottom image). Note that the whole mitochondrial proteome was not used in this case.
Once you have your predictions, you could be interested in reconstructing their history. You can perform this task with by using R
ancestral state reconstruction packages or phylogenetic tools such as Mesquite
and MrBayes
. You can also use ansrec
a script available on the github repository here : to perform a quick ancestral reconstruction using parsimony (Fitch or Dollo). ansrec
takes as mandatory input a json or yaml file containing the codon reassignments. The format of the json file should be Genome --> Amino Acid --> Codons
as seen below. The yaml file structure should also obey this format.
{ "Halocynthia":{ "G":[ "AGG", "AGA" ] }, "Schistosoma":{ "S":[ "AGG" ], "N":[ "AAA" ] }, "Caenorhabditis":{ "S":[ "AGG", "AGA" ] } }
Alternatively, you can provide a file in a format similar to coretracker's output (see previous section and predictions.txt
). An example run of ansrec
with Fitch's parsimony and the resulting output is shown below
python ansrec -c test_data/out_example.txt --gcode 4 --out ancestral.svg --tree test_data/species_tree.nwk -a 'fitch'
An additional script to draw all predicted reassignments on the input tree is provided. This scripts use the reassignment.json file returned by coretracker
and a file in a format similar to the predictions.txt
reaplot -r test_data/output/reassignment.json -i test_data/output/predictions.txt --gcode 4 --tree test_data/species_tree.nwk --out test_data/output/reaplot.svg --valid both
The complete list of options can be found in the help (reaplot -h
. The resulting image is a vector graphics that can be easily edited to suit your needs.
We also provide a script to check the positions of each reassignment against Pfam domains. The main objectif here was to verify if the predicted codon reassignments affects structurally/functionally important positions. domaintester
is available on the github repository here :
# run pfamprepare.sh if you don't have # a local version of the pfam database bash pfamprepare.sh # run domaintester using the previous output of coretracker # could take some time python domaintester -c test_data/output/ori_alignment.core -p test_data/output/positions.json --outdir test_data/domaintester --hmm Pfam/Pfam-A.hmm -o domains.csv --format 'csv' ls test_data/domaintester atp6 atp8 cob cox1 cox2 cox3 domains.csv # look at the first line head -9 test_data/output/domains.csv | column -t Species Gene Codon AA Pos Domain Pfam start end description Paracentrotus cox2 AGA S 168 COX2 PF00116.18 94 213 Cytochrome C oxidase subunit II, periplasmic domain Patiria cox2 AGA S 168 COX2 PF00116.18 94 213 Cytochrome C oxidase subunit II, periplasmic domain Lumbricus cox2 AGA S 168 COX2|94-213 PF00116.18 94 213 Cytochrome C oxidase subunit II, periplasmic domain Caenorhabditis cox2 AGA S 112 COX2 PF00116.18 97 216 Cytochrome C oxidase subunit II, periplasmic domain Caenorhabditis cox2 AGA S 154 COX2 PF00116.18 97 216 Cytochrome C oxidase subunit II, periplasmic domain Caenorhabditis cox2 AGA S 168 COX2 PF00116.18 97 216 Cytochrome C oxidase subunit II, periplasmic domain Caenorhabditis cox2 AGA S 230 NID NID - - NID
NID (Not In Domains)
is used when the position does not appear in a domain, whereas, N/A (not available)
is used when no hits were found.
This is not yet offered, but planned for the next version release of CoreTracker
This software is licensed under GPLv3, a copy of which can be found on the github repository. Note that this software is provided "as is" without warranty of any kind, express or implied to the extent permitted by applicable law.