Tutorial on Phylogenetic (Super-)Reconciliation

by Mattéo Delabre and Nadia El-Mabrouk

The goal of this hands-on session is to experiment with phylogenetic tree reconstruction and (super-)reconciliation, using a case study on CRISPR-Cas systems. Our final result will be a hypothetical evolutionary scenario for a subset of Cas gene clusters. We will go through a complete pipeline, starting with the retrieval of Cas cluster information, up to building the scenario using reconciliation techniques.

Prerequisites

To follow along this tutorial, you will need a Unix-like environment with the latest versions of the tools we are going to use. The easiest way to set this up is by running the provided Docker container. If needed, refer to one the following webpages for instructions on how to install Docker, depending on your operating system.

Next, download the container image (about 370 MB).

1
$ docker pull ghcr.io/udem-lbit/superrec-tutorial:latest

Finally, run the following command from a terminal to start the container. Replace <DATA_PATH> with the path to a folder on your machine where you want to store the results of the tutorial (e.g., /home/user/superrec-data or C:\superrec-data).

1
2
3
$ docker run -it --name reconciliation \
    --mount=type=bind,src=<DATA_PATH>,dst=/home/tts/data \
    ghcr.io/udem-lbit/superrec-tutorial

If at any point you leave the container (e.g., by closing the terminal), you can get back in with the following command.

1
$ docker start -i reconciliation

If you prefer not to use Docker, you can also manually install the required tools: PostgreSQL 15, Python 3.11, MUSCLE 5.1.0, RAxML-NG 1.2.0, ecceTERA c600321a, NCBI Datasets 15.7.0, and superrec2 0.1.0.

Using Data From CRISPRCasDb

CRISPRCasDb is a publically-available database of CRISPR loci detected in over 36,000 bacterial and archaeal genomes. This database was constructed by the I2BC institute using CRISPRCasFinder (Couvin et al., Nucleic Acids Research, 2018).

Retrieving a copy of the dataset

If you are following along using Docker, the container already includes a copy of the database dump as a file named ccpp_db.zip. Otherwise, you can download a copy from the CRISPRCasDb website.

Let’s start by loading the database dump into PostgreSQL.

1
2
3
$ createdb tts
$ unzip -d ccpp_db ccpp_db.zip
$ psql --file ccpp_db/*/*/*.sql 2>/dev/null

Understanding the dataset structure

There are five main tables in the database which will be of interest to us today.

taxon
A taxonomic entity (species, strain, …).
.parent references the parent taxon.
strain
A strain of a taxon.
.taxon references the associated taxon.
.genbank holds a GenBank ID for the strain genome assembly.
sequence
A sequence in a genome assembly.
.strain references the sequenced strain.
clustercas
A cluster of Cas genes.
.sequence references the sequence containing this cluster.
.start and .length are the cluster extents in the sequence.
clustercas_gene
A Cas gene in a cluster.
.clustercas references the containing cluster.
.gene is the gene family name.
.start and .length are the gene extents in the sequence.
.orientation is 2 if the gene is backwards in the sequence, 1 otherwise.

To get more details on the tables’s structure, you can start a PostgreSQL session and use the \d and \d <TABLE> commands.

1
2
3
4
5
$ psql
psql (15.3)
Type "help" for help.

tts=# \d taxon

Running basic queries on the dataset

Here are a few questions and corresponding SQL queries to get more familiar with the dataset. You can execute them in your PostgreSQL session, as before.

Count the total number of Cas genes in the dataset.

1
select count(*) from clustercas_gene;

Count the number of Cas genes for each Cas family in the dataset.

1
2
3
4
5
6
select
    regexp_substr(gene, '^[^_]+') as family,
    count(*) as occurrences
from clustercas_gene
group by family
order by count(*) desc;

Count the total number of Cas clusters in the dataset.

1
select count(*) from clustercas;

List the gene contents of a specific Cas cluster.

1
2
3
4
5
6
7
8
select
    regexp_substr(gene, '^[^_]+') as family,
    start,
    length,
    orientation
from clustercas_gene
where clustercas = '8aed4995-e0c3-41e5-839b-bce8997c6752'
order by start;

Notice how this cluster contains multiple copies of the Cas3, 4, 5 and 7 genes.

For each Cas gene family, compute the proportion of clusters containing at least one gene from that family.

1
2
3
4
5
6
7
8
select
    regexp_substr(gene, '^[^_]+') as family,
    count(distinct clustercas)::decimal
    / (select count(*) from clustercas) * 100
    as clusters_percent
from clustercas_gene
group by family
order by clusters_percent desc;

Now that we are more familiar with the dataset, let’s start extracting the data we need.

Extracting the taxa tree

To run a reconciliation, we will need a tree of the taxa inside which the genes evolve. The dataset contains a subset of the NCBI taxonomy, which we will extract and use. Doing this using SQL only is a bit awkward, so we will use Python instead.

1
$ python build_taxa_tree.py > data/taxa.nh

Selecting a subset of taxa and clusters to study

The dataset is too large to be analyzed entirely within the timeframe of this tutorial. Hence, we will only consider Cas clusters from a sample of the available taxa.

We first subsample the taxa set by removing parts of the tree until we get a fully binary tree. Such a binary tree is required for running the reconciliation step. Then, we take a random sample of 40 taxa among the remaining ones.

1
$ python prune_taxa_tree.py 40 < data/taxa.nh > data/pruned_taxa.nh

The script also creates views named *_sample in the database, which mirror the tables of interest with only the sampled records. You can upload the resulting pruned_taxa.nh tree to the Ete tree viewer to visualize its structure.

Building Gene Trees

Our next step on the way to running the reconciliation tools is to build gene trees for the selected gene families. We’ll use the usual multiple alignment and maximum likelihood approaches.

Retrieving Genome Assemblies

Since CRISPRCasDb does not include the actual DNA sequences, we’ll have to start by retrieving them from NCBI’s servers. We start by listing all the assembly IDs to be retrieved.

1
2
3
$ psql --command 'select genbank from strain_sample;' \
    --no-align --tuples-only \
    > data/assembly_ids.txt

The retrieval of sequence data is done in two parts. The first step is to prepare a folder structure in which the sequences will be stored.

1
2
3
4
5
6
7
$ datasets download genome accession \
    --inputfile data/assembly_ids.txt \
    --filename data/ncbi_dataset.zip \
    --include genome \
    --dehydrated
$ unzip -d data data/ncbi_dataset.zip
$ rm data/ncbi_dataset.zip data/README.md

Then, we run a second command to populate this structure. If this command fails or is interrupted, you can simply start it again to fetch the missing sequences.

1
$ datasets rehydrate --directory data

The downloaded sequences can be found in the data/ncbi_dataset/data subfolders.

Extracting Gene Sequences

Since we are only interested in the genes, not complete genome sequences, we need to extract those specifically. The following script uses information from the database and the downloaded sequences to compile one multi-FASTA file for each gene family.

1
$ python collect_cas_genes.py

The resulting files are stored in data/align.

Aligning the Sequences

Next, we proceed to performing multiple sequence alignment on Cas1 and Cas2 genes, using MUSCLE (Edgar, 2022).

1
2
$ muscle -align data/align/Cas1.fna -output data/align/Cas1.afa
$ muscle -align data/align/Cas2.fna -output data/align/Cas2.afa

Searching for Maximum-Likelihood Trees

Finally, we use RAxML-NG (Kozlov et al., 2019) to look for the best maximum-likelihood trees for the multiple alignment results. Here the GTR+G evolution model is used, which is suited for nucleotidic data. For more advanced uses of RAxML-NG, you can refer to this tutorial.

1
2
$ raxml-ng --model GTR+G --msa data/align/Cas1.afa --seed 42 --search
$ raxml-ng --model GTR+G --msa data/align/Cas2.afa --seed 42 --search

The resulting trees are stored in data/align/Cas1.afa.raxml.bestTree and data/align/Cas2.afa.raxml.bestTree. Upload them to the Ete tree viewer to visualize their structure, and compare them to the species tree obtained above.

Computing Reconciliations on Individual Gene Trees

To obtain an scenario explaining the discrepancies between the gene trees and the species tree, we can compute a most-parsimonious classical reconciliation. Many reconciliation tools exist which implement various reconciliation models: Jane, TreeRecs, RANGER-DTL, Notung, among others.

Running ecceTERA

Here, we chose to use ecceTERA (Jacox et al., 2016).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ ecceTERA species.file=data/pruned_taxa.nh \
    gene.file=data/align/Cas1.afa.raxml.bestTree \
    dated=0 \
    print.reconciliations=1 \
    recPhyloXML.reconciliation=true \
    output.dir=data \
    output.prefix=Cas1_
$ ecceTERA species.file=data/pruned_taxa.nh \
    gene.file=data/align/Cas2.afa.raxml.bestTree \
    dated=0 \
    print.reconciliations=1 \
    recPhyloXML.reconciliation=true \
    output.dir=data \
    output.prefix=Cas2_

The results are stored as recPhyloXML files inside the data folder (recPhyloXML is a standard interchange format for phylogenetic reconciliations, see Duchemin, 2018). You can use ThirdKind to visualize them.

When there are multiple optimal reconciliations, ecceTERA outputs three different solutions: a randomly selected one, and two median reconciliations representative of the space of all possible solutions.

Tweaking the costs

By default, ecceTERA assigns a cost of 1 to losses, 2 to duplications, and 3 to HGTs. To redefine those costs, you can use the loss.cost, dupli.cost and HGT.cost arguments. Try assigning a very high cost to HGTs (so that they are effectively forbidden). What do you notice?

Computing Super-Reconciliations

Finally, we are interested in computing a cluster-level scenario with segmental events. To that end, we will use superrec2 (Anselmetti et al., 2022).

Building a supertree

The trees obtained above for Cas1 and Cas2 overlap on some of their nodes. These are the nodes corresponding to clusters with both a Cas1 and a Cas2 copy. To run the super-reconciliation, we need a supertree that encompasses the topology of both of these trees. Unfortunately, the trees are not consistent (their topologies contradict in some places).

Here, we will use an heuristic approach: we remove random disagreeing triplets until the two trees become consistent. We repeat this a few times and keep the most resolved solution, i.e., the run where we had to remove the least number of triplets.

1
$ python build_super_input.py > data/super_input.json

Running superrec2

1
2
3
$ superrec2 reconcile \
    --input data/super_input.json \
    superdtl > data/super_output.json

To view the results, you can use the following command. This requires a working LaTeX installation, which is unfortunately not included in the Docker container due to space constraints.

1
2
3
$ superrec2 draw \
    --input data/super_output.json \
    --output data/super_output.pdf