#### Read Phylogeny Tree text version

ZiJia (Zinnia) Zheng

Binary tree

LEAF INTERNAL NODE EDGE LENGTH Species Evolutionary event Evolutionary distance

Describe evolutionary history of species Deduce gene function & origin

Figure Source: CS262 lecture 11 . 2009.

Neighbor joining Fast neighbor Joining

Elias & Lagergren. Fast neighbor joining. Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) 2005.

Gene tree reconstruction

Rasmussen, Kellis. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Genome Res 2007.

Many methods Distance methods

Neighbor joining Input: matrix D estimates distance between all pairs of species Output: phylogenetic tree whose distance matrix DT is close to D

GENOME SEQUENCES

DISTANCE MATRIX D

PHYLOGENETIC TREE

DISTANCE MATRIX DT

Multiple Sequence Alignment

Reconstruction (distance method)

Phylogenetic tree T DT(i, j) = sum of lengths of edges on path from i to j

D is additive

Distance between any pair of leaves is equal to the sum of the lengths of edges connecting them

D is nearly additive

maxx,y|D(x, y) DT (x, y)| < ½ mine in E(T) length(e)

X Y Z X Y Z

X

Y Z

0

3

0

7

4 0

X

Y Z

0

3

0

6

4 0

Additive

Not Additive

Input: D

1. 2.

Greedy, bottom-up clustering

Find pair of nodes to join Join the nodes together by new parent node p Repeat algorithm, consider joined nodes as single entity

Find

no yes

Join

3.

Single cluster?

Output: T

Terminate when there's a single cluster/tree

Unique tree

Additive Nearly-additive

Neighbor joining reconstructs this tree

B

C

A

D

Want to join the pair of nodes that's further away from any other node Pair (i, j) that maximizes

Keep track of values in matrix Q

i

j

...

N1

N

i

j

...

N1

N

i

j

...

N1

N

i

j

...

N1

N

Define new node p Distance from p to i and j

H

H I

0

I

11 0

J

10 9

K

10 15

Update distance matrix D for next iteration Let D' be the updated matrix

Remove entries that reference i or j Add entries that relate p to

J

K

0

14

0

remaining nodes m

H P

6

0

K

10

10 0

H

P K

0

Define distance matrix Q' for next iteration Q' = current matrix Q

Remove entries that reference i or j

Add entries that relate p to remaining nodes m

Initialization

O(N3) to compute Q(i, j) for all pairs (i, j)

N iterations

To begin with, N nodes each in its own cluster

In each iteration, decrease clusters by one O(N2) to find the closest pair O(N) to introduce new parent node and update D O(N2) to update Q

O(N3 + N*N2) O(N3) total

Q matrix

O(N2) in size O(N) computation per entry

Consequence

O(N3) initialization Per iteration O(N2) to find minimum entry O(N2) to update Q for new parent node

Isaac Elias and Jens Lagergren Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) 2005.

Keep track of R

Row sums of D O(N) in size

Cost

O(N2) to initialize O(N) to update, O(1) per entry

O(1) to compute an entry in Q

Definition

Pair (i, j) is visible from i if Pair (i, j) is in visible set if it is

i

visible from i or j

j

max Q(i, j)

O(N) in size O(N2) to initialize

In every iteration

Closest pair is in visible set O(n) to find this pair

Update

Remove pair (i, j) that's been joined O(N) to find the pair of nodes visible from p

Initialization

O(N2) to compute R O(N2) to find the visible set

N iterations

O(N) to find the closest pair O(N) to introduce new parent node and update D O(N) to update R O(N) to update visible set

O(2*N2 + N*N) O(N2) total

Additive or nearly additive

If so, FNJ and NJ output the same tree Else, FNJ output almost as good as that of NJ

In practice, most distance matrices are not nearly additive

Figure Source: Elias & Lagergren. Fast neighbor joining.

Matthew D. Rasmussen and Manolis Kellis Genome Res. 2007

Genes with similar characteristic due to common ancestry

Ortholog

Speciation event Species evolution

Paralog

Duplication event Gene family expansion within species

Figure Source: Rasmussen & Kellis.

Maps each node in gene tree to closest common ancestor node in species tree Gene tree combine orthologs and paralogs across species

Figure Source: Rasmussen & Kellis.

Species tree is known Gene tree is correct

Not always true!

Inaccuracy in phylogenetic reconstruction

Correct gene tree

Figure Source: Rasmussen & Kellis.

Outputs wrong gene-tree because...

Method not good enough to model our

understanding? Our understanding does not match reality?

Various phylogenetic methods Genomes

12 Drosophila 9 fungal 5154 orthologs from Drosophila 739 orthologs from fungus

Genes from regions with synteny

Gene order is preserved since common ancestor Phylogeny congruent to species phylogeny

Compare against known species tree

T1: benchmark, known species topology No alternate topology systematically favored

Most abundant recovered gene trees across 12 Drosophila genomes

Figure Source: Rasmussen & Kellis.

Accuracy increases with sequence length

Figure Source: Rasmussen & Kellis.

Accuracy correlates with the divergence rate of genes

Slow: lack sufficient events to resolve divergence

order Fast: independent substitution doesn't distinguish between different topologies

SLOW ACCURACY SEQUENCE IDENTITY 25% 70% MODERATE 48% 40%-50% FAST 35% 20%

Simulated phylogenies

Phylogeny model based on existing knowledge Exclude "unknown biological process"

Result

Same alternate topologies with similar frequencies T1 recovery

72%, ~30% increase over real data Simulation and reconstruction used the same model of evolution

Still large portion incongruence not explained

Existing methods

Lack information View each gene in isolation

Phylogeny topology of many genes identical to species tree common property shared amongst different gene trees

Substitution rate bi

One for each species bi = (gene-specific rate g) * (species-specific rate si )

5154 orthologs from 12 fly genomes

Gene tree topology congruent to species topology Infer branch lengths

bi

g

si

Figure Source: Rasmussen & Kellis.

Figure Source: Rasmussen & Kellis.

Learning

Orthologs with phylogeny likely to be identical to

species tree Estimate gene- and species-specific distributions

Reconstruction for remaining genes

Search through gene tree topologies

Estimate branch lengths

Compute likelihood of branch lengths

Figure Source: Rasmussen & Kellis.

Species-specific rates are independent of each other and of the genespecific rate

Gene trees are scaled

versions of species tree Evolutionary rates strongly correlated due to g Species-specific rates independent of each other, since correlation is primarily a result of g

Figure Source: Rasmussen & Kellis.

#### Information

#### Report File (DMCA)

Our content is added by our users. **We aim to remove reported files within 1 working day.** Please use this link to notify us:

Report this file as copyright or inappropriate

1080030

### You might also be interested in

^{BETA}