Computational Methods in Evolutionary Biology, WS 2021/2022

Direct links to teaching material: Lecture with Exercises (Vorlesung mit Übung):

Language: English

Instructors:
Prof. Dr. Dirk Metzler (lectures and exercises),
Dr. Ulrich Knief (practical exercises on phylogenetics),
Dr. Ricardo Pereira (practical exercises on population genetics)

Time:

Due to the current Covid-19 situation, lectures will be provided as videos on this website.
Online-Sessions for questions and exercises will be done via zoom (Meeting-ID: 940 7030 1918; find access key on moodle ) each each Wednesday at 10:00 a.m. and each Friday at 9:00 (times may change during the semester, depending on demand).

Additional exercises/practicals for block courses Phylogenetics I/II and Comp. Pop. Gen. I/II: Tuesday 9:00 to 12:00 taught by Dr. Ulrich Knief and Dr. Ricardo Pereira.
Note: times for the practicals population genetics taught by Ricardo Pereira the exact time slots may be subject to change to avoid overlap with other courses.

Target group (Zielgruppe) and ECTS points:

Master's and PhD students in EES/MEME, Bioinformatics, Biostatistics, Biology, Mathematics, Statistics,...
Please find the module description for this course in the official module catalog of the bioinformatics master's program.
Students can obtain 8 ECTS for passing the exam of the entire course Computational Methods in Evolutionary Biology.

Students in block-structured programms like EES, MEME and the Master's program in Biology can participate block-wisely.
For each block, combined with the additiona practical part, students can obtain 3 ECTS by passing the exam and giving a presentation in the practical part.

Phylogenetics I (Block I, 18. Oct. 2021 - 5. Nov. 2021)
Phylogenetics II (Block II, 8. Nov. 2021 - 1. Dec. 2021)
Computational Methods in Population Genetics I (Block III, 3. Dec. 2021 - 7. Jan. 2022)
Computational Methods in Population Genetics II (Block IV, 10. Jan. 2022 - 11. Feb. 2022)

Contents

Data sets of DNA, RNA or protein sequences contain a lot of hidden informations about the history of evolution, about evolutionary processes and about the roles of particular genes in evolutionary adaptation. It is a challenge to develop methods to uncover these informations. Methods that are based on explicit models for evolutionary processes and on the application of statistical principles (like likelihood-maximization or Bayesian inferrence) are most promising. Some of these methods, however, can be very demanding - computationally and intellectually. A thorough understanding of the models and methods is crucial, not only for those who aim to contribute to the further development of such methods but also for those who want to apply these methods to their datasets and have to decide which method to choose, how to set their optional parameters and how to interprete the outcome.
In the first half of the semester we will focus on computational methods in phylogenetics In the second half of the semester we will turn to population genetics.

Bayesian and likelihood-based Phylogenetics

We discuss methods from computational statistics and their applications in phylogenetic tree reconstruction. First we compare maximum-likelihood (ML) methods to parsimonious and distance-based methods. Then we turn to Bayesian methods that are based on Markov-Chain Monte-Carlo (MCMC) approaches like the Metropolis-Hastings algorithm and Gibbs sampling. Such methods allow to sample phylogenies (approximately) according to their posterior probability, i.e. conditioned on the given sequence data. Thus, it is also possible to assess the uncertainty of the estimation. Among the special applications that we discuss are phylogeny estimation with time-calibration (e.g. according to the fossil record) and methods for the reconcilement of gene trees and species trees. Statistical methods are always based on probabilistic models for the origin of the data. Therefore, we discuss evolution models for biological sequences (Jukes-Cantor, PAM, F81, HKY, F84, GTR, Gamma-distributed rates,....) and the fundamentals about Markov processes that are necessary to understand these models. Furthermore, we will discuss relaxed molecular-clock models and Brownian-motion models for the evolution of quantitative traits along phylogenetic trees. Another topic are statical sequence-alignment methods that are based on explicit sequence evolution models with insertions and deletions (TKF91, TKF92,...). Software: PHYLIP, Seq-Gen, R with the ape package, RAxML, MrBayes, BEAST, Bali-Phy, ....

Computational methods in population genetics

Given population genetic data, how can we infer evolutionary and ecological features like population substructure, change of population size, recent speciation, natural selection and adaptation? Many computational methods for this purpose have been proposed and most of them are freely available in software packages. In this course we will discuss the theoretical and practical aspects of these methods. The theoretical aspects are the underlying models, statistical principles and computational strategies. In the practical part we will analyze these methods. We will also try out various software packages and explore under which circumstances they are appropriate. Among the models that we discuss are the coalescent process and its variants with structure and demography, the ancestral selection graph, and the ancestral recombination graph. Among the parameter estimation strategies are full-likelihood and full-Bayesian methods, methods based on summary statistics, and Approximate-Bayesian Computation. These methods use computational strategies like importance sampling and variants of MCMC. Software: LAMARC Hudson's MS, IM/IMa, Beast2, STRUCTURE, etc...

Handouts

The following handouts contain only a summary of the contents of the slides shown in the lecture. More detailed explanations are given on the whiteboard during the lectures. The handouts will be updated during the semester. We may not have time for all the topics that appear in the handout (also as in WS2020/21 the blocks for the Population Genetics part are a bit shorter) but also add newer topics, such as aspects of the analysis of whole-genome data.
Handout on Phylogenetics: PhyloHandout.pdf, handout on Computational Population Genetics: CMPG_handout.pdf

Videos

Most parts of the following videos have been made in winter term 2020/2021. Some of the videos will be updated during the semester. Dates in front of the links to the videos mean that questions about the video contents will be discussed in the question session on this day. Therefore it is recommended to watch the videos before these sessions. Questions on video contents can, of course, be asked in any later sessions as well. (Software videos are considered as belonging to tutorial sessions.)

General Introduction
Oct 20: Structure of the course and some general remarks (duration: 19:05 [Min:Sec])

Videos on Phylogenetics

Classical phylogenetics methods:
Oct 20: Phylogenetics lectures overview and notations (26:22)
Oct 20: Distance-based phylogeny reconstruction with UPGA (37:27)
Oct 20 or 22: Software: Simple phylogenetic analysis in R and how to use a Linux server (46:53)
Oct 22: Distance-based phylogeny reconstruction with Neighbor Joining (28:11)
Oct 22: Parsimony and how to calculate it for a given tree (and given data) (17:31)
Oct 22: Searching for the most parsimonious tree (44:30)
Oct 27: Difference measures for phylogenetic trees (14:34)
Oct 27: Software: PHYLIP (33:58)

Maximum-Likelihood based methods:
The videos in this section will require some basics from probability theory. If you need a brief reminder to the basic terms of probability theory and the law of total probability you can watch this video: Basic concepts of probability theory (22:13)
Oct 27: Maximum-likelihood principle and Felsenstein's pruning algorithm (46:02)
Oct 27 or 29: Software: Seq-gen and a simulation study with phylip (43:06)
Oct 29: The Jukes-Cantor model of sequence evolution (49:25)
Oct 29: Markov chains and their equilibria (28:34)
Nov 3: Idea of proof of convergence of irreducible aperiodic Markov chains (13:36)
Nov 3: Reversibility of Markov chains (14:01)
Nov 3: Software: maximum-likelihood phylogeny reconstruction and bootstrapping with RAxML (27:36)
Nov 3: Searching for the maximum-likelihood phylogeny (29:34)
Nov 5: Insights from ML for parsimony and distance-based phylogeny (22:50)
Nov 5: Consistency of ML tree reconstruction (22:42)
Nov 5: Bootstrapping (34:14)

Bayesian sampling with MCMC:
Bayesian statistics and MCMC (47:03)
MCMCMC and Gibbs sampling (37:34)
Software: Bayesian phylogentic analysis with Beast (43:55)
Effective sample sizes in MCMC, problematic priors and mixtures of gene trees (48:43)

Models for substitution processes in sequence evolution:
Rate matrices and exponential waiting times (48:20)
Calculating transition matrices (may contain traces of linear algebra) (63:29)
Substitution-rate heterogeneity, relaxed-clock models and time-calibration using fossils (47:46)
Software: fossil-based time calibration with BEAST (20:59)

Phylogenomics and gene families
Some remarks on phylogenomics and on paralogy and orthology (41:04)

Quantitative traits and independent contrasts:
The Brownian motion model for (neutral) quantitive trait evolution and some basics on normally distributed vectors (48:42)
Reduced ML and Felsenstein's prunig algorithm for quantitative traits (44:41)
Software: phylip contrasts (8:25)

Statistical Alignment
Introduction to statistical alignment (69:40)
Statistical alignment with longer gaps (11:55)
Simultaneous sampling of phylogenies and alignments (48:15)

Model-selection strategies
Model selection (64:50)

Statistical Tests for Phylogentic Trees (Extra material, not part of the course in WS 21/22)
The Kishino–Hasegawa Test (13:03)
The Shimodaira–Hasegawa Test (19:14)
The SOWH Test (10:41)
Anisimova and Gascuel's approximate likelihood-ratio test (23:25)

Videos on Computational Methods in Population Genetics

Introduction to basic models of population genetics
General remarks, Wright-Fisher model and Kingman's Coalescent (80:02)
Population-scale mutation rate θ, nucleotide diversity and Tajima's D (44:04)
Overview of coalescent-based parameter estimation approaches (27:18)
Software: Population genetic data simulation with ms and scrm and summary static calculations with R (44:26)

Likelihood-based inference with importance sampling and MCMC
Likelihoods in population genetics and the idea of importance sampling (49:14)
The importance sampling method of B. Griffiths and S. Tavaré (29:20)
LAMARC's approach of using MCMC for Importance Sampling (43:01)
Software: A simple LAMARC analysis (52:39)
MCMC sampling of genealogies (60:20)
Ancestral Recombination Graphs and some more aspects of MCMC sampling in LAMARC (40:32)
Software: How to simulate data for testing LAMARC (44:51)
MCMC methods for the isolation-migration model as implemented in IM/IMa/IMa2 (54:54)

Approximate Bayesian Computation (ABC) and other summary-statistics based methods
ABC with local regression and MCMC without likelihoods (77:35)
Software: an ABC analysis with the abc package in R (41:49)
Sequential/Adaptive ABC (ABC-PMC) (24:08)
Combining summary statistics with partial least squares (42:49)
Composite-likelihood approaches and the Joint site-frequency spectrum (58:40)

Detecting population structure with programs like STRUCTURE
Method in STRUCTURE with and without admixture (77:24)
Software: getting started with STRUCTURE (19:20)
STRUCTURE model variants and alternative tools (56:38)

Selection: Models and statistics
Basic population genetic model of directional selection (33:53)
Weak selection and the ancestral selection graph (26:35)
Selective sweeps and how they are modeled in simulated in msms (43:18)
Statistics for detecting selective sweeps (33:34)
Soft sweeps, incomplete sweeps and population demography (41:21)
A statistic for detecting balancing selection (30:42)

Li and Stephens' PAC approach
The PAC approach (with an excursus on HMMs) (84:25)

Exercises

Exercises on Phylogenetics

phylo01.pdf
phylo02.pdf
phylo03.pdf
phylo04.pdf
phylo05.pdf, pruning.R, PAM_rate_matrix.txt, pfold_rate_matrix.txt
phylo06.pdf, QuantTraitsA.csv, QuantTraitsB.csv, QuantTraitsC.csv, QuantTraits_Tree.txt
phylo07.pdf
(Exercises on statistical tests for phylogenies; not part of the course in WS21/22: phylo08.pdf)

Phylogenetics example files

primates.nex, primates.phylip, primates.R, NJvsMPvsML.zip

Exercises on Computational Population Genetics

sheet01.pdf
sheet02.pdf
sheet03.pdf
sheet04.pdf
sheet05.pdf
sheet06.pdf
sheet07.pdf, cheater.txt, cpg_islands.txt.zip

Comp PopGen software example files

drift.R (Wright-Fisher and finite-population coalescent simulations), Tajimas_D.R, abc_example.R, coala_jsfs.R
SortSequences.R (Example R file to convert ms/seq-gen output to Migrate input file, which can be read by Lamarac input file converter)

Linux

In the practical part of the cours(es) we will use Linux. If you are new to Linux/Unix, you may be interested in some online tutorials such as http://www.ee.surrey.ac.uk/Teaching/Unix/ or https://www.codecademy.com/learn/learn-the-command-line. It may be a good idea to go through one of these tutorials even before the course starts.

Announcement for bioinformaticians in official LMU course overview


web page last updated: Dirk Metzler, 22. Oct. 2021