The Structural Inference of Large Regulatory Networks (DPhil thesis)

Abstract: Networks are increasingly being used in a wide range of fields, including genetics, epidemiology and social sciences, and this means that machine learning research in network studies may be useful in several otherwise unrelated application fields. This thesis considers the problem of large network structural inference, and uses genetic regulatory networks (GRNs) as an example application field. As well as having direct biological functions, genes regulate each others' expression levels. Thus, the genes in an organism can be viewed as a directed graph of regulatory relationships. This is a GRN, and structural inference is inference of the edges in the graph; i.e., of its graphical structure.

The thesis selects dynamic Bayesian networks (DBNs) as a graphical model for representing GRNs and extends existing methods for their structural inference. BNs and DBNs have many advantages, including non-linearity, robustness with noise and missing data, and causal inference in some cases. However, the structural inference of BNs and DBNs is NP-hard. Colloquially, this hardness is due to the curse of dimensionality, and the curse of dimensionality also affects the amount of data necessary to infer a good model; i.e., the sample complexity. Thus, good structural inference of large DBN is very difficult. However, the value of genome-wide GRN structural inference is increasingly recognised, and large network structural inference may be valuable in other fields as well. This need and the problem's difficulty motivates this thesis.

Some of its contributions are as follows. The thesis presents Dense Structural Expectation Maximisation (DSEM), an extended version of the Structural Expectation Maximisation (SEM) algorithm, and it parallelises structural inference across a cluster of computers. Both DSEM and parallelisation can accelerate structural inference by an order-of-magnitude or more, and their multiplicative combination together is investigated. The thesis also considers the use of clustering to reduce the sample complexity of structural inference. It systematically investigates how best to cluster for structural inference, analyses the necessary algorithmic characteristics and proposes three novel measures for empirically comparing clustering algorithms. A range of algorithms are evaluated and the two algorithms which best possessed the necessary characteristics, FLAME and KMART, are empirically compared using the proposed measures. KMART appears to be slightly better when clustering for structural inference.

Parallelised DSEM and clustering with KMART are applied to biological data, and a GRN for Saccharomyces cerevisiae with more than 6000 genes is inferred in just a few hours. The thesis considers how to use the inferred network to direct biological experimentation, and looks at building a gene-level model from an inferred network of clusters. It also suggests using another more time-efficient method for the necessary parameter inference during structural inference, and uses it illustratively on the biological network. Finally, realistic simulation of large GRNs was necessary to comprehensively evaluate the thesis's proposed techniques. No existing simulator appeared suitable, and so this thesis also presents a network simulator which uses several novel techniques for the automatic generation and realistic simulation of of GRNs with up to 104 genes. All code from this thesis is publicly available to other researchers here:

  author = 	 {Christopher Fogelberg},
  title = 	 {The Structural Inference of Large Regulatory Networks},
  school = 	 {Computing Laboratory, Oxford University},
  year = 	 2010,
  month = 	 {September}

Available here.