Biostatistics and Data Science


The cross-cutting theme on Biostatistics and Data Science aims to provide statistical support and develop novel quantitative methods to enable researchers to probe the rich data resources available at the Centre, to analyse multiple complex data sources in a coherent and robust manner, and to reveal the complex interactions and pathways between exposures and outcomes in environment and health studies.

Our research in this area includes the development of statistical methodology and innovative data analytics, anchored in the Bayesian hierarchical modelling paradigm and machine learning, to improve statistical inference. We employ flexible, spatio-temporal semi- and non-parametric models for large complex data from environmental and epidemiological studies and multi-omic high-throughput platforms, and strategies to address the computational challenges of high-dimensional inference and estimation.



Theme Leader: Professor Marta Blangiardo (ICL)

Principal teams: M Blangiardo, T Ebbels, M Evangelou, F Piel, M Pirani.

Associated teams: M Chadeau-Hyam, P Elliott, S Filippi, K Katsouyannii.


Key Projects
  • Integration of data sources in spatio-temporal modelling to: i) combine ground monitors at point level and numerical model output at grid level to better characterise exposure to air pollution, allowing propagation of uncertainty between estimates of environmental exposure and health outcomes [1]; augment population-based registries via cohorts/surveys to improve confounder adjustment of the association between exposures/risk factors and health outcomes, dealing at the same time with missing data via propensity score adjustment (MRC Methodology grant, PI Blangiardo) [2].
  • We compared statistical profiling approaches for exposome data [3], including how to include interactions in high dimensional profiling [4]. We devised a multivariate normal approach to analyse exposome data generated using complex study designs with multiple observations per participant and applied it to EXPOsOMICS data [5]. We proposed a series of partial least squares (PLS) models to explore the multivariate effect of exposure mixtures on inflammatory markers [6].
  • We contributed to the development of the Metabolome Wide Significance Level as an approach to correct for multiple testing in metabolome-wide association studies [7] and a new method for power and sample size calculations for metabolomic studies [8]. We also contributed to the EU-funded PhenoMeNal programme cloud-based infrastructure for computational metabolomics [9].
  • We produced the first reference book on the Integrated Nested Laplace Approximation (INLA) for spatial and spatio-temporal applications [10], as well as the R packages R2GUESS for Bayesian variable selection [11], and binda and sda for multi-class discriminant analysis with variable selection [12,13]. We provide statistical code and simulated data to ensure reproducibility of our research.


Accordion title 2
  1. Two-stage Bayesian model to evaluate the effect of air pollution on chronic respiratory diseases using drug prescriptions. Blangiardo M, Finazzi F, Cameletti M. Spat Spatiotemporal Epidemiol. 2016 Aug;18:1-12
  2. Bayesian spatiotemporal modelling for the assessment of short-term exposure to particle pollution in urban areas. Pirani M, Gulliver J, Fuller GW, Blangiardo M. J Expo Sci Environ Epidemiol. 2014 May-Jun;24(3):319-27.
  3. A Systematic Comparison of Linear Regression-Based Statistical Methods to Assess Exposome-Health Associations. Agier L, Portengen L, Chadeau-Hyam M, Basagaña X, Giorgis-Allemand L, Siroux V, Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen MJ, Vineis P, Vrijheid M, Slama R, Vermeulen R. Environ Health Perspect. 2016 Dec;124(12):1848-1856.
  4. A systematic comparison of statistical methods to detect interactions in exposome-health associations. Barrera-Gómez J, Agier L, Portengen L, Chadeau-Hyam M, Giorgis-Allemand L, Siroux V, Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen M, Vineis P, Vrijheid M, Vermeulen R, Slama R, Basagaña X. Environ Health. 2017 Jul 14;16(1):74.
  5. Blood transcriptional and microRNA responses to short-term exposure to disinfection by-products in a swimming pool. Espín-Pérez A, Font-Ribera L, van Veldhoven K, Krauskopf J, Portengen L, Chadeau-Hyam M, Vermeulen R, Grimalt JO, Villanueva CM, Vineis P, Kogevinas M, Kleinjans JC, de Kok TM. Environ Int. 2018 Jan;110:42-50.
  6. A multivariate approach to investigate the combined biological effects of multiple exposures. Jain P, Vineis P, Liquet B, Vlaanderen J, Bodinier B, van Veldhoven K, Kogevinas M, Athersuch TJ, Font-Ribera L, Villanueva CM, Vermeulen R, Chadeau-Hyam M. J Epidemiol Community Health. 2018 Jul;72(7):564-571.
  7. Improving Visualization and Interpretation of Metabolome-Wide Association Studies: An Application in a Population-Based Cohort Using Untargeted 1H NMR Metabolic Profiling. Castagné R, Boulangé CL, Karaman I, Campanella G, Santos Ferreira DL, Kaluarachchi MR, Lehne B, Moayyeri A, Lewis MR, Spagou K, Dona AC, Evangelos V, Tracy R, Greenland P, Lindon JC, Herrington D, Ebbels TMD, Elliott P, Tzoulaki I, Chadeau-Hyam M. J Proteome Res. 2017 Oct 6;16(10):3623-3633.
  8. Power Analysis and Sample Size Determination in Metabolic Phenotyping. Blaise BJ, Correia G, Tin A, Young JH, Vergnaud AC, Lewis M, Pearce JT, Elliott P, Nicholson JK, Holmes E, Ebbels TM. Anal Chem. 2016 May 17;88(10):5179-88.
  9. Ebbels TM, Pearce JTM, Sadawi N, Gao J, Glen RC. Chapter 11 - Big Data and Databases for Metabolic Phenotyping. The Handbook of Metabolic Phenotyping.. Editor(s): Lindon JC, Nicholson JK, Holmes E. Elsevier, 2019. Pages 329-367.
  10. Blangiardo M, Cameletti M. Spatial and Spatio-temporal Bayesian Models with R – INLA. Wiley, 2015.
  11. R2GUESS: A Graphics Processing Unit-Based R Package for Bayesian Variable Selection Regression of Multivariate Responses. Liquet B, Bottolo L, Campanella G, Richardson S, Chadeau-Hyam M. J Stat Softw. 2016 Jan 29;69(2).
  12. Optimal Whitening and Decorrelation. Kessy A, Lewin A, Strimmer K.  et al. Amer Statistician 2018 Jan 26;3019-314.
  13. Differential protein expression and peak selection in mass spectrometry data by binary discriminant analysis. Gibb S, Strimmer K. Bioinformatics. 2015 Oct 1;31(19):3156-62.
  14. A Bayesian mixture modeling approach for public health surveillance. Boulieri A, Bennett JE, Blangiardo M. Biostatistics. 2018 Sep 25. doi: 10.1093/biostatistics/kxy038
  15. Dynamics of the risk of smoking-induced lung cancer: a compartmental hidden Markov model for longitudinal analysis. Chadeau-Hyam M, Tubert-Bitter P, Guihenneuc-Jouyaux C, Campanella G, Richardson S, Vermeulen R, De Iorio M, Galea S, Vineis P. Epidemiology. 2014 Jan;25(1):28-34.
  16. A hierarchical modelling approach to assess multi pollutant effects in time-series studies. Blangiardo M, Pirani M, Kanapka L, Hansell A, Fuller G. PLoS One. 2019 Mar 4;14(3):e0212565.
  17. Bayesian modeling for spatially misaligned health and air pollution data through the INLA-SPDE approach. Cameletti M, Gomez-Rubio V, Blangiardo M. Spatial Statistics 31, 2019 April.
  18. Bayesian spatial modelling for quasi-experimental designs: An interrupted time series study of the opening of Municipal Waste Incinerators in relation to infant mortality and sex ratio. Freni-Sterrantino A, Ghosh RE, Fecht D, Toledano MB, Elliott P, Hansell AL, Blangiardo M. Environ Int. 2019 Jul;128:109-115.
  19. Error in air pollution exposure model determinants and bias in health estimates. Vlaanderen J, Portengen L, Chadeau-Hyam M, Szpiro A, Gehring U, Brunekreef B, Hoek G, Vermeulen R. J Expo Sci Environ Epidemiol. 2019 Mar;29(2):258-266.
  20. Analysing the health effects of simultaneous exposure to physical and chemical properties of airborne particles. Pirani M, Best N, Blangiardo M, Liverani S, Atkinson RW, Fuller GW. Environ Int. 2015 Jun;79:56-64.