Abstract
Type 2 diabetes (T2D) is a global health burden that will benefit from personalised risk prediction. We aimed to identify longitudinal predictors of glycaemic traits relevant for T2D by applying machine learning (ML) to multi-omics data from the Northern Finland Birth Cohort 1966 at 31 (T1) and 46 (T2) years old. We predicted fasting glucose/insulin (FG/FI), glycated haemoglobin (HbA1c) and 2-hour glucose/insulin from oral glucose tolerance test (2hGlu/2hIns) at T2 in 595 individuals from 1,010 variables at T1 and T2: body-mass-index (BMI), waist-hip-ratio, sex; nine blood plasma measurements; 454 NMR-based metabolites (228 at T1 and 226 at T2); 542 methylation probes established for BMI/FG/FI/HbA1c/T2D/2hGlu/2hIns (277 at T1 and 264 at T2). Metabolic and methylation data were used in their raw form (Mb-R, Mh-R) or in scores (Mb-S, Mh-S). We used six ML approaches: random forest (RF), boosted trees (BT) and support vector regression (SVR) with the kernels of linear/linear with L2 regularization/polynomial/radial-basis function. RF and BT showed consistent performance while most SVRs struggled with high-dimensional data. The predictions worked best for FG and FI (average R2 values of six ML models: 0.47 and 0.30 for Mb-S). With Mb-S/Mb-R data, sex, branched-chain and aromatic amino acids, HDL-cholesterol, VLDL, glycoprotein acetyls, glycerol, ketone bodies at T2 and measurements of obesity already at T1 were amongst the top predictors. Addition of methylation data, did not improve the predictions (P>0.3, model comparison); however, 15/17 markers were amongst the top 25 predictors of FI/FG when using Mb-S+Mh-R data. With ML we could narrow down hundreds of variables into a clinically relevant set of predictors and demonstrate the importance of longitudinal changes in prediction.