High-Dimensional Data and Machine Learning
Motivated by the data explosion in multiple scientific areas, such as genomics and proteomics, high-dimensional statistics deals with data where number of variables p is large compared with the sample size n. Instead of low dimensional regime, where the number of variables p is fixed, in high dimensional statistics, it is often assumed that the number of variables p grows with the growth of sample size n, which brings new challenges for estimation and inference problems. In the past two decades, rapid progress has been made in computation, methodology and theory for high-dimensional statistics, which yields fast growing areas of selective inference, post selection inference and multiple testing.
Machine learning (ML) is an emerging area in statistics and computer science aiming at algorithm development for data mining tasks, such as classification, prediction, and clustering. Statisticians play important roles in ML not only in developing novel algorithms and applying them to real data challenges, but also providing the theoretical guarantees on the statistical and computational properties of the algorithms. Machine learning has several subareas:
- Supervised learning: Learning using labeled data (in statistical language, there is a response variable in the data). Typical supervised learning problems include classification and prediction.
- Unsupervised learning: Learning using unlabeled data. Learning on its own to find structure in its inputs. Unsupervised learning can be a goal itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain feedback that is analogous to rewards, which it tries to maximize.
- Deep learning: A subarea of ML which is a complex neural network with more than three layers. Some deep learning algorithms include more complex structures and architectures to be able to work with unprocessed data, such as text or imaging data.
Our department faculty (Drs. Dai, R., Dai, H., Dong, Zhang, and Zheng) have done some innovative methodology research in this area.
- Zhang, H., Zheng, Y., Hou, L., Zheng, C., and Liu, L. (2021) Mediation analysis for survival data with high-dimensional mediators. doi: 10.1093/bioinformatics/btab564.
- Dai, R., Song, H., Raskutti, G., and Barber, RF. (2020) The bias of isotonic regression. Electronic Journal of Statistics. 14: 801-874
- Song, H., Dai, R., Barber, RF., and Raskutti, G. (2020) Convex and non-convex approaches for statistical inference with noisy labels. Journal of Machine Learning Research. 21: 1-58.
- Wu, L., Jin, Q., Chen, J., He, J., Dong, J. (2020). Diagnostic Accuracy of Chest Computed Tomography Scans for Suspected Patients With COVID-19: Receiver Operating Characteristic Curve Analysis, JMIR Public Health and Surveillance. Oct; 6(4): e19424. DOI: 10.2196/19424
- Liu, Y. and Zheng, C. (2019). Deep latent variable models for generating knockoffs. STAT. 8: e260.
- Dong, J., Wang, L., Gill, J., and Cao J. (2017) Functional Principal Component Analysis of GFR Curves after Kidney Transplant. Statistical Methods in Medical Research. 27(12):3785--3796
- Dai, , Wu, G., Wu. M., and Zhi D. (2016) An Optimal Bahadur-efficient Method in Detection of Sparse Signals with Applications to Pathway Analysis in Sequencing Association Studies. PloS One. doi.org/10.1371/journal.pone.0152667.
- Su, X., Wijayasinghe, CS., Fan, J., and Zhang, Y. (2016) Sparse estimation of proportional hazards models via approximated information criteria. Biometrics. 72: 751-759.
- Jiang, DF., Huang, J., and Zhang, Y. (2013) The cross-validated AUC for MCP-logistic regression with high-dimensional data. Statistical Methods for Medical Research. 22(5): 505-518, 2013.