Instructor. Jianqing Fan,
Frederick L. Moore'18 Professor of Finance. Office: 205
Sherred Hall. Phone:
258-7924. E-mail: firstname.lastname@example.org
Office Hours: Monday:
4:00pm--5:00 pm (Zoom Meeting), Wednesday 10:00am--11:00am (Zoom Meeting) , or by appointments.
Arranged by the AI as needed
Assistants in Instruction
Igor Silin email@example.com, 258-9433, Office: 228 Sherred Hall
Office Hours: Zoom meeting
--- Tuesday: 3:00pm-4:00pm
--- Thursday: 10:00am-11:00am
- Financial Econometric Lab, 222 Sherred Hall, 258-9433,
213 Sherred Hall, 258-8787
- Fan, J., Li, R., Zhang, C.-H., and Zou, H. (2020).
Statistical Foundations of Data Science. CRC Press.
Lectures are primarily based on the lecture
notes and a part of text book with the following references.
- James, G., Witten, D., Hastie, T.J., Tibshirani, R. and Friedman, J. (2013).
An Introduction to Statistical Learning with Applications in R . Springer, New York.
- Hastie, T.J., Tibshirani, R. and Friedman, J. (2009).
The elements of Statistical Learning: Data Mining, Inference,
and Prediction (2nd ed). Springer, New York.
- Buehlmann, P. and van de Geer, S. (2011).
Statistics for High-Dimensional Data:
Methods, Theory and Applications. Springer, New York.
- Hastie, T., Tibshirani, R., and Wainwright, M. (2015).
Statistical learning with sparsity. CRC press, New York.
- Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
1. Rise of Big Data and Dimensionality*
Syllabus: This course gives in depth introduction to statistics and machine learning theory, methods, and algorithms for data science. It covers multiple regression, kernel learning, sparse regression, sure screening, generalized linear models and quasi-likelihood, covariance learning and factor models, principal component analysis, supervised and unsupervised learning, deep learning, and other related topics such as community detection, item ranking, and matrix completion. Applicability and limitations of these methods will be illustrated using mathematical statistics and a variety of modern real world data sets and manipulation of the statistical software R.
Course material will be covered the
following topics; some topics will be assigned as reading materials.
2. Multiple and Nonparametric Regression
3. Penalized Least Squares
- Impact of Big Data;
- Impact of Dimensionality
- Aims of High-dimensional statistical learning
- Aims of Big Data
4. Generalized Linear Models and Penalized Likelihood
- Best subset and L_0 penalty
- Folded-concave Penalized Least Squares
- Lasso and L_1-regularization
- Numerical Algorithms
- Regularization parameters
- Refitted Cross-validation
- Extensions to Nonparametric Modeling
Lecture Notes 2,     Homework 2
5. Feature Screening
- Generalized Linear Models
- Variable Selection via Penalized Likelihod
- Numerical Algorithms
- Statistical Properties
6. Supervised Learning
- Correlation Screening
- Generalized and Rank Correlation Screeing
- Nonparametric Screening
- Sure Screening and False Selection
7. Unsupervised Learning
- Model-based Classifiers
- Kernel Density Classifiers and Naive Bayes
- Nearest Neighbor Classifiers
- Classification Trees and Ensemble Classifiers
- Support Vector Machine
- Sparsier classifiers
- Sparse Discriminant Analysis
- Sparse Additive Classifiers
8. Introduction to Deep Learning
- Cluster Analysis
- Variable Selection in Clustering
- Choice of Number of Clusters
- Sparse PCA
9. Covariance Regularization and Graphical Models
- CNN and RNN
- Generative adversary networks
- Training Algorithms
- A Glimpse of Theory
10. Covariance Learning and Factor Models
- Sparse Covariance Matrix Estimation
- Robust Covariance Inputs
- Sparse Precision Matrix and Graphical Models
- Latent Gaussian Graphical Models
11. Applications of PCA and Factor Models
- Principal Component Analysis
- Factor Models and Structured Covariance Learning
- Covariance and Precision Learning with Known Factors
- Augmented Factor Models and Projected PCA
- Asymptotic Properties
- Factor-adjusted Regularized Model Selection
- Factor-adjusted Robust Multiple Testing
- Augmented Factor Regression
- Applications to Statistical Machine Learning
The software package for this class is R
or RStudio. See R-labs
Most of computation in this class can be done through a laptop.
Laptops with wireless communication off can be used during the exams,
and so are the calculators.
Attendance of the class is required
and essential. The course materials are mainly from the
Many conceptual issues and statistical thinking are only
taught in the class. They will appear in the midterm and final exams.
Problems will be assigned through Canvas approximately biweekly and submitted online. No late homework will be accepted.
Missed homework will receive a grade of zero.
The homework will be graded, and each assignment carries equal weight.
You are allowed to work with other students on the homework problems,
however, verbatim copying of homework is absolutely forbidden.
Therefore each student must ultimately produce his or her own homework
to be handed in and graded.
There will be one in-class midterm exam, and a final
exam. All exams are required and there will be no make-up exams. Missed
exams will receive a grade of zero. All exams are open-book and
open-notes. Laptops with wireless off and calculators may be used during the exams.
Schedules and Grading Policy:
............................................................ Various due dates (approx 5 sets)
Midterm Exam (20%) ....................................................... Monday, March 22, 2021 (1:30pm--2:50pm, in class)
Final Exam (50%) (tentative) ........................................... 9:00am--12:00pm, Monday, May 10, 2021.
The following files intend to help you familiar
with the use of R-lab commands.
Here are some useful materials too.
An Introduction to R, by
W. N. Venables, D. M. Smith and the R Core Team.
U-Tube video: An introduction to R