


Instructor. Jianqing Fan,
Frederick L. Moore'18 Professor of Finance. Office: 205
Sherred Hall. Phone:
2587924. Email: jqfan@princeton.edu
Office Hours: Monday:
4:00pm5:00 pm (Zoom Meeting), Wednesday 10:00am11:00am (Zoom Meeting) , or by appointments.
Precept:
Arranged by the AI as needed
Assistants in Instruction
(AIs):

Igor Silin isilin@princeton.edu, 2589433, Office: 228 Sherred Hall
Office Hours: Zoom meeting
 Tuesday: 3:00pm4:00pm
 Thursday: 10:00am11:00am
 Financial Econometric Lab, 222 Sherred Hall, 2589433,
Statistics Lab,
213 Sherred Hall, 2588787
Text Book:
 Fan, J., Li, R., Zhang, C.H., and Zou, H. (2020).
Statistical Foundations of Data Science. CRC Press.

Lectures are primarily based on the lecture
notes and a part of text book with the following references.
Reference Books:
 James, G., Witten, D., Hastie, T.J., Tibshirani, R. and Friedman, J. (2013).
An Introduction to Statistical Learning with Applications in R . Springer, New York.
 Hastie, T.J., Tibshirani, R. and Friedman, J. (2009).
The elements of Statistical Learning: Data Mining, Inference,
and Prediction (2nd ed). Springer, New York.
 Buehlmann, P. and van de Geer, S. (2011).
Statistics for HighDimensional Data:
Methods, Theory and Applications. Springer, New York.
 Hastie, T., Tibshirani, R., and Wainwright, M. (2015).
Statistical learning with sparsity. CRC press, New York.
 Wainwright, M. J. (2019). Highdimensional statistics: A nonasymptotic viewpoint. Cambridge University Press.
Syllabus: This course gives in depth introduction to statistics and machine learning theory, methods, and algorithms for data science. It covers multiple regression, kernel learning, sparse regression, sure screening, generalized linear models and quasilikelihood, covariance learning and factor models, principal component analysis, supervised and unsupervised learning, deep learning, and other related topics such as community detection, item ranking, and matrix completion. Applicability and limitations of these methods will be illustrated using mathematical statistics and a variety of modern real world data sets and manipulation of the statistical software R.
Course material will be covered the
following topics; some topics will be assigned as reading materials.
1. Rise of Big Data and Dimensionality*
2. Multiple and Nonparametric Regression
3. Penalized Least Squares
 Best subset and L_0 penalty
 Foldedconcave Penalized Least Squares
 Lasso and L_1regularization
 Numerical Algorithms
 Regularization parameters
 Refitted Crossvalidation
 Extensions to Nonparametric Modeling
Lecture Notes 2, Homework 2
4. Generalized Linear Models and Penalized Likelihood
 Generalized Linear Models
 Variable Selection via Penalized Likelihod
 Numerical Algorithms
 Statistical Properties
5. Feature Screening
 Correlation Screening
 Generalized and Rank Correlation Screeing
 Nonparametric Screening
 Sure Screening and False Selection
6. Supervised Learning
 Modelbased Classifiers
 Kernel Density Classifiers and Naive Bayes
 Nearest Neighbor Classifiers
 Classification Trees and Ensemble Classifiers
 Support Vector Machine
 Sparsier classifiers
 Sparse Discriminant Analysis
 Sparse Additive Classifiers
7. Unsupervised Learning
 Cluster Analysis
 Variable Selection in Clustering
 Choice of Number of Clusters
 Sparse PCA
8. Introduction to Deep Learning
 CNN and RNN
 Generative adversary networks
 Training Algorithms
 A Glimpse of Theory
9. Covariance Regularization and Graphical Models
 Sparse Covariance Matrix Estimation
 Robust Covariance Inputs
 Sparse Precision Matrix and Graphical Models
 Latent Gaussian Graphical Models
10. Covariance Learning and Factor Models
 Principal Component Analysis
 Factor Models and Structured Covariance Learning
 Covariance and Precision Learning with Known Factors
 Augmented Factor Models and Projected PCA
 Asymptotic Properties
11. Applications of PCA and Factor Models
 Factoradjusted Regularized Model Selection
 Factoradjusted Robust Multiple Testing
 Augmented Factor Regression
 Applications to Statistical Machine Learning
Computation:
The software package for this class is R
or RStudio. See Rlabs
below.
Most of computation in this class can be done through a laptop.
Laptops with wireless communication off can be used during the exams,
and so are the calculators.
Attendance:
Attendance of the class is required
and essential. The course materials are mainly from the
notes.
Many conceptual issues and statistical thinking are only
taught in the class. They will appear in the midterm and final exams.
Homework:
Problems will be assigned through Canvas approximately biweekly and submitted online. No late homework will be accepted.
Missed homework will receive a grade of zero.
The homework will be graded, and each assignment carries equal weight.
You are allowed to work with other students on the homework problems,
however, verbatim copying of homework is absolutely forbidden.
Therefore each student must ultimately produce his or her own homework
to be handed in and graded.
Exams:
There will be one inclass midterm exam, and a final
exam. All exams are required and there will be no makeup exams. Missed
exams will receive a grade of zero. All exams are openbook and
opennotes. Laptops with wireless off and calculators may be used during the exams.
Schedules and Grading Policy:
Homework (30%)
............................................................ Various due dates (approx 5 sets)
Midterm Exam (20%) ....................................................... Monday, March 22, 2021 (1:30pm2:50pm, in class)
Final Exam (50%) (tentative) ........................................... 9:00am12:00pm, Monday, May 10, 2021.
R
labs:
The following files intend to help you familiar
with the use of Rlab commands. Here are some useful materials too.
An Introduction to R, by
W. N. Venables, D. M. Smith and the R Core Team.
UTube video: An introduction to R

