ORF 525: Statistical Foundations of Data Science

Spring Semester, 2026
MW 2:55 pm - 4:15 pm

Text BookDetails
Statistical Foundations of Data Science Book Cover
Fan, J., Li, R., Zhang, C.-H., and Zou (2020). 
Statistical Foundations of Data Science.
CRC Press. 

Homepage of the book 
To order the book from amazon.com or from CRC Press.

General Information

Instructor: Jianqing  Fan, Frederi ck L. Moore'18 Professor of Finance.
Office: 205 Sherred Hall
Phone: 609-258-7924
E-mail: [email protected]

Office Hours: Monday 4:30 pm--5:30 pm (205 Sherred),   Wednesday 11:00 am –- 12:00 pm (205 Sherred), or by appointment.

Precept: Arranged by the AI as needed

Teaching Assistants:

  • Jinhang Chai, [email protected], 258-8787, Office: 222 Sherred Hall
    • Office Hours and Locations.
      • Tuesday 3:00pm-4:00pm, Sherrerd Hall 003
      • Thursday 2:00pm-3:00pm, Sherrerd Hall 003
  • Alessandro Chiusolo, [email protected] (½ TA)
            Office hour: Friday 10:00am-11:00am, Sherred Hall 002
  • Financial Econometric Lab, 222 Sherred Hall, 258-9433
  • Statistics Lab, 213 Sherred Hall, 258-8787

Text Book

Reference Books

  • James, G., Witten, D., Hastie, T.J., Tibshirani, R. and Friedman, J. (2013). An Introduction to Statistical Learning with Applications in R . Springer, New York.
  • Hastie, T.J., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed). Springer, New York.
  • Buehlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, New York.
  • Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical learning with sparsity. CRC Press, New York.
  • Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.

Syllabus

This course gives an in-depth introduction to statistics and machine learning theory, methods, and algorithms for data science. It covers multiple regression, kernel learning, sparse regression, sure screening, generalized linear models and quasi-likelihood, supervised and unsupervised learning, deep learning, covariance learning and factor models, principal component analysis, and other related topics such as community detection, item ranking, topic modeling, and matrix completion. The applicability and limitations of these methods will be illustrated using mathematical statistics, a variety of modern real-world data sets, and manipulation of the statistical software R. 

Course material will cover the following topics: some will be assigned as reading materials.

  1. Rise of Big Data and Dimensionality* (self reading)
    • Impact of Big Data;
    • Impact of Dimensionality
    • Aims of High-dimensional statistical learning
    • Aims of Big Data
    • Chapters 1--3
  2. Multiple and Nonparametric Regression
  3. Penalized Least Squares
    • Best subset and L_0 penalty
    • Folded-concave Penalized Least Squares
    • Lasso and L_1-regularization
    • Folded-concave regularization
    • Concentration Inequalities
    • Refitted Cross-validation
    • Extensions to Nonparametric Modeling
    • Lecture Notes 2, Homework 2
  4. Generalized Linear Models and Penalized Likelihood
    • Generalized Linear Models (ORF 245 Chap12)
    • Variable Selection via Penalized Quasi-Likelihood
    • Numerical Algorithms
    • One-step estimation
    • Statistical Properties
    • Homework 3
  5. Feature Screening
    • Sure Independent Screening
    • Iteratively Independent Screening
  6. Supervised Learning
    • Model-based Classifiers
    • Kernel Density Classifiers and Naive Bayes
    • Nearest Neighbor Classifiers
    • Classification Trees and Ensemble Classifiers
    • Support Vector Machine
    • Sparse Classifiers and Feature Augmentation
  7. Unsupervised Learning
    • Cluster Analysis
    • Variable Selection in Clustering
    • Choice of Number of Clusters
    • Principal Component Analysis
  8. Introduction to Deep Learning
    • CNNs and RNNs
    • Generative adversary networks
    • Transformers & LLM
    • Training Algorithms
    • A Glimpse of Theory
  9. Covariance Regularization and Graphical Models
    • Sparse Covariance Matrix Estimation
    • Robust Covariance Inputs
    • Sparse Precision Matrix and Graphical Models
    • Latent Gaussian Graphical Models*
  10. Factor Models and Covariance Learning
    • Principal Component Analysis
    • Factor Models and Structured Covariance Learning
    • Robust Covariance and Precision Learning
    • Asymptotic Properties
  11. Applications of PCA and Factor Models
    • Factor-adjusted Regularized Model Selection
    • Factor-adjusted Robust Multiple Testing
    • Augmented Factor Regression
    • Applications to Statistical Machine Learning
      --Community detection, – Matrix completion, – Topic Modeling, — Item Ranking

Computation

The software package for this class is R or RStudio. See R-labs below. Most of the computation in this class can be done on a laptop. Laptops with wireless communication turned off can be used during exams, and so are the calculators.

Attendance

Attendance at the class is required and essential.  The course materials are mainly from the notes.  Many conceptual issues and statistical thinking are only taught in class. They will appear in the midterm and final exams.  

Homework

Problems will be assigned through Canvas approximately biweekly and submitted online. No late homework will be accepted. Missed homework will receive a grade of zero. The homework will be graded, and each assignment carries equal weight. You are allowed to work with other students on the homework problems; however, verbatim copying of homework is absolutely forbidden. Therefore, each student must ultimately produce his or her own homework to be handed in and graded.

Exams

There will be one in-class midterm exam and a final exam. All exams are required, and there will be no make-up exams. Missed exams will receive a grade of zero. All exams are open-book and open-notes. Laptops with wireless off and calculators may be used during the exams.

Schedules and Grading Policy

AssignmentSchedule
Homework (25%)Various due dates (approx 5 problem sets)
Midterm Exam (25%)Wednesday, March 18, 2026 (2:55pm--4:15pm, in class)
Final Exam (45%)9:30am--12:30pm, Thursday, April 30, 2026 at McCosh 62
Participation (5%) This will be measured by a signup sheet or random quizzes

R-labs

The following files are intended to help you become familiar with the use of R-lab commands.

Here are some useful materials, too.

Data Sets used in the class