ORF 245: Fundamentals of Statistics

Fall Semester, 2021
MW 3:00pm--4:20pm in Room 101, Friends Center

General Information

Instructor: Jianqing Fan, Frederick L. Moore'18 Professor of Finance, Professor of Operations Research and FInancial Engineering
Office: 205 Sherrerd Hall.
Phone: 258-7924.
E-mail: [email protected]

Assistants in Instruction:

Office Hours

AI's office hours will be held in room 107 Sherrerd Hall. The instructor's office hours will held in room 205 Sherrerd Hall.

Instructor: Mondays: 10:00am--11:00am and Wednesdays 4:30pm -- 5:30pm or by appointments
Assistants in Instruction (AIs): Every day, there are office hours held by AIs and UCAs. Please see Canvas for details. AI's office hours are in room 007 Sherred Hall

Text and Reference Books

The course textbook is not required for the class. However, it does contain many examples and practice questions and can serve as a good accompaniment to lecture and precept materials:

  • Jay Devore, Probability and Statistics for Engineering and the Sciences, 9th Edition. Download Here

Syllabus

A first introduction to probability, statistics and machine learning. This course will provide background to understand and produce rigorous statistical analysis including estimation, confidence intervals, hypothesis testing, regression, logistic regression and a brief introduction to machine learning. Applicability and limitations of these methods will be illustrated using a variety of modern real world data sets and manipulation of the statistical software R. Precepts are based on real data analysis using R. 

Course material will be covered the following topics; some topics will be assigned as reading materials.

  1. Descriptive statistics
    • Statistics vs. probability, sample vs population;
    • Summary statistics: Mean, SD, Median, IQR;
    • Graphical Summary: Pie Charts, Histograms, Box-plots
      Lecture Notes 1, Homework 1
  2. Probability
    • Sample space, event, probability
    • Conditional Probability, Bayes's Theorem
    • Independence
    • Monte Carlo Simulations
      Lecture Notes 2, Homework 2
  3. Random variables and probability distributions
    • Random variables and probability distribution
    • Expected values and standard deviation
    • Probability density functions 
  4. Commonly used distributions
    • Binomial distribution
    • Hypergeometric, negative bionomial
    • Poisson distributions
    • Normal distributions
    • Normal approximations to data histograms
    • Exponential and Gammas distributions
    • Quantile-Quantile plot 
  5. Joint Distributions and Random Samples
    • Discrete joint distribution
    • Joint densities
    • Covariance and correlation
    • Multivariate random variables
    • Square root law
    • Central limit theorem 
  6. Concepts and Methods of Estimation
    • Point Estimation
    • Methods of Estimation
    • Standard error
    • Bootstrap
  7. Confidence intervals
    • Basic Concept
    • Precision, sample size
    • Bootstrap
    • Intervals based on normal population
    • One-sided confidence bounds 
  8. Hypothesis Testing
    • Basic concept
    • Test for population mean
    • t-test
    • Test for population proportion 
  9. Comparisons of two treatments
    • Inference based on two samples
    • Two-sample z-test
    • Two-sample t-test
    • Difference between two proportions
    • Analysis of paired data 
  10. Simple linear regression
    • Models and summary statistics
    • Estimation of model parameters
    • Regression effect and goodness of fit
    • Inference of model parameters
    • Prediction
    • Inference of Correlation 
  11. Multiple and NonlinearRegression
    • Parameter estimation
    • Variable Selection
    • Statistical inference and ANOVA
    • Model diagnostics
    • Training and Testing
    • Cross-validation and Prediction errors
    • Polynomial and nonlinear regression
    • Model building using dummies
  12. Introduction to Machine Learning
    • Logistic Regression
    • Supervised learning and Bayesian classifiers
    • Fisher and nearest neighborhood classification
    • Support vector machine
    • Unsupervised learning

Computation

The software package for this class is R. The implementations of the statistical machine learning ideas are essential to this class. Laptops can be used during the exam as a calculator; however, internet and other communication tools should be turned off.
Popularity of Data Science Software Popularity of Programming Languages

Attendance

We encourage active participation in lectures and precepts. These sessions cover many conceptual and practical issues and hone statistical thinking that cannot be learned from reading the text book and lecture notes alone. They will appear in the midterm and final exams.

Homework

There will be 10 homeworks throughout the semester. Problems will be posted on Canvas. They will be due Wednesdays 11:30pm EDT/EST in the following week. You must show all your work, including your R code. You must submit a single pdf file containing answers to questions in the order presented. Missed homework will receive a grade of zero. All homeworks carry equal weight, except for the one on which you achieve the lowest score. The homework with the lowest score will carry 40% of its original weight. You are encouraged to work with other students in small groups on the homework problems, as some of the problems could be challenging. However, verbatim copying of homework is absolutely forbidden. Therefore each student must ultimately produce his or her own homework to be handed in and graded.

Exams

There will be a midterm that covers the first half of the course during mid-term week, and a cumulative final exam covering the entire course at the end of the semester. All exams are required and there will be no make-up exams. Missed exams will receive a grade of zero. All exams are open-book and open-notes. Laptops with wireless off and calculators may be used during exams.

Schedules and Grading

Assignment Schedule
Homework (30%) 11:30pm of due dates
Midterm Exam (25%) Wed (3:00--4:20pm), Oct 13, 2021
FINAL EXAM (45%) 1:30-4:30pm, Dec 19, McDonnell Hall A02

R-Labs

Understanding how to implement statistical machine learning ideas is becoming increasingly important for many career paths. R is one of the most prominently used language for data scientists and statisticians. We will be applying the theory by analyzing real-world datasets in this class using R. Please install R and R studio before the first class: R or R Studio You may choose to use the R console, but R studio offers a more friendly user interface so it is recommended for use.

Datasets Used in Class