Fall Semester, 2021

MW 3:00pm--4:20pm in Room 101, Friends Center

## General Information

**Instructor: **Jianqing Fan, Frederick L. Moore'18 Professor of Finance, Professor of Operations Research and FInancial Engineering**Office**: 205 Sherrerd Hall.**Phone**: 258-7924.**E-mail**: [email protected]

**Assistants in Instruction**:

- Head AI: Tony Ye, [email protected]
- Other AIs: Onat Aydinhan ([email protected]), Yihong Gu ([email protected]), (Jikai Hou) [email protected], Bingyan Wang ([email protected]), Mengxi Wang ([email protected])
- Precept: Precepts are a good way to solidify your understanding of lecture material through practice questions and discussions in a small group. They are weekly, according to the schedule. We ask that you attend the precept section you are officially enrolled in.

### Office Hours

AI's office hours will be held in room 107 Sherrerd Hall. The instructor's office hours will held in room 205 Sherrerd Hall.

**Instructor**: Mondays: 10:00am--11:00am and Wednesdays 4:30pm -- 5:30pm or by appointments**Assistants in Instruction (AIs)**: Every day, there are office hours held by AIs and UCAs. Please see Canvas for details. AI's office hours are in room 007 Sherred Hall

### Text and Reference Books

The course textbook is not required for the class. However, it does contain many examples and practice questions and can serve as a good accompaniment to lecture and precept materials:

- Jay Devore, Probability and Statistics for Engineering and the Sciences, 9th Edition. Download Here

## Syllabus

A first introduction to probability, statistics and machine learning. This course will provide background to understand and produce rigorous statistical analysis including estimation, confidence intervals, hypothesis testing, regression, logistic regression and a brief introduction to machine learning. Applicability and limitations of these methods will be illustrated using a variety of modern real world data sets and manipulation of the statistical software R. Precepts are based on real data analysis using R.

Course material will be covered the following topics; some topics will be assigned as reading materials.

- Descriptive statistics
- Statistics vs. probability, sample vs population;
- Summary statistics: Mean, SD, Median, IQR;
- Graphical Summary: Pie Charts, Histograms, Box-plots

Lecture Notes 1, Homework 1

- Probability
- Sample space, event, probability
- Conditional Probability, Bayes's Theorem
- Independence
- Monte Carlo Simulations

Lecture Notes 2, Homework 2

- Random variables and probability distributions
- Random variables and probability distribution
- Expected values and standard deviation
- Probability density functions

- Commonly used distributions
- Binomial distribution
- Hypergeometric, negative bionomial
- Poisson distributions
- Normal distributions
- Normal approximations to data histograms
- Exponential and Gammas distributions
- Quantile-Quantile plot

- Joint Distributions and Random Samples
- Discrete joint distribution
- Joint densities
- Covariance and correlation
- Multivariate random variables
- Square root law
- Central limit theorem

- Concepts and Methods of Estimation
- Point Estimation
- Methods of Estimation
- Standard error
- Bootstrap

- Confidence intervals
- Basic Concept
- Precision, sample size
- Bootstrap
- Intervals based on normal population
- One-sided confidence bounds

- Hypothesis Testing
- Basic concept
- Test for population mean
- t-test
- Test for population proportion

- Comparisons of two treatments
- Inference based on two samples
- Two-sample z-test
- Two-sample t-test
- Difference between two proportions
- Analysis of paired data

- Simple linear regression
- Models and summary statistics
- Estimation of model parameters
- Regression effect and goodness of fit
- Inference of model parameters
- Prediction
- Inference of Correlation

- Multiple and NonlinearRegression
- Parameter estimation
- Variable Selection
- Statistical inference and ANOVA
- Model diagnostics
- Training and Testing
- Cross-validation and Prediction errors
- Polynomial and nonlinear regression
- Model building using dummies

- Introduction to Machine Learning
- Logistic Regression
- Supervised learning and Bayesian classifiers
- Fisher and nearest neighborhood classification
- Support vector machine
- Unsupervised learning

### Computation

The software package for this class is *R*. The implementations of the statistical machine learning ideas are essential to this class. Laptops can be used during the exam as a calculator; however, internet and other communication tools should be turned off.

Popularity of Data Science Software , Popularity of Programming Languages

### Attendance

We encourage active participation in lectures and precepts. These sessions cover many conceptual and practical issues and hone statistical thinking that cannot be learned from reading the text book and lecture notes alone. They will appear in the midterm and final exams.

### Homework

There will be 10 homeworks throughout the semester. Problems will be posted on Canvas. They will be due Wednesdays 11:30pm EDT/EST in the following week. You must show all your work, including your R code. You must submit a single pdf file containing answers to questions in the order presented. Missed homework will receive a grade of zero. All homeworks carry equal weight, except for the one on which you achieve the lowest score. The homework with the lowest score will carry 40% of its original weight. You are encouraged to work with other students in small groups on the homework problems, as some of the problems could be challenging. However, verbatim copying of homework is absolutely forbidden. Therefore each student must ultimately produce his or her own homework to be handed in and graded.

### Exams

There will be a midterm that covers the first half of the course during mid-term week, and a cumulative final exam covering the entire course at the end of the semester. All exams are required and there will be no make-up exams. Missed exams will receive a grade of zero. All exams are open-book and open-notes. Laptops with wireless off and calculators may be used during exams.

### Schedules and Grading

Assignment | Schedule |
---|---|

Homework (30%) | 11:30pm of due dates |

Midterm Exam (25%) | Wed (3:00--4:20pm), Oct 13, 2021 |

FINAL EXAM (45%) | 1:30-4:30pm, Dec 19, McDonnell Hall A02 |

### R-Labs

Understanding how to implement statistical machine learning ideas is becoming increasingly important for many career paths. R is one of the most prominently used language for data scientists and statisticians. We will be applying the theory by analyzing real-world datasets in this class using R. Please install R and R studio before the first class: R or R Studio You may choose to use the R console, but R studio offers a more friendly user interface so it is recommended for use.

- Materials for Quick Start of R installation and learning.
- The following files intend to help you familiar with the use of R-lab commands.

U-Tube video: An introduction to R - Here are useful materials for R.

An Introduction to R, by W. N. Venables, D. M. Smith and the R Core Team. - Quick Labs 1-5: Basic skills and their associated data set (Boston housing data). The materials cover basic skills for R. Do the first 3 labs for now and do remaining when we get to Multiple Regression
- The following extended skills are not used in the class, but is provided here for your convenience.
- Extended Skills: GLIM and ttheir associated data set (burn data). Description of the data set
- Extended Skills: ANOVA and their associated data set (labor data).

### Datasets Used in Class

- Daily Stock Prices from 1/1/2000 to 9/8/2016: SP500, IBM, Johnson & Johnson , Apple Inc.
- Tax data in years 2014 and 2000
- Salary data for 253 MBA's first jobs (in pounds) in UK in 2010 Jobs.txt
- 129 macroeconomic monthly time series from 1959 to 2016 macro.csv
- motorcycle data motordata.txt
- autism data autism.csv
- Boston Housing Data boston.housing.dat
- Image Data: 500 photos with people in the pictures and 500 photos without people in pictures and its associated R-code to preprocess the data human.r