# Data Mining and Computational Statistics - StructureData Mining and Computational Statistics - Structure

GIANCARLO MANZI , responsible for the course

Degree in FINANCE AND ECONOMICS (MEF) - Classe LM-16 Enrolled from 2017/2018 academic year - Laurea Magistrale - 2018/2019

Compulsory course or activity yes 1s 3rd term SECS-S/01 - Statistica 9 -

## General information

Aims and objectives: Course objectives are:
• To introduce students to the expanding world of big data analysis.
• To introduce students to basic concepts, techniques and applications of computational statistics & data mining to be used in finance and economics.
• To develop skills for using the R software in order to solve practical problems
• To achieve skills for doing independent study and research.

Language of instruction: English

Teaching methods: 75% lecture-style lessons
25% classroom teaching activities focused on examples and applications in R

## Syllabus

### Web:

http://ariel.unimi.it

Syllabus: Part I
(i) Introduction to data mining and statistical learning. (ii) Exploratory data analysis and visualization. (iii) Supervised vs. unsupervised methods: introduction. (iii) Parametric vs. nonparametric methods: introduction. (iv) Quick review of Maximum Likelihood Methods (v) Multiple linear regression. (vi) Classification methods: logistic regression, linear discriminant analysis and the K-nearest neighbors method. The Bayes classifier. (vii) Resampling methods: cross validation and the bootstrap. (vii) Shrinkage methods: Ridge regression, the Lasso and other shrinkage methods. (ix) Regression splines and local regression. (x) Tree-based methods: random forest, bagging and boosting. Introduction to Bayesian networks. (xi) Support vector machines. (xii) Unsupervised learning: PCA, clustering and multidimensional scaling methods; correspondance analysis. Principal component regression. (xiii) Introduction to Bayesian methods in data mining. (xiv) Elementary text mining. (xv) Data mining in finance.
Part II
(i) Computer-intensive statistical methods: overview. (ii) Pseudo-random number and variable generation. (iii) Monte Carlo methods for numerical integration. (iv) Simulation-based inference. (v) MCMC methods: overview. (vi) MCMC methods: Metropolis-Hastings and Gibbs sampling.

Syllabus - non-attending students: Part I
(i) Introduction to data mining and statistical learning. (ii) Exploratory data analysis and visualization. (iii) Supervised vs. unsupervised methods: introduction. (iii) Parametric vs. nonparametric methods: introduction. (iv) Quick review of Maximum Likelihood Methods (v) Multiple linear regression. (vi) Classification methods: logistic regression, linear discriminant analysis and the K-nearest neighbors method. The Bayes classifier. (vii) Resampling methods: cross validation and the bootstrap. (vii) Shrinkage methods: Ridge regression, the Lasso and other shrinkage methods. (ix) Regression splines and local regression. (x) Tree-based methods: random forest, bagging and boosting. Introduction to Bayesian networks. (xi) Support vector machines. (xii) Unsupervised learning: PCA, clustering and multidimensional scaling methods; correspondance analysis. Principal component regression. (xiii) Introduction to Bayesian methods in data mining. (xiv) Elementary text mining. (xv) Data mining in finance.
Part II
(i) Computer-intensive statistical methods: overview. (ii) Pseudo-random number and variable generation. (iii) Monte Carlo methods for numerical integration. (iv) Simulation-based inference. (v) MCMC methods: overview. (vi) MCMC methods: Metropolis-Hastings and Gibbs sampling.

### Short course description The aim of this course is to provide a basic understanding of supervised and unsupervised statistical learning from data. It will help students to acquire the basic methodology and the most popular tools used in applications. Data mining topics include: review of basic likelihood theory, multiple linear, non-linear regression and other parametric and classification methods; variable selection; logistic regression; the Bayes classifier; linear and quadratic discriminant analysis; regression shrinkage methods (Ridge, Lasso and other methods); dimension reduction (Principal component analysis, multidimensional scaling and correspondance analysis); k-nearest-neighbors; decision trees; support vector machines; clustering. Other computational statistics topics include pseudo-random number and variate generation, Monte Carlo methods for numerical integration and basic Monte Carlo Markov Chain methods. Students' practice will focus on usage of statistical software packages (mainly R). Applications of data mining in finance (time series clustering) and text mining will be also presented.

(i) An Introduction to Statistical Learning, with applications in R (2013) by G. James, D. Witten, T. Hastie, R. Tibshirani, Springer.
(ii) Introducing Monte Carlo Statistical Methods with R (2010) by C.P. Robert, G. Casella, Springer.
Suggested reading for insights into some topics in main textbooks:
(i) The Elements of Statistical Learning, 2nd edition (2009), T. Hastie, R. Tibshirani, J. Friedman, Springer.
(ii) Machine Learning: a Probabilistic Perspective (2012), K.P. Murphy, The MIT Press.
(iii) Monte Carlo Statistical Methods (2004) by C.P. Robert, G. Casella, Springer.

Further reading will be suggested during the course.

Readings - non-attending students: Main textbooks:
(i) An Introduction to Statistical Learning, with applications in R (2013) by G. James, D. Witten, T. Hastie, R. Tibshirani, Springer.
(ii) Introducing Monte Carlo Statistical Methods with R (2010) by C.P. Robert, G. Casella, Springer.
Suggested reading for insights into some topics in main textbooks:
(i) The Elements of Statistical Learning, 2nd edition (2009), T. Hastie, R. Tibshirani, J. Friedman, Springer.
(ii) Machine Learning: a Probabilistic Perspective (2012), K.P. Murphy, The MIT Press.
(iii) Monte Carlo Statistical Methods (2004) by C.P. Robert, G. Casella, Springer.

Further reading will be suggested during the course.

## Prerequisites, exams and assessment

Exam unico Esame voto verbalizzato in trentesimi

Prerequisites, exams and assessment A basic knowledge of statistics and probability fundamentals is required. Basics on regression methods are useful to speed up the first part of the course.
Matrix algebra and multivariate calculus will be beneficial but are not strictly required.
A basic R knowledge and some programming skills are also useful but not required.
Evaluation will be performed through an oral examination on boh theoretical topics and possible applications. Homeworks and assignments will be delivered during the course.

Prerequisites, exams and assessment - non attendant students A basic knowledge of statistics and probability fundamentals is required. Basics on regression methods are useful to speed up the first part of the course.
Matrix algebra and multivariate calculus will be beneficial but are not strictly required.
A basic R knowledge and some programming skills are also useful but not required.
Evaluation will be performed through an oral examination on boh theoretical topics and possible applications. Homeworks and assignments will be delivered during the course.

Propaedeutical courses No mandatory prerequisites are required, but a good knowledge of basic statistical and mathematical topics is welcome.

## Structure of the course

Scientific fields

• SECS-S/01 - Statistica - Credits: 9
Activities

Lezioni: 60 hours

## Teachers ' office hours

Teacher's office hours
TeacherOffice location
GIANCARLO MANZI , responsible for the courseMercoledì 16.30-19.30. Il ricevimento studenti è sospeso per la pausa natalizia. Riprenderà regolarmente il 9 gennaio 2019.DEMM stanza 37 - 3° piano.