Executive Software Engineering Program

EE380L –Data Mining-SE

Spring 2006

 

Instructor:

 

Joydeep Ghosh, Ph.D. Professor

Email address: ghosh@ece.utexas.edu;

URL: http://www.lans.ece.utexas.edu/~ghosh

 

 

Course Title and Description:

           

Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract “hidden” patterns useful for making smart business decisions. Effective data mining requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning/ AI, heterogenous data bases, parallel processing and data visualization, in addition to knowing the application  domain. I will focus on basic techniques for data mining, including methods useful for analyzing information from the world wide web.  Demos using the public domain JAVA package (WEKA) (and also some demos using an industrial strength software (SAS)) will be given and some applications/case studies will be discussed.  The course involves a mid-term exam, a paper presentation and a term project. There will be no final exam.

 

 

Textbooks:

 

Author:  Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)

Title:  Introduction to Data Mining

Publisher:   Addison-Wesley (2005)

ISBN: 0-321-32136-7.
 

Author:  Witten and Frank (WF)

Title:  Data Mining (2nd Ed)

Publisher:   Morgan Kaufmann (2005)

ISBN:  0-12-088407-0

 

 

 

Course Expectations:

 

This course requires students to have very basic knowledge of JAVA.  An undergraduate level understanding  of probability/statistics, data analysis and linear algebra is assumed. This is a graduate course so the workload will be medium to heavy. 

While studying techniques for database representation/modeling, clustering, classification, finding associations and sequence processing, emphasis will be placed on the issues of algorithm scalability, performance, interpretability and  the ability to deal with garbage data. 10-15 minute student talks will be interwoven with the lectures, depending on class size. The last two classes will largely consist of student term-project presentations, followed by active discussion.

 

 


 

 

 

 

 

 

 

 

Class outline:

            Introduction – January 20 and 21

Reading Assignment:  TSK ch 1-3; WF ch 1, 2

 

Area of study: overview, SAS demos, data warehousing, OLAP; Data quality and pre-processing

 

            February 17 and 18

Reading Assignment:: TSK ch 8, 9, 6 ; WF 3.4, 3.9,4.5, 4.8, 6.6; 7.1-7.3

 

Area of study: clustering/segmentation; market basket analysis; intro to finding association rules

 

March 10 and 11

Reading Assignment : TSK ch (6), 4, 5; WF rest of Ch 4-6.

 

Area of study: Assoc. Rules (contd), classification; prediction/ forecasting

 

            April 7 and 8

Reading Assignment: from papers/notes, also WF 7.5 and 8.3

 

Area of study:  combining multiple models; web analytics: analyzing hyperlink structure and content of websites.

 

            May 12 and 13

                        Reading Assignment: TSK ch 9; notes

 

Area of study: web analytics (contd): analyzing usage of web sites.; project presentations; course wrap-up; the future of data mining.

 

Grading Information:

 

40% final project,

25% written homework and critique

20% mid-term

10% brief presentation of research paper (groups of 2)

 5% class participation