Executive Software
Engineering Program
EE380L –Data Mining-SE
Spring 2008
Instructor:
Joydeep Ghosh,
Ph.D. Professor
Email address: ghosh@ece.utexas.edu;
URL: http://www.ideal.ece.utexas.edu/~ghosh
Course Title and Description:
Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract “hidden” patterns useful for making smart business decisions. Effective data mining requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning/ AI, heterogenous data bases, parallel processing and data visualization, in addition to knowing the application domain. I will focus on basic techniques for data mining, including methods useful for analyzing information from the world wide web. Demos using the public domain JAVA package (WEKA) (and also some demos using an industrial strength software (SAS)) will be given and some applications/case studies will be discussed. The course involves a mid-term exam, a paper presentation and a term project. There will be no final exam. New for this semester is a Matlab based series of demos and notes for you to play with. Since many students may not know Matlab, this is
an optional "value add" component - I will not give any assignments in Matlab.
Textbooks:
Author: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)
Title: Introduction to Data Mining
Publisher: Addison-Wesley (2005)
ISBN: 0-321-32136-7.
Author:
Title: Data Mining
Publisher: Morgan Kaufmann
ISBN: 0-12-088407-0
Course Expectations:
This course requires students to have
very basic knowledge of JAVA. An undergraduate
level understanding of
probability/statistics, data analysis and linear algebra is assumed.
This is a graduate course so
the workload will be medium to heavy.
While studying techniques for database representation/modeling, clustering, classification, finding associations and sequence processing, emphasis will be placed on the issues of algorithm scalability, performance, interpretability and the ability to deal with garbage data. 10-15 minute student talks will be interwoven with the lectures, depending on class size. The last two classes will largely consist of student term-project presentations, followed by active discussion.
Class outline:
Introduction – January 18 and 19
Reading Assignment: TSK ch 1-3; WF ch 1, 2
Area of study:
overview, SAS demos, data warehousing, OLAP; Data quality and pre-processing
February 15 and 16
Area of study:
Classification;
Finding association rules
March 14 and 15
Reading Assignment:: TSK ch 8, 9, WF 3.9, 4.8, 6.6;
Area of study: clustering/segmentation; market basket analysis.
April 11 and 12
Reading Assignment: from papers/notes
Area of study:
prediction/ forecasting
May 9 and 10
Reading
Assignment: TSK ch 9; notes
Area of study: web
analytics (contd): analyzing content and usage of web sites.; project presentations; course wrap-up; the future of
data mining.
Grading Information:
40% final project,
25% written homework and critique
20% mid-term
10% brief presentation of research
paper (groups of 2)
5% class participation