# Data Mining

Amos Fiat (fiat@tau.ac.il)
1st Semester, 2005/2006 - Tuesday 1500-1800
School of Computer Science,
Tel-Aviv University

Relevant Papers/Courses:

John Kleinberg: The Structure of Information Networks Computer Science (Class at Cornell)

Rajeev Motwani: Advanced Algorithms for Internet Applications (Class at Stanford)

Yishay Mansour: Machine Learning: Foundations (Class at Tel Aviv)

Moses Charikar: Algorithms for Massive Data Sets (Class at Princeton)

Avrim Blum: Machine learning theory (Class at CMU)

Manoel Mendonca, Nancy L. Sunderhaft: Mining Software Engineering Data: A Survey (Report)

Computing the Support Vector Machines in Maple, see examples of using the Lagrange Multiplier Method in Maplesoft.

• The Hub/Authority and PageRank algorithms
• Eigenvector/Singular Vector Analysis
• Association Rules
• Data Streams
• Clustering
• Privacy Preserving Data Mining

Homework Assignments 1:

Take the census-income dataset from the UCI machine learning depository and try to derive if the income is above \$50K using the other fields of the dataset.

Part 1: Perform an SVD of 1000 records, and take a low rank approximation (for various small values of the rank). Then, take an additional 1000 records (setting this coordinate to zero) and project the record onto the lower dimensional subspace spanned by the singular vectors. See how well this performs.

Part 2: Construct a decision tree for this problem using 1000 records to derive a tree and test it on the next 1000. Explain how you are constructing the tree, what you do about errors, etc. Read all about pruing the tree.

Part 3: Do itemset mining on this data and try to derive association rules whose consequent is the income field.

Part 4: Build a support vector machine for this dataset.

Final Project: Due March 10, 2006

• Read Rank Aggregation Methods for the Web by C. Dwork, R. Kumar, M. Naor and D. Sivakumar.
• Define
• Kemeny optimal aggregation
• Extended Condorcet criterion
• Footrule optimal aggregation
• Locally Kemeny optimal aggregation
• Borda ordering
• Scaled footrule aggregation
• What are the connections between the above? What can be computed efficiently? How?
• Read What's New: Finding Significant Differences in Network Data Streams by G. Cormode and S. Muthukrishnan.
• Define
• Absolute, relative and variational differences
• Exact deltoids
• Approximate deltoids
• In your own words, and explained as simply as possible, how does the paper test for absolute deltoids?
• Read A Divide-and-Merge Methodology for Clustering, by David Cheng, Ravi Kannan, Santosh Vempala, and Grant Wang.
• In your own words, explain the Divide and Merge approach.
• Describe the spectral partitioning algorithm used in the Divide phase.
• What can you say about the complexity of this method?
• Cite the relevant references in Privacy preserving Data mining.
• What are the open problems from the above papers?

• Test files parsed by Lior Kapelushnik:

Scribe notes of class of 2002: