Data Mining
Amos Fiat (fiat@tau.ac.il)
1st Semester, 2005/2006  Tuesday 15001800
School of Computer Science,
TelAviv University
Relevant Papers/Courses:
John Kleinberg: The Structure of Information Networks
Computer Science (Class at Cornell)
Rajeev Motwani: Advanced Algorithms for Internet Applications (Class at Stanford)
Yishay Mansour: Machine Learning: Foundations (Class
at Tel Aviv)
Moses Charikar: Algorithms for Massive Data Sets
(Class
at Princeton)
Avrim Blum: Machine learning theory
(Class
at CMU)
Manoel Mendonca, Nancy L. Sunderhaft: Mining Software Engineering Data: A Survey
(Report)
Computing the Support Vector Machines in Maple, see examples of using the Lagrange Multiplier Method in Maplesoft.
 The Hub/Authority and PageRank algorithms
 Eigenvector/Singular Vector Analysis

 Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and
Harshman, R. A.
Indexing by latent semantic analysis. Journal of the Society for
Information Science, 41(6), 391407 (1990).

 Christos Papadimitriou, Prabhakar Raghavan Hisao Tamaki, Santosh
Vempala. Latent Semantic
Indexing: A Probabilistic Analysis. 17th ACM Symposium on the Principles
of Database Systems, 1998.

 D. Gibson, J. Kleinberg, P. Raghavan.
Inferring Web communities from link topology. Proc. 9th ACM Conference
on Hypertext and Hypermedia, 1998.

 P. Drineas, Ravi Kannan, Alan Frieze, Santosh Vempala and V. Vinay
"Clustering
in large graphs and matrices." Proc. of the 10th ACMSIAM Symposium
on Discrete Algorithms, Baltimore, 1999.

 Yossi Azar, Amos Fiat, Anna Karlin, Frank McSherry and Jared Saia.
Spectral Analysis
of Data. 33rd ACM Symposium on Theory of Computing, 2001.

 Dimitris Achlioptas, Amos Fiat, Anna Karlin, Frank McSherry,
Web Search via Hub Synthesis. 42nd IEEE Symposium on Foundations
of Computer Science, 2001, p.611618.
 Association Rules
 Data Streams

 Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer
Widom,
Models and Issues in Data Stream Systems, Proceedings of the 21st
ACM Symposium on Principles of Database Systems, 2002.

 M. Datar, A. Gionis, P. Indyk, AND R. Motwani, Maintaiing
Stream Statistics Over Sliding Windows., SICOMP, 2002, Vol. 31,
No. 6, pp. 1794–1813.

 Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O'Callaghan,
Maintaining
Variance and kMedians over Data Stream Windows, Proceedings of
the 22nd ACM Symposium on Principles of Database Systems, 2003.

 See Yossi Matias list
of publications on massive Data Sets
 Clustering
 Privacy Preserving Data Mining
Homework Assignments 1:
Take the censusincome
dataset from the UCI machine learning depository
and try to derive if the income is above $50K using the other fields of
the dataset.
Part 1: Perform an SVD of 1000 records, and take a low rank approximation
(for various small values of the rank). Then, take an additional 1000 records
(setting this coordinate to zero) and project the record onto the lower
dimensional subspace spanned by the singular vectors. See how well this
performs.
Part 2: Construct a decision tree for this problem using 1000 records to
derive a tree and test it on the next 1000. Explain how you are constructing
the tree, what you do about errors, etc. Read all about pruing the tree.
Part 3: Do itemset mining on this data and try to derive association rules
whose consequent is the income field.
Part 4: Build a support vector machine for this dataset.
Final Project: Due March 10, 2006
 Read Rank Aggregation
Methods for the Web by C. Dwork, R. Kumar, M. Naor and D. Sivakumar.
 Define
 Kemeny optimal aggregation
 Extended Condorcet criterion
 Footrule optimal aggregation
 Locally Kemeny optimal aggregation
 Borda ordering
 Scaled footrule aggregation
 What are the connections between the above? What can be computed efficiently?
How?
 Read What's
New: Finding Significant Differences in Network Data Streams by G.
Cormode and S. Muthukrishnan.
 Define
 Absolute, relative and variational differences
 Exact deltoids
 Approximate deltoids
 In your own words, and explained as simply as possible, how does
the paper test for absolute deltoids?
 Read A
DivideandMerge Methodology for Clustering, by David Cheng, Ravi Kannan, Santosh Vempala, and Grant Wang.
 In your own words, explain the Divide and Merge approach.
 Describe the spectral partitioning algorithm used in the Divide phase.
 What can you say about the complexity of this method?
 Cite the relevant references in Privacy preserving Data mining.
 What are the open problems from the above papers?
 Test files parsed by Lior Kapelushnik:
Scribe notes
of class of 2002:
 Scribe
notes, Lecture 1
 Scribe
notes, Lecture 2
 Scribe
notes, Lecture 3
 Scribe
notes, Lecture 4
 Scribe
notes, Lecture 5
 Scribe
notes, Lecture 6
 Scribe
notes, Lecture 7
 Scribe
notes, Lecture 8
 Scribe
notes, Lecture 9
 Scribe
notes, Lecture 10