Wang, Kraska, Franklin, Feng
VLDB'12
Presented by: Miki Shifman
ER (Entity Resolution) is a key topic in data integration & cleaning
The task is to find different records in the table, that refer the same entity
records r1 and r2 have different text, but refer the same entity
Measuring the similarity between records
Training a classifier to class every 2 entities as DUPLICATE or NON-DUPLICATE
The techniques still remain far from perfect
Jaccard Similarity is a common mesaure for similarity between 2 sets of tokens.
Formula: $$ J(A,B) = \frac{|A\cap B|}{|A\cup B|}$$
Example: $$J(r1, r2) = \frac{|\{iPad, 16GB, WiFi, White\}|}{|\{iPad, 16GB, WiFi, White, Two, 2nd, generation\}|} = 0.57$$
Latest human-based approaches use AMT (Amazon Mechanical Turk) as their platform
Combining machine-based algorithms with crowdsourcing
Attempts to gain the positives from both worlds
Proven as effective in image search and machine translation
The approach that is used in the paper
Given a table with n records, the workflow is the following:
The HIT generation is a key component of the workflow and we will prove it can affect on the results dramatically
As we've mentioned, HIT generation is a crucial part of the workflow. we will deal with 2 types of HIT generation:
Pair-based HIT generation
Cluster-based HIT generation
Given a set of pairs of records, P, and a cluster-size threshold, k, the cluster-based HIT generation problem is to generate the minimum number of cluster-based HITs, \(H_1, H_2, H_h\), that satisfy two requirements:
The problem as it is stated is NP-Hard, so we'll have to use approximation algorithms to solve it
The cluster-based HIT generation problem is NP-Hard
We will show we can obtain a solution to the k-clique covering problem of the solution to our problem.
Based on k-clique covering approximation algorithm by Goldschmidt et. al
Observation - For every k-1 consecutive elements in SEQ, the edges in the subsequence \(\{e_i, e_i+1, ..., e_i+k-1\}\) contain at most k different vertices.
Therefore, this algorithm can split the graph into \(\lceil \frac{|SEQ|}{k-1}\rceil\) cliques
This algorithm was proven as inefficient (generated many HITs) in experiments, even in comparison to the naive algorithm
We'll use a greedy algorithm to partition the LCC into SCCs
Given a set of SCCs, pack them into the minimum cluster-based HITs s.t. The size of each HIT is not larger than k.
We can formulate the problem as an integer linear problem: $$ min \quad \sum_{i=1}^{m}x_i \quad s.t. \quad \sum_{i=1}^{m} a_{ij} x_i\ge c_j \quad \forall j\in[1,k] $$ while \(x_i\ge0\) and integer, \(c_j\) is the number of SCCs than contain j vertices.
The motivation is to minimize the sum \(\sum_{i=1}^{m}x_i \) (number of HITs) while making sure all SCCs contained in some HIT.
The solution satisfies the constraint, i.e. for j = 2: \(\sum_{i=1}^{3} a_{i2} xi = 0 \cdot 2 + 2 \cdot 0 + 1 \cdot 2 = 2 = c2 \)
The given solution is not optimal, as there is a better solution: x1 = 2, x2 = 1, x3 = 0
The optimal solutions can be calculated with Branch & Bound or Column Generation algorithms.
People may be tempted to think cluster-based HIT require \(\frac{n \cdot (n-1)}{2}\) comparisons, for n items in the HIT.
Let \(e_1,.., e_m\) be the set of distinct entites. For each \(i\in[1,m]\), \(e_i\) contains all the distinct records that refer entity \(e_i\)
The process of answering the cluster-based HIT consists of identifying each such entity.
Example: Identifying all the records of \(e_2\), requires \(n - 1 - |e_1|\) comparisons.
Combined, the process of identifying all entities requires: (1) \(\sum_{i=1}^{m}{(n - 1 - \sum_{j=1}^{i-1} |e_j|)}\)
The worklow need a machine-based technique for filtering couples that don't pass a certain threshold.
The technique utilized in the expermients called simjoin:
Methods that were used to improve the result quality:
approximation algorithm performs worse than the random algorithm
simjoin
SVM: state of art machine learning technique. The classfier was trained with 2 features: edit distance, and cosine similarity. The SVM classifier was trained with 500 pairs that were randomly selected from the pairs whose Jaccard similarities were above 0.1
Many recent crowdsourcing researches
Data blocking
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |