CrowdER: Crowdsourcing Entity Resolution

Wang, Kraska, Franklin, Feng
VLDB'12


Presented by: Miki Shifman

Presenter Notes

Intoduction

Presenter Notes

What is ER ?

  • ER (Entity Resolution) is a key topic in data integration & cleaning

  • The task is to find different records in the table, that refer the same entity

Examples

  • \(O_2\) == Oxygen
  • Michael Shifman == Miki Shifman
  • Benjamin Netanyahu == Bibi

Presenter Notes

Data sources

Presenter Notes

Example

Products Table

records r1 and r2 have different text, but refer the same entity

Presenter Notes

Machine Learning Techniques

  • Measuring the similarity between records

  • Training a classifier to class every 2 entities as DUPLICATE or NON-DUPLICATE

  • The techniques still remain far from perfect

Presenter Notes

Jaccard Similarity

  • Jaccard Similarity is a common mesaure for similarity between 2 sets of tokens.

  • Formula: $$ J(A,B) = \frac{|A\cap B|}{|A\cup B|}$$

  • Example: $$J(r1, r2) = \frac{|\{iPad, 16GB, WiFi, White\}|}{|\{iPad, 16GB, WiFi, White, Two, 2nd, generation\}|} = 0.57$$

Presenter Notes

Human-based approaches

Latest human-based approaches use AMT (Amazon Mechanical Turk) as their platform

Various Systems

CrowdSQL

  • select * from Product p, Product q where p.product_name ~= q.product_name
  • Comparing all pairs in a table results in \(O(n^2)\) comparisons

Qurk (from Slava's lecture)

  • Reduces comparisons by batching
  • if a HIT consists of k records, it can be reduced to \(O(\frac{n^2}{k})\) with batching or \(O(\frac{n^2}{k^2})\) with 'Smart Batching'. In a database of size 10,000 records with k=20, the system will generate 5M / 250K Hits, which will cost 50K$ / 2,500$ respectively

Presenter Notes

Hybrid Approach

  • Combining machine-based algorithms with crowdsourcing

  • Attempts to gain the positives from both worlds

  • Proven as effective in image search and machine translation

  • The approach that is used in the paper

Presenter Notes

Hybrid algorithm workflow

Given a table with n records, the workflow is the following:

  1. Initially there are \(\frac{n \cdot (n-1)}{2}\) pairs and probably Most pairs are dissimilar
  2. We define a threshold for a machine-based algorithm, and prune the pairs that do not reach the threshold
  3. We generate HITs out of the pairs left, to reach a descision

The HIT generation is a key component of the workflow and we will prove it can affect on the results dramatically

Presenter Notes

Hybrid algorithm workflow

Workflow

Presenter Notes

HIT Generation

As we've mentioned, HIT generation is a crucial part of the workflow. we will deal with 2 types of HIT generation:

  • Pair-based HIT generation

  • Cluster-based HIT generation

Presenter Notes

Pair-based HIT Generation

  • Multiple record pairs batched into a single query
  • Suppose a Pair-based HIT can contain at most k pairs, we should generate \(\lceil{\frac{|P|}{k}}\rceil\) HITs.

Pair-based Hits

Presenter Notes

Cluster-based HIT Generation

  • Each HIT consists of individual records
  • Workers are asked to find all duplicate records in the group

Cluster-based HITs

Presenter Notes

Cluster-based HIT Generation

Formal definition

Given a set of pairs of records, P, and a cluster-size threshold, k, the cluster-based HIT generation problem is to generate the minimum number of cluster-based HITs, \(H_1, H_2, H_h\), that satisfy two requirements:

  1. \(|H_l|\) = k for any \(l \in [1, h]\), where \(|H_l|\) denotes the number of records in \(H_l\);
  2. \( \forall r_i, r_j \in P\) , there exists \(H_l\) \(l \in [1, h]\), s.t. \(r_i \in H_l\) and \(r_j \in H_l \).

The problem as it is stated is NP-Hard, so we'll have to use approximation algorithms to solve it

Presenter Notes

Cluster-based HIT Generation

Claim

The cluster-based HIT generation problem is NP-Hard

Proof

We will show we can obtain a solution to the k-clique covering problem of the solution to our problem.

  • Let \(H_1,..,H_h\) be the solution to the Cluster-based HIT generation problem
  • h is the minimal number of cliques (of size \(\le k \)) in the graph
  • For each \(H_1,..,H_h\) we generate edges \(C_1,..,C_h\) in the graph, and complete it to k-clique by adding each \(C_i\), \(k-|H_i|\) vertices

Presenter Notes

Transformation of records into a graph

records to graph

Presenter Notes

Approximation Algorithm

Based on k-clique covering approximation algorithm by Goldschmidt et. al

  1. SEQ = []
  2. Pick up a random vertex v
  3. Add v and all its edges to SEQ and remove them from the graph
  4. Stop if no vertices or edges left, else repeat from 2

Observation - For every k-1 consecutive elements in SEQ, the edges in the subsequence \(\{e_i, e_i+1, ..., e_i+k-1\}\) contain at most k different vertices.

Therefore, this algorithm can split the graph into \(\lceil \frac{|SEQ|}{k-1}\rceil\) cliques

This algorithm was proven as inefficient (generated many HITs) in experiments, even in comparison to the naive algorithm

Presenter Notes

Two-Tiered Approach

Presenter Notes

Motivation

  • Similar to approximation algorithm, but takes into account the size of the connected components in each HIT (k)
  • We define:
    • LCC (Large Connected Component) as a component with more than k vertices
    • SCC (Small Connected Component) as a component with k vertices and less
  • We would like to create SCCs that are Highly Connected in order to increase the ammount of edges covered by the component
  • LCCs must be partitioned into SCCs
  • We would like to batch SCCs into single HIT when it's possible

Presenter Notes

Design

records to graph

Presenter Notes

LCC Partitioning (Top Tier)

We'll use a greedy algorithm to partition the LCC into SCCs

  • Given an LCC, initialize a set scc with the vertex with maximal degree in the LCC
  • For each vertex in the LCC (still not in the scc), we'll compute the indegree (# of edges to vertices in scc), and outdegree (# of edges to vertices not in the scc)
  • We'll add to the scc the vertex with the maximum indegree. In case of the tie, we'll pick the one with the minimum outdegree
  • The algorithm runs until scc is of size k, or there are no remaining vertices left that connect to scc

Presenter Notes

LCC Partitioning (Top Tier)

Top tier example

Presenter Notes

SCC Packing (Bottom Tier)

The problem

Given a set of SCCs, pack them into the minimum cluster-based HITs s.t. The size of each HIT is not larger than k.

  • Let \(p = {a_1, a_2, .., a_k}\) denote a pattern of cluster based HIT, where \(j \in [1, k]\) is the number of SCCs in the HIT that contain k vertices
  • A pattern is feasible only if \(\sum_{j=1}^{k} j \cdot a_j \le k\)
  • Denote the set \(\{p_1,..., p_n\}\) as the set of feasible patterns, where each \(p_i = \{a_{i1},.., a_{ik}\}\)
  • Let \(x_i\) denote the set of cluster-based HITs whose pattern is \(p_i\)

Presenter Notes

SCC Packing (Bottom Tier)

We can formulate the problem as an integer linear problem: $$ min \quad \sum_{i=1}^{m}x_i \quad s.t. \quad \sum_{i=1}^{m} a_{ij} x_i\ge c_j \quad \forall j\in[1,k] $$ while \(x_i\ge0\) and integer, \(c_j\) is the number of SCCs than contain j vertices.

The motivation is to minimize the sum \(\sum_{i=1}^{m}x_i \) (number of HITs) while making sure all SCCs contained in some HIT.

Presenter Notes

SCC Packing - Example

  • Given a set of SCCs: {r3, r4, r5, r6}, {r1, r2, r3, r7}, {r4, r7} and {r8, r9}, and k = 4
  • c1 = 0, c2 = 2, c3 = 0, c4 = 2
  • a possible solution is: x1 = 2, x2 = 0, x3 = 2. \(\sum_{i=1}^{3}x_i \) = 4
  • The solution satisfies the constraint, i.e. for j = 2: \(\sum_{i=1}^{3} a_{i2} xi = 0 \cdot 2 + 2 \cdot 0 + 1 \cdot 2 = 2 = c2 \)

  • The given solution is not optimal, as there is a better solution: x1 = 2, x2 = 1, x3 = 0

  • The optimal solutions can be calculated with Branch & Bound or Column Generation algorithms.

Presenter Notes

Cluster-based HITs properties

People may be tempted to think cluster-based HIT require \(\frac{n \cdot (n-1)}{2}\) comparisons, for n items in the HIT.
Let \(e_1,.., e_m\) be the set of distinct entites. For each \(i\in[1,m]\), \(e_i\) contains all the distinct records that refer entity \(e_i\)

The process of answering the cluster-based HIT consists of identifying each such entity.

Example: Identifying all the records of \(e_2\), requires \(n - 1 - |e_1|\) comparisons.

Combined, the process of identifying all entities requires: (1) \(\sum_{i=1}^{m}{(n - 1 - \sum_{j=1}^{i-1} |e_j|)}\)

Observations:

  • The number of comparisons (value of equation (1)) decreases as \(|e_i| \quad i\in[1,m]\) increases
  • Equation (1) can be presented as: \((n-1)\cdot m - \sum_{i=1}^{m-1}(m - i) \cdot |e_i|\)
    • \((n-1)\cdot m\) is constant, and (m - i) decreases as i increases. Therefore, it's best if entities are identified in increasing order.

Presenter Notes

Cluster-based HITs properties

Illusatration of the series of comparisons

Cluster-based HITs

Presenter Notes

Experiments

Presenter Notes

Experimental Setup

  • Restaurant Dataset
    • 858 records
    • (858 * 857) / 2 = 367,653 pairs
    • 106 pairs refer the same entity
    • 4 attributes: [name, address, city, type]
    • Example record: ["oceana", "55 e. 54th st.", "new york", "seafood"].
  • Product Dataset
    • 1081 records from abt + 1092 records from buy
    • 1081 * 1092 = 1,180,452 pairs of records
    • 1097 pairs refer the same entity
    • 2 attributes: [name, price]
    • Example record: ["Apple 8GB Black 2nd Generation iPod Touch - MB528LLA", "$229.00"].

Presenter Notes

Machine-based Technique

The worklow need a machine-based technique for filtering couples that don't pass a certain threshold.

The technique utilized in the expermients called simjoin:

  • Split each record into a token list the consists of the tokens of all its attributes
  • Compute the Jaccard similiarty between every pair of tokens list
  • Crowdsource only the pairs that pass a certain threhold

Presenter Notes

Likehood results on tables

Presenter Notes

Amazon Mechanical Turk

  • Workers are paid 0.02$ for each HIT + 0.005$ is paid to Amazon for publishing the HIT
  • Over 8000 HITs were executed, in the total cost of 600$
  • All hits were published between 18:00 to 24:00

Methods that were used to improve the result quality:

  • Each HIT was given to 3 different workers and the result of the HIT was combined from all three assignments
  • Qualification test to workers that consisted of 3 pairs of records

Presenter Notes

Cluster-based HIT Generation Techniques

  • Random: Generate cluster HITs randomly out of the record pairs
  • BFS: Start at a vertex, BFS until completed a k cluster (and repeat to find the next one..)
  • DFS: Same as the previous method, just with DFS
  • Approximation algorithm: As in the k-clique covering approximation algorithm

Presenter Notes

Fixed cluster size experiment

HIT Generation experiments - fixed cluster size(10)

approximation algorithm performs worse than the random algorithm

Presenter Notes

Various cluster sizes experiment

HIT Generation experiments - various cluster size

Presenter Notes

Entity Resolution techniques comparison

The metrics that are used to evaluate results

  • precision: is the percentage of correctly identified matching pairs out of all pairs identified as matches
  • recall: is the percentage of correctly identified matching pairs out of all matching pairs in the dataset

The techniques

  • simjoin

  • SVM: state of art machine learning technique. The classfier was trained with 2 features: edit distance, and cosine similarity. The SVM classifier was trained with 500 pairs that were randomly selected from the pairs whose Jaccard similarities were above 0.1

Presenter Notes

Presenter Notes

Pair-based vs. Cluster-based

  • An additional dataset will be used: Product+Dup, with more matching pairs than previous datasets
    • It is generated by adding 0-9 dummy matching records to each record in the original Product datases
    • It has 157,641 pairs of records, of which 1713 pairs are matches
  • likehood threshold was chosen as 0.2
  • cluster size was chosen as k=10
  • For products dataset
    • There were 8315 pairs that needed to be crowdsourced, which resulted in 508 cluster-based HIT
    • 8315/50 pairs, were generated in each pair-based HIT in order to match the ammount of HITs

Presenter Notes

Presenter Notes

Presenter Notes

Related Work

  • Many recent crowdsourcing researches

    • CrowdDB
    • CrowdSearch
    • Qurk
    • ...
  • Data blocking

    • Partitioning a table to maximize matching pairs co-occuring in given partitions
    • Differs by our problem, as our bound of a block restricted by the types of task a turker may do for the money

Presenter Notes

Future Work

  • UI improvements both to pair-based and cluster-based HITs
  • Budget based approach (tradeoff of latency, money, quality)
  • Make better use of machine-based techniques, to offload crowd resources
  • Take privacy into consideration, in order to be able to process confidential data

Presenter Notes

Thank You!

Presenter Notes