Papers
On agnostic boosting and parity learning
Adam Tauman Kalai, Yishay Mansour and Elad Verbin
Presented at STOC '08
Available: [local ps]
The motivating problem is agnostically learning parity functions, i.e., parity with arbitrary or adversarial noise. Specifically, given random labeled examples from an *arbitrary* distribution, we would like to produce an hypothesis whose accuracy nearly matches the accuracy of the best parity function. Our algorithm runs in time 2O(n/log n), which matches the best known for the easier cases of learning parities with random classification noise (Blum et al, 2003) and for agnostically learning parities over the uniform distribution on inputs (Feldman et al, 2006).
Our approach is as follows. We give an agnostic boosting theorem that is capable of nearly achieving optimal accuracy, improving upon earlier studies (starting with Ben David et al, 2001). To achieve this, we circumvent previous lower bounds by altering the boosting model. We then show that the (random noise) parity learning algorithm of Blum et al (2000) fits our new model of agnostic weak learner. Our agnostic boosting framework is completely general and may be applied to other agnostic learning problems. Hence, it also sheds light on the actual difficulty of agnostic learning by showing that full agnostic boosting is indeed possible.
Boosting topic-based publish-subscribe systems with
dynamic clustering
Tova Milo, Tal
Zur and Elad Verbin
Proceedings of SIGMOD Conference 2007
Available: [here]
We consider in this paper a class of Publish-Subscribe (pub-sub) systems called topic-based systems, where users subscribe to topics and are notified on events that belong to those subscribed topics. With the recent flourishing of RSS news syndication, these systems are regaining popularity and are raising new challenging problems.
In most of the modern topics-based systems, the events in each topic are delivered to the subscribers via a supporting, distributed, data structure (typically a multicast tree). Since peers in the network may come and go frequently, this supporting structure must be continuously maintained so that "holes" do not disrupt the events delivery. The dissemination of events in each topic thus incurs two main costs: (1) the actual transmission cost for the topic events,and (2) the maintenance cost for its supporting structure. This maintenance overhead becomes particularly dominating when a pub-sub system supports a large number of topics with moderate event frequency; a typical scenario in nowadays news syndication scene.
The goal of this paper is to devise a method for reducing this maintenance overhead to the minimum. Our aim is not to invent yet another topic-based pub-sub system, but rather to develop a generic technique for better utilization of existing platforms. Our solution is based on a novel distributed clustering algorithm that utilizes correlations between user subscriptions to dynamically group topics together, into virtual topics (called topic-clusters), andt hereby unifies their supporting structures and reduces costs. Our technique continuously adapts the topic-clusters and the user subscriptions to the system state, and incurs only very minimal overhead. We have implemented our solution in the Tamara pub-sub system. Our experimental study shows this approach to be extremely effective, improving the performance by an order of magnitude.
Sorting and Selection in Posets
Constantinos Daskalakis, Richard M. Karp, Elchanan Mossel, Samantha Riesenfeld and Elad Verbin
Submitted.
Available: at [CoRR]
Classical problems of sorting and searching assume an underlying linear ordering of the objects being compared. In this paper, we study a more general setting, in which some pairs of objects are incomparable. This generalization is relevant in applications related to rankings in sports, college admissions, or conference submissions. It also has potential applications in biology, such as comparing the evolutionary fitness of different strains of bacteria, or understanding input-output relations among a set of metabolic reactions or the causal influences among a set of interacting genes or proteins. Our results improve and extend results from two decades ago of Faigle and Tur\'{a}n.
A measure of complexity of a partially ordered set (poset) is its width. Our algorithms obtain information about a poset by queries that compare two elements. We present an algorithm that sorts, i.e. completely identifies, a width w poset of size n and has query complexity O(wn + nlog(n)), which is within a constant factor of the information-theoretic lower bound. We also show that a variant of Mergesort has query complexity O(wn(log(n/w))) and total complexity O((w^2)nlog(n/w)). Faigle and Tur\'{a}n have shown that the sorting problem has query complexity O(wn(log(n/w))) but did not address its total complexity.
For the related problem of determining the minimal elements of a poset, we give efficient deterministic and randomized algorithms with O(wn) query and total complexity, along with matching lower bounds for the query complexity up to a factor of 2. We generalize these results to the k-selection problem of determining the elements of height at most k. We also derive upper bounds on the total complexity of some other problems of a similar flavor.
Most Burrows-Wheeler Based Compressors are Not Optimal
Haim Kaplan and Elad Verbin
Presented at CPM '07
Available: [local ps]
Extra Materials (maple code, etc.)
We present a technique for proving lower bounds on the compression ratio of algorithms which are based on the Burrows-Wheeler Transform (BWT). We study three well known BWT-based compressors: the original algorithm suggested by Burrows and Wheeler; BWT with distance coding; and BWT with run-length encoding. For each compressor, we show a Markov source such that for asymptotically-large text generated by the source, the compression ratio divided by the entropy of the source is a constant greater than 1. This constant is 2-epsilon, 1.26, and 1.29, for each of the three compressors respectively. Our technique is robust, and can be used to prove similar claims for most BWT-based compressors (with a few notable exceptions). This stands in contrast to statistical compressors and Lempel-Ziv-style dictionary compressors, which are long known to be optimal, in the sense that for any Markov source, the compression ratio divided by the entropy of the source asymptotically tends to 1.
We experimentally corroborate our theoretical bounds. Furthermore, we compare BWT-based compressors to other compressors and show that for "realistic" Markov sources they indeed perform bad and often worse than other compressors. This is in contrast with the well known fact that on English text, BWT-based compressors are superior to many other types of compressors.
Counting Colors in Boxes
Haim Kaplan, Natan Rubin, Micha Sharir and Elad Verbin
Proceedings of SODA '07
Available: [local pdf]
Let P be a set of n points in Rd, so that each point is colored by one of C given colors. We present algorithms for preprocessing P into a data structure that efficiently supports queries of the form: Given an axis-parallel box Q, count the number of distinct colors of the points of P∩Q. We present a general and relatively simple solution that has polylogarithmic query time and worst-case storage about O(nd). It is based on several interesting structural properties of the problem that we derive. We also show that for random inputs, the data structure requires almost linear expected storage.
We then present several techniques for achieving space-time tradeoff. In R2, the most efficient solution uses fast matrix multiplication in the preprocessing stage. In higher dimensions we obtain a tradeoff using simpler mechanisms. We give a reduction from matrix multiplication to the offline version of problem, which shows that in R2 our time-space tradeoffs are close to optimal in the sense that improving them substantially would improve the best exponent of matrix multiplication. Finally, we present a generalized matrix multiplication problem and show its intimate relation to counting colors in boxes in any dimension.
Compact Samples for Data Dissemination
Tova Milo, Assaf Sagi and Elad Verbin
J. Comput. Syst. Sci. 74(5): 697-720, 2008
Preliminary version was published in Proceedings of ICDT '07
Available: [local pdf of conference version], [journal version at publisher's website]
We consider data dissemination in a peer-to-peer network, where each user wishes to obtain some subset of the available information objects. In most of the modern algorithms for such data dissemination, the users periodically obtain samples of peer IDs (possibly with some summary of their content). They then use the samples for connecting to other peers and downloading data pieces from them. For a set O of information objects, we call a sample of peers, containing at least k possible providers for each object oÎO, a k-sample.
In order to balance the load, the k-samples should be fair, in the sense that for every object, its providers should appear in the sample with equal probability. Also, since most algorithms send fresh samples frequently, the size of the k-samples should be as small as possible, to minimize communication overhead. We describe in this paper two novel techniques for generating fair and small k-samples in a P2P setting. The first is based on a particular usage of uniform sampling and has the advantage that it allows to build on standard P2P uniform sampling tools. The second is based on non-uniform sampling and requires more particular care, but is guaranteed to generate the smallest possible fair k-sample. The two algorithms exploit available dependencies between information objects to reduce the sample size, and are proved, both theoretically and experimentally, to be extremely effective.
Matrix Tightness: A Linear-Algebraic Framework for
Sorting by Transpositions
Tzvika Hartman and Elad Verbin
Proceedings of String Processing and Information Retrieval (SPIRE) 2006
Available: [local ps]
PowerPoint presentation (openoffice compatible)
We study the problems of sorting signed permutations by reversals (SBR) and sorting unsigned permutations by transpositions (SBT), which are central problems in computational molecular biology. While a polynomial-time solution for SBR is known, the computational complexity of SBT has been open for more than a decade and is considered a major open problem.
In the first efficient solution of SBR, Hannenhalli and Pevzner used a graph-theoretic model for representing permutations, called the interleaving graph. This model was crucial to their solution. Here, we define a new model for SBT, which is analogous to the interleaving graph. Our model has some desirable properties that were lacking in earlier models for SBT. These properties make it extremely useful for studying SBT.
Using this model, we give a linear-algebraic framework in which SBT can be studied. Specifically, for matrices over any algebraic ring, we define a class of matrices called tight matrices. We show that an efficient algorithm which recognizes tight matrices over a certain ring, M, implies an efficient algorithm that solves SBT on an important class of permutations, called simple permutations. Such an algorithm is likely to lead to an efficient algorithm for SBT that works on all permutations.
The problem of recognizing tight matrices is also a generalization of SBR and of a large class of other "sorting by rearrangements'" problems, and seems interesting in its own right as well. We give an efficient algorithm for recognizing tight symmetric matrices over any field of characteristic 2. We leave as an open problem to find an efficient algorithm for recognizing tight matrices over the ring M.
A Simpler Analysis of Burrows-Wheeler Based Compression
Haim Kaplan, Shir Landau and Elad Verbin
received best paper award in Combinatorial Pattern Matching (CPM) 2006.
To be published in Theoretical Computer Science, special issue on The Burrows-Wheeler Transform and its Applications (expected publication February 2007)
(preliminary version published in CPM '06)
Available: [local ps]
PowerPoint presentation; Another powerpoint presentation
In this paper we present a new technique for worst-case analysis of compression algorithms which are based on the Burrows-Wheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject, called bw0. This algorithm consists of the following three essential steps: 1) Obtain the Burrows-Wheeler Transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using Arithmetic code or any order-0 encoding (possibly with run-length encoding).
We achieve a strong upper bound on the worst-case compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and μ > 1, the length of the compressed string is bounded by μ|s|Hk(s) + log(ζ(μ))|s| + μgk + O(log n) where Hk is the k-th order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and
ζ(μ) = 1/1^ μ + 1/2^ μ + … is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest.
Finally, we apply our techniques to prove a worst-case bound on the compression ratio of a compression algorithm based on the Burrows-Wheeler Transform followed by distance coding, for which worst-case guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286|s|Hk(s) + gk + O(log n). This bound is better than the bound we give for bw0.
Colored Intersection Searching via Sparse Rectangular
Matrix Multiplication
Haim Kaplan, Micha Sharir and Elad Verbin
Proceedings of Symposium on Computational Geometry (SoCG) 2006
Available: [local ps]
PowerPoint presentations: 25 minutes; 40 minutes:
In a Batched Colored Intersection Searching Problem (CI), one is given a set of n geometric objects (of a certain class). Each object is colored by one of c colors, and the goal is to report all pairs of colors (c1,c2) such that there are two objects, one colored c1 and one colored c2, that intersect each other. We also consider the bipartite version of the problem, where we are interested in intersections between objects of one class with objects of another class (e.g., points and halfspaces).
In a Sparse Rectangular Matrix Multiplication Problem (SRMM), one is given an n1 by n2 matrix A and an n2 by n3 matrix B, each containing at most m non-zero entries, and the goal is to compute their product AB.
In this paper we present a technique for solving CI problems over a wide range of classes of geometric objects. The basic idea is first to use some decomposition method, such as geometric cuttings, to represent the intersection graph of the objects as a union of bi-cliques. Then, in each of these bi-cliques, contract all vertices of the same color. Finally, use an algorithm for sparse matrix multiplication (adapted from Yuster and Zwick) to compute the union of the bi-cliques. We apply the technique to segments in R1, to segments in R2, to points and halfplanes in R2, and, more generally, to points and halfspaces in Rd, for any fixed d. However, the technique extends to colored intersection searching in any class (or pair of classes) of geometric objects of constant descriptive complexity.
In particular, using our technique we obtain an algorithm that reports all the pairs of intersecting colors for n points and n halfplanes in R2, that are colored by c colors, in O(n4/3c0.46) time when n≥c1.44, and in O(n1.04c0.9+c2) time when n≤c1.44.
The algorithms that we give for CI use the algorithm for SRMM as a black box, which means that any improved algorithm for SRMM immediately leads to an improved algorithm for all colored intersection problems that our method applies to. We also show that the complexity of computing all intersecting colors in a set of segments on the real line is identical, up to a polylogarithmic multiplicative factor, to the complexity of SRMM with the appropriate parameters.
Sorting Signed Permutations by Reversals, Revisited
Haim Kaplan and Elad Verbin
Journal of Computer and System Sciences, Volume 70, Issue 3, May 2005, Pages 321-341
(Preliminary version published in CPM '03)
The problem of sorting signed permutations by reversals (SBR) is a fundamental problem in computational molecular biology. The goal is, given a signed permutation, to find a shortest sequence of reversals that transforms it into the positive identity permutation, where a reversal is the operation of taking a segment of the permutation, reversing it, and flipping the signs of its elements.
In this paper we describe a randomized algorithm for SBR. The algorithm tries to sort the permutation by repeatedly performing a random oriented reversal. This process is in fact a random walk on the graph where permutations are the nodes and an arc from π to π' corresponds to an oriented reversal that transforms π to π'. We show that if this random walk stops at the identity permutation, then we have found a shortest sequence. We give empirical evidence that this process indeed succeeds with high probability on a random permutation.
To implement our algorithm we describe a data structure to maintain a permutation, that allows to draw an oriented reversal uniformly at random, and perform it in sub-linear time. With this data structure we can implement the random walk in O(n3/2log1/2n) time, thus obtaining an algorithm for SBR that almost always runs in sub-quadratic time. The data structures we present may also be of independent interest for developing other algorithms for SBR, and for other problems.
Finally, we present the first efficient parallel algorithm for SBR. We obtain this result by developing a fast implementation of the recent algorithm of Bergeron for sorting signed permutations by reversals that is parallelizable. Our implementation runs in O(n2logn) time on a regular RAM, and in O(nlogn) time on a PRAM using n processors.
On the Complexity of Cell Flipping in Permutation
Diagrams, and Multiprocessor Scheduling Problems
Martin Charles Golumbic, Haim Kaplan and Elad Verbin
Discrete Mathematics, Volume 296, Issue 1, 28 June 2005, Pages 25-41
Available here
Permutation diagrams have been used in circuit design to model a set of single point nets crossing a channel, where the minimum number of layers needed to realize the diagram equals the clique number ω(G) of its permutation graph, the value of which can be calculated in O(nlogn) time. We consider a generalization of this model motivated by “standard cell” technology in which the numbers on each side of the channel are partitioned into consecutive subsequences, or cells, each of which can be left unchanged or flipped (i.e., reversed). We ask, for what choice of flippings will the resulting clique number be minimum or maximum. We show that when one side of the channel is fixed (no flipping), an optimal flipping for the other side can be found in O(nlogn) time for the maximum clique number, and that when both sides are free this can be solved in O(n2) time. We also prove NP-completeness of finding a flipping that gives a minimum clique number, even when one side of the channel is fixed, and even when the size of the cells is restricted to be less than a small constant. Moreover, since the complement of a permutation graph is also a permutation graph, the same complexity results hold for the stable set (independence) number. In the process of the NP-completeness proof we also prove NP-completeness of a restricted variant of a scheduling problem. This new NP-completeness result may be of independent interest.
Between Genome Rearrangments and Delta-Matroids
Haim Kaplan and Elad Verbin
In preparation.
Some
Powerpoint Presentations that are not about my work
Powerpoint
presentation on Asymmetric Communication Complexity
Powerpoint
presentation on Graph Decompositions (given 28/3/04 at the seminar of Asaf
Nachmias)