We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size sigma, compute the Hamming distance (i.e., the number of mismatches) between the pattern and the text at every location. Several randomized (1+eps)-approximation algorithms have been proposed in the literature (e.g., by Karloff (Inf. Proc. Lett., 1993), Indyk (FOCS 1998), and Kopelowitz and Porat (SOSA 2018)), with running time of the form O(eps^{-O(1)} n log n log m), all using fast Fourier transform (FFT). We describe a simple randomized (1+eps)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results (all Monte-Carlo randomized) in different settings:
A set of strings, called a string dictionary, is a basic string data structure. The most primitive query, where one seeks the existence of a pattern in the dictionary, is called a lookup query. Approximate lookup queries, i.e., to lookup the existence of a pattern with a bounded number of errors, is a fundamental string problem. Several data structures have been proposed to do so efficiently. Almost all solutions consider a single error, as will this result. Lately, Belazzougui and Venturini (CPM 2013) raised the question whether one can construct efficient indexes that support lookup queries with one error in optimal query time, that is, O(|p|/w + occ), where p is the query, w the machine word-size, and occ the number of occurrences.
Specifically, for the problem of one mismatch and constant alphabet size, we obtain optimal query time. For a dictionary of d strings our proposed index uses O(w d log^{1+eps}d) additional bit space (beyond the dictionary which can be maintained in compressed form). Our results are parameterized for a space-time tradeoff.
We propose more results for the case of lookup queries with one insertion/deletion on dictionaries over a constant sized alphabet. These results are especially effective for large patterns.
We present a collection of new results on problems related to 3SUM, including:
Jumbled indexing is the problem of indexing a text T for queries that ask whether there is a substring of T matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years. There is a naive algorithm that preprocesses all answers in O(n^2 |Sigma|) time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has O(n log |Sigma|) query time. Despite a tremendous amount of effort there has been little improvement over these running times.
In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size omega(1) requires Omega(n^{2-epsilon}) preprocessing time or Omega(n^{1-delta}) query time for any epsilon,delta>0. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size r >= 3 there exist describable fixed constant epsilon_r and delta_r such that jumbled indexing requires Omega(n^{2-epsilon_r}) preprocessing time or Omega(n^{1-delta_r}) query time.
The documents contained in this directory are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.