Volume: 16, Issue: 6(2005)
pp. 1121-1134 DOI: 10.1142/S0129054105003698
|
|
Abstract |
Full Text (PDF, 762KB)
|
References
|
 |
| Title: |
BDD-BASED ANALYSIS OF GAPPED q-GRAM FILTERS |
| Author(s): |
MARC FONTAINE Max-Planck-Institut für Informatik,
Stuhlsatzenhausweg 85, 66123 Saarbrücken, GermanySTEFAN BURKHARDT Google Inc, 1600 Amphitheatre Parkway,
94043 Mountain View, CA, USAJUHA KÄRKKÄINEN Department of Computer Science,
P.O. Box 68 (Gustaf Hällströmin katu 2 B), FI-00014
University of Helsinki, Finland
|
| History: |
Received 18 October 2004 Accepted 25 February 2005
|
| Abstract: |
Recently, there has been a surge of interest in gapped q-gram filters
for approximate string matching. Important design parameters for filters are for
example the value of q, the filter-threshold and in particular the shape
(aka seed) of the filter. A good choice of parameters can improve the performance
of a q-gram filter by orders of magnitude and optimizing these parameters
is a nontrivial combinatorial problem. We describe a new method for analyzing gapped
q-gram filters. This method is simple and generic. It applies to a variety of
filters, overcomes many restrictions that are present in existing
algorithms and can easily be extended to new filter variants. To implement our approach,
we use an extended version of BDDs (Binary Decision Diagrams), a data structure that
efficiently represents sets of bit-strings. In a second step, we define a new class
of multi-shape filters and analyze these filters with the BDD-based approach.
Experiments show that multi-shape filters can outperform the best single-shape filters,
which are currently in use, in many aspects. The BDD-based algorithm is crucial for the
design and analysis of these new and better multi-shape filters. Our results apply to
the k-mismatches problem, i.e. approximate string matching with Hamming
distance. |
| Keywords: |
approximate string matching; filter; gapped q-gram; BDD
|
|
|