Appraisal of Efficient Techniques for Online Record Linkage and Deduplication using Q-Gram Based Indexing?Journal: International Journal of Computer Science and Mobile Computing - IJCSMC (Vol.3, No. 5)
Publication Date: 2014-05-30
Authors : M.V Shiva Prasad; Ch.Krishna Prasad; B.Rambabu;
Page : 404-414
Keywords : q-samples; substrings; BKV; q-grams; Record identifiers;
We present new indexing techniques for approximate string matching. The index collects text qsamples, that is, disjoints text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. The aim of this technique is to index the database such that records that have a similar, not just the same BKV (Blocking key value) will be inserted into the same block. Assuming the BKVs are strings, the basic idea is to create variations for each BKV using q-grams (sub-strings of lengths q), and to insert record identifiers into more than one block.
Other Latest Articles
Last modified: 2014-05-21 20:59:03