Fast text filtration with modified Bloom filter

Бесплатный доступ

The main goal of this paper is to address the problem of fast comparison of text arrays. The main attention has been paid to the adaptation of Bloom filters for the tasks related to fast filtering of text data. Comparative analysis of various Bloom filters has been performed to study the potential applicability for text filtration. Improved algorithms and data structures based on matrix Bloom filters have been proposed for fast filtration. Also various approaches for text storage in the filter have been studied. It was proposed to add complimentary index to the filter in order to solve search localization problem. In addition to the localization, this index improves the density of filter population thus reducing the memory requirements while keeping the computational costs required for document comparisons at the same level. The potential limitations of the proposed approach have been reviewed. Such limitations include challenges in processing of highly obfuscated texts and additional garbage collection mechanisms required to maintain data density in the filter. The potential application areas for the proposed model are text plagiarism detection and data leak protection systems. Experimental results shown in the paper indicated acceptable levels of false positives and false negatives results for these application areas. They also outlined noticeable reduction in memory consumption in comparison with the original matrix Bloom filter. Based on the experiments the proposed approach may be recommended for filtering texts that included 5% or more of the copied text.

Еще

Bloom filter, text search, information filtration, algorithms, data structures, plagiarism detection

Короткий адрес: https://sciup.org/142179131

IDR: 142179131

Статья научная