Record linkage in data integration problem under big data conditions

Автор: Papoyan Vladimir, Korenkov Vladimir, Kadochnikov Ivan

Журнал: Сетевое научное издание «Системный анализ в науке и образовании» @journal-sanse

Статья в выпуске: 3, 2019 года.

Бесплатный доступ

The problem of identifying records refer to the same entity arises appears during the integration data from multiple sources. The application of probabilistic record linkage is one of the key to solve described problem. In this article defined and tried that application of locality-sensitive hashing and vector space model on the blocking stage allow to reach the efficient implementation of described above decision. The implementation is tested in Apache Spark on two registers of companies GLEIF and Companies House.

Apache spark, big data, record linkage, vector space model, locality-sensitive hashing

Короткий адрес: https://sciup.org/14122702

IDR: 14122702

Статья научная