Record linkage in data integration problem under big data conditions
Автор: Papoyan Vladimir, Korenkov Vladimir, Kadochnikov Ivan
Журнал: Сетевое научное издание «Системный анализ в науке и образовании» @journal-sanse
Статья в выпуске: 3, 2019 года.
Бесплатный доступ
The problem of identifying records refer to the same entity arises appears during the integration data from multiple sources. The application of probabilistic record linkage is one of the key to solve described problem. In this article defined and tried that application of locality-sensitive hashing and vector space model on the blocking stage allow to reach the efficient implementation of described above decision. The implementation is tested in Apache Spark on two registers of companies GLEIF and Companies House.
Apache spark, big data, record linkage, vector space model, locality-sensitive hashing
Короткий адрес: https://sciup.org/14122702
IDR: 14122702