Data Optimization through Compression Methods Using Information Technology

Автор: Igor V. Malyk, Yevhen Kyrychenko, Mykola Gorbatenko, Taras Lukashiv

Журнал: International Journal of Information Technology and Computer Science @ijitcs

Статья в выпуске: 5 Vol. 17, 2025 года.

Бесплатный доступ

Efficient comparison of heterogeneous tabular datasets is difficult when sources are unknown or weakly documented. We address this problem by introducing a unified, type-aware framework that builds compact data represen- tations (CDRs)—concise summaries sufficient for downstream analysis—and a corresponding similarity graph (and tree) over a data corpus. Our novelty is threefold: (i) a principled vocabulary and procedure for constructing CDRs per variable type (factor, time, numeric, string), (ii) a weighted, type-specific similarity metric we call Data Information Structural Similarity (DISS) that aggregates distances across heterogeneous summaries, and (iii) an end-to-end, cloud-scalable real- ization that supports large corpora. Methodologically, factor variables are summarized by frequency tables; time variables by fixed-bin histograms; numeric variables by moment vectors (up to the fourth order); and string variables by TF–IDF vectors. Pairwise similarities use Hellinger, Wasserstein (p=1), total variation, and L1/L2 distances, with MAE/MAPE for numeric summaries; the DISS score combines these via learned or user-set weights to form an adjacency graph whose minimum-spanning tree yields a similarity tree. In experiments on multi-source CSVs, the approach enables accurate retrieval of closest datasets and robust corpus-level structuring while reducing storage and I/O. This contributes a repro- ducible pathway from raw tables to a similarity tree, clarifying terminology and providing algorithms that practitioners can deploy at scale.

Еще

Information Technology, Data Similarity, Compressed Copy of Tabular Data, Compact Data Representation

Короткий адрес: https://sciup.org/15020019

IDR: 15020019   |   DOI: 10.5815/ijitcs.2025.05.07