Application of topic modeling methods to identify groups of internet resources in order to reduce the risk of cyber threats

Бесплатный доступ

Internal network security is an important aspect of a successful enterprise. There are various means to prevent cyber threats and analyze visited Internet resources, but their speed and the possibility of application strongly depends on the volume of input data. This article discusses the existing methods for determining network threats by analyzing proxy server logs, and proposes a method for clustering Internet resources aimed at reducing the volume of input data by excluding groups of secure Internet resources or selecting only suspicious Internet resources. The proposed method consists of 3 stages: data preprocessing, data analysis and interpretation of the results obtained. The initial data for the method are the proxy server log entries. At the first stage, data useful for analysis is selected from the source data, after which the continuous data stream is divided into small sessions using the nuclear density estimation method. At the second stage, soft clustering of visited Internet resources is performed by applying the thematic modeling method. The result of the second stage are unmarked groups of Internet resources. At the third stage, with the help of an expert, the results are interpreted by analyzing the most popular Internet resources in each group. The method has many settings at each stage, which allows you to configure it for any format and specifics of the input data. The scope of the method is not limited in any way. The resulting method can be used as an additional preprocessing step in order to reduce the amount of input data.

Еще

Topic-modeling, cyber security, data analysis

Короткий адрес: https://sciup.org/148324798

IDR: 148324798   |   DOI: 10.31772/2712-8970-2022-23-2-148-155

Статья научная