Adaptive model for dynamic and temporal topic modeling from big data using deep learning architecture

Автор: Ajeet Ram Pathak, Manjusha Pandey, Siddharth Rautaray

Журнал: International Journal of Intelligent Systems and Applications @ijisa

Статья в выпуске: 6 vol.11, 2019 года.

Бесплатный доступ

Due to freedom to express views, opinions, news, etc and easier method to disseminate the information to large population worldwide, social media platforms are inundated with big streaming data characterized by both short text and long normal text. Getting the glimpse of ongoing events happening over social media is quintessential from the viewpoint of understanding the trends, and for this, topic modeling is the most important step. With reference to increase in proliferation of big data streaming from social media platforms, it is crucial to perform large scale topic modeling to extract the topics dynamically in an online manner. This paper proposes an adaptive framework for dynamic topic modeling from big data using deep learning approach. Approach based on approximation of online latent semantic indexing constrained by regularization has been put forth. The model is designed using deep network of feed forward layers. The framework works in an adaptive manner in the sense that model is extracts incrementally according to streaming data and retrieves dynamic topics. In order to get the trends and evolution of topics, the framework supports temporal topic modeling, and enables to detect implicit and explicit aspects from sentences also.

Еще

Aspect detection, big data, deep learning, latent semantic indexing, online learning, regularization, topic modeling

Короткий адрес: https://sciup.org/15016598

IDR: 15016598   |   DOI: 10.5815/ijisa.2019.06.02

Текст научной статьи Adaptive model for dynamic and temporal topic modeling from big data using deep learning architecture

Published Online June 2019 in MECS

The emergence of social media platforms lead to increase in posting of text in the form of reviews, opinions on the web, and heavily contribute for unprecedented growth of big data [1]. Many natural language processing applications such as summarization, user profiling, product recommendation, event tracking, text classification, collaborative filtering, similarity finding, sentiment analysis, etc. need to discover latent semantic topics from large text corpora. In such applications, topic modeling is the foremost step. Extracting the latent topics at large scale is challenging due to sparseness of text, spelling and grammatical errors, slangs, or jargons, unstructured data, and interrelated data discussed under different domains, etc.

Microblogging sites such as Twitter, Tumblr, Pinterest, Reddit, Yammer, etc stream large amount of short and long normal texts with substantial growth in due course of time. Streaming data are characterized by temporal order. Temporal information is necessary to get the notion of evolution and spread of domain-specific latent topics. Moreover, instead of processing large collection of time-stamped datasets with using off-line fashion in batch mode; it is more crucial for many natural language applications to analyze, summarize and extract valuable insights on the go in an online manner. Batch algorithms are not suitable for extracting topics from large scale and streaming data. Also such algorithms need to repeatedly scan the data for topic learning and need to keep the model up-to-date when new data arrives. Therefore, online algorithms are preferred for topic learning.

Online algorithms are able to handle large scale data efficiently since they only store small chunks of data for updating the model when new data arrives. This makes them more efficient than batch counterparts. For example, due to some worldwide event, many social media platforms gets flooded with comments of people, news feeds, etc and this requires automated systems to extract and track current topics of specific interest and identify emerging trends discussed on social media platforms. If extracted topics correspond to suspicious activities or alarming scenarios, then quick actions can be taken by authorized personnel for proactive measures. Hence, use of temporal topic model which works in online mode to infer dynamically generated topics from streaming data is the need of the hour.

Considering all the aforementioned motivation, this paper proposes an adaptive Framework for deep learning based dynamic and temporal topic modeling from big data. The proposed approach works in online manner for topic modeling and therefore, it is intrinsically scalable to large datasets. The work is contributed as follows.

  •    We have proposed a deep learning model for detection of dynamically generated topics from streaming data by online version of Latent Semantic Indexing (LSI) constrained by

regularization.

  •    The approach is scalable to large collection of datasets. It is flexible to support both long normal text and short text for modeling the topics.

  •    The model is adaptive and it is updated incrementally and performs temporal topic modeling to get notion of evolution and trends of topics over time.

  •    It supports extraction of implicit and explicit topics from sentences also.

The rest of the paper is portrayed as follows. Section II deals with discussion of conventional topic models, topics models based on deep learning paradigm and relation of the existing work with the proposed approach. Section III focuses on statistical environment, proposed architecture and algorithms for dynamic and temporal topic modeling and user query evaluation. The experimentation details encompassing exploratory data analysis and correspondence analysis and results are discussed in section IV. Section V gives conclusion and future directions of the research.

  • II.    Related Work

Topic modeling is a statistical technique and provides automated approach for extracting latent semantic topics from documents. In classical settings, document is considered as a mixture of latent topics i.e. multinomial distribution over topics and topic is viewed as probability distribution over words.

Topic Modeling algorithms have been extensively developed for text analysis since the past decade [2]. Manually identifying the topics is not efficient and scalable due to huge size of data, wide variation, and dynamically changing nature of topics. Therefore, topic models such as Latent Dirichlet Allocation (LDA) [3], probabilistic Latent Semantic Analysis (PLSA) [4], and Latent Semantic Indexing (LSI) [5] have been put forth for automatically extracting the topics at large scale. Various topic modeling algorithms have been used for inferring the hidden topics from short texts [6-8] and normal long texts [9-10].

LDA model put forth by Blei et al. [3] is the most popular probabilistic generative model for topic modeling. Approximate technique namely convexity-based variational method is used for inference since exact inference is intractable. For estimating Bayes parameters, expectation maximization algorithm is used in LDA model. Due to probabilistic and modular nature, LDA models can be easily fit into complex architectures. This property is not supported by LSI [5] model.

LSI model [5] uses Singular Value Decomposition (SVD) to capture variance in the document collection. This approach captures implicit semantic structure among the terms in documents for identifying relevant documents based on terms present in queries. It maps high dimensional count vectors to lower dimensional latent semantic space.

The improved version of standard latent semantic analysis model has been put forth in [4]. Probabilistic latent semantic analysis model - PLSA (aspect model) follows statistical latent class model, and it is an unsupervised learning method. A method for generalization of maximum likelihood estimation, namely, tempered expectation maximization has also been proposed in [4].

Many topic modeling approaches have been put forth by modifying basic topic models like LDA, PLSA, and LSI. Hoffman et al. [11] extended LDA by proposing online variational Bayes (VB) algorithm for topic modeling over streaming data.

Scaling to large dataset of document collection is one of the most challenging tasks in topic modeling. Topic modeling approaches based on LDA and LSI pose challenges related to scalability when such methods are employed to solve real-world tasks. For an instance, it is very difficult to update term-topic matrix simultaneously for satisfying the criterion of probability distribution when the dataset is large. In case of LSI, due to orthogonality assumption, problem needs to be solved using SVD, and it is difficult to parallelize the procedure for SVD. Also topic models like LDA and PLSA assumes document as a mixture of topics and models documentlevel word co-occurrences. Wang et al. [12] came up with novel model based on regularized latent semantic indexing (RLSI) for scalable topic modeling. RLSI is different from LSI method. It uses regularization to constrain the solutions instead of using orthogonality as adopted by LSI techniques.

The online version of LDA has been put forth in [13]. This approach works on non-Markovian Gibbs sampling. The weight-matrix history is maintained in the generative process of the method according to homogeneity of domain. It does not handle inter-topic differences and drifts within same topics. Topic model for temporally sequenced data has been proposed in [14]. This model dynamically predicts future trends for data and is scalable in nature.

Some topic modeling approaches assume that words have equal weights. This results into selection of topics having highest frequency of terms in documents. But, this may cause selection of meaningless words like domainspecific stop words which are not useful for further processing. Li et al. addressed this issue by proposing conditional entropy based term weighing scheme in which entropy is measured by word co-occurrences [15]. To infer more effective topics during topic inference phase, meaningless words are assigned lower weights and informative words are assigned higher weights. This scheme is applied with Dirichlet Multinomial Mixture (DMM) model [8] and LDA model [3] to infer topics from shorts texts and normal long texts respectively.

Kuhn put forth structural topic modeling approach and captured correlation among topics pertaining to single domain [16]. Brody and Elhadad [17] employed unsupervised approach for aspect detection. They used local version of LDA working at sentence level and assumed each sentence as a document.

With success of deep learning approaches in computer vision tasks [18-20], deep learning models have also been devised for natural language processing tasks. Language model based on recurrent neural network (TopicRNN) [21] follows semi-supervised approach to capture syntactic dependency and semantic dependency of document using RNN and latent topic model respectively. This model can be considered as alternative to LDA for topic modeling. Li et al. [22] used an attention mechanism of neural networks for modeling contents and topics to recommend the hashtags.

Generative topic models usually do not consider the contextual information while performing the task of topic extraction. A Document Informed Neural Autoregressive Distribution Estimator (iDocNADE) [23] takes into account the contextual information using language models having backward and forward references. LDA-based generative topic model proposed in [24] performs incremental updating of parameters over consecutive windows, and enables faster processing by adaptive window length.

Word embeddings have found to be useful for distributed representation of words and capturing semantic and syntactic information in many natural language processing tasks such as parts-of-speech tagging, parsing, named entity recognition, etc. Enlightened by same, word embedding models have also been used for topic modeling. Zhang et al. [25] used word2vec embedding model for feature extraction from large range of bibliometric data and coupled it with k-means algorithm for improving the performance of topic extraction. Topic modeling approaches working on short texts from social media platforms suffer from data sparsity, noisy words and word sense disambiguation problems. Gao et al. [26] addressed the issue of word sense disambiguation by utilizing local and global semantic correlation provided by word embedding model. Conditional random field is used in inference phase for short text modeling. Approach in [27] introduced common semantic topic model designed using mixture of unigram models for capturing the semantic and noisy words from short texts. Weibull distribution based hybrid autoencoding inference process for deep LDA has been put forth in [28] to get hierarchical latent representation of big data for scalable topic modeling.

Considering the relation of proposed work with the existing literature, this paper proposes scalable topic modelling approach based on deep learning. The proposed model has capability to infer the dynamic topics from streaming data and provides notion of evolution and trends of topic over time.

  • III.    Methodology

This section describes statistical environment and proposed architecture for topic modeling.

  • A.    Statistical environment

Following matrices are used in the proposed approach.

  •    Terms Г, Г £ ^ м

г= [г,г2,..гм]

where M denotes number of terms.

  •    Sentences $ /

Sentences can be represented as set of terms

$/ = [Г1,Г2,..Г7]

where i = 1,2,.., Р and P is number of sentences. The number of terms J in a sentence are less than total number of terms i.e. J < M.

  •    Topics Ui , Ui £ ^x 1

Ui = {UVU2, ^,UK3

where i = {1,2,.., K} and K denotes number of topics.

  •    Term-document Matrix D , D £ ^ MxW

0 = [d i, ^2, ■", ^jv]

where M denotes number of terms, and N denotes number of documents. The values in term-document matrix are calculated using TF-IDF score.

  •    Sentence-term matrix W, W £ ^ PxMxV

W = [w1,w2,..wN]

where P denotes number of sentences, M denotes number of terms and N denotes number of documents. For documents d 1 , d 2 ,^,d v , sentence-term matrix W is shown as three dimensional matrix. The values in sentence-term matrix ‘W’ are calculated using 2 ways.

  • -    Presence or absence of terms in each sentence can be represented by 1 or 0 respectively in sentenceterm matrix

  • -    Term frequency–inverse document frequency (TF-IDF) score can be used for calculating the values in a sentence-term matrix


    where K denotes number of topics, N denotes number of documents, vk „denotes weight of kh topic in document dN . As shown in term-document matrix, each column v1,v2, ^vn gives the representation of document d 1 , d 2 , ... Vv in the topic space.


    ^ N


    u 1


    v 12


    v1n


    U 2


  •    Term-topic matrix U , UE ^^

where M is number of terms, K is number of topic. Entries in the term-topic matrix correspond to the weight of m th term in topic ‘ k’ which is denoted by um. . At sentence level, term Г gives name of the aspect if term belongs to the topic i.e. corresponding weight of the term umк is higher for the given topic ut . Term-topic matrix shows strength of belongingness of each term to topic.

0 = [o1,o2,^,oN]

where P denotes number of sentences, K denotes number of topics and N denotes number of documents. For documents d 1 ,d 2 ,...,dN , three dimensional sentencetopic matrix is shown where qp k denotes weight of each kth topic in a sentence.

  •    Topic-document Matrix V, V E K ^xv

V = [v^v,...,^]

u К  vk1  vk2   • vkn )

v 1     v 2     ..    v n

t t t A

Each column v 1 , v 2 ,..,v n gives representation of document d 1 , d 2 ,..,d N respectively in topic space

  • B.    Proposed architecture

Fig. 1 shows the proposed methodology of topic modeling. The architecture is divided into 4 main modules. The model works on steaming data. For building the prototyped model, we collected the dataset. Initially, data cleaning and tokenization has been performed using R, Python and SpaCy packages. After data preprocessing, exploratory data analysis (EDA) and correspondence analysis (CA) has been performed. EDA is used to infer useful information from data, understand the behaviour of data and check usefulness of data for further phase of topic modelling.

We have done correspondence analysis as a generalization of principal component analysis (PCA), and SVD. We analysed the data using heat map, scree plots, factor score and most contributing variable. We designed the topic model for temporal analysis by online latent semantic indexing constrained by regularization using deep learning approach. We designed the model using dense network of feed forward network layers.

Use of sentence level topic modeling yields to detect both implicit and explicit topics mentioned in sentences. We applied algorithm 1 for training the model incrementally.

Due to online learning, only one document remains in memory at a time. Therefore space complexity is given by DocLength + К where DocLength stands for document length and K is number of topics. For initial construction of ‘U’ and ‘V’ matrices, space complexity is D o cL egg th x N + NN. For processing document at time t , the time for updating ‘U’ and ‘V’ matrices as shown in equation (3) is significant, therefore, time complexity is given by C x M x К 2 where C denotes number of times the algorithm iterated, M is number of terms and K is number of topics.

When user sends a query for inferring the topics, algorithm 2 is followed. The topics associated with terms mentioned in query and other implied latent topics are returned by model as output of the query evaluation.

  •    Data cleaning

  •    Data tokenization

Documents 1,2,..,N

Streaming data

Preprocessing

Preprocessing using R, Python and SpaCy packages for enhancing the effectiveness of data

Exploratory Data Analysis

  •    Correspondence Analysis as a generalization of PCA

  •    Singular Value Decomposition

  •    Analysis using heat map, scree plots, factor score

EDA to infer useful information from data, understanding the behavior of data, checking usefulness of data for further phase of topic modeling and improving the analysis

Input user query

Topic Modeling by Online version of Latent Semantic Indexing

■ ■ ■

■ ■

constrained by regularization

Find the distribution of documents

Calculate TF-IDF score for terms in document collection

Draw Term-document matrix ‘D’ and sentence-term matrix ‘W’ for each document.

Approximate the Term-document matrix ‘D’ as a product of Termtopic Matrix ‘U’ and Topic-document matrix ‘V’

Based on ‘U’ and ‘V’, matrices draw sentence-topic matrix ‘O’ Minimize the difference between actual document and approximated representation of document by regularization For each document collected at time t, update the matrices ‘U’ and ‘V’

Retrieve the term topic matrix ‘U’

■ ■

Input Query Processing

Analyze the query

Retrieve terms explicitly mentioned in the query

Perform explicit topic detection with the help of term topic matrix ‘U’

If term is not explicitly mentioned in the input query, then consult sentence-term matrix ‘W’ and sentence-topic matrix ‘O’ for implicit detection of topics.

Choose top topics having highest scores/weights

Evaluation of user queries on streaming data to infer dynamic topics

Dynamic topic modeling (Topics are updated using online learning)

d N

d 1

Topic 1

Topic 2

Topic K

Dynamically generated topics

Fig.1. Proposed architecture for topic modeling

Algorithm 1: Temporal Topic Modeling by Online Latent Semantic Indexing constrained by regularization

Model Training

Input: Streaming Data

Output:  Trained model obtained using online learning

  •    Find the distribution of documents dY, d2, ..., dN.

  •    Calculate TF-IDF score for terms in document collection using equation (1).

TF1DF = ^) x log -—(1)

|d|           |{deDTed}| where, с (Г, d) is a count that term Г occurs in document d, |d| is Length of document d, |D| is Total count of documents in document collection, and |{d ЕД:Г £ d}| is total count of documents in which term Г occurs

  •    Draw term-document matrix ‘D’ and sentenceterm matrix ‘W’ for each document.

  •    Approximate term-document matrix ‘D’ as a product of term-topic Matrix ‘U’ and topicdocument matrix ‘V’ .

dn ~ Uvn                 (2)

where du is document, 'U' is term-topic matrix, vn is Representation of document du in the topic space.

  •    Based on ‘U’ and ‘V’ matrices, draw sentencetopic matrix ‘O’ .

  •    Minimize the difference between actual document and approximated representation of document by 12 regularization according to Eq. (3)

Мг-UvX           (3)

  •    For each document collected at time t , update the matrices ‘V’ and ‘U’ by approximating latent semantic indexing model using Eq. (3).

  •    Retrieve the term topic matrix ‘U’ and topicdocument matrix ‘V’

Algorithm 2: Query Evaluation

Input: User Query

Output: Dynamically generating topics

  •    Analyze the query

  •    Retrieve terms explicitly mentioned in the query

  •    Perform explicit topic detection with the help of term topic matrix ‘U’

  •    If term is not explicitly mentioned in the input query, then consult sentence-term matrix ‘W’ and sentence-topic matrix ‘O’ for implicit detection of topics.

  •    Choose top topics having highest scores/weights

  • IV.    Experimentation Details

    count of words discussed under the hashtags #ethereum, #facebook and #bitcoin respectively in tweets per week.


    For topic modeling, real word dataset related to 3 hashtags has been collected using Twitter APIs. Tweets associated with the hashtags #bitcoin, #ethereum, and #facebook are captured from 3-3-2018 to 3-5-2018. Tweets are analyzed weekly according to the duration of data collected. For experimentation, Python, and R programming languages have been used. We also used SpaCy for advanced natural language processing task and developed the model in the TensorFlow framework.

(a) #ethereum

Based on the collected dataset in .csv format having 3 major hashtags, we converted the duration into weeks i.e. from 9 to week 18 of the year. Out of 22 attributes of the dataset ( viz. 'Tweet ID, Conversation ID, Author Id , Author Name, isVerified, DateTime, Tweet Text, Replies, Retweets, Favorites, Mentions, Hashtags, Permalink, URLs, isPartOfConversation, isReply, isRetweet, Reply To User ID, Reply To User Name, Quoted Tweet ID, Quoted Tweet User Name, Quoted Tweet User ID' ), we only focused on attributes - DateTime , Tweet Text and Hashtags . The reason behind choosing these 3 attributes out of 22 attributes is that our aim is to perform temporal topic modeling from Tweet text to get the notion of evolution and trends of topics discussed under various hashtags over time. Therefore, we are considering 3 attributes ( DateTime , Tweet Text and Hashtags ) for topic modeling.

(b) #facebook

  • A.    Exploratory Data Analysis

As a first step towards topic modeling, Exploratory Data Analysis has been performed. All the steps in EDA have been carried out on aforementioned dataset having 3 main hashtags. The main objective of EDA is to understand how much useful information does dataset hold. Therefore, we first calculated the five-number summary statistics for the  ‘ DateTime’  attribute to

(c) #bitcoin

Fig.2. Frequency of words against each month for (a) #ethereum, (b) #facebook, and (c) #bitcoin (X-axis represents time in terms of date and Y-axis represents count of tweets)

understand the distribution of words over weeks. After that we calculated frequency of words against each month. Fig. 2 - (a), (b) and (c) shows the frequency of words against each month from March, to May for hashtags #ethereum, #facebook and #bitcoin respectively. Frequency of occurrence of words is calculated using Eq. (4). Table 1 shows the frequency of words based on weeks.

Frequency = ------- (4)

Table 1. Weekly Frequency of Words for #Ethereum

week

word

free;

n

total

9

#ethereum

0.039225512

29118

742323

9

#eth

0.033369571

24771

742323

9

#btc

0.028740858

21335

742323

9

#bitcoin

0.028646560

21265

742323

9

#ico

0.023175895

17204

742323

9

#blockchain

C02273134c

16874

742323

9

#cryptocurr

0.022020064

16346

742323

tоtaI wоrds where n denotes number of times specific word occurs in a week, total_words denote total number of words appeared in a week.

To get clear notion of appearance of words in corresponding weeks, tables 1, 2 and 3 show the count and frequency of words weekly for the hashtags #bitcoin, #ethereum, and #facebook respectively. From tables 1, 2 and 3, it can be observed that value of frequency is very low. This is the major issue related to big data. To overcome this issue, correspondence analysis has been done considering the count of terms occurred in a week instead of frequency of terms. Tables 4, 5 and 6 show the

Table 2. Weekly Frequency of Words for #Facebook

week

word

freq

n

total

9

#facebook

0.073163210

949

12971

9

#twitter

0.012335209

160

12971

9

#instagram

0.011795544

153

12971

9

en

0.009251407

120

12971

9

de

0.008711742

113

12971

9

facebook

0.007246935

94

12971

9

#socialmedia

0.006475985

84

12971

Table 3. Weekly Frequency of Words for #Bitcoin

week

word

freq

n

total

9

#bitcoin

0.067669116

26952

398291

9

#cryptocurr

0.023540577

9376

398291

9

#blockchain

0.021306030

8486

398291

9

ffethereum

0.020575408

8195

398291

9

#crypto

0.019721761

7855

398291

9

#btc

0.018283115

7282

398291

  • B.    Correspondence Analysis

For extracting useful information from dataset which is represented using contingency table, reducing the data by focusing on important information and analyzing the patterns in data, we have performed singular value decomposition and decomposition of positive semidefinite matrices. For handling the qualitative variables, Principal Component Analysis has been generalized as Correspondence Analysis [29].

This would mean that something important happened during that week that brought different words closer to each other, and therefore, we are just using highest frequency word as topic.

For CA, the whole dataset with 3 hashtags, namely, #ethereum, #facebook and #bitcoin and associated terms have been displayed in row-column manner in which rows represent terms associated with hashtags and columns represent weeks. Entries in contingency table represent how many times the given terms have been discussed on Twitter in a given week. Table 7 shows contingency table of terms associated with 3 hashtags along with their occurrence in weeks from 9 to 18.

  • 1)    Singular Value Decomposition

For reducing data size, SVD finds new components which are derived from original variables using linear combinations. The first component exhibits as much as large variance as possible. This component explains the largest part of inertia of table. Subsequent principal component is obtained having large variance with a constraint that it is orthogonal to the preceding component. These new variables used for deriving the components are called as factor scores. Factor scores can be assumed as projections of observed data onto the principal components. We have used Correspondence Analysis via ExPosition. To reduce dimensions, we used SVD, and got new axis for all weeks as shown in the table 8.

  • 2)    Heat Map Analysis

It can be noted from histogram that count of terms #bitcoin1, #ethereum, and #facebook is very high (shown in red color) compared to other terms.

  • 3)    Scree Plots

As we only need to infer the useful information, the problem is how many components need to be considered for correspondence analysis. Scree plots give an intuition which components represent data in best possible way. Scree plots may or may not give best components since this procedure is somewhat subjective.

For scree plot analysis, Eigen values are plotted according to their size. Then an elbow point is decided such that slope of graph becomes flat from steep one. The points before this elbow point are kept for further analysis. These points represent the data in best possible manner. Three points above elbow point best represent the data as shown in figure 4. Based on scree plot, only two dimensions possessing large amount of data variability are selected for further analysis.

  • 4)    Factor Scores

Factor scores represent the proportion of the total inertia ‘‘explained’’ by the dimension. Factor score obtained from scree plot asymmetrically and symmetrically for both dimensions are plotted in figures 5 and 6 respectively where л represents Eigen values and т represents percentage of data explained by the dimension.

Table 4. Weekly Frequency of Words for #Ethereum

Word

9

10

11

12

13

14

15

16

17

18

#ada

4909

8017

22093

8335

7913

4790

4459

5966

5662

3024

#airdrop

8803

26098

52520

35142

36588

23695

23437

17425

22093

10432

#altcoin

5252

15007

20565

19074

24754

15769

16140

15630

14965

7212

#bch

1676

5532

0

0

5837

4847

0

4335

6812

3517

#binanc

2100

6885

5777

6301

5311

5388

6872

4604

8544

3573

#bitcoin

21265

74625

86687

80112

88349

80468

78405

75880

74117

35899

#bitcoincash

0

0

0

0

0

0

0

5292

7042

4591

#blockchain

16874

63634

77100

76298

86977

73230

74537

62033

70407

33964

#bounti

4349

13399

23040

18379

23916

12641

15873

12443

12209

4913

#btc

21335

63269

85576

62158

66565

55042

56945

48847

55188

25308

#bts

0

0

0

5602

0

0

0

4577

0

0

#coin

0

0

0

0

8583

4176

0

0

0

0

#crowdfund

1649

0

0

0

0

0

6617

4170

4572

0

#crowdsal

0

4511

0

5034

4876

5252

6068

4020

4487

0

#crypto

16057

49975

64539

57773

64473

57169

63050

54821

60671

28176

#cryptocurr

16346

55146

67646

65178

72530

64913

65206

60007

58497

27540

#cryptonew

0

0

0

4878

0

4196

4725

0

0

0

#dash

2216

6034

6165

6053

6340

6508

4518

4540

5847

2968

#digitizecoin

0

0

0

0

0

0

4733

0

0

0

#earn

0

0

0

0

5741

0

0

0

0

0

#elsalvador

0

0

0

0

0

0

0

0

0

3205

#energytoken

5921

8852

0

0

0

0

0

0

0

0

#eo

4481

9439

0

4646

6158

4802

0

4102

8049

4669

#erc20

1385

0

5729

6245

7331

4942

4706

4441

5897

3412

#escort

0

0

0

0

0

0

0

3786

0

3628

#etc

1643

6167

0

0

0

0

0

0

0

0

#eth

24771

74637

99115

77422

87220

71144

73107

58282

67597

32730

#ether

3120

11848

13683

14143

18291

19046

16220

11351

11154

5596

#ethereum

29118

108179

123037

114655

121189

113522

110730

99528

100171

48708

#fintech

0

0

0

0

0

0

0

0

3928

0

#xrp

9686

26943

24497

20066

19455

17269

16319

14031

22588

10513

#xvg

5386

10978

30172

11820

9724

6645

5888

8167

5656

3131

airdrop

3652

8952

17883

10533

7212

6779

7201

4515

6496

2585

blockchain

0

0

0

4972

4746

4063

4967

4187

4662

0

btc

4896

16187

15886

18275

16913

14503

16235

14673

18964

9716

chanc

0

0

10243

0

0

0

0

0

0

0

crypto

0

5167

0

4677

4995

4536

4821

4134

4342

0

cryptocurr

0

0

0

0

0

0

4375

0

0

0

de

0

0

0

0

0

0

0

0

0

3674

earn

2223

5665

0

0

0

0

0

0

0

0

eth

2629

8605

8347

9307

8228

6775

6539

5530

6726

3568

free

4869

12371

22354

9599

5688

4401

4254

0

0

0

friend

1510

0

11701

0

0

0

0

0

0

0

goal

0

0

9690

0

0

0

0

0

0

0

ico

0

4522

0

5096

0

4069

5552

3911

0

0

join

5105

15739

15012

11259

10457

8495

8946

7508

6947

2836

link

0

0

8352

5435

0

0

0

0

0

0

mani

0

0

10111

0

0

0

0

0

0

0

offer

0

0

5691

0

0

0

0

0

0

0

peopl

0

0

10731

0

0

0

0

0

0

0

platform

0

4424

0

4653

4636

4337

4483

3521

0

0

price

0

4429

0

0

4457

0

0

0

0

0

project

2154

8793

19084

11368

10908

10298

10595

8079

8053

3245

reach

0

0

9778

0

0

0

0

0

0

0

refer

0

0

5695

0

0

0

0

0

0

0

regist

0

0

13344

4883

0

0

0

0

0

0

share

0

0

14946

4499

0

0

0

0

0

0

start

0

0

7559

0

0

0

0

0

0

0

time

0

0

7462

4589

0

0

0

0

0

0

token

4541

13866

41710

19947

13665

11690

13544

7998

11945

4653

Table 5. Weekly Frequency of Words for #Facebook

Word

9

10

11

12

13

14

15

16

17

18

#actu

33

1887

1719

0

2248

2178

0

2046

2078

1070

#amazon

0

985

706

0

0

0

0

0

1223

0

#busi

29

1283

1041

0

0

0

0

1566

1457

712

#cambridgeanalyt

0

0

0

5004

1658

0

0

0

0

0

#cambridgeanalytica

0

0

780

13947

4700

4102

6515

1851

1307

826

#congress

0

0

0

0

0

0

2309

0

0

0

#data

0

0

0

3958

2986

2558

3721

1867

1233

0

#date

0

0

0

0

0

0

0

0

0

832

#deletefacebook

0

0

0

5732

3050

1797

2762

0

0

0

#digitalmarket

0

895

710

0

0

0

0

1356

1218

0

#f8

0

0

0

0

0

0

0

0

0

1318

#facebook

949

50748

41961

130670

86932

76678

128772

63718

57189

32773

#facebookdatabreach

0

0

0

0

1635

0

2683

0

0

0

#facebookdataleak

0

0

0

0

0

0

2801

0

0

0

#facebookg

0

0

0

3196

0

0

0

0

0

0

#faitsdiv

33

1882

1717

0

2241

2173

0

2037

2069

1067

#follow

18

0

0

0

0

0

0

0

0

0

#gdpr

0

0

0

0

0

0

0

1499

0

0

#googl

42

2779

2532

4691

5555

3372

4228

3154

3154

1144

#hiphop

21

1249

950

0

0

0

0

0

1102

0

#info

35

1918

1747

2435

2295

2217

0

2101

2118

1077

#instagram

153

7778

5601

8764

7736

7760

8403

7887

7987

3791

#justic

35

1962

1773

2452

2316

2263

2288

2205

2266

1125

#linkedin

23

1091

948

0

0

0

0

1384

1390

0

#maga

0

0

0

0

0

0

2999

0

0

0

#market

36

2763

1954

3032

2726

2457

2865

2591

2902

1449

#markzuckerberg

0

0

0

3952

0

1805

7903

0

0

0

#music

46

2458

2201

3045

2798

2846

3027

2820

2985

1497

#new

0

974

803

0

1616

1734

0

1525

1459

763

le

0

0

0

2617

1695

0

2817

1396

1059

834

les

19

0

0

0

0

0

0

0

0

725

live

18

917

808

0

0

0

0

0

0

0

los

0

0

0

0

0

1652

0

0

0

0

mark

0

0

0

2944

0

0

6420

0

0

0

market

30

990

0

0

0

0

0

0

1092

0

media

33

1383

1208

3226

2058

1885

3156

1705

1494

672

million

0

0

0

0

0

1853

0

0

0

0

moment

26

1072

964

0

0

0

0

1193

1192

0

news

0

1296

827

0

1615

1580

0

0

0

703

page

66

2644

2094

3646

2885

2674

3505

2364

2321

1154

para

20

1126

805

0

1709

1636

0

1655

1425

1028

peopl

18

0

0

3429

1921

2004

3763

0

0

0

person

0

0

0

2847

0

0

0

0

0

0

post

32

1751

1211

2450

2007

1685

2311

1629

1722

898

privaci

0

0

0

2567

2799

2136

4267

1673

0

780

question

0

0

0

0

0

0

3826

0

0

0

radiocapitol

25

997

922

0

0

0

0

0

1091

0

scandal

0

0

0

3077

1805

1632

0

0

0

0

senat

0

0

0

0

0

0

3569

0

0

0

share

0

0

0

2457

0

1833

2849

0

0

0

social

60

2131

1766

5031

3234

2631

4526

2540

2228

1072

su

0

0

0

0

0

0

0

1254

0

0

sur

0

914

0

0

0

0

0

0

0

0

time

28

0

0

3166

1627

0

2645

0

0

0

tip

19

0

0

0

0

0

0

0

0

0

user

0

0

0

4479

2942

3965

5084

2144

1281

895

video

18

1236

766

0

0

0

0

0

1042

0

zuckerberg

0

0

0

4690

1625

2486

11162

1224

0

763

Table 6. Weekly Frequency of Words for #Bitcoin

Word

9

10

11

12

13

14

15

16

17

18

#ada

533

0

3477

0

0

0

0

0

0

0

#airdrop

2638

10038

12507

11880

7421

5107

7985

3490

3742

2120

#altcoin

3324

12295

13478

14894

10800

7325

13349

6072

7680

3825

#bch

580

0

0

0

0

1866

0

0

1953

0

#binanc

867

3603

3463

4556

2974

3130

4372

2906

3326

1713

#bitcoin

26952

129250

130504

142416

93379

88906

139196

70279

84378

40098

#bitcoincash

805

3329

3176

3487

2468

2912

3852

2703

4746

3084

#bittrex

0

0

0

0

0

0

0

1327

0

0

#blockchain

8486

41274

44567

49860

32235

28574

45937

20584

28864

15041

#bounti

1602

5557

9347

9114

5793

3129

6389

2043

0

1003

#btc

7282

32320

34138

33660

24708

21475

34979

15436

21689

10554

#bts

0

0

0

4123

0

0

0

0

0

0

#busi

0

0

0

0

1989

0

0

0

0

0

#coin

544

2564

0

0

3963

0

0

0

0

0

#coinbas

674

0

0

0

0

1647

2644

0

0

0

#costarica

0

0

0

0

0

0

0

0

0

1258

#crowdfund

0

0

0

0

0

0

2577

0

0

0

#crowdsal

0

0

0

2824

0

0

3536

0

0

0

#crypto

7855

34970

36987

38793

23964

23958

39727

20486

26705

14038

#cryptocurr

9376

42126

46774

52489

35009

30402

51495

26833

33424

16409

#cryptonew

0

2919

2865

3296

2009

1834

3759

0

1773

0

#cybersecur

2396

8491

9800

12639

9689

11301

15782

8265

4301

0

#dash

645

2524

2959

2779

0

1904

0

1410

1925

0

#earn

0

0

0

0

2774

0

0

0

0

0

#elsalvador

0

0

0

0

0

0

0

0

0

1816

#eo

0

0

0

0

0

0

0

0

2121

1056

#escort

0

0

0

0

0

0

0

1298

2142

2168

#eth

4133

15872

19390

18999

14174

11231

17540

6999

9486

4917

#ether

1081

5281

5903

6822

5362

5166

8506

2842

3459

1954

#ethereum

8195

37482

40265

45831

32256

27772

47159

23838

28119

15209

#vip

0

0

0

0

0

0

0

0

0

1822

#xrp

1483

6590

4916

5638

3961

3662

5279

3034

3997

2064

#xvg

651

0

3361

3226

0

0

2917

1353

0

0

000guarium

0

0

0

0

0

0

0

1311

0

0

airdrop

620

0

0

0

0

0

0

0

0

0

bitcoin

2711

14780

14408

13655

7537

7178

11675

5826

8137

3793

blockchain

679

3830

3165

4987

2127

1920

3561

1643

2414

1368

btc

3013

12069

13080

14361

8529

7151

11433

6156

8969

4928

buy

0

0

3049

3092

2252

1773

2703

2144

2452

1038

crypto

1098

5655

5945

5495

3300

3129

4997

2410

3177

1661

cryptocurr

956

5092

5437

5222

3318

3026

4341

2091

2740

1309

de

0

0

2536

2867

2198

1581

2924

2074

2880

2359

eth

612

2817

2844

3870

0

0

2595

1287

1797

1191

exchang

0

3077

0

2945

0

1639

2622

0

0

0

free

939

4028

3615

3340

0

1680

0

1327

1846

0

hour

0

0

0

0

0

1838

2616

0

0

0

ico

0

2604

2803

2955

0

0

2969

1251

0

0

invest

0

2545

0

0

0

0

0

0

0

0

join

974

4141

4132

3385

2055

1808

2598

1706

1813

0

market

792

3984

4242

4148

3030

2923

3997

1837

3124

1311

mine

0

2674

2724

0

0

0

0

0

0

0

price

1243

6543

6949

7200

4701

4847

7262

3282

4641

1921

project

0

2565

2527

3030

2006

1927

2912

1275

0

0

secur

564

0

0

0

0

0

0

0

0

0

sell

0

0

0

0

0

0

0

1423

0

0

start

0

2745

0

0

0

0

0

1323

0

0

telegram

557

0

0

0

0

0

0

0

0

0

token

808

3766

3547

3890

2380

1864

2717

1329

1852

0

trade

730

3604

3056

2933

0

1966

0

1375

1859

0

world

0

0

0

3240

0

0

0

0

0

0

Table 7. Contingency table of hashtags along with their occurrence in weeks from 9 to 18

Weeks

Words

9

10

11

12

13

14

15

16

17

18

#bitcoin

21265

74625

86687

80112

88349

80468

78405

75880

74117

35899

#blockchain

0

63634

0

76298

86977

73230

74537

62033

70407

33964

#btc

21335

63269

85576

0

0

0

0

0

0

0

#crypto

0

0

0

0

0

0

0

0

60671

28176

#cryptocurr

0

0

0

0

0

0

0

60007

0

0

#eth

24771

74637

99115

77422

87220

71144

73107

58282

67597

32730

#ethereum

29118

108179

123037

114655

121189

113522

110730

99528

100171

48708

#ico

17204

0

80290

75305

76068

68794

70205

0

0

0

#cambridgeanalytica

0

0

0

13947

0

0

0

0

0

0

#facebook

949

50748

41961

130670

86932

76678

128772

63718

57189

32773

#instagram

153

7778

5601

0

0

7760

0

7887

7987

3791

#twitter

160

8199

6384

0

9690

8526

0

8848

8827

3895

#zuckerberg

0

0

0

0

0

0

18855

0

0

0

#Data

0

0

0

13700

8052

0

11498

0

0

0

#De

113

6790

5513

21954

13265

12369

19778

10902

8529

5478

#En

120

5497

0

0

0

0

0

7677

6994

0

#Facebook

0

0

4555

15673

10760

8853

15829

0

0

3817

#bitcoin1

29118

154708

159088

165500

107381      109707

176790

81169

95962

43737

#blockchain1

9196

49535

54362

58028

37199

35756

59092

24153

32865

16337

#btc1

0

0

0

0

28262

0

0

0

0

0

#crypto1

8461

41888

44723

44898

0

29718

50492

23863

30424

15276

#cryptocurr1

10076

49910

57015

60767

40541

37868

65620

31307

37879

17766

#ethereum1

8774

45341

49824

54141

37640

35016

60999

28025

32138

16575

Table 8. SVD applied for weeks from 9 to 18

[,1]        [,2]        [,3]          [,4]         [,5]         [,6]         [,7]        [,8]         [,9]

9  -2.39704392  0.7504260  0.7867228 -0.135401766  0.29029054 -1.24858593  2.06928320 -0.3540769 -5.07620094

10 -1.18154968 -0.8508375 -0.2612808  0.352639099  0.96122721  1.79196491 -0.95563582 -0.7434918 -0.14423827

11 -1.71436241  0.5768373  0.4915485 -0.111930135 -0.00433863 -1.20364721 -0.27991729  0.3793571  1.32622634

12  0.96792692  0.7903582 -0.2512851 -1.511792640  1.47251661 -0.20953824 -0.11017653 -0.1786082 -0.03855474

13 -0.07089537  1.0562556 -0.9800645  0.003870922 -0.94707575  1.37464399  1.31150418  0.9377378  0.31156602

14  0.41491830  0.3527319 -0.3414070 -0.452840703 -1.97879656 -0.30870141 -1.33951583 -1.3002880 -0.48019706

15  0.99273302  0.6564433  1.1705694  1.782179233  0.32616232 -0.01323828 -0.02534604 -0.1235730 -0.02066347

16  0.39084956 -1.2953340 -2.1881072  1.090122086  0.36505431 -1.40700008  0.10876275  0.4659193 -0.16846302

17  0.38071378 -1.8370036  1.0445893 -0.724527422 -0.47709338 -0.03915036  1.68439157 -0.8721054  0.65668085

18  0.53355439 -1.4318975  1.3527783 -0.899883061 -0.64115617  0.32671801 -1.53862175  3.3660138 -1.01199859

Fig.3. Heat map analysis

Dimensions

Fig.4. Scree plot

Hashtags and color key

#Ethereum - Blue

#Facebook - Green

#Bitcoin - Red

#zuckerM6g::::faceb00k

  • #lco e8^^^ #blockcfl6ln Sblockchalnl #ethereu^ypt°*yrr1 .№ Sbltcolnl ^greum #bltcoln17

Scryptol #eth-fitter

«Instagram

10             en t sole

  • 11 '                                     #cryptocurr

9                                     16*

Dimension 1. X = 0.119. 1=31%

Fig.5. Factor Score (Asymmetric plot)

Fig.6. Factor Score (Symmetric plot)

-0.4                                           0.0                                            0.4

Contributions

Fig.7. Most Contributing Variables

Fig.8. Contribution of all weeks

Find the topics associated with the term “market”

(u1) = {Shares} ^ {Sentences$h } (u2) = (bitcoin} ^ {Sentences$h} (u3) = (prizes) ^ {Sentences$h}

For each sentence in ($h) , retrieve associated topics (uk} from sentence-topic matrix 0

{ uii,ui2,uuj { U11,U12,U1(,}

{uh1,uh2,uh I ,}

Return the retrieved topics

Fig.9. Query evaluation scenario for topic extraction

31% and 25% of total inertia has been explained by dimensions 1 and 2 respectively. More the data points are away from the center, more the inertia (variance) they possess. If distance between data points is more, then they carry good pattern among themselves. For example, #cambridgeanalytica has been discussed more in weeks 12 and 15 under hashtag #facebook. So, this visualization gives notion of information and patterns among data points. In order to check whether variables associated with data points really possess good information, the plot of most contributing variables in depicted in figure 7.

Variable associated with #cryptocurr (shown in blue color) under #facebook category and #crypto (shown in red color) under #bitcoin are the most contributing variables.

Figure 8 shows the plot of weeks according to their contribution for topics being discussed in respective week. Weeks 9, 10, 11, 12, and 15 are important. It shows something happened in News during these weeks that caused the words closer to these weeks behave similarly.

C. Results Discussion

After performing EDA and CA, we applied our proposed approach – online latent semantic indexing constrained by regularization on Twitter data. Tweets are associated with the hashtags #bitcoin, #ethereum, and #facebook. Table 9 shows the top 5 topics extracted in each week from 9 to 18 for the hashtags #bitcoin, #ethereum, and #facebook.

Figure 9 shows the query evaluation scenario when user sends query to get the topics associated with a given term. Let us say, uses sends a query: Find the topics associated with the term “market”. Initially, the query is analyzed and terms directly mentioned in query are retrieved. Therefore, for the given example, the termtopic matrix ‘U’ is searched and top i topics associated are         retrieved,         i.e.         top         topics

{shares , b it co in, and prizes] are retrieved. For each topic, retrieve the sentences from sentence-term matrix ‘W’ assuming each topic as a term. For each sentence, choose topics {u fc } from sentence-topic matrix ‘O’ , and return retrieved topics. These topics are the topics related to the term Г = market. The extracted topics are given as

^11, ^12, ■" , U1Z

^21, ^22, ■" , ^21

U31 , U32 , ■" , U3Z

With the help of both sentence-term matrix ‘W’ and sentence-topic matrix ‘O’ , the approach also performs extraction of topics implicitly present.

Table 9. Top 5 topics extracted using proposed approach for each week from 9 to 18

(a) Topics extracted with #bitcoin

Week 9

Week 10

Week

11

Week 12

Week 13

Week 14

Week 15

Week 16

Week 17

Week 18

Dominance

Market

croatia

recent

market

bitcoin

destroy

supporters

Lacklustre

Blockchain

project

Change

Bitcoin

audit

Bounce

scammers

value

Contactez

Markets

Telephone

CRYPTO

Masterminds

rank

payment

water

account

Crypto

reward

Bitcoin

Electron

successs

Bitcoin

Space

sagolsun

Analysis

news

Madness

Hublot

Cash

future

investors

Ripple

quality

cryptocurrency

Dogecoin

livestramers

podcast

decentralisedsystem

crypto

investment

Altcoin

CryptoCashbackRebate

pembeli

BitLicense

Support

akan

Transtoken

cash

trading

History

(b) Topics extracted with #facebook

Week 9

Week 10

Week 11

Week 12

Week 13

Week 14

Week 15

Week 16

Week 17

Week 18

download

Clarification

Google

Shkreli

deactivated

Censoring

fight

Cryptocorner

Unbearably

technologies

Hype

guide

failed

Brotheers

Optimization

virus

content

Market

Snapchat

Market

Facebook

teenager

edumacated

Zuckerberg

socialmedia

catie

brother

drop

video

fund

archive

Business

broadcasting

friend

Regulating

Scanning

privacy

Facebook

lol

suspend

Retargeting

uploaded

live

million

Cryptocurrency

Messenger

graduation

friday

data

Stock

(c) Topics extracted with #ethereum

Week 9

Week 10

Week 11

Week 12

Week 13

Week 14

Week 15

Week 16

Week 17

Week 18

Altcoin

CryptoCashbackRebate

pembeli

BitLicense

Support

akan

Transtoken

cash

trading

History

airdropping

Kapsus

Cryptocurrency

terjual

VertChain

wallet

free

future

Darico

currency

hedgefund

Localcoin

Food

crypto

Airdrop

cryptocurrency

market

RAXOM

crypto

Binance

Limited

Bitcoin

Ico

price

happy

sidechain

Casper

airdrop

Ethereum

Bitcoin

investment

price

Blockchain

drop

shopping

beacon

revolutionary

Truffle

Mining

Exchange

V. Conclusion

We have proposed a deep learning model for explicit and implicit detection of dynamically generated topics from streaming data by online version of Latent Semantic Indexing constrained by regularization. The approach mentioned is scalable to large dataset. It is flexible to support both long normal text and short text for modeling the topics. The model is adaptive such that it is updated incrementally and performs temporal topic modeling to get notion of evolution and trends of topics over time. Topic modeling approach supports extraction of implicit and explicit topics from sentences also. This model can be treated as first step towards implicit and explicit aspect detection for aspect based sentiment analysis on social media data.

We have performed exploratory data analysis and correspondence analysis on real world Twitter dataset. Results state that our approach works well to extract topics associated with a given hashtag. Given the query, the approach is able to extract both implicit and explicit topics associated with the terms mentioned in the query. The next step would be to perform the performance analysis with reference to standard performance metrics.

Список литературы Adaptive model for dynamic and temporal topic modeling from big data using deep learning architecture

  • A. R. Pathak, M. Pandey, and S. Rautaray, “Construing the big data based on taxonomy, analytics and approaches,” Iran J. Comput. Sci., vol. 1, no. 4, pp. 237–259, Dec. 2018.
  • D. M. Blei, “Probabilistic Topic Models,” Commun. ACM, vol. 55, no. 4, pp. 77–84, Apr. 2012.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003.
  • T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 289–296.
  • S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, 1990.
  • X. Cheng, X. Yan, Y. Lan, and J. Guo, “Btm: Topic modeling over short texts,” IEEE Trans. Knowl. Data Eng., no. 1, p. 1, 2014.
  • Y. Zuo, J. Zhao, and K. Xu, “Word network topic model: a simple but general solution for short and imbalanced texts,” Knowl. Inf. Syst., vol. 48, no. 2, pp. 379–398, Aug. 2016.
  • K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM,” Mach. Learn., vol. 39, no. 2, pp. 103–134, May 2000.
  • P. Xie and E. P. Xing, “Integrating document clustering and topic modeling,” arXiv Prepr. arXiv1309.6874, 2013.
  • D. M. Blei, J. D. Lafferty, and others, “A correlated topic model of science,” Ann. Appl. Stat., vol. 1, no. 1, pp. 17–35, 2007.
  • M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent dirichlet allocation,” in advances in neural information processing systems, 2010, pp. 856–864.
  • Q. Wang, J. Xu, H. Li, and N. Craswell, “Regularized latent semantic indexing,” in Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, 2011, pp. 685–694.
  • L. AlSumait, D. Barbará, and C. Domeniconi, “On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 2008, pp. 3–12.
  • Y. Wang, E. Agichtein, and M. Benzi, “TM-LDA: efficient online modeling of latent topic transitions in social media,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012, pp. 123–131.
  • X. Li, A. Zhang, C. Li, J. Ouyang, and Y. Cai, “Exploring coherent topics by topic modeling with term weighting,” Inf. Process. Manag., 2018.
  • K. D. Kuhn, “Using structural topic modeling to identify latent topics and trends in aviation incident reports,” Transp. Res. Part C Emerg. Technol., vol. 87, pp. 105–122, 2018.
  • S. Brody and N. Elhadad, “An Unsupervised Aspect-sentiment Model for Online Reviews,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 804–812.
  • A. R. Pathak, M. Pandey, S. Rautaray, and K. Pawar, “Assessment of Object Detection Using Deep Convolutional Neural Networks,” in Intelligent Computing and Information and Communication, 2018, pp. 457–466.
  • A. R. Pathak, M. Pandey, and S. Rautaray, “Deep Learning Approaches for Detecting Objects from Images: A Review,” in Progress in Computing, Analytics and Networking, 2018, pp. 491–499.
  • A. R. Pathak, M. Pandey, and S. Rautaray, “Application of Deep Learning for Object Detection,” Procedia Comput. Sci., vol. 132, pp. 1706–1717, 2018.
  • A. B. Dieng, C. Wang, J. Gao, and J. Paisley, “Topicrnn: A recurrent neural network with long-range semantic dependency,” arXiv Prepr. arXiv1611.01702, 2016.
  • Y. Li, T. Liu, J. Hu, and J. Jiang, “Topical Co-Attention Networks for hashtag recommendation on microblogs,” Neurocomputing, vol. 331, pp. 356–365, 2019
  • P. Gupta, F. Buettner, and H. Schütze, “Document informed neural autoregressive topic models,” arXiv Prepr. arXiv1808.03793, 2018
  • K. Giannakopoulos and L. Chen, “Incremental and Adaptive Topic Detection over Social Media,” in International Conference on Database Systems for Advanced Applications, 2018, pp. 460–473
  • Y. Zhang et al., “Does deep learning help topic extraction? A kernel k-means clustering method with word embedding,” J. Informetr., vol. 12, no. 4, pp. 1099–1117, 2018
  • W. Gao, M. Peng, H. Wang, Y. Zhang, Q. Xie, and G. Tian, “Incorporating word embeddings into topic modeling of short text,” Knowl. Inf. Syst., pp. 1–23, 2018
  • X. Li, Y. Wang, A. Zhang, C. Li, J. Chi, and J. Ouyang, “Filtering out the noise in short text topic modeling,” Inf. Sci. (Ny)., vol. 456, pp. 83–96, 2018
  • H. Zhang, B. Chen, D. Guo, and M. Zhou, “WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling,” in International Conference on Learning Representations, 2018
  • H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdiscip. Rev. Comput. Stat., vol. 2, no. 4, pp. 433–459, 2010.
  • H. Abdi, “Multivariate analysis,” Encycl. Res. methods Soc. Sci. Thousand Oaks Sage, pp. 699–702, 2003.
  • H. Abdi and L. J. Williams, “Correspondence analysis,” Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage. 2010.
Еще
Статья научная