Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси

Селезнев Н.К.; Ирхин И.А.; Кантор В.В.

Научные статьи \ Общие вопросы науки и культуры \ Информационные технологии. Вычислительная техника. Обработка данных \ Искусственный интеллект

Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси

Автор: Селезнев Н.К., Ирхин И.А., Кантор В.В.

Журнал: Труды Московского физико-технического института @trudy-mipt

Рубрика: Информатика и управление

Статья в выпуске: 3 (39) т.10, 2018 года.

Бесплатный доступ

Во многих задачах, решаемых в Яндекс.Такси с помощью машинного обучения, будь это обыкновенная сегментация пользователей, предсказание числа поездок в сле- дующем месяце или другие задачи, необходимо представлять пользователя приложе- ния в виде вектора признаков. Среди основных источников данных для построения такого вектора можно выделить логи мобильного приложения, которые, однако, слабо структурированы. Извлечение признаков из данных такого типа вручную осложнено характером данных: требуются серьезные знания в области человеческого поведения, а кроме этого - глубокое понимание технических деталей генерации логов. Мы раз- работали метод, который автоматически конструирует �-мерное векторное представ- ление пользователя, построенное на основе его активности в мобильном приложении. Полученное представление может использоваться как набор признаков в задачах обу- чения с учителем и без учителя. Как показывают эксперименты, опробованные модели успешно справляются с извлечением важной информации о пользователе. Мы проте- стировали наш метод в задачах обучения с учителем, решаемых в сервисе, и результаты показывают, что получаемое представление пользователя полезно как само по себе, так и в комбинации с собранными вручную признаками из истории заказов пользователя.

Многокритериальная оптимизация, обучение представлений, анализ логов, логи мобильного приложения, автоматическое извлечение признаков

Короткий адрес: https://sciup.org/142220442

IDR: 142220442 | УДК: 004.85

Текст научной статьи Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси

Yandex.Taxi is a service that allows its users to order an official taxi at an affordable rate without calling a dispatcher. One can order a taxi on the site or through the Yandex.Taxi application for iOS or Android1. Yandex.Taxi users generate substantial amount of data, mainly coming from their history of orders and application activity logs. This data is used extensively for machine learning objectives throughout the company, such as recommendation of destination points for a given trip or estimation of the taxi demand for a given area.

Both streams of data (history of orders and application activity logs) contain crucial information about the users, and are complementary to each other in various user-oriented machine learning tasks. However, there is some difficulty in analyzing them together. Users’ history of orders is well-structured and, in many ways, straightforward to extract features from. At the same time, logs of users’ activity in the application are much less accessible without extensive study of the data. Besides, feature extraction from application logs requires some expertise in the areas of human behavior, cognitive abilities and psychology, specifically applied to mobile application user-activity understanding. Overall, it is extremely labor-intensive to extract features from application logs and, as a consequence, the efficiency of data-utilization in the company is less than it might be if only application logs were easier to work with.

In order to help machine learning practitioners throughout Yandex.Taxi to facilitate the process of technical and weakly structured application logs analysis, we propose a method for automatic construction of user’s vector representation based on her mobile application activity. The proposed representation is as an n-dimensional dense vector constructed from a given Yandex.Taxi user’s mobile application log history. This representation maps users to the same vector space. It acts as a feature set for both supervised and unsupervised machine learning tasks.

Later in this paper we will refer to the aforementioned n-dimensional dense vector constructed from a given user’s mobile application activity logs as «user representation», «user-embedding», «representation» or «user-vector».

2. Related Work

The construction of user-representations based on some weakly or unstructured data has been around for a while. The popular setup is to bring users of some service to the same vector space with its products and make product recommendations for users based on some distance metric or more sophisticated techniques [3,11,12]. This paper is not concerned with recommendation systems and aims to solve supervised learning tasks as in [1] and to find similar users as in [3].

Although our approach is closely related to the model presented in [1], one of the main differences is that mobile application activity data is more technical and less interpretable than website activity data. Moreover, we are not only interested in the user representation that is explicitly trained on some number of supervised learning tasks, but also similarly concerned with the ability of this representation to generalize to previously unseen tasks. For that reason, apart from supervised learning tasks, we employ various techniques to improve generalization in an ordinary multitask fashion [2]. Furthermore, we test various models capable of word-level embedding and compare their performance against each other on the set of mimic tasks. We also show that our method may be applied to the real-world production task. Finally, we study the relationship between method’s performance on supervised tasks and the configuration of auxiliary tasks it was trained on.

The method employed in this paper is comparable with the one suggested in [3] as one of the goals of our approach is to identify similar users. In some of tested models we use similar strategy to obtain user representation, except for the fact that we are not interested in representing user’s log sessions, but instead, in the aggregated history of her sessions. Nevertheless, we test the idea of averaging word-level embeddings that belong to a user’s application activity log history which is close in spirit to approach of [3]. One of the distinctive features of our setup is that the notion of context in Yandex.Taxi mobile application logs is ill-defined. Therefore, it is not immediately justified to use word2vec [4] and other context-based embedding techniques to obtain word-level embeddings.

Parts of the presented approach may be used for categorical feature embedding as in [5]. During the course of training, some of the tested models learn representations for mobile application activity logs’ event names (identifiers of some event happening, e.g. start of the application or tap on the «order button»). After training, one may use the Euclidean space representations of said event names for machine learning tasks.

3. Methodology

1) be able to act as a feature set for business-oriented supervised learning tasks, such as user’s Lifetime Value estimation or user’s service preferences identification (like child seat requirement etc.). Below, this feature is referred as «predictive power»; 2) help identify similar users in terms of business metrics, such as willingness to accept surge pricing2 or tariff preferences3. Below, this feature is referred as «similarity». 3.1. Data Description and Preprocessing

The main source of users’ data is their logged activity in the mobile application. The log is represented by a series of consecutive events, some of which contain detailed descriptions regarding the event. Each event has: er ent _ name, event_value (description), event_timestamp, event_ region, session_ id and event_ coordinates.

Example 1. If event_name is «accept_order_button_is_clicked», then its description might be «tariff: economy, surge value: 1.5, source coordinate: (10, 10), target coordinate: (20, 20)».

After the manual selection process, there are 169 unique event names, 40 of which contain event values. The selection of event names for user-text (concatenated event names and event values) creation was done manually based on the amount of useful information they bear.

Example 2. Event «application_started» is ignored, because it, seemingly, bears no relevant information about the user except for the fact that she started the app, and that information is logged seconds later on the first screen she sees.

Preprocessing of event values is aimed to extract useful information from raw logs and help text-embedding models observe the diversity of, at first glance, similar events.

Example 3. event_value «surge 1.2» is transformed to «surge_yes surge_value_l_2», while event_value «surge: 1.0» is transformed to «surgeno surgevaluelO» to enable text-embedding models to tell the difference between the situation in which surge price was accepted and the opposite.

3.2. Predictive Power and Similarity Evaluation

To evaluate the performance of user-embeddings on predictive power and similarity tasks we collected the set of business metrics4 that are used as target values in these tasks. Predictive power of user representation is measured on its ability to predict collected business metrics associated with the user. For the experiments, we chose 8 business metrics. The symbol «*» in the column «Metric name» indicates the presence of information directly associated with metric value in user’s application logs.

User’s Business Metrics Description

Table 1

Metric Name	Description	Performance Evaluation Metric
accepts surge*	1 if user has accepted surge pricing at least once, 0 - otherwise.	Accuracy
tariff*	The distribution of user’s total taxi orders among available tariffs.	Categorical cross entropy
card system	The distribution of user’s total taxi orders among available card systems.	Categorical cross entropy
payment type*	The distribution of user’s total taxi orders among available payment types.	Categorical cross entropy
mean cost	Average cost of user’s order.	RMSE
mean trauel time	Average travel time of user’s trip.	RMSE
cancel frequency*	The number of cancelled orders divided by the total number of orders user had.	RMSE
num orders	Total number of orders user had.	RMSE

The user-embedding’s ability to help identify similar users is measured as follows: firstly, for each user we find top-n most similar users (in our experiments n = 5) based on their cosine similarity. Secondly, for each business metric from Table 1 we evaluate variance in the group of selected users. For real-valued metrics and for binary one we use regular variance, for other metrics, which are, essentially, distributions, we measured average pairwise Hellinger distance in the group.

3.3. Models

In our experiments, we evaluated the performance of 5 different in nature models’ some of which have both unsupervised and supervised versions (* indicates presence of supervised version of a model):

• Word2Vec (W2V)*
• FastText (FT)
• Doc2Vec (D2V);

e Autoencoder (AE)*

• ARTM*

Before diving deeper into the models’ architectures, it is crucial to define the concept of «guide». A guide is a task additional to the model’s original unsupervised objective. With this additional task, we encourage the model to pay more attention to the textual features that are indicative of e.g. user’s tariff preferences or the card system she uses. We employ guides explicitly in an ordinary multitask learning fashion, i.e. we introduce auxiliary losses to the original unsupervised loss, while the embedding stage is shared among all of the tasks. As an example, one may think of an autoencoder model which takes as input user-text (in a bag-of-words representation) and transforms it to some dense fixed length vector with the objective to minimize reconstruction loss and auxiliary binary cross entropy of user classification as one that accepts surge pricing or not. In the described setup, the autoencoder is trying to learn user representation in such a way that it preserves both the information important for reconstruction (original objective) and the patterns indicative of surge pricing acceptance. In the remaining of the paper we refer to any auxiliary task as a guide. All supervised models were trained with 4 guides: payment type, tariff, mean cost and num orders.

All the models except for unsupervised versions of W2V, FT and D2V are trained to obtain 100-dimensional user representation. Unsupervised W2V, FT and D2V obtain 200-dimensional representation. The choice of dimensionality is guided by each model’s performance in the predictive power task.

Word2Vec

As the unsupervised version of the model (W2V simple) we used genism [6] implementation of Skip-Gram word2vec trained on full corpus of user-texts. In order to obtain given user’s representation all word vectors from his or her user-text are averaged. The supervised word2vec model (CBOW) has multi-layer perceptrons attached to the embedding layer of word2vec for each guide we introduce to the model. At each epoch of training, firstly, the regular word2vec model is trained on the whole corpus of user-texts, then the embedding layer is taken out and trained simultaneously with multiple classificators (MLP’s) on top of it to minimize the loss associated with user’s business metrics prediction. The whole process is repeated until the classification converges. In the described architecture original word2vec objective acts as a regularizer that helps the model to generalize better to unseen tasks (e.g. prediction of business metrics that were not used as guides during the training). The supervised model has 2 versions: one with global average pooling layer (W2V POOL) on top of the embedding layer and the other with LSTM [10] layer in that place (W2V LSTM).

Fig. 1. General supervised word2vec architecture:

w - word, textitt - index of current central word in the word2vec window (CBOW), s - half-length of the word2vec window, e^ - embedding of word i, e_av - embedding of user-text, I - number of words in user-text, (^ ) - prediction of guide j (business metric), m - number of guides used for training

ARTM

We chose ARTM as a topic modeling approach (and BigARTM5 [7] as a tool), because it is, no more than, a generalized topic modeling method. Taken with 2 different sets of parameters it acts as a generalization of 2 of the most popular approaches to the task, namely, LDA and PLSA. The unsupervised version of ARTM (ARTM simple) is a simple LDA model implemented in the BigARTM library trained on the full corpus of user-texts. The supervised ARTM (ARTM guided) is a regular ARTM model with guides represented as added modalities to the original word modality. Intuitively, the model takes, for example, user’s mean order cost, or her willingness to accept surge pricing as an additional modality to the topic modeling task.

Autoencoder

The unsupervised autoencoder (AE simple) aims to encode a bag-of-words representation of user-text to fixed-length vector and then reconstruct the original input from it. The encoded representation is used for subsequent tasks. The supervised Autoencoder (AE guided) is a regular autoencoder model with output from the encoder being fed to dense layers for business metrics prediction. The model is trained in a regular multitask fashion with total loss being a weighted sum of reconstruction loss and all guides’ losses. Contrary to supervised word2vec, this model is trained in an end-to-end fashion.

Fig. 2. General supervised autoencoder architecture:

[wco...wc„] - bag-of-words representation of user-text, n - number of words in the vocabulary, юс і - number of times word i appeared in user-text, dim - dimensionality of user-text embedding, e - embedding of user-text, (y ) - prediction of guide j (business metric), m - number of guides used for training

Doc2Vec

As the unsupervised version of the model we used gensim implementation of DBOW doc2vec [13] trained on full corpus of user-texts with each user-text treated as a document. In order to obtain user representation, the corresponding document vector is inferred. There is no supervised version of this model.

FastText

The original Facebook Research fastText [8] implementation is used. There is no supervised version of this model.

4. Results
- 4.1. Mock Evaluation: Predictive Power

Table 2

Predictive Power Evaluation Results

Each row in the table presents a model, each column - a guide (business metric) for which performance was evaluated. Row «constant» shows performance of the best train constant prediction for each business metric. Row «W2V untrained» refers to the supervised word2vec model, which was not trained, just initialized with random weights. All performance evaluation metrics are taken from Table 1. In all columns except for the first one the less is the better.

	accepts surge	card system	payment type	tariff	cancel frequency	mean travel time	mean cost	num orders
W2V untrained	0.8578	0.719	0.4959	0.2129	0.1629	11.8916	198.809	52.986
W2V simple	0.8758	0.7388	0.4393	0.2152	0.1688	11.8363	192.227	49.9414
W2V POOL	0.8657	0.5608	0.3958	0.1537	0.1635	11.7972	175.895	47.2452
W2V LSTM	0.7769	0.6904	0.4797	0.5538	0.1758	11.9295	184.502	41.8545
FT	0.8736	0.6248	0.4314	0.1541	0.1691	11.7443	191.327	50.5823
D2V	0.6072	0.7749	0.5243	0.2185	0.192	12.6536	220.606	55.9574
ARTM simple	0.9112	0.8008	0.4681	0.1836	0.1378	11.7484	202.655	38.3039
ARTM guided	0.9162	0.7478	0.5148	0.2165	0.1641	11.7736	194.085	38.8015
AE simple	0.8664	0.6767	0.4776	0.1605	0.1722	11.9182	200.173	36.0736
AE guided	0.87	0.5838	0.4319	0.1632	0.1693	11.8151	183.072	36.7271
constant	0.5819	0.7823	0.5259	0.2152	0.1932	12.6442	220.624	60.0897

The results suggest that our supervised W2V model shows best performance in 4 out of 8 prediction tasks. The guides used for supervision are: payment type, tariff, mean cost and num orders. The W2V POOL model outperforms others in 3 out of 4 tasks it was explicitly supervised on. However, in the card system distribution estimation it shows best result despite the fact it was not supervised with respect to this metric. The opposite is true for the num orders metric, on which W2V POOL was supervised, yet it struggles to beat the other models.

In 2 of the tasks ^cancel frequency and mean travel time) the best results are shown by unsupervised models (LDA and FastText).

The largest variance in the predictive power among the models is present for tariff and num orders tasks, the smallest - for mean travel time and mean cost.

Additionally, in none of the tasks constant prediction is the best one, which serves as a proof that nearly all models have learned to extract meaningful information about users’ business metrics.

4.2. Mock Evaluation: Similarity

In the similarity task, the supervised autoencoder (AE guided) beats the others in 3 out of 8 tasks, also W2V untrained shows the same result with 3 out 8 tasks being won. For now, we cannot suggest a reasonable explanation of that phenomenon. The supervised autoencoder shows best performance in only one task it was supervised on. Moreover, the model shows best performance on 2 tasks on which it had no guide.

The largest variance in the similarity task performance is present for tariff and mean cost tasks, the smallest - for card system and payment type.

Table 3

Similarity Evaluation Results

Each row in the table presents a model, each column - a guide (business metric) for which performance was evaluated. The row named «random» presents variances calculated based on n random samples from the dataset (i.e. the step with selection of n most similar users to the given user is replaced with random uniform sampling of n users). Row named «W2V untrained» refers to the word2vec model which was not trained, just initialized with random weights. The details of evaluation are described in section 3.2 of this paper. In each column the less is the better.

	accepts surge	payment type	card system	tariff	cancel frequency	mean cost	num orders	mean travel time
W2V untrainded	0.0984	0.2792	0.6893	0.0879	0.0203	26951.9	3103.32	82.0377
W2V simple	0.0976	0.2689	0.6563	0.0951	0.0223	29554.4	2238.73	104.411
W2V POOL	0.0956	0.2258	0.5596	0.0907	0.0209	29819.8	2167.41	92.5748
W2V LSTM	0.1259	0.2645	0.6188	0.0894	0.0239	30774.1	773.741	108.91
FT	0.1024	0.2685	0.6562	0.0935	0.0224	35447	2420.46	103.427
D2V	0.1762	0.319	0.7217	0.1375	0.0289	67849.9	2483.63	127.114
ARTM simple	0.1073	0.233	0.5595	0.0893	0.023	57348.9	1704.41	116.845
ARTM guided	0.1111	0.2805	0.6674	0.0969	0.0243	65342.3	2375.38	104.782
AE simple	0.0967	0.283	0.6833	0.0865	0.0229	39925	1943.59	102.506
AE guided	0.0928	0.223	0.555	0.0785	0.0214	49074.9	1405.92	108.226
random	0.1966	0.3165	0.7271	0.1356	0.0293	67277.2	2107.31	121.262

It is important to note that in 5 out of 8 tasks supervised models beat the others, however in 3 tasks untrained word2vec with randomly initialized weights wins.

Furthermore, in none of the tasks random grouping of users is the best one, which serves as a proof that nearly all the models have learned to place similar, in terms of business metrics, users closer to each other in cosine distance terms.

4.3. Guide Validation

We also studied the relationship between addition of different guides to the autoencoder model and its performance on the predictive power task. In order to estimate the relation, we trained and evaluated the supervised autoencoder model with 255 possible combinations of guides. Then we created a set of 255 examples, each of which is represented by a vector of 8 variables indicating whether model was trained with guide g (guides[g]=l ) or without it (guides[g]=O) , this is our feature set. The target values are the performance measures on 8 business metrics prediction tasks from the predictive power evaluation stage. We train 8 regression models separately to predict performance on each business metric for every possible set of guides.

Table 4

Guide Validation Results

Each coefficient with coordinates (g, m) in the table shows an effect of g’th guide introduction on the performance of the model on m’th business metric prediction task. The coefficients are normalized on the scale of the dependent variable (all coefficients show relative percentage changes in target variables if guide is present cet. par). Empty cells are coefficients which did not pass 95% significance level measured by regular significance tests applied to OLS regression coefficients.

	tariff	payment type	accepts surge	card system	cancel frequency	num orders	mean cost	mean travel time
tariff	0.0831						0.0039
payment type		-0.0459		-0.0705				0.0018
accepts surge			-0.0173				0.0064
card system		-0.0740	0.0033	-0.1069
cancel frequency			0.0021				0.0036
num orders							0.0041
mean cost		-0.0110	0.0019		0.0073	-0.005	-0.0451	-0.0040
mean travel time								-0.0086

Table 4 offers some insights about the guides and their effects on the separate predictive power tasks. One of them is that, as expected, the introduction of some guide to the model boosts its performance on the corresponding prediction task cet. par. (for example if we add guide for mean cost prediction task, the RMSE on this task falls by 4,5% cet. par.). The only artefact is the tariff task, which demonstrates the opposite.

Another feature is that some of the guides appear to contribute not only to their metric predictive power, but to others as well. The example is the card system guide, which helps not only to predict credit card system type better, but also boosts the performance on the payment type task. We speculate that this phenomenon may be explained as follows: if one gives the model the information about the card system of the user (e.g. MasterCard), then it may infer that this user’s payment type might be card and not cash. Less intuitive relation is seen between mean cost and payment type, where introduction of the mean cost guide improves model’s performance on the estimation of payment type distribution.

Overall, if each row is summed up, one may see that some guides improve the total performance of the model and some do not. This information may be useful to select guides for models’ training.

4.4. Application to Production Task

In this part of the section we investigate how obtained user representation may be applied in a real-world setup.

The task is to predict the number of users’ trips up to received date based on their activity in the first month. The received date is fixed for all users, while the starting date may vary. We use both available data streams, namely, history of orders and mobile app activity logs. There are 3.999 users in the dataset.

First, we extract features from users’ history of orders (94 features in total). We fit boosting model (CatBoost [9]) on 2.999 samples from the dataset and evaluate it using 1.000 samples as the test set. Second, we construct user representations from users’ first month of mobile application activity and fit the same model on the constructed vectors. Then, we fit the model on the combined feature set, both with hand-crafted features generated from history of orders and user-embeddings obtained by our method. For user-embeddings’ construction we use the unsupervised autoencoder model so as to prevent leaks indicative of users’ future trips, moreover, the autoencoder model is trained using only the first month of users’ mobile application activity.

Table 5

Production Task Results

The task is to predict the number of user’s trips up to given date based on his or her activity in the first month. Row named «Constant» shows performance of best constant prediction on this dataset.

Feature Set	MAE	RMSE
Best of median / mean	13.93	32.46
Hand-crafted features (HC)	12.11	27.06
User-embeddings (UE)	11.79	26.2
Combined (HC + UE)	11.2	24.6

It is evident from Table 5 that our method performs better than hand-crafted feature extraction. Moreover, the combined representation yields best results. It is important to note that in the combined version we are using both available data streams while also avoiding the process of manual feature extraction from user mobile application activity logs, which is a very laborintensive procedure. After all, we suppose that our method may improve existing production processes by enriching them with automatic feature extraction from mobile application activity logs.

5. Conclusions and Future Work

We show how various models of different nature may be used to obtain user representation through her Yandex.Taxi mobile application activity. Such representation is capable of acting as a feature set for supervised learning tasks and is helpful to identify similar users in terms of their business metrics. We also studied the relation between the method’s performance and configuration of guides it was fed with. The findings suggest that some guides are complementary to each other and some are the opposite. One can tune the configuration of guides in order to achieve best overall performance.

Our method is not yet deployed in the company as the process faces various challenges. The main obstacle is that Yandex.Taxi is growing rapidly and the existing mobile application log generating process is constantly improving (event names are changed or merged e.t.c). So, in order to keep the method’s performance on the same level, one needs to constantly retrain it. However, training of best models is quite time-consuming: on the machine with 16 cpu-cores, 2.5 GHz each, the supervised word2vec takes almost 10 hours to converge with training set size of 5.000 users. Our aim is to scale training up to around 10.000.000 users. Our approach is going to be deployed as soon as we optimize it for faster training. Nevertheless, the current state of the method is enough for a single-time improvement of various models used in Yandex.Taxi, however for the continuous usage in the production processes the challenge outlined above needs to be overcome.

Future work may concentrate around context-based embedding models’ performance under the conditions of context absence. Also, the study of word-level embeddings change in the course of training looks promising for the discovery of methods to separate training of words that benefit from context-based approach from ones that do not, which might be helpful to learn better representations.

We would like to thank Tatiana Saveleva, Arsenii Ashukha and Anton Pankratov for their contributions in reviewing and drafting the paper; and providing various thoughts on algorithm design and evaluation.

Список литературы Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси

Zol na Konrad User Modeling Using LSTM Networks//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). 2017. P. 5025-5026.
Ruder S. An overview of multi-task learning in deep neural networks//arXiv preprint arXiv:1706.05098. 2017.
Arora S., Warrier D. Decoding fashion contexts using word embeddings//KDD Workshop on Machine learning meets fashion. 2016.
Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality//Advances in neural information processing systems. 2013. P. 3111-3119.
Guo Ch., Berkhahn F. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
Rehurek R., Sojka P. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010. P. 361-369.
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. Bigartm: Open source library for regularized multimodal topic modeling of large collections//International Conference on Analysis of Images, Social Networks and Texts. 2015. P. 370-381.
Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
Dorogush A.V., Ershov V., Gulin A. CatBoost: gradient boosting with categorical features support. 2017.
Hochreiter S., Schmidhuber J. Long short-term memory//Neural computation. 1997. V. 9, N 8. P. 1735-1780.
Liu H., Wu L., Zhang D., Jian M., Zhang X. Multi-perspective User2Vec: Exploiting re-pin activity for user representation learning in content curation social network//Signal Processing. 2018. V. 142. P. 450-456.
Ozsoy M.G. From word embeddings to item recommendation. arXiv preprint arXiv:1601.01356. 2016.
Le Q., Mikolov T. Distributed representations of sentences and documents//International Conference on Machine Learning. 2014. P. 1188-1196.

Еще