Активное обучение и краудсорсинг: обзор методов оптимизации разметки данных

Гилязев Р.А.; Турдаков Д.Ю.; Gilyazev R.A.; Turdakov D.Y.

doi:10.15514/ISPRAS-2018-30(2)-11

Научные статьи \ Общие вопросы науки и культуры \ Информационные технологии. Вычислительная техника. Обработка данных \ Прикладные информационные (компьютерные) технологии. Методы основанные на применении компьютеров

Активное обучение и краудсорсинг: обзор методов оптимизации разметки данных

Автор: Гилязев Р.А., Турдаков Д.Ю.

Журнал: Труды Института системного программирования РАН @trudy-isp-ran

Статья в выпуске: 2 т.30, 2018 года.

Бесплатный доступ

Качественные аннотированные коллекции являются ключевым элементом при построении систем, использующих машинное обучение. В большинстве случаев создание таких коллекций предполагает привлечение к разметке данных людей, а сам процесс является дорогостоящим и утомительным для аннотаторов. Для оптимизации этого процесса был предложен ряд методов, использующих активное обучение и краудсорсинг. В статье приводится обзор существующих подходов, обсуждается их комбинированное применения, а также описываются существующие программные системы, предназначенные для упрощения процесса разметки данных.

Активное обучение, краудсорсинг, аннотация корпусов, крауд-вычисления

Короткий адрес: https://sciup.org/14916522

IDR: 14916522 | DOI: 10.15514/ISPRAS-2018-30(2)-11

Active learning and crowdsourcing: a survey of annotation optimization methods

High quality labeled corpora play a key role to elaborate machine learning systems. Generally, creating of such corpora requires human efforts. So, annotation process is expensive and time-consuming. Two approaches that optimize the annotation are active learning and crowdsourcing. Methods of active learning are aimed at finding the most informative examples for the classifier. At each iteration from the unplaced set, one algorithm is chosen by an algorithm, it is provided to the oracle (expert) for the markup and the classifier is newly trained on the updated set of training examples. Crowdsourcing is widely used in solving problems that can not be automated and require human effort. To get the most out of using crowdplatforms one needs to to solve three problems. The first of these is quality, that is, algorithms are needed that will best determine the real labels from the available ones. Of course, it is necessary to remember the cost of markup - to solve the problem by increasing the number of annotators for one example is not always reasonable - this is the second problem. And, thirdly, sometimes the immediate factor is the rapid receipt of the marked corpus, then it is necessary to minimize the time delays when the participants perform the task. This paper aims to survey existing methods based on this approaches and techniques to combine them. Also, the paper describes the systems that help to reduce the cost of annotation.

Список литературы Активное обучение и краудсорсинг: обзор методов оптимизации разметки данных

Rubi Boim, Ohad Greenshpan, Tova Milo, Slava Novgorodov, Neoklis Polyzotis, and Wang-Chiew Tan. Asking the right questions in crowd data sourcing. In Proc. of the 28th International Conference on Data Engineering (ICDE), 2012, pp 1261-1264.
Anthony Brew, Derek Greene, and Pa´draig Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In Proc. of the 19th European Conference on Artificial Intelligence, pp. 145-150.
P. Dawid, A. M. Skene, A. P. Dawidt, and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, vol. 28, № 1, 1979, pp. 20-28.
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudr´e-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. of the 21st international conference on World Wide Web, 2012, pp. 469-478.
Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, and Jianhua Feng. icrowd: An adaptive crowdsourcing framework. In Proc. of the ACM SIGMOD International Conference on Management of Data, 2015, pp. 1015-1030.
Ju Fan, Meihui Zhang, Stanley Kok, Meiyu Lu, and Beng Chin Ooi. Crowdop: Query optimization for declarative crowdsourcing systems. IEEE Transactions on Knowledge and Data Engineering, vol. 27, № 8, 2015, pp. 2078-2092.
Meng Fang, Xingquan Zhu, Bin Li, Wei Ding, and Xindong Wu. Self-taught active learning from crowds. In Proc. of the 12th International Conference on Data Mining (ICDM), 2012, pp. 858-863.
Paul Felt, Robbie Haertel, Eric K Ringger, and Kevin D Seppi. Momresp: A bayesian model for multi-annotator document labeling. In Proc. of the Ninth International Conference on Language Resources and Evaluation, 2014. pp. 3704-3711.
Arpita Ghosh, Satyen Kale, and Preston McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proc. of the 12th ACM conference on Electronic commerce, 2011, pp. 167-176.
Daniel Haas, Jiannan Wang, Eugene Wu, and Michael J. Franklin. Clamshell: Speeding up crowds for low-latency data labeling. Proc. of the VLDB Endowment, vol. 9, № 4, 2015, pp. 372-383.
Shuji Hao, Steven C. H. Hoi, Chunyan Miao, and Peilin Zhao. Active crowdsourcing for annotation. In Proc. of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. II, 2015, pp. 1-8.
Gang Hua, Chengjiang Long, Ming Yang, and Yan Gao. Collaborative active learning of a kernel machine ensemble for recognition. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1209-1216.
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, and Karl Aberer. An evaluation of aggregation techniques in crowdsourcing. In Proc. of the International Conference on Web Information Systems Engineering, 2013, pp. 1-15.
Hiroshi Kajino, Yuta Tsuboi, and Hisashi Kashima. A convex formulation for learning from crowds. Transactions of the Japanese Society for Artificial Intelligence, vol. 27, № 3,, 2012, pp. 133-142.
David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In Proc. of the Neural Information Processing Systems 2011 Conference (Advances in neural information processing systems 24), 2011, pp. 1953-1961.
Faiza Khan Khattak. Toward a Robust and Universal Crowd Labeling Framework. PhD Thesis, Columbia University, 2017, 168 p.
Faiza Khan Khattak and Ansaf Salleb-Aouissi. Quality control of crowd labeling through expert evaluation. In Proceedings of the NIPS 2nd Workshop on Computational Social Science and the Wisdom of Crowds, vol. 2, 2011, 5 p.
Adam Kilgarriff and Adam Kilgarriff. Gold standard datasets for evaluating word sense disambiguation programs. Computer Speech and Language, vol. 12, № 3, 1998, pp. 453-472.
Hyun-Chul Kim and Zoubin Ghahramani. Bayesian classifier combination. In Proc. of the Fifteenth International Conference on Artificial Intelligence and Statistics, 2012, pp. 619-627, 2012.
Florian Laws, Christian Scheible, and Hinrich Schütze. Active learning with amazon mechanical turk. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2011, pp. 1546-1556.
Kyumin Lee, James Caverlee, and Steve Webb. The social honeypot project: protecting online communities from spammers. In Proc. of the 19th International Conference on World Wide Web, 2010, pp 1139-1140.
Guoliang Li, Chengliang Chai, Ju Fan, Xueping Weng, Jian Li, Yudian Zheng, Yuanbing Li, Xiang Yu, Xiaohang Zhang, and Haitao Yuan. Cdb: optimizing queries with crowd-based selections and joins. In Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data, 2007, pp. 1463-1478.
Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J Franklin. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering, vol. 28, № 9, 2016, pp. 2296-2319.
Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. A survey on truth discovery. ACM SIGKDD Explorations Newsletter, vol. 17, № 2, 2016, pp. 1-16.
Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Proc. of the Neural Information Processing Systems 2011 Conference (Advances in neural information processing systems 25), 2012, pp. 692-700.
Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. Cdas: a rowdsourcing data analytics system. Proc. of the VLDB Endowment, vol. 5, № 10, 2012, pp. 1040-1051.
Yang Liu and Mingyan Liu. An online learning approach to improving the quality of crowd-sourcing. ACM SIGMETRICS Performance Evaluation Review, vol. 43, 2015, pp. 217-230.
Adam Marcus, Eugene Wu, David R Karger, Samuel Madden, and Robert C Miller. Crowdsourced databases: Query processing with people. Proc. of the 5th Conference on Innovative Data Systems Research (CIDR). 2011, pp. 211-214.
Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. Scaling up crowdsourcing to very large datasets: a case for active learning. Proc. of the VLDB Endowment, vol. 8, № 2, 2014, pp. 125-136.
An Thanh Nguyen, Byron C Wallace, and Matthew Lease. Combining crowd and expert labels using decision theoretic active learning. In Proc. of the Third AAAI Conference on Human Computation and Crowdsourcing, 2015, pp. 120-129.
Stefanie Nowak and Stefan Ru¨ger. How reliable are annotations via crowdsourcing: A study about interannotator agreement for multi-label image annotation. In Proc. of the International Conference on Multimedia Information Retrieval, 2010, pp. 557-566.
Aditya Ganesh Parameswaran, Hyunjung Park, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. Deco: declarative crowdsourcing. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012, pp. 1203-1212.
Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced Lectures on Machine Learning, Lecture Notes in Computer Science, vol 3176, 2004, pp. 63-71.
Vikas C Raykar, Shipeng Yu, Linda H Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proc. of the 26th Annual international conference on machine learning, 2009, pp. 889-896.
Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, vol. 11, 2010, pp. 1297-1322.
Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. Gaussian process classification and active learning with multiple annotators. In Proc. of the International Conference on Machine Learning, 2014, pp. 433-441.
Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009, 65 p.
Victor S. Sheng, Foster J. Provost, and Panagiotis G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 614-622.
Aashish Sheshadri and Matthew Lease. Square: A benchmark for research on computing crowd consensus. In Proc. of the First AAAI Conference on Human Computation and Crowdsourcing, 2013, pp. 156-164.
Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and fast-but is it good?: Evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2008, pp. 254-263.
Long Tran-Thanh, Sebastian Stein, Alex Rogers, and Nicholas R Jennings. Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artificial Intelligence, vol. 214, issue 1, 2014, pp. 89-111.
Fabian L Wauthier and Michael I Jordan. Bayesian bias mitigation for crowdsourcing. In In Proc. of the Neural Information Processing Systems 2011 Conference (Advances in neural information processing systems 24), 2011, pp. 1800-1808.
Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition -Workshops, 2010, pp. 25-32.
Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proc. of the Neural Information Processing Systems 2009 Conference (Advances in neural information processing systems 22), 2009, pp. 2035-2043.
Yan Yan, Romer Rosales, Glenn Fung, and Jennifer G Dy. Active learning from crowds. In Proc. of the 28th International Conference on International Conference on Machine Learning, 2011, pp. 1161-1168.
Chicheng Zhang and Kamalika Chaudhuri. Active learning from weak and strong labelers. In Proc. of the Neural Information Processing Systems 2015 Conference (Advances in neural information processing systems 28), 2015, pp. 703-711.
Jing Zhang, Victor S Sheng, Jian Wu, and Xindong Wu. Multi-class ground truth inference in crowdsourcing with clustering. IEEE Transactions on Knowledge and Data Engineering, vol. 28, № 4, 2016, pp. 1080-1085.
Jing Zhang, Xindong Wu, and Victor S Sheng. Learning from crowdsourced labeled data: a survey. Artificial Intelligence Review, vol. 46, № 4, 2016, pp. 543-576.
Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Proc. of the Neural Information Processing Systems 2014 Conference (Advances in neural information processing systems 27), 2014, pp. 1260-1268.
Yudian Zheng, Reynold Cheng, Silviu Maniu, and Luyi Mo. On optimality of jury selection in crowdsourcing. In Proceedings of the 18th International Conference on Extending Database Technology, 2015, pp. 193-204.
Yudian Zheng, Guoliang Li, and Reynold Cheng. Docs: a domain-aware crowdsourcing system using knowledge bases. Proc. of the VLDB Endowment, vol. 10, № 4, 2016, pp. 361-372.
Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. Truth inference in crowdsourcing: is the problem solved? Proc. of the VLDB Endowment, vol. 10, № 5, 2017, pp. 541-552.
Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, and Jianhua Feng. Qasca: A quality-aware task assignment system for crowdsourcing applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2015, pp. 1031-1046.
Jinhong Zhong, Ke Tang, and Zhi-Hua Zhou. Active learning from crowds with unsure option. In Proc. of the 24th International Conference on Artificial Intelligence, 2015, pp. 1061-1068.
A.V. Ponomarev. Quality Control Methods in Crowd Computing: Literature Review. SPIIRAS Proceeding, issue 54, 2017, pp. 152-184 DOI: 10.15622/sp.54.7
A. Korshunov, A. Gomzin. Topic modeling in natural language texts. Trudy ISP RAN/Proc. ISP RAS, vol. 23, 2012, pp. 215-244 DOI: 10.15514/ISPRAS-2012-23-13

Еще