Improving the Efficiency of Term Weighting in Set of Dynamic Documents
Автор: Mehdi Jabalameli, Ala Arman, Mohammadali Nematbakhsh
Журнал: International Journal of Modern Education and Computer Science (IJMECS) @ijmecs
Статья в выпуске: 2 vol.7, 2015 года.
Бесплатный доступ
In real information systems, there are few static documents. On the other hand, there are too many documents that their content change during the time that could be considered as signals to improve the quality of information retrieval. Unfortunately, considering all these changes could be time-consuming. In this paper, a method has been proposed that the time of analyzing these changes could be reduced significantly. The main idea of this method is choosing a special part of changes that do not make effective changes in the quality of information retrieval; but it could be possible to reduce the analyzing time. To evaluate the proposed method, three different datasets selected from Wikipedia. Different factors have been assessed in term weighting and the effect of the proposed method investigated on these factors. The results of empirical experiments showed that the proposed method could keep the quality of retrieved information in an acceptable rate and reduce the documents' analysis time as a result.
Document Revision, Term Frequency, Term Weightings, Ranked Terms, Information retrieval process
Короткий адрес: https://sciup.org/15014731
IDR: 15014731
Текст научной статьи Improving the Efficiency of Term Weighting in Set of Dynamic Documents
Published Online February 2015 in MECS DOI: 10.5815/ijmecs.2015.02.06
Nowadays, web pages are dynamic and during time their information is changed. In many applications such as search engines, each change in one page is considered as a new page. However, these pages are actually the different versions of the same page. For example, the Wikipedia pages have different versions that these versions are created by different people to improve the content of the pages. Previous researches [1-11] have shown that investigating on these changes can improve the efficiency of information retrieval systems.
In information retrieval systems, usually a document is seen as a vector of terms which each component of the vector shows the one of the terms’ weight. Reference [12] provides the popular formula of term weighting is called
TF_IDF which the number of repetitive terms and the number of all documents on a set that include that term are used for term weighting. In the most of methods that have been proposed for term weighting, a document has been considered statically. However, the content of many of these pages change during the time and these changes could be containing important information in the retrieval process.
In the recent years, some researches have been done that changes in a document (or a page) have been considered in the quality of information retrieval and term weighting. In the second section of this paper, some these researches will be reviewed.
In all of these researches, the whole document and its previous histories should be analyzed to specify the weight of each term and the relatedness of extracting documents with the query. In this paper, for the first time, the run time of these algorithms has been considered. In the proposed method, instead of analyzing the whole history of documents, only a special part of documents’ histories is analyzed and the weight of terms is specified.
Empirical experiments have been done on the three different datasets of Wikipedia. The results have been shown that the quality of term weighting in the proposed method is almost the same as existing ones. However, the processing time of the investigation of documents’ records have been significantly reduced due to selecting a special part of documents’ histories.
The rest of paper is organized as follows. Section II, provides some related works in term weighting using different methods. Section III, presents the proposed method to improve the term weighting efficiency. In section IV, the empirical evaluation of the suggested method is described. Finally, conclusion and future works are discussed in section V.
-
II. Related Works
-
A. Global Term Weighting in a Set of Documents
Список литературы Improving the Efficiency of Term Weighting in Set of Dynamic Documents
- E. Adar, J. Teevan, and S. T. Dumais, "Resonance on the Web: Web Dynamics and Revisitation Patterns," in Proceedings of CHI 2009, 2009, "doi: 10.1145/1871437.1871519".
- E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas, "The Web Changes Everything: Understanding the Dynamics of Web Content," in Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009, pp. 282–291, "doi: 10.1145/1498759.1498837".
- A. Aji, Y. Wang, E. Agichtein, and E. Gabrilovich, "Using the Past to Score the Present: Extending Term Weighting Models Through Revision History Analysis.," in CIKM, 2010, pp. 629–638, "doi: 10.1145/1871437.1871519".
- R. Campos, G. Dias, A. M. Jorge, and A. Jatowt, "Survey of Temporal Information Retrieval and Related Applications," ACM Comput. Surv., vol. 47, no. 2, pp. 15:1–15:41, 2014, "doi: 10.1145/2619088".
- M. Efron, "Linear Time Series Models for Term Weighting in Information Retrieval.," JASIST, vol. 61, no. 7, pp. 1299–1312, 2010, "doi: 10.1002/asi.21315 ".
- J. L. Elsas and S. T. Dumais, "Leveraging Temporal Dynamics of Dcument Content in Relevance Ranking.," in WSDM, 2010, pp. 1–10, "doi: 10.1145/1718487.1718489".
- N. Kanhabua, "Time-aware Approaches to InformationRetrieval," SIGIR Forum, vol. 46, no. 1, p. 85, 2012, "doi: 10.1145/2215676.2215691.
- Nunes, C. Ribeiro, and G. David, "Term Weighting Based on Document Revision History.," JASIST, vol. 62, no. 12, pp. 2471–2478, 2011, "doi: 10.1002/asi.21597".
- Nunes, C. Ribeiro, and G. David, "Term Frequency Dynamics in Collaborative Articles," in Proceedings of the 10th ACM Symposium on Document Engineering, 2010, pp. 267–270, "doi: 10.1145/1860559.1860620".
- K. Radinsky, F. Diaz, S. Dumais, M. Shokouhi, A. Dong, and Y. Chang, "Temporal Web Dynamics and Its Application to Information Retrieval," in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 2013, pp. 781–782.
- A. Zubiaga, "Enhancing Navigation on Wikipedia with Social Tags," CoRR, vol. abs/1202.5, 2012.
- G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Inf. Process. Manag. an Int. J., vol. 24, no. 5, pp. 513–523, 1988, "doi: 10.1016/0306-4573(88)90021-0".