Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms

Автор: Abdeslem DENNAI, Mohammed Yacine DENNAI, Sidi Mohammed BENSLIMANE

Журнал: International Journal of Information Technology and Computer Science(IJITCS) @ijitcs

Статья в выпуске: 11 Vol. 8, 2016 года.

Бесплатный доступ

Three classes of documents, based on their data, circulate in the web: Unstructured documents (.Doc, .html, .pdf ...), semi-structured documents (.xml, .Owl ...) and structured documents (Tables database for example). A semi-structured document is organized around predefined tags or defined by its author. However, many studies use a document classification by taking into account their textual content and underestimate their structure. We attempt in this paper to propose a representation of these semi-structured web documents based on weighted vectors allowing exploiting their content for a possible treatment. The weight of terms is calculated using: The normal frequency for a document, TF-IDF (Term Frequency - Inverse Document Frequency) and logic (Boolean) frequency for a set of documents. To assess and demonstrate the relevance of our proposed approach, we will realize several experiments on different corpus.

Еще

Semi-structured web document, term weighting, term frequency, TF-IDF and logic frequency

Короткий адрес: https://sciup.org/15012585

IDR: 15012585

Список литературы Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms

  • Moussa L., Amrane H. and Patrick R., “Un modèle de conception d’application Web basé sur XML”, ISPS’2001 – Alger, Mai. 2001, RIST Vol. 11 Issue 1, 2001.
  • W3C Recommendation, “eXtensible Markup Language, 5ème Edition”, http : // www.w3.org / TR / 2008/REC-xml-20081126, edited on line Nov. 26 2008, (Consulted June. 2014).
  • JSON (JavaScript Object Notation), Official WebSite, (Consulted June. 2014).
  • W3C Recommendation, “Langage de balisage extensible”, http://www.w3.org/TR/1998/REC-xml-19980210, Put on line Feb. 10 1998, (Consulted June. 2014).
  • Hubert K. and Valérie M., “Les web services. Techniques, démarches et outils XML, WSDL, SOAP, UDDI, RosettaNet, UML”, Dunod 2003.
  • Gagnon O., “Indexation de documents web à l’aide d’ontologies”, Maitrise en sciences appliquées, Ecole Polytechnique de Montréal, CANADA, 2013.
  • Chagheri S., Roussey C., Calabretto S. and Dumoulin C, “Classification de documents combinant la structure et le contenu”, 2012.
  • Vercoustre A. M., Fegas M., Lechevallier Y. and Despeyroux T., “Classification de documents XML à partir d’une représentation linéaire des arbres de ces documents”, 2006.
  • Denoyer L., Wisniewski G. and Gallinari P., “Classification automatique de structures arborescentes à l’aide du noyau de Fisher : Application aux documents XML”, 6th European Congress on Systems Science, Sep. 19-22, 2005.
  • Dennai A. and Benslimane S. M., “Information extraction from HTML pages or XML documents by a semantic indexing, using domain ontology”, 3rd International Conference on Multimedia Computing and Systems ICMCS’2012, IEEE conference, Tangier, Morocco, 10- 12 Mai 2012.
  • Dennai A. and Benslimane S. M., “Building a Semantic Index from HTML Pages or XML Documents”, International Conference on Computing Technology and Information Management, ICCTIM 2014, Dubai, E.A.U, 09- 11 April 2014.
  • Dennai A. and Benslimane S. M., “Semantic Indexing of Web Documents Based on Domain Ontology”, International Journal of Information Technology and Computer Science (IJITCS), ISSN: 2074‐9007 (Print), ISSN: 2074‐9015 (Online), DOI: 10.5815/ijitcs, Published By: MECS Publisher, IJITCS Vol. 7 Issue 2, Jan. 2015.
Еще
Статья научная