A domain specific key phrase extraction framework for email corpuses

Автор: I. V. S. Venugopal, D. Lalitha Bhaskari, M. N. Seetaramanath

Журнал: International Journal of Information Technology and Computer Science @ijitcs

Статья в выпуске: 7 Vol. 10, 2018 года.

Бесплатный доступ

With the growth in the communication over Internet via short messages, messaging services and chat, still emails are the most preferred communication method. Thousands of emails are been communicated everyday over different service providers. The emails being the most effective communication methods can also attract a lot of spam or irrelevant information. The spam emails are annoying and consumes a lot of time for filtering. Regardless to mention, the spam emails also consumes the main allocated inbox space and at the same time causes huge network traffic. The filtration methods are miles away from perfection as most of these filters depends on the standard rules, thus making the valid emails marked as spam. The first step of any email filtration should be extracting the key phrases from the emails and based on the key phrases or mostly used phrases the filters should be activated. A number of parallel researches have demonstrated the key phrase extraction policies. Nonetheless, the methods are truly focused on domain specific corpuses and have not addressed the email corpuses. Thus this work demonstrates the key phrases extraction process specifically for the email corpuses. The extracted key phrases demonstrate the frequency of the words used in that email. This analysis can make the further analysis easier in terms of sentiment analysis or spam detection. Also, this analysis can cater to the need for text summarization. The proposed component based framework demonstrates a nearly 95% accuracy.

Еще

Email Corpus, Key Phrase Extraction, Domain Specific Extraction, Modified Term Frequency, Modified Inverse Document Frequency

Короткий адрес: https://sciup.org/15016280

IDR: 15016280   |   DOI: 10.5815/ijitcs.2018.07.06

Текст научной статьи A domain specific key phrase extraction framework for email corpuses

Published Online July 2018 in MECS

The traditional communication methods between the humans were consisting of spoken languages, sign languages and finally the written languages. These communication languages are usually categorised as natural languages [1]. Nevertheless, the communication methods have crossed the barriers of communications between humans and extended over the communication between human and machine and between machines and machines. The communication between the machines as the computers have fewer challenges as the communication is backed up by the binary system. Nevertheless, the communication between the human and computer systems have the major challenge of converting the human understandable languages into the computer understandable language. The well accepted process of language conversion for these purposes are called the natural language processing or NLP [2].

The machine language processing is a widely accepted technique for various reasons like content summarization, information retrieval or the information extractions. The content summarization process is mainly focuses on preparation of summary of any given text. This application of NLP can reduce the time of processing the complete text and regardless to mention has multiple application usages. This method was first introduced in novel work by Jusoh et al. [3] in the year of 2011. Further another application of NLP is the information retrieval process. This process mainly focuses on the query processing and conversion of natural language queries into the content specific terms. This process is elaborated by the notable work of Zukerman et al. [4] in the year of 2002. Yet another parallel research application of NLP is the information extraction. The primary focus of this process is to reduce the time to extract meaningful information from any given corpuses. The benefits of this process is to cater the benefits of correlation based information extraction, where the related terms can be inferred from the corpus and can be considered information gain towards the extraction process. The notable work by Sekine et al. [5] [6] has demonstrated the use of information extraction with the benefits. The other popular outcomes from the parallel researches on NLP is the querying and answering methods as demonstrated by Bernhard et al. [7], machine based translations as proposed by Zhou et al. [8] [9], text to speech generation as formulated by Kaji et al. [10] and the sentence compression by Zhau et al. [11] [12] [13] [14]. Nonetheless, the base of all these applications and processes are the key phrase extractions.

Thus, it is natural to understand that the key phrase extraction is the major pre-processing analysis for any machine learning tasks ranging from summarization to email filtration. Nonetheless, the key phrase extraction processes are strongly depended on two factors as language and the domain:

  •    The language influence on the key phrase extraction cannot be ignored due to the inferences present in the languages and grammars.

  •    Also, the impacts of domain specific vocabularies are strong in terms of key frame extractions.

Thus, this work proposes a domain specific key phrase or key word extraction process for email corpuses.

The rest of the paper is furnished such as in the Section – II the current outcomes of the parallel researches are elaborated, in Section – III the domain specific extraction methods are discussed, in Section – IV the framework is elaborated, Section – V compresence the driving algorithm of the proposed framework, the results are discussed in the Section – VI and this work finally rests the conclusion in the Section – VII.

  • II.    Current State of Art

The initial attempts for collecting the key phrases were manual as stated by Barzilay et al. [15]. The challenges of key phrase extraction are it is denoted as a complex process by Hasegawa et al. [16] and expected to be a high time consuming process as demonstrated by Ibrahim et al. [17]. During the extraction of key phrases, the possible elaborations in terms of synonyms are also to be considered. The variations of the results with the influence of parallel words are demonstrated by Shinyama et al. [18] [19]. Yet another factor for making the key phrase extraction difficult is making a complete list of parallel words for the key phrases are difficult as shown in the work by Lin et al. [20].

Further in this section of the work, the factors for key phrase extraction are elaborated.

  • A.    Availability of Corpuses

WWW is an instance of free corpora which represents the largest public repository of natural language texts defined by Ringlstetter et al. [21]. This argument is supported by Zhao et al. [11]who write: “First, the web is not domain limited. Almost all kinds of topics and contexts can be covered. Second, the scale of the web is extremely large, which makes it feasible to find any specific context on it. In addition, the web is dynamic, which means that new words and concepts can be retrieved from the web”.

  • B.    Validation of Results

The results of any key phrase extraction process are depending on the validation of the results. It is regardless to mention that the during the extraction process, considering the similar meaning of the words can improve the results. This hypothesis is called the distribution hypothesis introduced first by Harris et al. [22]. The extension to this hypothesis is carried out in the work of Bhagat et al. [23][24].

  • C.    Key Phrase Extraction

It is natural to understand that the most important phase of the extraction process is the extraction of the key phrases from the corpuses. The dilemma in the research attempts is the base of the extraction process as the key phrases can be extracted either from the syntax based features or also from the semantic based features. The work of Ho et al. [25] demonstrates that the use of semantic based features can be useful during the extraction process. Nevertheless these are complementary to each other.

Henceforth, this work summarizes the research challenges in key phrase extraction:

  •    The extracted key phrases are to be considered for syntax based and semantic based for the difference in accuracy.

  •    The methods for extraction of key phrases for few popular methods are to be analysed.

  •    During the extraction process the synonyms and the idioms to be considered.

  •    The domain specific extraction process are to be analysed and for email based key phrase extraction is to be addressed

Thus in the next section this work analyses the domain specific key phrase extraction processes.

  • III.    Domain Specific Extraction Methods

The domain specific key phrase extraction process is different from the general purpose extraction of key phrases. The domain specific list of text must be available in the framework for referring as the training text rather than the testing text. The novel algorithm elaborated by the Bannard et al. [26] is furnished here:

Algorithm 1 : Existing Domain Specific Key Phrase Extraction

Step -1.

Calculate the word frequencies from the training text

Step -2.

Measure the threshold of word frequency

Step -3.

Further analyse the testing corpus

a.

Calculate the word frequencies

b.

Compare with the threshold

c.

If word frequency > threshold

i.

Then Accept the key phrase

d.

Else

i.

Reject the key phrase

Step -4.

Present the final list of key phrases

The algorithm is analysed visually as well [Figure – 1].

Fig.1. Existing Domain Specific Key Phrase Extraction

Thus, with the understanding of domain specific key framework extraction process and with the knowledge of no availability of the existing methods for key phrase extraction process for email, in the next section this work furnishes the framework for the intended purpose.

  • IV.    Framework Elaboration

The major motivation of this work is the minimal availability of domain specific extraction of key words or key phrases and at the same time no availability for email key phrase extraction. This results into the proposed framework furnished in this section.

Considering the limitations of the parallel research outcomes, this work elaborates the components of the proposed framework [Figure – 2].

Fig.2. Domain Specific Key Phrase Extraction Process

The components of the framework are elaborated here:

  • A.    Email Corpus Reader Service

The first component of the framework is the email reader service component. Due to the component based nature of this framework, any email service can be connected to this framework. This service needs to be preconfigured with the following parameters [Table – 1].

Table 1. Email Reader Service Configuration Parameters

Configuration Parameters

Purpose

Email_Address

Email Address of the receiver

Passwd

Password for the email account of the receiver

Server_Name

Name of the email server

Port

The port for the receiving email

Record_Size

Number of emails to be fetched per minute

The purpose of this component is to read the emails and build the testing corpus.

  • B.    Pre-Processor

The second service or component in the framework is the pre-processor component. This component is responsible for cleaning the text and remove stop words. After the initial pre-processing, this component converts the complete text into tokenized set of text. The algorithm used in this component is elaborated in the next section.

  • C.    Domain Corpus

The incorporated domain specific email corpus is used during the extraction of the key phrase frequency and further calculates the weighted average for the threshold. The description of the email domain corpus used in this work is elaborated here [Table – 2].

This corpus is a training corpus rather than testing corpus.

  • D.    Thresholding Service

The thresholding service is the component in the framework to calculate the threshold of each word present in the training corpus. The algorithm for threshold calculation is elaborated in the next section of this work.

Table 2. Email Domain Corpus Description

Meta Information

Description

Number of users

158

Number of Emails

619446

Number of Email Threads

7520

Number of Emails per user

3920

Corpus Major Properties

  • To

  •    From

  • •   Text

  •    Date_Time

  • E.    Training Corpus Reader

The training corpus reader component is responsible for reading the tokenized words and passes the words to the word frequency analyser service.

  • F.    Word Frequency Analyser Service

The word frequency analyser component is the implementation of term frequency and inverse document frequency calculator. The elaborated algorithm is analysed in the next section of the work.

  • G.    Key Phrase Ranking Service

The final ranking of the keywords are given based on the thresholds obtained from the thresholding service. If the thresholds of the extracted key words are nearing to the value of the thresholds of the extracted key words from training corpuses then the keywords are listed in the final summarization service.

  • H.    Summarization Service

The final service or component in this framework is the summarization service. This service provides the key phrases or the key words in terms of actual phrase and ranks. This information can further be used to calculate the sentiment or the spam factors of the emails.

  • V. Proposed Algorithm

This section of the work elaborates on the driving algorithms for the framework. The four fold algorithm is elaborated and analysed in this section.

  • A.    Pre-Processing

The algorithm used in this component is elaborated here:

Algorithm 2 : Pre-Processing Algorithm

Step -1. Accept the Email Corpus

Step -2.   For Each Sentence Convert the stop words into ","

  • a.    Convert punctuations

  • b.    Convert the Braces

  • c.    Convert the question marks

  • d.    Convert the forward and backwordslace

  • e.    Converts "and" and "or"

Step -3.   Convert all words into lower case

Step -4.   Find the initial token

Step -5.   For Each Sentence

  • a.    Extract the tokens based on separator

  • b.    Build the final token sets

Step -6.   Generate the final token set for the corpus

The algorithm is visualized graphically [Figure – 3].

Accept the Email Corpus

Fig.3. Pre-Processing Algorithm

  • B.    TF-IDF

The second driving algorithm of this framework is the modified term frequency and inverse document frequency algorithm as elaborated here:

  • C.    Thresholding

The next algorithm is the Threshold calculation algorithm from the training corpus:

Algorithm 3 : Modified TF-IDF Algorithm

Step -1. Generate the term count in the email corpus

Step -2. For each term in the list

  • a. Calculate the term frequency as (term count / total terms count in the document)

Step -3. For each document in the corpus

  • a. Calculate the inverse document frequency as log (documents includes the term / total number of documents)

Step -4. For each term in the document

  • a. Calculate the term frequency with respect to inverse document frequency as term frequency X inverse document frequency

Step -5. Present the final list of terms per document

Algorithm 4 : Thresholding Algorithm

Step -1. For each document in the training corpus

  • a.    Accept the TF-IDF values for each keyword

  • b.    Calculate the moving average for the keywords

Step -2.   Build the weighted average for all the terms

  • D. Ranking

The final algorithm in this framework is the ranking algorithm. As a outcome of this algorithm, the documents will be summarized for further analysis. The algorithm is elaborated here:

The algorithm is visualized graphically here [Figure – 4].

Fig.4. Modified TF – IDF Algorithm

Algorithm 5 : Ranking Algorithm

Step -1. Accept the TF - IDF for each term in the testing corpus

Step -2.   For each document

  • a.    Build the Array List with all the terms

  • b.    Sort the elements in the array list

Step -3. Generate ranking for all the key words or key phrases

Thus this fourfold algorithm in the framework generates the final ranking of the key phrases for further analysis.

The results obtained from this framework are discussed in the next section.

  • VI. Results and Discussion

Results obtained from this framework are highly satisfactory and discussed here in this section. This work evaluates three major corpuses collected from the spam filtration sample domain of google.

  • A. Corpus Length

Firstly the Corpuses used for generating the results are evaluated here [Table – 3]:

Table 3. Corpus Description Analysis

Parametric Information

Email - 1

Email - 2

Email - 3

Word Count

394

73

89

Number of Lines

53

12

20

Number of Paragraphs

19

5

9

Time to Pre

Process (Sec)

0.2

0.11

0.10

Thus it is natural to understand that the proposed preprocessing algorithm is time efficient and cater to the need of reduction in time complexity.

The results are been analysed graphically [Figure – 5].

Fig.5. Corpus Analysis

  • B.    TF – IDF

Secondly for the used corpuses the term frequency and inverse document frequency is analysed [Table – 4].

Table 5. TF – IDF Analysis

Rank

Key Phases

Email - 1

Email - 2

Email - 3

(1)

YOU

TEXT

LAST

(2)

ATM

MANY

APRIL

(3)

CARD

QUALITY

DATE

(4)

THIS

WHAT

APPLICATION

(5)

DOCUMENTS

PAYMENT

(6)

YOU

Henceforth based on the key phrase analysis, the nature of the emails can be identified [Table – 6].

Table 6. Corpus Description Analysis

Table 4. TF – IDF Analysis

Key Phrases

TF – IDF

Email - 1

Email - 2

Email - 3

last

0.002726

0

0.037032

april

0

0

0.037032

your

0.043617

0

0.024688

payment

0.008178

0

0.024688

date

0

0

0.024688

Thus this key phrase extraction process can be helpful in many domains for emails corpus analysis.

D. Accuracy Analysis

Finally, accuracy of the key phrase extraction is evaluated for this framework [Table – 7].

Table 7. Accuracy Analysis

Here this TF – IDF analysis demonstrates the stability of the proposed framework to extract the key phrased based on the term frequency.

The results are been visualised graphically [Figure – 6].

Fig.6. TF – IDF Analysis

Email - 1

Email - 2

Email - 3

Number of Actual Key Phrases

190

60

65

Number of Extracted Key Phrases

188

53

65

Accuracy (%)

98.94

88.33

100

It is natural to understand that the framework provides 95% accuracy in extraction.

The result is evaluated graphically [Figure – 7].

C. Ranking

Further, the extracted key words are been ranked for each documents in the corpus [Table – 5].

Fig.7. Accuracy Analysis

Email - 1

Email - 2

Email - 3

Nature of the Email

Email from Bank regarding ATM Card

Email regarding text quality

Email regarding payment or application

  • VII. Conclusion

The key phrase or the key words extraction is a primary task for further processing of a corpus to analyses the meaning, summary or the clustering. The extracted key phrases can be justified by the term frequency in the corpus. Nevertheless, the term frequencies depend on the writing style for each author and must be validated against the domain specific terms. This work provides a framework for email domain specific key frame extraction process. The accuracy demonstrated by this framework is highly satisfactory and nearly 95%. The final outcome of this work is to provide the email domain specific key phrase extraction and making the world of email analysis better.

Список литературы A domain specific key phrase extraction framework for email corpuses

  • Azmi Murad MA, Martin TP, “Using fuzzy sets in contextual word similarity”, Intell Data Eng Automa Learn (IDEAL), LNCS 3177 pp.517–522, 2004.
  • Bannard C, Callison-Burch C, “Paraphrasing with bilingual parallel corpora” In Proceedings of the 43rd annual meeting of the Association for Computational Linguistics, pp. 597–604, 2005.
  • Jusoh S, Masoud AM, Alfawareh HM, “Automated text summarization: sentence refinement approach”, Commun Comput Inf Sci Digit Inf Process Commun 189(8),pp.207–218,2011.
  • Zukerman I, RaskuttiB,Wen Y, “Experiments in query paraphrasing for information retrieval”, Adv Artif Intell, LNCS 2557, pp.24–35,2002.
  • Sekine S, “Automatic paraphrase discovery based on context and keywords between NE pairs”, In Proceedings of IWP, 2005.
  • Sekine S, “On–demand information extraction”, In Proceedings of the COLING/ACL onmain conference poster sessions, pp. 731–738, 2006.
  • Bernhard D, Gurevych I, “Answering learners questions by retrieving question paraphrases from social Q&A sites”, In Proceedings of the 3rd workshop on innovative use of NLP for building educational applications, pp. 44–52,2008.
  • Zhou L, Lin C, Munteanu DS, Hovy E, “ ParaEval: using paraphrases to evaluate summaries automatically”, In Proceedings of the human language technology conference of the North American chapter of the ACL, pp. 447–454,2006.
  • Wu H, Zhou M, “Optimizing synonym extraction using monolingual and bilingual resources”, In Proceedings of the second international workshop on paraphrasing (IWP), pp. 72–79,2003.
  • Kaji N, Kurohashi S, “Lexical choice via topic adaptation for paraphrasing written language to spoken language”, InfRetrTechnol LNCS 4182, pp.673–679,2006.
  • Zhao SQ, Wang HF, Liu T, Li S, “Pivot approach for extracting paraphrase patterns from bilingual corpora”, In Proceedings of ACL–HLT, pp.780–788,2008.
  • Zhao SQ, Lan X, Liu T, Li S, “Application-driven statistical paraphrase generation”, In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP, pp.834–842,2009a.
  • Zhao SQ, Wang HF, Liu T, Li S, “Extracting paraphrase patterns from bilingual parallel corpora”, Nat Lang Eng 15(4),pp.503–526,2009b.
  • Zhao SQ, Wang HF, Liu T, “Paraphrasing with search engine query logs”, In Proceedings of the 23rd international conference on computational linguistics (COLING), pp.1317–1325,2010.
  • Barzilay R, McKeown KR, “Extracting paraphrases from a parallel corpus”, In Proceedings of the 39th annual meeting on Association for Computational Linguistics, pp. 50–57,2001.
  • Hasegawa T, Sekine S, Grishman R, “Unsupervised paraphrase acquisition via relation discovery”, Technical Report 05-012, Proteus Project, Computer Department, New York University,2005.
  • Ibrahim A, Katz B, Lin J, “Extracting structural paraphrases from aligned monolingual corpora”, In Proceedings of ACL, pp.10–17,2003.
  • Shinyama Y, Sekine S, Sudo K, “Automatic paraphrase acquisition from news articles”, In Proceedings of HLTR, pp. 313–318,2002.
  • Shinyama Y, Sekine S, “Paraphrase acquisition for information extraction”, In Proceedings of IWP, pp. 65–71,2003.
  • Lin D, Pantel P, “DIRT—discovery of inference rules from text”, In Proceedings of ACM SIGKDD, pp. 323–328,2001.
  • Ringlstetter C, Schulz KU, Mihov S, “Orthographic errors in web pages: toward cleaner web corpora”, J Comput Linguist 32(3),pp.295–340,2006.
  • Harris Z, “Distributional structure. Structural and transformational linguistics”, pp.775–794,1970.
  • Bhagat R, Ravichandran D, “Large scale acquisition of paraphrases for learning surface patterns”, In Proceedings of ACL–HLT, pp.674–682, 2008.
  • Bhagat R, Hovy E, Patwardhan S, “Acquiring paraphrases from text corpora”, In Proceedings of the 5th international conference on knowledge capture (K-CAP), pp.161–168, 2009.
  • Ho CF, Azmi Murad MA, Doraisamy S, Abdul Kadir R,“Comparing two corpus-based methods for extracting paraphrases to dictionary-based method”, Int J Semant Comput (IJSC) 5(2), pp.133–178, 2011.
  • Colin Bannard and Chris Callison-Burch, “Paraphrasing with bilingual parallel corpora”, In ACL, pp.597–604, 2005.
Еще
Статья научная