Научные статьи \ Прикладные науки. Медицина. Технология \ Mедицинские науки \ Фармакология. Общая терапия. Токсикология

Feature Selection in Detection of Adverse Drug Reactions from the Health Improvement Network (THIN) Database

Автор: Yihui Liu, Uwe Aickelin

Журнал: International Journal of Information Technology and Computer Science(IJITCS) @ijitcs

Статья в выпуске: 3 Vol. 7, 2015 года.

Бесплатный доступ

Adverse drug reaction (ADR) is widely concerned for public health issue. ADRs are one of most common causes to withdraw some drugs from market. Prescription event monitoring (PEM) is an important approach to detect the adverse drug reactions. The main problem to deal with this method is how to automatically extract the medical events or side effects from high-throughput medical events, which are collected from day to day clinical practice. In this study we propose a novel concept of feature matrix to detect the ADRs. Feature matrix, which is extracted from big medical data from The Health Improvement Network (THIN) database, is created to characterize the medical events for the patients who take drugs. Feature matrix builds the foundation for the irregular and big medical data. Then feature selection methods are performed on feature matrix to detect the significant features. Finally the ADRs can be located based on the significant features. The experiments are carried out on three drugs: Atorvastatin, Alendronate, and Metoclopramide. Major side effects for each drug are detected and better performance is achieved compared to other computerized methods. The detected ADRs are based on computerized methods, further investigation is needed.

Еще

Adverse Drug Reactions, Feature Matrix, Feature Selection, Atorvastatin, Alendronate, Metoclopramide

Короткий адрес: https://sciup.org/15012265

IDR: 15012265

Текст научной статьи Feature Selection in Detection of Adverse Drug Reactions from the Health Improvement Network (THIN) Database

Published Online February 2015 in MECS

Adverse drug reaction (ADR) is widely concerned for public health issue. ADRs are one of most common causes to withdraw some drugs from market [1]. Now two major methods for detecting ADRs are spontaneous reporting system (SRS) [2-5], and prescription event monitoring (PEM) [6-10]. The World Health Organization (WHO) defines a signal in pharmacovigilance as “any reported information on a possible causal relationship between an adverse event and a drug, the relationship being unknown or incompletely documented previously” [11]. For spontaneous reporting system, many machine learning methods are used to detect ADRs, such as Bayesian confidence propagation neural network (BCPNN) [12，13], decision support method [14], genetic algorithm [15], knowledge based approach [16，17], etc. One limitation is the reporting mechanism to submit ADR reports [14], which has serious underreporting and is not able to accurately quantify the corresponding risk. Another limitation is hard to detect ADRs with small number of occurrences of each drug-event association in the database. Prescription event monitoring has been developed by the Drug Safety Research Unit (DSRU) and is the first large-scale systematic post-marketing surveillance method to use event monitoring in the UK [8]. The health data are widely and routinely collected. It is a key step for PEM methods to automatically detect the ADRs from thousands of medical events. In paper [18，19], MUTARA and HUNT, which are based on a domain-driven knowledge representation Unexpected Temporal Association Rule, are proposed to signal unexpected and infrequent patterns characteristic of ADRs, using Queensland Hospital morbidity data, more commonly referred to as the Queensland Linked Data Set (QLDS) [20]. But their methods achieve low accuracies for detecting ADRs. In the paper [21], four existing ADR detecting algorithms were compared by applying them to The Health Improvement Network (THIN) database for a range of drugs. The results show that the existing algorithms are not capable of detecting rare ADRs. Until now no successful algorithms are proposed to detect the ADRs automatically for PEM methods.

In this paper we propose a novel approach using feature matrix to detect ADRs from The Health Improvement Network (THIN) database. First feature matrix which represents the medical events for the patients before and after taking drugs, is created by linking patients’ prescriptions and corresponding medical events together. Then significant features are selected based on feature selection methods, comparing the feature matrix before patients take drugs with one after patients take drugs. Finally the significant ADRs can be detected from thousands of medical events based on corresponding features. Experiments are carried on three drugs of Atorvastatin, Alendronate, and Metoclopramide. Good performance is achieved.

The rest of paper is organized as follows. In section 2, we introduce feature matrix and how to extract it from THIN database. Two feature selection methods are described in section 3. The experiments are conducted in section 4 based on three drugs, followed by discussion in section 5. Conclusions are made in section 6.

II. Feature Matrix

A novel concept of feature matrix is proposed to represent data and detect ADRs, using feature selection methods. Normally patients take drugs for different periods of time, and have different numbers of repeated prescriptions. The longer the period of time, the more the medical events related to the drug. How to represent the high-throughput data related to prescriptions and medical events, is a key step to detect the ADRs. Feature matrix builds the basis of saving data and comparing data. Two kinds of feature matrix are built in this study, based on medical events using Readcodes at level 1-5 and Readcodes at level 1-3, respectively. The medical events using Readcodes at level 1-3 give the general term, whereas those using Readcodes at level 1-5 give more specific descriptions.

A. The THIN database

The Health Improvement Network (THIN) is a collaboration product between two companies of EPIC and InPS. EPIC is a research organisation, which provides the electronic database of patient care records from UK and other countries. InPS continue to develop and supply the widely-used Vision general practice computer system. The anonymised patient data are collected from the practice's Vision clinical system on a regular basis without interruption to the running of the GP’s system and sent to EPIC who supplies the THIN data to researchers for studies. Research studies for publication using THIN Data are approved by a nationally accredited ethics committee which has also approved the data collection scheme.

There are ‘Therapy’ and ‘Medical’ databases in THIN data. The “Therapy” database contains the details of prescriptions issued to patients. Information of patients and the prescription date for the drug can be obtained. The “Medical” database contains a record of symptoms, diagnoses, and interventions recorded by the GP and/or primary care team. Each event for patients forms a record. By linking patient identifier, their prescriptions, and their corresponding medical events or symptoms together, feature matrix to characterize the symptoms during the period before or after patients take drugs is built.

B. Readcodes and feature matrix

Medical events or symptoms are represented by medical codes, named Readcodes. There are 103387 types of medical events in “Readcodes” database. The Read Codes used in general practice (GP), were invented and developed by Dr James Read in 1982. The NHS (National Health Service) has expanded the codes to cover all areas of clinical practice. The code is hierarchical from left to right or from level 1 to level 5. It means that it gives more detailed information from level 1 to level 5. Table 1 shows the medical symptoms based

on Readcodes at level 3 and at level 5. ‘Other soft tissue disorders’ is general description using Readcodes at level 3. ‘Foot pain’, ‘Heel pain’, etc., give more details using Readcodes at level 5. ‘Other extrapyramidal disease and abnormal movement disorders’ is general term; ‘Restless legs syndrome’, ‘Essential and other specified forms of tremor’, etc., are detailed descriptions using Readcodes at level 5.

In this research, two kinds of feature matrix are built to characterize the symptoms of patients who take drugs. One is based on Readcodes at level 1-5, which cover all the symptoms and detailed information which occur when patients take drugs. Another one is based on Readcodes at level 1-3, which is created by combining the detailed symptoms using Readcodes at level 4-5 into the general term using Readcodes at level 3. For the drug of Alendronate, ‘Temporomandibular joint disorders’ of ‘J046.00’ at level 4 is a typical ADR, but ‘Dentofacial anomalies’ of ‘J04..00’ at level 3 is more general term for this particular ADR. If patients have the symptom of ‘Temporomandibular joint disorders’ (‘J046.00’) in feature matrix based Readcodes at 1-5, after we merge the detailed descriptions into the general term, a new feature matrix based on Readcodes at level 1-3 is rebuilt instead of using ‘Dentofacial anomalies’ (‘J04..00’) .

C. The extraction of feature matrix

To detect the ADRs of drugs, first feature matrix is extracted from THIN database, which describes the medical events that patients occur before or after taking drugs. Then feature selection methods of Student’s t-test and Wilcoxon rank sum test are performed to select the significant features from feature matrix containing thousands of medical events. Figure 1 shows the process to detect the ADRs using feature matrix. Feature matrix A (Figure 2) describes the medical events for each patient during 60 days before they take drugs. Feature matrix B is one after patients take drugs. For repeated prescriptions, all the medical events related to each prescription of patients are input to build the feature matrix. When the interval between two adjacent prescriptions is less than 60 days, we put all the medical events between two adjacent into feature matrix after patients take drugs. When the interval between two prescriptions is more than 60 days, the medical events within 60 days after the first prescription are used build the feature matrix after patients take drugs, and the medical events beyond 60 days are put into the feature matrix before patients take drugs. Figure 3 shows how to extract feature matrix. Figure 4 shows the flowchart of building feature matrix.

In order to reduce the computation time and save the computation space, we set 100 patients as a group. Matrix X and Y are feature matrix after patients are divided into groups.

Feature matrix A and B are defined as follows:

a = ( a y ) mxn , i = 0,1,..., m , j = 1,..., n ⁽¹⁾

aij ^ {0,1}

B = ( b j ) m x n , i = 0,1,..., m , j = 1,..., n ⁽²⁾

bj e {0,1} where variable i and j represent patients and medical events. Variable m and n represent the number of patients and medical events respectively. atj = 1 is for the state where ith patient has jth medical event; otherwise atj = 0 . Feature matrix A and B describe the medical events during 60 days before or after patients take drugs.

Fig. 2. Feature matrix. The row of feature matrix represents the patients, and the column of feature matrix represents the medical events. The element of events is 1 or 0, which represents that patients have or do not have the corresponding medical symptom or event.

Fig. 1. The process to detect ADRs. Matrix A and B are feature matrix before or after patients take drugs. The time period of observation is set to 60 days. Matrix X and Y are feature matrix after patients are divided into groups. We set 100 patients as one group.

Fig. 3. The building process of feature matrix. Feature matrix A reflects the medical events for each patient during 60 days before they take drugs. Feature matrix B is one after patients take drugs. Time axis covers time points of prescriptions and medical events during the whole time for the patients taking drug. Variable p1,p2,p3 and p4 ( ☆ ) represent the time points of prescriptions for the patient taking the drug.

Variable e1,e2,..., and e8 (○) represent the time points of medical symptoms for the patient. The interval T1 between p1 and p2 is 60120 days.

The interval T3 between p3 and p4 is <=60 days.

Fig. 4. The flowchart of building feature matrix. Feature matrix A reflects the medical events for each patient during 60 days before they take drugs.

Feature matrix B is one after patients take drugs. Two datasets of medical events E and prescriptions P include all the medical events and prescriptions of patients taking the drug. The datasets of E and P are extracted from THIN database, and all the data are sorted in ascending order along the time. Variable p l and e k represent the time points for the prescription (l) and the medical event (k) respectively. Variable k (k=1,...,N) and l (l=1,...,M) represent the number or position of the medical event in dataset E and the prescriptions in dataset P respectively. Variable S represents the dataset including all the types of medical events in A and B. Variable r represents the type number of medical events in S. Variable i and j represent ith patient and jth medical event respectively. Variable m k represents the kth medical event in dataset E.

Feature matrix X and Y are defined as follows:

x = ( x/ g x n , ^k = ^0,1,.-., g ; j = 1,-,ⁿ ⁽³⁾

xkj = ik I 1 aj, , ik = k X d, k = 0,...,g i = ik y = (y^ gxn,k=w,-, g; j=1,..., n (4)

ik + d - 1

ykj = I by, , ik = k x d , k = 0,...,g i=ik where variable g represents the number of groups. Variable x represents the patient number of kth group having jth medical event. In this study, each group contains 100 patients, and d is set to 100.

III. Feature Selection

After two feature matrixes are obtained, which represent the medical events before and after patients take drugs, feature selection methods are performed to detect the significant features that reflect the significant changes. Feature selection methods normally have two classes. One class is “wrapper” methods, which select feature subsets based on classification performance, such as genetic algorithm [22 ， 23]. Another one is filter methods, which select features based on between-class discriminant criterion, give the ranked features, and are not involved in classification performance in the process of feature selection, such as Student’s t-test and Wilcoxon rank-sum test.

Obviously, in this study, it is not the classification problem. The main idea is to find which medical event changes significantly and give their ranks. We employ feature selection methods of Student’s t-test and Wilcoxon rank-sum test, which are based on the data assumption having normal distribution and no assumptions for data distributions, respectively. Based on doctors’ suggestion, the ratio of the patient number after taking the drug to one before taking the drug for having one particular symptom is also used to represent the changes of symptoms.

A. Student’s t-test

Student’s t-test [24] feature selection method is used to detect the significant ADRs from thousands of medical events. Student’s t-test is a kind of statistical hypothesis test based on a normal distribution, and is used to measure the difference between two kinds of samples. Student’s t-test calculates a score t to measure the difference between feature matrix X and Y , which represents the importance of jth medical event or ADR among all groups of patients.

Student’s t-test is defined as follows:

xi — У ty= j = 1,2,..., n (5)

22 ^S xj ^S yj

Mx My where x and y represent the means of two groups of samples, S and S represent the standard deviations of two groups of samples. M and M represent the number of two groups of samples.

B. Wilcoxon rank-sum test

Student’s t-test assumes that the data follow a normal distribution. If no assumption can be made about the shape of the population distribution, non-parametric statistical methods can be used. The Wilcoxon rank-sum test is a non-parametric analogue to Student’s t-test for two independent samples. It is used for the determination of equality of the means of two non-normal samples. The test is based on the rank of the individual data within feature matrix. As it operates on rank-transformed data, it also is a robust choice for detecting the significant medical symptoms. In this test, the ranking of data is performed first by giving the highest rank (equal to the number of features) to the feature with the highest value, and the lowest rank to the feature with lowest value. The ranks for remaining features are assigned using the ranksum test as explained in details in [25].

C. Other parameters

The variable of ratio R is defined to evaluate significant changes of the medical events, using ratio of the patient number after taking the drug to one before taking the drug. The variable R represents the ratio of patient number after taking the drug to the number of whole population having one particular medical symptom.

The ratio variables R and R are defined as follows:

' Na/

R =1 /N_B ifN_B * 0; (6)

t N a ifN B = 0;

R 2 = nA / N where N and N represent the numbers of patients before or after they take drugs for having one particular medical event respectively. The variable N represents the number of whole population who take drugs.

We compare our proposed method with published results in [19]. Same evaluation standard is used in our research. Accuracy is defined as follows:

ACC = Nadr / Nall where ACC represents the accuracy to predict the ADRs. N represents the number of true positive

(ADRs), and N represents the number of whole number for prediction. In this research we only evaluate top 20 results of prediction, and N is set to 20.

IV. Experiments and Results

Three drugs of Atorvastatin, Alendronate, and Metoclopramide are used to test our proposed method, using 20 GPs data in THIN database. The drug of Atorvastatin is one of ‘statin’ class and has the particular ‘muscle pain’ side effects. For the drug of Alendronate, the ‘Temporomandibular joint disorders’ is a typical ADR. The drug of Metoclopramide has the typical ADR of ‘extrapyramidal effects’ or ‘abnormal movement disorders’. So in this study, three drugs that have different typical ADRs, are used to test our proposed method. Jin et al. also use the drugs of Atorvastatin and Alendronate to test their proposed methods of MUTARA and HUNT [18,19].

Student’s t-test and Wilcoxon rank-sum test are performed to select the significant features from feature matrix, which represent the medical events having the significant changes after patients take the drugs. In experiments, we use two kinds of feature matrix. One feature matrix is based on all medical events using Readcodes at level 1-5 in order to observe the detailed symptoms. Another one is based on the medical events using Readcodes at leve1 1-3, by combining the detailed information of Readcodes at level 4 and 5 into general terms using Readcodes at level 3.

Some common abbreviations used in the ‘Readcodes’ dictionary are shown as below:

C/O Complains of

H/O History of

NOS Not otherwise specified

F/H Family history

O/E on examination

[M] Morphology of neoplasms
[D] Working diagnosis

A. Atorvastatin

Atorvastatin [26], under the trade name Lipitor, is one of the drugs known as statins. It is used for lowering blood cholesterol. There are seven currently prescribed forms of statin drugs. They are Rosuvastatin, Atorvastatin, Simvastatin, Pravastatin, Fluvastatin, Lovastatin, Pitavastatin. Muscle pain and musculoskeletal events are two of the main side effects of statin drugs [27].

Drugs.com provides access to healthcare information, sourced solely from the most trusted, well-respected and independant agents such as the Food and Drug Administration (FDA), American Society of HealthSystem Pharmacists, Wolters Kluwer Health, Thomson Micromedex, Cerner Multum and Stedman's. The side effects for Atorvastatin [28] by Drugs.com are severe allergic reactions (rash; hives; itching; difficulty breathing or swallowing; tightness in the chest; swelling of the mouth, face, lips, throat, or tongue); burning, numbness, or tingling; change in the amount of urine produced; confusion; memory problems; mental or mood problems (eg, depression); muscle pain, tenderness, or weakness (with or without fever or fatigue); painful, difficult, or frequent urination; persistent pain, soreness, redness, or swelling of a tendon or joint; red, swollen, blistered, or peeling skin; severe stomach or back pain (with or without nausea or vomiting); symptoms of liver problems (eg, dark urine; pale stools; severe or persistent nausea, loss of appetite, or stomach pain; unusual tiredness; yellowing of the skin or eyes).

We only use the data for patients, who registered on GP for more than one year, in order to obtain the reliable data. 6803 patients take Atorvastatin from 20GP data in THIN database. Based on Readcodes at level 1-5, totally 10528 medical events are obtained before or after 6803 patients take the drug. So 6803x10528 feature matrix is obtained. Based on Readcodes at level 1-3, we combine the medical events which have the same first three codes into one medical event, just using first three codes. 10528 medical events based on Readcodes at level 1-5 are combined into 2350 medical events based on Readcodes at level 1-3, and a 6803x2350 feature matrix is created. After grouping patients, 68x10528 and 68x2350 feature matrix are formed to select the significant features, which reflect the significant change of medical events after patients take drugs.

Table 2 and Table 3 show the top 20 ADRs for Atorvastatin based on Student’s t-test and Wilcoxon ranksum test, using medical events based on Readcodes at level 1-5 and level 1-3. From Table 2 and 3, it is clear that the detected ADRs are consistent with the published side effects for Atorvastatin. The difference for the detected ADRs between Student’s t-test and Wilcoxon rank-sum test is that the rank of the ADRs is different. For example the medical event of ‘Other soft tissue disorder’ is rank 3 based on Student’s t-test, and rank 12 based on Wilcoxon rank-sum test. ‘Other soft tissue disorder’ is major side effect for statin. From this view, the detected ADRs based on Student’s t-test are better than those based on Wilcoxon rank-sum test.

Our detected ADRs are classified into several categories, such as muscle pain, muscle weakness and neuropathy, gastrointestinal events, musculoskeletal events [27], skin and subcutaneous tissue events [28], bronchitis and bronchiolitis [29] diabetes mellitus events [30 ， 31]. Good performance of 100% accuracy is achieved for the top 20 ADRs based on Student’s t-test and Wilcoxon rank-sum test.

MUTARA and HUNT [18 ， 19] based on Unexpected Temporal Association Rule are proposed to signal unexpected and infrequent patterns characteristic of

ADRs, using The Queensland Linked Data Set (QLDS). They indicated that “HUNT can reliably shortlist statistically significantly more ADRs than MUTARA”. For the drug of Atorvastatin, HUNT detects 4 ADRs of ‘urinary tract infection’, ‘stomach ulcer’, ‘diarrhoea’, and ‘bronchitis’ from top 20 results based on 13712 patient records, and only obtains 30% average accuracy [19]. The major side effects of Atorvastatin, such as muscle pain and musculoskeletal events, are not detected. Table 4 shows the performance of HUNT and MUTARA. Our proposed method detects ‘muscle and musculoskeletal’ events not only based on Readcodes at level 1-5, but also based on Readcodes at level 1-3. ‘Peripheral enthesopathies and allied syndromes’ and ‘Other soft tissue disorders’ are detected at rank 2 and 3 based on Readcodes at level 1-3, using feature selection method of Student’s t-test.

B. Alendronate

Alendronate sodium, sold as Fosamax by Merck, is a bisphosphonate drug used for osteoporosis and several other bone disease [32]. The side effects for Alendronate [ 33] are severe allergic reactions (rash; hives; itching; difficulty breathing; tightness in the chest; swelling of the mouth, face, lips, throat, or tongue); black, tarry, or bloody stools; chest pain; coughing or vomiting blood; difficult or painful swallowing; mouth sores; new, worsening, or persistent heartburn; red, swollen, blistered, or peeling skin; severe bone, muscle, or joint pain (especially in the hip, groin, or thigh); severe or persistent sore throat or stomach pain; swelling of the hands, legs, or joints; swelling or pain in the jaw; symptoms of low blood calcium (eg, spasms, twitches, or cramps in your muscles; numbness or tingling in your fingers, toes, or around your mouth).

In [32], side effects include ulceration of the esophagus, gastric and duodenal ulceration, skin rash, rarely manifesting as Stevens–Johnson syndrome and toxic epidermal necrolysis, eye problems (uveitis, scleritis), generalized muscle, joint, and bone pain, osteonecrosis of the jaw or deterioration of the Temporomandibular Joint (TMJ), auditory hallucinations, visual disturbances. Studies suggest that users of alendronate have an increase in the numbers of osteoclasts and develop giant, more multinucleated osteoclasts. Fosamax has been linked to a rare type of leg fracture that cuts straight across the upper thigh bone after little or no trauma.

3346 patients from 20GP data in THIN database are taking Alendronate, and 7260 medical events are obtained based on Readcodes at level 1-5. After grouping them, 33x7260 feature matrix is obtained. For Readcodes at level 1-3, 33x1964 feature matrix is obtained. Table 5 and 6 show the top 20 ADRs based on Student’s t-test and Wilcoxon rank-sum test. Our results show that most detected ADRs are consistent with the published results [32]. ‘Chronic kidney disease stage 3’, ‘Type 2 diabetes mellitus’, and ‘Essential hypertension’ are not listed in [32]. 85% accuracy is achieved. But in [34], ’acute renal failure’ is reported after the patient takes Alendronate. ‘Based on the study created by eHealthMe from U.S. Food and Drug Administration (FDA) and user community [35], 0.60% patients have ‘Type 2 diabetes mellitus’. In [36], 0.62% patients have ‘Essential hypertension’. For HUNT, only ‘headache’, ‘swelling’, wheeze’, ‘breathing difficult’, ‘heart burn’, and ‘constipation’ are detected in top 20 ADRs based on 7569 patients. Totally 30% average accuracy is obtained. The major side effect of ‘muscle or joint pain’ is not detected.

‘Peripheral enthesopathies and allied syndromes’ and ‘Other soft tissue disorders’ with rank 14 and 15 are detected to describe the ADRs related to ‘muscle or joint pain’, which are listed in Table 5 using Readcodes at level 1-3 based on Student’s t-test. But they are not list in Table 6 based on Wilcoxon rank-sum test in top 20 ADRs. From this view, Student’s t-test method is better than Wilcoxon rank-sum test.

After we obtain the results using Student’s t-test with p<0.05, which reflect the significant changes after patients take drug, then we also sort the order of results according to descending order of R1 value. Table 7 shows top 20 ADRs based on the descending order of R1 value. From Table 7, we notice that the typical side effect ‘Temporomandibular joint disorders’ for Alendronate’ is detected at rank 11 using Readcodes at level 1-5.

C. Metoclopramide

Metoclopramide is used to treat nausea and vomiting associated with conditions such as uremia, radiation sickness, malignancy, labor, infection, migraine headaches, and emetogenic drugs [37]. The side effects [38] for Metoclopramide are severe allergic reactions (rash; hives; itching; difficulty breathing; tightness in the chest; swelling of the mouth, face, lips, or tongue; unusual hoarseness); abnormal thinking; confusion; dark urine; decreased balance or coordination; decreased sexual ability; fast, slow, or irregular heartbeat; fever; hallucinations; loss of bladder control; mental or mood changes (eg, depression, anxiety, agitation, jitteriness); seizures; severe or persistent dizziness, headache, or trouble sleeping; severe or persistent restlessness, including inability to sit still; shortness of breath; stiff or rigid muscles; sudden increased sweating; sudden, unusual weight gain; suicidal thoughts or actions; swelling of the arms, legs, or feet; uncontrolled muscle spasms or movements (eg, of the arms, legs, tongue, jaw, cheeks; twitching; tremors); vision changes; yellowing of the skin or eyes. In [37] common adverse drug reactions associated with metoclopramide therapy include restlessness, drowsiness, dizziness, fatigue, and focal dystonia. Infrequent ADRs include hypertension, hypotension, hyperprolactinaemia leading to galactorrhea, constipation, depression, headache, and extrapyramidal effects such as oculogyric crisis.

8320 patients from 20 GPs take Metoclopramide. 8094 medical events are obtained using Readcodes at level 1-5, and 8320x8094 feature matrix is created. After 8094 medical events are combined into 2117 medical symptoms based on Readcodes at level 1-3, 8320x2117 feature matrix is created. After patients are grouped, 83x8094 feature matrix is formed using Readcodes at level 1-5, and 83x2117 feature matrix is obtained using Readcodes at level 1-3. Table 8 and 9 show the top 20

detected results in ascending order of p value of Student’s t-test and Wilcoxon rank-sum test, using Readcodes at level 1-5 and at level 1-3, respectively.

From Table 8 and 9, it is shown that the typical side effect ‘Other extrapyramidal disease and abnormal movement disorders’ for Metoclopramide is detected at rank 3 and rank 4 based on Student’s t-test and Wilcoxon rank-sum test. The top 20 medical events cover most important side effects for the drug of Metoclopramide. The side effect ‘Cholangitis’ is detected at rank 14 in Table 8 based on Student’s t-test using Readcodes at level 1-5. In [38], a 22-year-old female, treated with metoclopramide for 7 to 8 months for abdominal pain, developed hepatic hemangiomatosis with arteriovenous shunting and cholestasis.

The ADRs having top ranks (rank 1,2,3,5) are ‘patient died’ based on Student’s t-test using Readcodes at level 1-5, and they are not listed in [37 ， 38]. So our result achieves 80% accuracy. But based on the study created by eHealthMe from FDA and user community, 1.93% patients have ‘Death’ [39]. We notice that three patients are ‘died’ before they took the drug. Actually, the real state for these patients is ‘expected died’.

V. Discussion

A. Feature matrix

In this research, we propose the novel concept of feature matrix. For the medical data which are collected from day to day clinical practice, they are irregular and magnanimity data. Each patient has different medical information. Even for the same drug, different patients have taken drugs at different time points and different periods, and have different medical symptoms. How to extract the useful information after patients take drugs is first and key step to detect the adverse drug reactions. A novel concept of feature matrix is proposed to store and characterize the common information that only deal with symptoms before and after taking drugs for patients. For each drug, its corresponding feature matrix is built to cover all the symptoms or medical events occurred before or after taking the drug for all patients. Feature matrix transfers the unrelated information of different patients to common saving format, and builds the foundation to detect the adverse drug reactions. For example, some patients take the “statin” drugs for years, and the corresponding prescriptions and medical events are “mass” information. But we only interest the information before and after taking drugs, feature matrix are built to characterize the representative features extracted from THIN database, which is collected day to day clinical practice. All the prescriptions during the period of taking drugs are checked. The medical events during 60 days before or after each prescription are used to build the feature matrix for this drug. This is reason that chronic side effects can be detected. For example, for the drug of Alendronate, one chronic side effect is “Temporomandibular joint disorders”, which is detected by our proposed method.

Two kinds of feature matrix are created to represents the medical events or symptoms for patients before and after taking drugs. One type of feature matrix is to use original data based on Readcodes at level 1-5. Another one is based on Readcodes at level 1-3, combining the medical symptoms using Readcodes at level 4-5 into one using Readcodes at level 3. For example, ‘Type 2 diabetes mellitus’ has Readcodes of ‘C10F.00’ at level 4, and Readcodes of ‘C10..00’ at level 3 represents the general term of ‘Diabetes mellitus’.

B. Feature selection and the performance

In this research, we compare our proposed method with published results in [19]. In paper [19], two drugs of Atovastatin and Alendronate are used to test their proposed methods, and average 30% accuracy is obtained in [19]. Their proposed methods do not detect two major side effects of muscle pain and musculoskeletal events.

Same evaluation standard is used in our research. Our proposed methods based on extracted feature matrix and feature selection, achieve good results. For drug Atovastatin and Alendronate, major ADRs are detected, and 100% and 85% accuracy are obtained. For Atorvastatin, the major side effect of ‘muscle pain’ is detected at rank 3, which is ‘Other soft tissue disorders’ using Readcodes at level 1-3. For the drug of Alendronate, the typical side effect of ‘Temporomandibular joint disorders’ is detected at rank 1 based on the descending order of the ratio of the patient number after taking the drug to one before taking the drug, which is ‘Dentofacial anomalies’ using Readcodes at level 1-3. For the drug of Metoclopramide, the typical ADR of ‘abnormal movement disorders’ is detected at rank 3, which is ‘Other extrapyramidal disease and abnormal movement disorders’ using Readcodes at level 1-3.

Feature selection is widely used to reduce data dimensionality and detect the significant features [40,41]. In this research feature selection methods of Student’s t-test and Wilcoxon rank-sum test are employed to detect the significant changes of feature matrix. Student’s t-test assumes that the data follow a normal distribution. Wilcoxon rank-sum test has no assumption about the shape of data distribution. Experimental results show that Student’s t-test achieves better performance than Wilcoxon rank-sum test. The variable of ratio of the patient number after taking the drug to one before taking the drug is complementary part for Student’s t-test method. The ratio variable is useful to detect the some typical ADRs.

Because of the creation of feature matrix, its unique saving format makes it possible to combine medical symptoms using Readcodes at level 4-5 into one using Readcodes at level 3. Feature matrix using Readcodes at level 1-5 gives the detail information, and feature matrix using Readcode at level 1-3 can extracts the general symptoms, which may be not noticed using feature matrix using Readcodes at level 1-5, for example the ADR of “extrapyramidal disease and abnormal movement disorder” for drug Metoclopramide.

C. Rare ADRs, chronic ADRs, drug interaction, and advantages

Our proposed method has successfully shortlisted the rare ADRs of “Temporomandibular joint disorders” for drug Alendronate, “extrapyramidal disease and abnormal movement disorder” for drug Metoclopramide, and the research work in [21] indicated that the existing algorithms are not capable of detecting rare ADRs.

The ADRs are rare, and it means the frequency to appear is low. If the methods detect the ADRs only based on the frequency of ADR appearing, there is never sufficient data to support the hypothesis of a causal relationship between the drug and the reaction.

The rare ADRs of “Temporomandibular joint disorders” for drug Alendronate and “extrapyramidal disease and abnormal movement disorder” for drug Metoclopramide are from 3346 and 8320 patients. Enough data is used to support the detection of ADRs. Our proposed method is not only based on the frequency of ADR appearing, and but also is based on the significant change of medical symptoms. The rare ADRs may not appear frequently, but make the significant change. So our proposed method can detect rare ADRs.

When feature matrix is built, all the prescriptions of each patient for this drug are checked to extract the information, no matter how long the patient takes this drug. Because of the characteristics of feature matrix, chronic side effects can be detected. Exiting algorithms [18 ， 19] only check the prescriptions during some period of time, and lose part of the information of side effects.

Because of the characteristics of feature matrix, it is easy to trace the information of patients having the particular ADRs, and extract the information of other drugs taken by these patients. This gives the foundation to do further research of drug interactions.

Two advantages are achieved in this research compared with exiting algorithms. One is to propose feature matrix, which transfers the irregular information of each patient during whole period of taking drugs into standard saving format. This makes it possible to detect chronic ADRs. Another one is based on the changing degree of symptoms to detect the ADRs. This makes it possible to detect rare ADRs.

VI. Conclusions

In this study we propose a novel method to successfully detect the ADRs by introducing feature matrix and feature selection to detect the significant changes after patients take drugs. A feature matrix, which characterizes the medical events before patients take drugs or after patients take drugs, is created from THIN database. Feature matrix transfers the irregular and high-throughput medical data collected from daily basis into feature matrix of standard saving format, and is a foundation to perform feature selection methods. Feature selection methods based on Student’s t-test and Wilcoxon rank-sum test are used to detect the significant features from high dimensional feature matrix. The significant

ADRs, which are corresponding to significant features, are detected. Experiments are performed on three drugs: Atorvastatin, Alendronate, and Metoclopramide. Compared to other computerized method, our proposed method achieves better performance. The detected ADRs are based on computerized methods, further investigation is needed.

Список литературы Feature Selection in Detection of Adverse Drug Reactions from the Health Improvement Network (THIN) Database

G. Severino and M.D. Zompo, “Adverse drug reactions: role of pharmacogenomics,” Pharmacological Research, 49: 363-373, 2004.
Y. Qian, X. Ye, W. Du, J. Ren, Y. Sun, H. Wang, B. Luo, Q. Gao, M. Wu, and J. He, “A computerized system for signal detection in spontaneous reporting system of Shanghai China,” Pharmacoepidemiology and Drug Safety, 18: 154-158, 2009.
K.S. Park and O. Kwon, “The state of adverse event reporting and signal generation of dietary supplements in Korea,” Regulatory Toxicology and Pharmacology, 57: 74-77, 2010.
A. Farcas, A. Sinpetrean, C. Mogosan, M. Palage, O. Vostinaru, M. Bojita, D. Dumitrascu, “Adverse drug reactions detected by stimulated spontaneous reporting in an internal medicine department in Romania,” European Journal of Internal Medicine, 21: 453-457, 2010.
A. Szarfman , S.G. Machado and R.T. O'Neill, “Use of screening algorithms and computer systems to efficiently signal higher-than-expected combinations of drugs and events in the US FDA’s spontaneous reports database,” Drug Saf., 25: 381-392, 2002.
R. Kasliwal, L.V. Wilton, V. Cornelius, B. Aurich-Barrera and S.A.W. Shakir, “Safety profile of Rosuvastatin-results of a prescription-event monitoring study of 11680 patients,” Drug Safety, 30: 157-170, 2007.
R.D. Mann, K. Kubota, G. Pearce and L. Wilton, “Salmeterol: a study by prescription-event monitoring in a UK cohort of 15,407 patients,” J Clin Epidemio, 49: 247-250, 1996.
N.S. Rawson, G.L. Pearce and W.H. Inman, “Prescription-event monitoring: methodology and recent progress,” Journal of Clinical Epidemiology, 43: 509-522, 1990.
B. Aurich-Barrera, L. Wilton, D. Brown and S. Shakir, “Paediatric postmarketing pharmacovigilance using prescription-event monitoring: comparison of the adverse event profile of lamotrigine prescribed to children and adults in England,” Drug Saf., 33: 751-763, 2010.
T.J. Hannan, “Detecting adverse drug reactions to improve patient outcomes,” International Journal of Medical Informatics, 55: 61-64, 1999.
R.H. Meyboom, M. Lindquist, A.C. Egberts and I.R. Edwards, “Signal detection and follow-up in pharmacovigilance,” Drug Saf., 25: 459-465, 2002.
A. Bate, M. Lindquist, I.R. Edwards, S. Olsson, R. Orre, A. Lansner and R.M. De Freitas, “A Bayesian neural network method for adverse drug reaction signal generation,” Eur J Clin Pharmacol, 54: 315-321, 1998.
R. Orre, A. Lansner, A. Bate and M. Lindquist, “Bayesian neural networks with confidence estimations applied to data mining,” Computational Statistics & Data Analysis, 34: 473-493, 2000.
M. Hauben and A. Bate, “Decision support methods for the detection of adverse events in post-marketing data,” Drug Discovery Today, 14: 343-357, 2009.
Y. Koh, C.W. Yap and S.C. Li, “A quantitative approach of using genetic algorithm in designing a probability scoring system of an adverse drug reaction assessment system,” International Journal of Medical Informatics, 77: 421-430, 2008.
C. Henegar, C. Bousquet, A.L. Lillo-Le, P. Degoulet and M.C. Jaulent, “A knowledge based approach for automated signal generation in pharmacovigilance,” Stud Health Technol Inform, 107: 626-630, 2004
C. Bousquet , C. Henegar , A.L. Louët , P. Degoulet and M.C. Jaulent, “Implementation of automated signal generation in pharmacovigilance using a knowledge-based approach,” International Journal of Medical Informatics, 74: 563-571, 2005.
H. Jin, J. Chen, H. He, G.J. Williams, C. Kelman and C.M. O'Keefe, “Mining unexpected temporal associations: applications in detecting adverse drug reactions,” IEEE Trans Inf Technol Biomed, 12: 488-500, 2008.
H. Jin, J. Chen, H. He, C. Kelman, D. McAullay and C.M. O'Keefe, “Signaling Potential Adverse Drug Reactions from Administrative Health Databases,” IEEE Trans. Knowl. Data Eng., 22: 839-853, 2010.
G. Williams , D. Vickers , C. Rainsford , L. Gu , H. He , R. Baxter and S. Hawkins, Bias in the Queensland linked data set, CSIRO Math. Inf. Sci., 2002.
J. Reps, J. Garibaldi, U. Aickelin, D. Soria, J. Gibson and R. Hubbard, “Comparison of algorithms that detect drug side effects using electronic healthcare databases,” Soft Computing, 2013
Y. Liu, U. Aickelin, Jan. Feyereisl and L. Durrant, “Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data,” Knowledge-Based Systems, 37: 502-514, 2013.
H. Uguz, “A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm,” Knowledge-Based Systems, 24: 1024-1032, 2011.
E. Kreyszig, Introductory Mathematical Statistics, John Wiley, New York, 1970.
B. Rosner, Fundamentals of Biostatistics, Duxbury press, 2000.
http://en.wikipedia.org/wiki/Atorvastatin.
http://www.statinanswers.com/effects.htm.
http://www.drugs.com/sfx/atorvastatin-side-effects.html.
A.B. Fernández, R.H. Karas, A.A. Alsheikh-Ali and P.D. Thompson, “Statins and interstitial lung disease: a systematic review of the literature and of food and drug administration adverse event reports,” Chest, 134: 824-830, 2008.
N. Sattar, D. Preiss, and H.M. Murray, et al., “Statins and risk of incident diabetes: a collaborative meta-analysis of randomized statins trials,” Lancet, 375: 735-742, 2010.
D. Preiss, S.R.K. Seshasai, and P. Welsh, et al., “Risk of incident diabetes with intensive-dose compared with moderate-dose statin therapy,” JAMA, 305: 2556-2564, 2011.
http://en.wikipedia.org/wiki/Alendronic_acid.
http://www.drugs.com/sfx/alendronate-side-effects.html.
N. Miura, N. Mizuno, R. Aoyama, W. Kitagawa, H. Yamada, K. Nishikawa and H. Imai, “Massive proteinuria and acute renal failure after oral bisphosphonate (alendronate) administration in a patient with focal segmental glomerulosclerosis,” Clin Exp Nephrol, 13: 85-88, 2009.
http://www.ehealthme.com/ds/alendronate+sodium/type+2+diabetes+mellitus.
http://www.ehealthme.com/ds/alendronate+sodium/hypertension+-+essential.
http://en.wikipedia.org/wiki/Metoclopramide.
http://www.drugs.com/sfx/metoclopramide-side-effects.html
http://www.ehealthme.com/ds/metoclopramide/death
R. Parimal and R. Nallaswamy ,"Feature selection using a novel particle swarm optimization and It’s Variants," I.J. Information Technology and Computer Science, 5: 16-24, 2012
S. Chittineni and R. Bhogapathi ,"Determining contribution of features in clustering multidimensional data using neural network," I.J. Information Technology and Computer Science, 10: 29-36, 2012.

Еще