Эффективный обобщённый оценщик гребня для модели логистической регрессии

Бесплатный доступ

В логистической регрессионной модели (LRM) для оценки неизвестных параметров традиционно используется метод максимального правдоподобия (MLE). Однако при наличии существенной мультиколлинеарности между объясняющими переменными оценки параметров, полученные методом MLE, становятся нестабильными, имеют большие дисперсии и приводят к широким доверительным интервалам и снижению статистической мощности критериев. Для преодоления этих недостатков в работе рассматривается обобщённый гребневой оценщик для логистической регрессии (GRL), основанный на введении матрицы гребневых параметров K, позволяющей контролировать степень смещения и уменьшать дисперсию оценок регрессионных коэффициентов. Параметры модели GRL оцениваются с использованием процедуры максимального правдоподобия, после чего проводится сравнительное исследование эффективности MLE и GRL при различных сценариях мультиколлинеарности с помощью моделирования Монте‑Карло. В ходе численного эксперимента анализируется ряд недавно предложенных методов выбора гребневого параметра k и оценивается их влияние на среднеквадратичную ошибку (MSE) оценок коэффициентов. Результаты моделирования демонстрируют, что обобщённый гребневой оценщик логистической регрессии обеспечивает более низкие значения MSE по сравнению с классическим MLE во всех рассмотренных конфигурациях корреляции между переменными и уровнями шумов, что подтверждает его практическую пригодность для задач классификации и прогнозирования в условиях мультиколлинеарности.

Еще

Логистическая регрессия, оценка гребней, обобщённая оценка гребней, мультиколлинеарность, моделирование Монто-Карло.

Короткий адрес: https://sciup.org/14135108

IDR: 14135108   |   DOI: 10.47813/2782-5280-2026-5-1-1033-1041

Текст статьи Эффективный обобщённый оценщик гребня для модели логистической регрессии

DOI:

Logistic regression is a common method for modeling binary data in health sciences and biostatistics. Frisch's 1934 discussed the problem of multicollinearity. which stated that any two variables together generate a multicollinearity problem. This occurs when independent variables are correlated in multiple linear regression, making it difficult to obtain definitive answers to the research questions because the variance are too high or the t-values are too low. This State is known the multicollinearity problem [1].

Logistic regression is a common method for modeling binary data. it is frequently used in classification and predictive analytics. and known logistic model. Logistic regression measures the probability of an event, such as presence/absence or success/failure, based on a given dataset of independent variables. Because the independent variables may be correlated, the ridge regression method can be used with logistic regression models. For further details on logistic regression and the ridge regression method, please refer our readers to [1-5], and others.

In many regression model applications, there is correlation between explanatory variables. When correlations between variables are high, they lead to unstable estimation of regression coefficients, making it difficult to interpret these estimates. In multicollinearity, it becomes difficult to estimate the individual effects of each explanatory variable within the model. Furthermore, the variability of regression coefficients will affect both the inference and prediction of the model. Several methods have been proposed to address multicollinearity [6]. The MLE is most commonly used to estimate the unknown coefficients of the (LRM). One assumption of multiple regression models is that explanatory variables are independent and uncorrelated. MLE performs better when explanatory variables are independent and uncorrelated [7]. However, in practice, linear relationships between explanatory variables can be found in multicollinearity. This multicollinearity problem, introduced by [8], has some disadvantages in parameter estimation using MLE. One issue is that parameter estimates often have large variances, making reliable results difficult to obtain. MLE can also produce unstable estimates of the estimated coefficients. There is also the problem of wide confidence interval and low statistical power in making appropriate decisions, which leads to an increased probability of type II hypothesis testing errors for regression coefficients.

Several methods exist for addressing and discusses problem of multicollinearity. one of the most common methods is Ridge regression, developed by [9]. Studies were conducted on the linear regression (LR) model to determine the best value shrinkage of the K coefficient for ridge regression.; for example, [9-11], and many others [12] developed the Ridge regression estimator in the Generalized Linear Mode (GLM), and by extending the idea of [12], many researchers have proposed the Ridge regression approach for different models, for example, [1, 4, 13].

The ridge method is one of the ways this problem is addressed, and it was first introduced by researchers [9]. In this method, a value for the slope of the letter, denoted as K, where the variance of the slope coefficient is reduced while the bias factor is increased. Researchers have showed that this K coefficient has a non-zero value, where (MSE) of the regression coefficient using ridge regression is smaller than maximum likelihood (MLE) variance of the coefficient. Among these are [11], [14-23]. And Several methods for generalized Ridge regression Among these are [24-27].

The purpose of this article is to apply several parameters of generalized ridge logistic regression (GRL) that can be estimated apply MLE method under conditions of high correlations between explanatory variables. The article was follows. In the first section, we explain the model we are analyzing and formally define several parameters for the logistic ridge regression. In section 2 we present Generalized ridge Estimator (GRE). In section 3, we simulation experiment, including the factors that can influence the sample characteristics for these proposed parameters. In section 4, we shown the results for the different coefficients in terms (MSE). The conclusions of the article are presented in section 5.

MATERIALS AND METHODS

Logistic Ridge Regression Model (Lrrm)

In this section, we will introduce the logistic regression, first proposed by [2], and discuss some of the parameters that have been used in ridge estimators by researchers [18, 21, 22].

Logistic regression analysis is a commonly apply statistical method when the value of i for the dependent variable (y) in the regression model is Be(p) with the following value for the coefficient:

exp (xjp) 1+exp (x j /?)

Where P is an explanatory variable and is a vector of size (k+1)×1 of the coefficients, xi is row i of X, which is a matrix data of size n×(k+1). The most common method of estimation is to using the MLE, where the following log Likelihood function must be maximized:

l(X;P) - yT log(p) + (1 — y)T log(1 — p). (2)

Set zero for first derivative . Then, MLE are solving the following equation:

?^-XT(y — p)-0.   (3)

The equations resulting from the first derivative are non-linear equations that do not have a solution. Therefore, these equations are solved by numerical methods, the most common of which is the Newton-Raphson algorithm, Therefore, by using the iterative weighted least squares(IWLS) algorithm, the solution equation 3 is obtained :

P ML - (XTWX)-1(XTWz)      (4)

Where W = diag[pt(1 — р) and z is a vector where the element i is equal to zi = log(pi)+-:^.         (5)

The covariance matrix asymptotic of MLE is equal to the inverse of the second derivative matrix ( inverse Hessian matrix):

-1

Cov(P ml) = E(-d-^)  — (XTWX)-1 (6)

the asymptotic MSE:

E(L Ml ) = E((P ml PY(P ml — P) = tr^WXyy^^iY-             (7)

Here, A j referred eigenvalue j of the XTWX matrix. One drawback of apply maximum likelihood estimation is that the variance becomes large when there is a strong correlation between the independent variables, because some eigenvalues will be small. The ridge estimator in ridge regression can be directly extended to the logistic ridge regression [2] [1]as follows

P lrr = (XTWX + kI)-1(XTWXP ML ) = ZP ml , (8)

Where W, PML are the ML estimates derived from eq (4). The mean squared error of the logistic regression is:

E(LIrr)=E(P l RR—P) (P lrr —P)

= E[(P ml — P) Z 'Z(P ml — P) + (ZP — P) (ZP P)

= tr[ (P ml — P)' (P ml — PV ' Z] + P ' (XTWX

+ kI)-1(XTWX + kI)-1P

- tr[(XTWX)-1Z ' Z] + k^P ' (XTWX + kI)-2P

- %=!  ..  + k2P (XTWX + kIY2P    (9)

(Aj + KJ where k> 0. A specific estimator from Eq. (8) with k= 0 might be thought of as the ML estimator.

Generalized Ridge Estimator

Generalized ridge Estimator (GRE) differs from the ridge Regression (RR) model in that it takes into account the p values of k:

P gre = (XTX + KY^Y      (10)

Where K = dig(k1, k2, ..., ks). It is useful to find the optimal values for ik when using GRE because the MSE is best than when used the ridge estimator and MLE.

The GRR definition for the Logistic Regression Model (LGRR) is:

P lgrr = (XTWX + KYWWXP ml ) = ZP ml , (11)

K matrix selection must be carefully considered. Several approaches are modified to estimate K in this study, including . These approaches are listed below, in order.

a2

^(HK) ^ , i-l,2,^,S        (12)

Where di2 is referred the element i of yPlrr and у is eigenvector of XTWX and the dispersion 2

coefficient, v, is estimated by a2 — УУ-Л —^ .

L 1 n-s

We will use some Several approaches are modified to estimate K in this study [27] , Specifically, the estimators presented in equations SL No (74)–(84):

-J

k 1

2o2

Amax^2 i max

k2 = max

2a2

max

-2 a i

k3

= max

2d2

л A 2

max i

к.

__      2S(7 2

■ 4 = Tv?

л ^ tt 1-1 1 max

k 5

29 2 лСП^а ? ) 1 / 2 v 1-1  1J max

k6 = median

ˆ2

29 2

(19)

k7      1----

L А ^ тах^а2)

k8 = max (

/   2^ 2

(20)

J If-1^2 )

J. _    2 9 2 уф    1

k 9      1 c     ^ i=1» 2

№=1*l     “ ;

(21)

2po2

(22)

k10     1----

ж

: 1^-1^)

2pS2

(23)

k11     1----

M1

лЖ-Л2 ) 1 / 5

\ /Lmax

RESULTS

Table 1. Average MSE values when n=100 and ^0 = 0.

p=5

p=10

Method

p =0.85

p =0.95

p =0.99

p =0.85

p =0.95

p =0.99

MLE

2.5640

8.0337

40.4974

11.1561

44.3506

210.4606

GRR_Lukman1

2.0123

5.6261

21.7362

8.1675

28.3941

109.3064

GRR_Lukman2

1.0306

2.1555

7.0780

1.3152

2.4829

8.1120

GRR_Lukman3

1.4339

3.3882

11.0671

3.1056

8.0597

29.4210

GRR_Lukman4

2.0730

6.3484

29.2969

8.2033

32.7290

150.6817

GRR_Lukman5

1.4426

4.6235

27.7736

1.6170

12.5240

159.1039

GRR_Lukman6

1.7778

4.6801

18.5824

5.3897

16.3831

67.9481

GRR_Lukman7

1.1269

2.2643

7.1078

3.7734

8.8272

28.1793

GRR_Lukman8

2.0247

5.1535

15.6500

7.3830

21.2948

59.5732

GRR_Lukman9

0.7515

1.1128

1.9983

1.2124

2.1293

3.3574

GRR_Lukman10

1.4345

3.2204

10.7390

4.7142

13.1398

39.8583

GRR_Lukman11

1.1615

2.2072

6.2601

2.7405

5.3214

13.6279

Table 2 . Average MSE values when n=200 and ^0 = 0.

p=5

p=10

Method

p =0.85

p =0.95

p =0.99

p =0.85

p =0.95

p =0.99

MLE

1.0442

3.3817

16.5013

3.3025

11.0761

59.2111

GRR_Lukman1

0.9321

2.7027

10.7266

2.8679

8.9635

41.6266

GRR_Lukman2

0.7296

1.4919

4.7454

0.9928

2.0051

7.0140

GRR_Lukman3

0.8855

2.0516

7.4390

1.9074

4.5544

19.4522

GRR_Lukman4

0.9627

2.9010

13.3834

2.8315

9.2649

47.9272

GRR_Lukman5

0.8856

2.1618

11.4463

1.0028

2.3699

38.5917

GRR_Lukman6

0.9024

2.5009

9.8090

2.2084

6.8076

32.6604

GRR_Lukman7

0.6180

1.3725

4.1141

1.4811

3.5308

12.7774

GRR_Lukman8

0.9765

2.7784

9.4083

2.8747

8.4569

32.9171

GRR_Lukman9

0.6009

0.8848

1.5433

0.9010

1.5181

2.7522

GRR_Lukman10

0.7713

1.7973

5.3324

1.9129

4.9953

19.2595

GRR_Lukman11

0.7240

1.4570

3.3814

1.3110

2.9221

9.7536

Table 3. Average MSE values when n=300 and /?0 = 0.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

0.7026

2.1128

10.9040

1.9295

6.4616

33.7800

GRR_Lukman1

0.6486

1.8067

8.0236

1.7516

5.4456

26.1889

GRR_Lukman2

0.6068

1.1835

4.1618

0.8440

1.7057

5.4734

GRR_Lukman3

0.6694

1.5601

6.1269

1.4356

3.4610

14.0352

GRR_Lukman4

0.6622

1.9106

9.4135

1.7402

5.5250

28.9724

GRR_Lukman5

0.7314

1.5641

7.9563

0.9479

1.4907

18.8884

GRR_Lukman6

0.6414

1.7482

7.6224

1.4887

4.4498

21.8357

GRR_Lukman7

0.5835

1.0167

3.2071

1.0468

2.4077

8.5328

GRR_Lukman8

0.6735

1.9004

7.6975

1.7914

5.3346

23.0749

GRR_Lukman9

0.5482

0.8337

1.3530

0.8208

1.3343

2.5168

GRR_Lukman10

0.5665

1.3308

4.4193

1.3282

3.3835

13.0538

GRR_Lukman11

0.5693

1.1555

3.1013

1.0628

2.2775

6.7911

Table 4: Average MSE values when n=100 and /?0 = 1.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

2.8853

9.0213

39.3057

7.5031

23.3707

38.3376

GRR_Lukman1

2.4003

6.6659

28.2197

6.1558

17.3545

19.5704

GRR_Lukman2

1.4377

3.3739

12.0741

2.2103

4.4228

6.6966

GRR_Lukman3

1.9670

5.0477

19.8731

4.1927

9.9221

11.5442

GRR_Lukman4

2.5588

7.7533

28.9280

6.5001

19.6348

25.6908

GRR_Lukman5

2.0190

6.4329

28.2873

3.5663

13.5733

17.2433

GRR_Lukman6

2.3001

6.2631

28.0022

5.2918

14.4088

19.6552

GRR_Lukman7

1.6270

3.5627

10.8803

3.6640

7.6278

8.9798

GRR_Lukman8

2.5380

6.7096

22.9903

6.3420

16.2860

17.4482

GRR_Lukman9

1.0476

1.6246

3.5449

1.7779

2.5738

3.8658

GRR_Lukman10

1.8426

4.2598

15.7734

4.3020

10.2747

11.8678

GRR_Lukman11

1.4828

2.9591

9.2960

2.9339

5.8681

8.7304

Table 5. Average MSE values when n=200 and /?0 = 1.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

1.1343

3.5181

17.9053

3.6015

12.1673

62.9317

GRR_Lukman1

1.0455

2.9787

12.9131

3.2352

10.3182

47.3147

GRR_Lukman2

0.8379

1.9236

7.3975

1.2874

2.8962

11.0290

GRR_Lukman3

1.0018

2.6092

10.7833

2.3911

6.8679

27.8553

GRR_Lukman4

1.0831

3.2344

15.9179

3.2767

10.9151

54.7164

GRR_Lukman5

0.9942

2.7171

14.5964

1.1480

4.4925

47.5701

GRR_Lukman6

1.0323

2.9789

13.2040

2.8057

8.8001

41.3562

GRR_Lukman7

0.8180

1.8122

5.9311

1.9173

4.6245

16.5334

GRR_Lukman8

1.0995

3.1639

12.5449

3.3397

10.2538

40.7839

GRR_Lukman9

0.7700

1.1871

2.1919

1.2827

2.0791

3.6526

GRR_Lukman10

0.8953

2.1650

7.7458

2.3931

6.4305

23.8907

GRR_Lukman11

0.8112

1.6878

5.2282

1.7178

3.9754

11.6601

Table 6. Average MSE values when n=300 and /?0 = 1.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

0.7207

2.1623

10.9962

2.0544

7.0762

34.4338

GRR_Lukman1

0.6837

1.9178

8.6506

1.9137

6.2860

27.8912

GRR_Lukman2

0.6490

1.3828

5.4834

1.2596

2.2104

8.4649

GRR_Lukman3

0.7137

1.7440

7.6352

1.6483

4.5553

19.6154

GRR_Lukman4

0.6984

2.0469

10.0677

1.9301

6.5065

31.3165

GRR_Lukman5

0.7794

1.7665

8.9426

0.9451

2.3031

24.0508

GRR_Lukman6

0.6786

1.9157

8.8958

1.7323

5.5408

25.8418

GRR_Lukman7

0.5699

1.2850

4.2769

1.2475

3.0282

11.2455

GRR_Lukman8

0.7071

2.0398

8.9884

1.9753

6.4112

26.4480

GRR_Lukman9

0.5107

1.0233

1.8671

1.1845

1.7906

3.0205

GRR_Lukman10

0.6188

1.4839

5.5180

1.5486

4.2778

15.3369

GRR_Lukman11

0.6091

1.2434

3.8175

1.2469

2.8739

8.3472

Table 7. Average MSE values when n=100 and /?0 = -1.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

2.9467

8.3180

42.4476

6.9314

23.8698

29.5804

GRR_Lukman1

2.2975

5.4689

21.3817

5.4325

16.3420

19.1137

GRR_Lukman2

1.2327

2.4029

7.3129

1.6253

3.1657

11.1405

GRR_Lukman3

1.6177

3.2421

11.7059

3.0771

7.3726

27.3403

GRR_Lukman4

2.3394

5.9420

28.9126

5.4588

17.8676

19.4914

GRR_Lukman5

1.7009

4.5919

29.9182

2.5120

12.2020

13.8735

GRR_Lukman6

2.0291

4.5375

18.8591

4.2049

12.0259

15.8053

GRR_Lukman7

1.1560

2.2522

6.5854

2.5469

5.7325

9.0577

GRR_Lukman8

2.2976

4.9186

14.5954

5.2753

13.7368

14.3222

GRR_Lukman9

0.8749

1.1090

2.1244

1.4134

2.0125

3.7456

GRR_Lukman10

1.6501

3.1216

9.2303

3.4325

8.1922

10.5109

GRR_Lukman11

1.3906

2.3528

5.5576

2.4372

4.7003

12.6191

Table 8. Average MSE values when n=200 and fi0 = -1.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

1.2170

3.4007

19.1084

3.4705

11.5113

62.1224

GRR_Lukman1

1.0572

2.4911

11.9543

3.0123

9.0812

42.3467

GRR_Lukman2

0.8354

1.4736

5.0611

1.1765

2.3169

7.3114

GRR_Lukman3

0.9406

1.8121

7.3855

2.0518

4.9306

19.3170

GRR_Lukman4

1.0760

2.6376

14.6526

2.9965

9.3444

49.2212

GRR_Lukman5

0.9304

1.9319

13.0337

1.0732

3.1636

42.8344

GRR_Lukman6

1.0004

2.2707

10.5528

2.4440

7.1221

32.7479

GRR_Lukman7

0.6580

1.2425

3.9162

1.4477

3.5208

11.7012

GRR_Lukman8

1.0902

2.4790

9.7868

3.0087

8.4983

31.8784

GRR_Lukman9

0.6815

0.8013

1.4840

1.0468

1.5884

2.8113

GRR_Lukman10

0.8547

1.5684

5.5986

2.0808

4.7607

17.3599

GRR_Lukman11

0.8371

1.4306

3.7911

1.5613

2.9232

9.0367

Table 9. Average MSE values when n=300 and /?0 = -1.

p=5

p=10

Method

ρ =0.85

ρ =0.95

ρ =0.99

ρ =0.85

ρ =0.95

ρ =0.99

MLE

0.7154

2.2026

11.1116

2.0765

6.7260

34.4314

GRR_Lukman1

0.6542

1.8230

7.5016

1.8686

5.6814

25.6538

GRR_Lukman2

0.6581

1.2301

3.6960

0.9910

1.8927

5.6851

GRR_Lukman3

0.6899

1.5461

5.4742

1.5366

3.6920

13.3017

GRR_Lukman4

0.6666

1.9237

8.9121

1.8533

5.7939

28.3836

GRR_Lukman5

0.7541

1.5646

7.6838

0.9905

1.7285

20.5645

GRR_Lukman6

0.6562

1.6968

6.8684

1.5623

4.6231

21.4535

GRR_Lukman7

0.6435

0.9295

2.6926

1.0122

2.2516

7.7297

GRR_Lukman8

0.6760

1.8802

6.7105

1.8957

5.5830

22.1709

GRR_Lukman9

0.5874

0.8249

1.1946

0.9870

1.4071

2.2339

GRR_Lukman10

0.6863

1.2994

3.6031

1.3793

3.3887

12.2619

GRR_Lukman11

0.6371

1.2430

2.6854

1.1468

2.3586

7.0557

DISCUSSION

Using a simulation method based on Logistic Regression Model (LRM) that incorporates the estimators used, which suffer from the multicollinearity problem,

The results are summarized in Tables 1-9. The MSE criterion was determined based on several coefficients: explanatory variables p, sample size n, correlation ρ The following conclusions were obtained:

The results showed that the best value for the mean squared error (MSE) is highlighted in bold. As shown in Tables 1-9, the estimator GRR_Lukman9 performs better than the current estimators under almost all conditions. Furthermore, the MLE estimator exhibits the worst performance among the other estimators, as it is affected by the multicollinearity problem. Furthermore, increasing the correlation coefficient results in higher MSE values for all estimators when n and p are held constant. This is particularly true when the correlation coefficient ( ρ ) is 0.99, Similarly, increasing the number of explanatory variables (p) leads to an increase in the MSE of all estimators used.

P ERFORMANCE OF ρ :

When increases correlation, we observe that the estimators in general are negatively affected, while

K1, K2,... K11 estimators proposed by [27] are only slightly affected. In fact, the mean squared error (MSI) values of these estimators sometimes decrease as correlation increases. Thus, the usefulness of using logistic regression increases the correlation increases. Generally, the best choice is the K9 estimator when the correlation is 0.85, 0.95 and 0.99.

Performance of n

When increases sample size (n) while the correlation coefficients and explanatory variables ( ρ and p) remain constant, MSE value decreases. This indicates that increasing the sample size has a positive effect on the performance of all estimators. Specifically, the MSE of the GRR_K9 estimator decreases when compared to other estimators. This suggests that sufficiently large sample sizes can lead to stable estimations.

Performance number of explanatory variables(s)

Increasing the number of explanatory variables also changes the sample size, making direct comparisons of MSE values very difficult. However, increasing the number of observations leads to an increase in MSE values. Furthermore, the advantage of apply logistic regression (LR) increases with number of explanatory variables because MLE outperform logistic regression less frequently. In cases with 10 independent variables, ML estimation is never superior to LR estimators except in very large or weakly correlated samples.

CONCLUSION

The simulation results also indicate that four important factors influence the features of value estimators: n number of observations, s number of explanatory variables, and ρ the correlation between variables. In most cases, the mean squared error decreases when increases n and increases as the other coefficients increase. Therefore, the conclusion of this article is that the maximum likelihood estimator should not be used when there is a high degree of correlation the explanatory variables because it leads to a high mean squared error. The logistic regression should always be preferred, as ridge parameter estimators provide some reduction in the mean squared error, but others are even better. The optimal choice is K9 when correlation is low and high. These estimators significantly reduce variance in all the different cases investigated in this article.