Formal validation of data warehouse complexity metrics using distance framework
Автор: Gargi Aggarwal, Sangeeta Sabharwal
Журнал: International Journal of Intelligent Systems and Applications @ijisa
Статья в выпуске: 10 vol.9, 2017 года.
Бесплатный доступ
Data Warehouse is the cornerstone for organizations that base their strategic decisions on the large scale processing of numerical data. The success of the organization depends on these decisions and hence it becomes extremely important to have a quality data warehouse. Conceptual models have been widely recognized as a key determinant of data warehouse quality during the early stages of design. Recently, metrics have been proposed by authors based on hierarchies to quantify the complexity and inturn quality of the conceptual models of data warehouse. They have formally corroborated the measures against Briand’s property based framework to ensure their validity. However, Briand’s set of properties for software measures are a set of necessary but not sufficient measure axioms. They are advantageous to refute software metrics but not to validate them. Thus, we focus on the theoretical validation of the data warehouse conceptual model metrics using the Distance framework whose sufficiency is ensured by the measurement theory. The results indicate that the metrics are valid measures of the complexity of data warehouse conceptual models. Besides, validation by Distance framework assures that the metrics are in the ratio scale which further aids in data analysis.
Distance framework, metrics, theoretical validation, data warehouse quality, multidimensional models
Короткий адрес: https://sciup.org/15016426
IDR: 15016426 | DOI: 10.5815/ijisa.2017.10.06
Текст научной статьи Formal validation of data warehouse complexity metrics using distance framework
Published Online October 2017 in MECS
MH(DWCM) (1)
This abstraction function determines the extent to which a DWCM is characterized by the number of multiple hierarchies. By comparing such abstractions we can deduce whether a conceptual model is more, equally or less characterized by the number of multiple hierarchies. For better understanding, an example has been taken in which we have the set of multiple hierarchies of DWCM A and DWCM B (Fig. 2):
DWCM А ∶ abstraction(NMH) = MH(DWCM A) = {Store_hierarchy, Time_hierarchy} (2)
DWCM В ∶ abstraction(NMH) = MH(DWCM B) = {Store_hierarchy} (3)
Step 2. In this step we model the distances among the elements of measurement abstractions. It is essential to determine a set of elementary transformations for P(UMH) so that any set of multiple hierarchies from P(UMH) can be transformed into any other set of multiple hierarchies by a finite sequence of such transformations. Since the elements of P(UMH) are set of multiple hierarchies, the set Trans can contain elementary transformations of only two types - one for the inclusion of a multiple hierarchy to a set and the other for the removal of a multiple hierarchy from a set. The smallest series of elementary transformations are eligible to be considered as models of distance. Thus, Trans = {t1, t2}, where t1 and t2 are defined as under:
∀ т ℎ ∊ р ( имн ): t 1 ( т ℎ) = т ℎ ∪ { т }
Wℎ ere т ∊ имн (4)
∀ т ℎ ∊ р ( имн ): t 2 ( т ℎ) = т ℎ - { т }
Wℎ ere т ∊ имн (5)
where mh is a set of P(UMH) and can be transformed into any other set of P(UMH) by the addition and removal of corresponding hierarchies.
In the example that we have considered, the distance between abstraction(NMH) for DWCM A and DWCM B can be determined by a series of transformations from the set Trans. MH(DWCM A) can be transformed to MH(DWCM B) simply by one elementary transformation i.e. the removal of Time hierarchy from MH(DWCM A). The two sets can be made equal by other sets of elementary transformations also but since this is the shortest sequence, we have considered it.
Step 3. In this step, the distance between two sets of multiple hierarchies in P(UMH) is quantified. This distance is determined by the measure of the smallest series of elementary transformations that make both the sets equal. Given two sets, mh, mh’ ∊ P(UMH), if an element is contained in either mh or mh’ but not both then exactly one transformation is needed to make them equal. Thus, the distance between the sets is equivalent to the cardinality of the symmetric difference between mh and mh’.
∀ т ℎ, т ℎ՚ ∊ р ( имн ): ẟ мн ( т ℎ, т ℎ՚) = | т ℎ - т ℎ՚| + | т ℎ՚ - т ℎ| (6)
The symmetric difference model, for our example, gives a value of 1 for the distance among the set of multiple hierarchies of DWCM A and DWCM B. Formally,
ẟ NMH ( abstr actionNMH ( DWCM А ), abstractionNMH ( DWCM В )) =
|{ Store _ℎ ierarc ℎ У , Time _ℎ ierarc ℎ У } -
{ Store _ℎ ierarc ℎ У }| + |{ Store _ℎ ierarc ℎ У } -
{ Store _ℎ ierarc ℎ У , Time _ℎ ierarc ℎ У }| =
| { Time _ℎ ierarc ℎ У }| + |{}| = 1 (7)
Step 4. In this step, a reference abstraction is identified for the attribute of interest. The void set of multiple hierarchies would form the reference abstraction in this study. A DWCM will have the lowest value for the NMH metric if it has no multiple hierarchies. Thus, we can define the following function:
RefNMH ∶ UDWCM → Р(UMH): DWCM → Ø (8)
Step 5. The software measure is defined in this step. The number of multiple hierarchies of a DWCM can be determined by the distance between its set of multiple hierarchies i.e. MH(DWCM) and the empty set. It is the smallest series of elementary transformations between MH(DWCM) and Ø. Thus, the NMH metric can be perceived as a function that returns for any DWCM ∊ UDWCM the value of the measure ẟ NMH for the sets MH(DWCM) and Ø :
∀ DWCM ∊ UDWCM ∶ ( DWCM )
= ẟ NMH ( MH ( DWCM ),Ø)
= | MH ( DWCM ) - Ø | + | Ø - MH ( DWCM ) |
= | MH ( DWCM ) | (9)
This proves that the NMH metric is theoretically valid.
-
B. Validation of the NAPMH Metric
Step 1. The measurement abstraction for the attribute of interest i.e. the number of alternate paths in multiple hierarchies (NAPMH) can be defined as:
This abstraction function ascertains to what extent a DWCM is characterized by the set of alternate paths in multiple hierarchies. Again as an example we will consider the set of alternate paths in multiple hierarchies of DWCM A and DWCM B (Fig. 2):
DWCM A ∶ abstraction(NAРMH)
= SAРMH(DWCM A)
= {City_Рrovince_State, Sales_Area_State, Week_Year, Month_Season_Year } (11)
DWCM B ∶ abstraction(NAРMH) = SAРMH(DWCM B) =
{City_Рrovince_State, Sales_Area_State} (12)
Step 2. The definition of the set Trans of elementary transformation types on P(UAPMH) that is both constructively and inverse constructively complete is :
Trans = {t 1 , t 2 }, where
∀ арт ℎ ∊ Р ( UAPMH ): ti ( арт ℎ) = арт ℎ ∪ { т }
Wℎ ere т ∊ UAPMH (13)
∀ арт ℎ ∊ Р ( UAPMH ): t2 ( арт ℎ) = арт ℎ- { т }
Wℎ ere т ∊ UAPMH (14)
where apmh, is a set of P(UAPMH) and can be transformed into any other set of P(UAPMH) by the addition and removal of corresponding alternate paths from the hierarchies.
In the example that we have considered, the shortest sequence of elementary transformations that can determine the distance between the abstraction(NAPMH) for DWCM A and DWCM B is the removal of the paths Week_Year and Month_Season_Year from
SAPMH(DWCM A).
Step 3. This step determines the metric space (P(UAPMH),ẟ). The distance between any two sets of alternate paths in multiple hierarchies in P(UAPMH) can be quantified by the smallest series of elementary transformations that makes both the sets equal.
∀ арт ℎ, арт ℎ՚ ∊
Р ( UAPMH ): ẟ APMH ( арт ℎ, арт ℎ՚) = | арт ℎ - арт ℎ՚| + | арт ℎ՚ - арт ℎ| (15)
The symmetric difference model, for our example, gives a value of 2 for the distance among the set of alternate paths in multiple hierarchies of DWCM A and DWCM B. Formally,
ẟ NAPMH ( abstractionNAPMH ( DWCM А ), abstr actionNAPMH ( DWCM В ))
= |{ City_Рrovince_State, Sales_Area_State, Week_Year, Month_Season_Year}
-
- {City_Рrovince_State, Sales_Area_State}|
+ |{ City_Рrovince_State, Sales_Area_State }
-
- { City_Рrovince_State, Sales_Area_State, Week_Year, Month_Season_Year }|
= | {Week_Year, Month_Season_Year}| + | {}| = 2
Step 4. This step determines the reference abstraction RefNAPMH ∊ P(UAMPH) for the number of alternate paths in multiple hierarchies. There exists conceptual models with no multiple hierarchies and hence, no alternate paths in multiple hierarchies. Thus, the RefNAPMH is the void set.
Step 5. The software measure is defined in this step. The number of alternate paths in multiple hierarchies of a DWCM can be determined by the distance between its set of alternate paths in multiple hierarchies i.e. APMH(DWCM) and the empty set. It is the smallest series of elementary transformations between APMH(DWCM) and Ø. Thus, the NAPMH metric can be perceived as a function that returns for any DWCM ∊ UDWCM the value of the measure ẟ NAPMH for the sets APMH(DWCM) and Ø:
∀ DWCM ∊ UDWCM ∶ ( DWCM )
= ẟ NAPMH ( APMH ( DWCM ),Ø)
= | APMH ( DWCM ) - Ø | + | Ø - ' APMH ( DWCM ) |
= | APMH ( DWCM )| (17)
This proves that the NAPMH metric is theoretically valid.
-
C. Validation of the NDSH Metric
Step 1. The measurement abstraction for the attribute of interest i.e. the number of dimensions participating in shared hierarchies (NDSH) can be defined as:
p(UDSH): DWCM → SDSH (DWCM)(18)
As an example we will consider the set of dimensions participating in shared hierarchies of DWCM A and DWCM B (Fig. 2):
DWCM A ∶ abstraction(NDSH) = SDSH(DWCM A) = {Рroduct, Store, Customer }(19)
DWCM B ∶ abstraction(NDSH) =
SDSH(DWCM B) = {Store, Customer}(20)
Step 2. The definition of the set Trans of elementary transformation types on P(UDSH) that is both constructively and inverse constructively complete is:
Trans = {t1, t2}, where
∀cisℎ ∊P(UDSH): ti (dsℎ) = dsℎ∪ {m} wℎere m ∊ UDSH (21)
∀dsℎ ∊P(UDSH): t2 (dsℎ) = dsℎ- {m} wℎere m ∊ UDSH (22)
where dsh is a set of P(UDSH) and can be transformed into any other set of P(UDSH) by the addition and removal of corresponding dimensions from shared hierarchies.
In the example that we have considered, the shortest sequence of elementary transformations that can determine the distance between the abstraction(NDSH) for DWCM A and DWCM B is the removal of the Product dimension from SDSH(DWCM A).
Step 3. This step determines the metric space (P(UDSH), ẟ). The distance between any two sets of dimensions participating in shared hierarchies in P(UDSH) can be quantified by the smallest series of elementary transformations that makes both the sets equal.
∀ cis ℎ, ds ℎ՚ ∊ P ( UDSH ): ẟ DSH ( ds ℎ, ds ℎ՚) = | ds ℎ- ds ℎ՚| + | ds ℎ՚ - ds ℎ| (23)
The symmetric difference model, for our example, gives a value of 1 for the distance among the set of dimensions participating in shared hierarchies of DWCM A and DWCM B. Formally,
ẟ NDSH ( abstractionNDSH ( DWCM A ), abstractionNDSH ( DWCM В ))
= |{ Рroduct, Store, Customer} - {Store, Customer}| +|{ Store, Customer } -
{Рroduct, Store, Customer}| =
| {Рroduct}| + | {}| = 1 (24)
Step 4. This step determines the reference abstraction RefNDSH ∊ P(UDSH) for the number of dimensions participating in shared hierarchies. There exists conceptual models with no shared hierarchies and hence, no dimensions participating in shared hierarchies. Thus, the RefNDSH is the void set.
Step 5. The software measure is defined in this step. The number of dimensions participating in shared hierarchies of a DWCM can be determined by the distance between its set of dimensions participating in shared hierarchies i.e. DSH(DWCM) and the empty set. It is the smallest series of elementary transformations between
DSH(DWCM) and Ø. Thus, the NDSH metric can be perceived as a function that returns for any DWCM ∊ UDWCM the value of the measure ẟ NDSH for the sets DSH(DWCM) and Ø:
∀ DWCM ∊ UDWCM ∶ ( DWCM )
= ẟ NDSH ( DSH ( DWCM ),Ø)
= | DSH ( DWCM ) - Ø | + | Ø - DSH ( DWCM ) |
= | DSH ( DWCM )| (25)
This proves that the NDSH metric is theoretically valid.
The measure construction and theoretical validation process of NSH and NLDH is analogous to that of the NMH, NAPMH and NDSH metrics and is summarized in Table 3. Since the Distance framework has been used to define the measures, they can all be described as distances. This guarantees that they are all characterised by the ratio scale and hence are formally sound software measures.
-
VI. Conclusion
In this paper, we have used the measurement theory based Distance framework to formally validate the data warehouse hierarchy metrics. The said metrics had previously been validated using Briand’s property based framework. However, it offers a preferable set of properties to validate the software metrics which is not sufficient. Thus, we have employed Distance framework to validate the hierarchy metrics which offers a set of sufficient and necessary measure axioms. The measures validated using this framework are above the ordinal scale and hence a wide range of data analysis techniques can be used to analyse them. All the five hierarchy measures (NMH, NAPMH, NLDH, NSH and NDSH) have been successfully corroborated using the Distance framework. Thus, they are valid measures of data warehouse conceptual model complexity.
Table 3. Abstraction functions for the remaining hierarchy metrics
Metric |
Abstraction Function |
NSH |
a bsNSH ∶ → P ( USH ) ∶ DWCM → SSH ( DWCM ) (26) where UDWCM is the Universe of Data Warehouse Conceptual Models USH is the Universe of Shared Hierarchies relevant to a UoD SSH(DWCM) ⊆ USH is the set of shared hierarchies in a DWCM |
NLDH |
Metric NLDH is represented at the class level as: a bsNLDH : → P ( UC ) ∶ C → Longest Pat ℎ( C ) (27) where UC is the Universe of classes LongestPath(C) ⊆ UC is the set of classes that are a part of dimension hierarchy When multiple hierarchies are considered, only the classes in the longest hierarchy are taken into consideration Metric NLDH is the largest value of NLDH computed for all the classes present in DWCM |
Список литературы Formal validation of data warehouse complexity metrics using distance framework
- W. H. Inmon, Building the Data Warehouse. Wiley, 2005.
- N. T. Debevoise, The data warehouse method. Prentice Hall, 1998.
- C. Calero, C. Pascual, M. Piattini, and M. A. Serrano, “Towards Data Warehouse Quality Metrics,” in Proceedings of the International Workshop on Design and Management of Data Warehouses, 2001, pp. 1–10.
- C. Calero, M. Piattini, and M. Genero, “Method for Obtaining Correct Metrics,” in Third International Conference on Enterprise Information Systems, 2001, pp. 779–784.
- L. C. Briand, S. Morasca, and V. R. Basili, “Property-based software engineering measurement,” IEEE Trans. Softw. Eng., vol. 22, no. 1, pp. 68–86, 1996.
- E. J. Weyuker, “Evaluating Software Complexity Measures,” IEEE Trans. Softw. Eng., vol. 14, no. 9, pp. 1357–1365, 1988.
- G. Poels and G. Dedene, “Distance-based software measurement: Necessary and sufficient properties for software measures,” Inf. Softw. Technol., vol. 42, no. 1, pp. 35–46, 2000.
- H. Zuse, A Framework of Software Measurement. Walter de Gruyter, 1998.
- A. Gosain, S. Nagpal, and S. Sabharwal, “Validating dimension hierarchy metrics for the understandability of multidimensional models for data warehouse,” IET Softw., vol. 7, no. 2, pp. 93–103, 2013.
- P. Suppes, M. Krantz, R. Luce, and A. Tversky, Foundations of Measurement. New York: Academic Press, 1989.
- M. A. Serrano, C. Calero, H. A. Sahraoui, and M. Piattini, “Empirical studies to assess the understandability of data warehouse schemas using structural metrics,” Softw. Qual. J., vol. 16, no. 1, pp. 79–106, 2008.
- G. Berenguer, R. Romero, J. Trujillo, M. Serrano, and M. Piattini, “A set of quality indicators and their corresponding metrics for conceptual models of data warehouses,” in Data Warehousing and Knowledge Discovery, 2005, pp. 95–104.
- S. S. Cherfi and N. Prat, “Multidimensional Schemas Quality : Assessing and Balancing Analyzability and Simplicity,” in Proceedings of ER Workshops, Springer LNCS, 2003, pp. 140–151.
- M. Serrano, C. Calero, J. Trujillo, S. Lujan, and M. Piattini, “Empirical validation of metrics for conceptual models of data warehouse,” in 16th International Conference on Advanced Information Systems Engineering (CAISE’04), 2004, pp. 506–520.
- M. Serrano, J. Trujillo, C. Calero, and M. Piattini, “Metrics for data warehouse conceptual models understandability,” Inf. Softw. Technol., vol. 49, no. 8, pp. 851–870, 2007.
- S. Nagpal, A. Gosain, and S. Sabharwal, “Theoretical and empirical validation of comprehensive complexity metric for multidimensional models for data warehouse,” Int. J. Syst. Assur. Eng. Manag., vol. 4, no. 2, pp. 193–204, 2013.
- S. Sabharwal, S. Nagpal, and G. Aggarwal, “Coupling metrics for object-oriented data warehouse design,” in Computing for Sustainable Global Development (INDIACom), 2015 2nd International Conference on, 2015, pp. 918–922.
- M. Genero, G. Poels, and M. Piattini, “Defining and validating metrics for assessing the understandability of entity-relationship diagrams,” Data Knowl. Eng., vol. 64, no. 3, pp. 534–557, 2008.
- P. Tripathi, M. Kumar, and N. Shrivastava, “Theoretical validation of quality metrics of Indian e-commerce domain,” in 2009 2nd International Conference on Computer, Control and Communication, 2009, pp. 1–7.
- P. Rossi and G. Fernandez, “Definition and validation of design metrics for distributed applications,” in Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717), 2003, pp. 124–132.
- A. O. Bajeh, S. Basri, and L. T. Jung, “A theoretical validation of the number of polymorphic methods as a complexity metric,” in 2014 International Conference on Computer and Information Sciences (ICCOINS), 2014, pp. 1–6.
- L. Muñoz, J. N. Mazón, and J. Trujillo, “A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses,” Inf. Softw. Technol., vol. 52, no. 11, pp. 1188–1203, 2010.