DATA REFINEMENT APPROACH FOR ANSWERING WHY-NOT PROBLEM OVER K-MOST PROMISING PRODUCT (K − MPP) QUERIES

K-Most Promising (K-MPP) product is a strategy for selecting a product that used in the process of determining the most demanded products by consumers. The basic computations used to perform K-MPP are two types of skyline queries: dynamic skyline and reverse skyline. K-MPP selection is done on the application layer, which is the last layer of the OSI model. One of the application layer functions is providing services according to the user's preferences. In the K-MPP implementation, there exists the situation in which the manufacturer may be less satisfied with the query results generated by the database search process (why-not question), so they want to know why the database gives query results that do not match their expectations. For example, manufacturers want to know why a particular data point (unexpected data) appears in the query result set, and why the expected product does not appear as a query result. The next problem is that traditional database systems will not be able to provide data analysis and solution to answer why-not questions preferred by users. To improve the usability of the database system, this study is aiming to answer why-not K-MPP and providing data refinement solutions by considering user feedback, so users can also find out why the result set does not meet their expectations. Moreover, it may help users to understand the result by performing analysis information and data refinement suggestion.


I. INTRODUCTION
HE development of information and communication technology has resulted in the emergence of various information digitization processes in recent decades. By processing the data on a computer system using a certain algorithm, new knowledge and information that previously have been unrealized can be emerged. For example, the sales data of a company can be evaluated by performing computation and analysis processes automatically over database system. In that way, the company may have insight into marketing strategies that can be used to sell its products.
Based on the example above, the database system is having the important role on data processing of information and evaluation recently. With the continuous development of architecture and evaluation techniques in the database systems, the execution and processing of queries can now be provided in real time without being constrained by T the amount of data that needs to be evaluated. Generally, the query is a question or information needed by the user.
Research on the database system mainly discusses the efficiency of query execution and resource sharing in order to provide the best system capabilities. However, most end users do not understand the knowledge of the database system. Therefore, there will be a problem when a query result evaluation is required on the database system but only a certain user can perform those task [1].
To improve the usability of the database system, it is important to understand the user expectation of an interactive and informative database system. If the query result is not desired, it is expected that the user may perform the further evaluation without any knowledge of the database system. Moreover, it is expected that the database system can provide brief and informative explanations so that users are able to understand and evaluate the problems that cause the query result. By giving a brief and informative explanation of the query result, the user may easily evaluate and determine the refinement of its query so that the search results provided by the database system in accordance with the expected result. It will also provide an alternative search efficiency for the user as well as resource database savings since the user does not have to perform multiple searches until the desired result appears as a query result.
In [2], Islam & Liu formulate the K-Most Promising Product ( − ) framework as a product selection strategy used in the product search process that demanded most by the customer. The basic computations used to perform K-MPP calculations are two types of skyline queries, namely: dynamic skyline and reverse skyline. The skyline operator was first proposed in [3]. By doing the query processing using skyline operator, the data of unique value or not dominated value are collected using three types of function that can be selected, there are MIN or minimum, MAX or maximal, and DIFF or different.
In the implementation of − , there exists a situation in which manufacturers may be less satisfied with the query result generated in the search process, so they also want to know why the database system provides query results that do not match their expectations. For example, the producer wants to know why an unexpected data point appears in the query result set hereinafter called why point, and why the expected product does not appear as a query result, hereinafter called the why-not point. Another problem is that traditional database systems do not provide data analysis and solution facilities to answer why-not question submitted by the user as illustrated in the above problem.
Based on above problems, Liu et al [4] identify the causality which is the cause of the expected data and unexpected data on the query result and the responsibility which is the value of the influence of the expected or unexpected data on the probability of reverse skyline query. As a further development, the research also implemented the identification process of causality and responsibility on reverse skyline query. Evaluation is done by testing the effectiveness and efficiency of the identification process of causality and responsibility. However, to solve the why-not question, further steps such as data or query modification are required so that expected data can appear in the query result, as discussed in research [5], [6] and [7].
This research analyzes why the why-not point does not appear as K-MPP and answer the why-not question which appears on the query result of K-MPP by modifying the data value in query or data refinement. It is also expected that data modification has a minimal cost of change possible.

II. RELATED WORK
There are several methods that have been proposed to answer why-not question in query results. [8] Finds the method which can identify the responsible data point that eliminates users' desired tuples on Select-Project-Join (SPJ) queries, while [9] resolve the why-not question on Select-Project-Join-Union-Aggregation (SPJUA) queries. In [10], [11], [12], and [13] data modifications is provided, so that the missing tuples can appear in the query result. [10] and [11] answer the why-not problem on SPJ queries, in the other hand [12] and [13] focused on SPJUA queries. However, query refinement method can also be applied to revise query results in top-k queries as in [14] and [15], and reverse skyline queries [6].
Islam [5] proposed a framework named FlexIQ to answer the why-not and why question on the SPJ query result. By invoking user input as feedback, a new query is specified which can include the why-not point and eliminate the unexpected why-point that appear in the query result. As the efficiency evaluation, this paper proposed two different query determination methods, namely: the baseline algorithm (TBA) and the trade-off algorithm (TOA).
Solutions that can be used to answer the why-not question on the reverse skyline query are discussed in [6]. The proposed solution consists of three parts, which are: identification of data points that cause expected data does not appear as a reverse query result, data point modification or query modification to make the expected data appear as a reverse query result, and modification of data point and modification of query. In this research, the evaluation was performed on the dataset which has two attribute values, or 2D-dimensional data, and the purpose of the evaluation was to compare the effectiveness and performance of the three proposed modifications to the data cardinality.
Liu et al on [7] discuss the solutions to answer the why-not question on reverse top-k queries. The proposed solution to answer the why-not question is similar to the research that has been done on [6], i.e. by performing combination task of three different modifications: query modification, point weight, and k value modification. This study used five different dimensional settings as in its evaluation to evaluate its performance on various dimensions of the data. The effectiveness and performance of proposed methods also evaluate in the various data cardinality.
III. RESEARCH PROBLEM − requires two evaluation models on the customer and product dataset. The first model is the product selection in which searching the skyline query result of a product using reverse skyline query. The result of this product selection evaluation process is the customer data that is interested in each product. In the second model, which is product adoption, this task performs dynamic skyline computation to the dataset, so that the set of products preferred by each customer can be obtained. In the end, determining the most promising product is done by determining the best k-ranking of the overall market contribution of the product. Market contribution (MC) is the amount of probability value (obtained from the evaluation of product adoption model) of all customers who are members of a reverse skyline of a product. If the query search results of − are not the expected one, no solution can be given to answer the why-not question in − , while it is especially needed by manufacturers and users of database systems to evaluate why this problem occurs. Assuming the area of expertise of the user is different with the database system designer, then it needs an additional informative solution to give usability for database system user.
Based on a brief explanation of − and the why-not question problem in the database system, this study answers the why-not question that appears in − that has not been discussed in the previous research. To answer the why-not question on − query result, as the second contribution, we proposed a data refinement approach by listing all the best query modification solution which have minimum modification cost. From several lists of proposed data refinement, then the validation task is performed to check whether the data refinement is able to answer why-not question at − or not With the contribution proposed in this research, it is expected that the refinement data approach can provide an alternative solution to answer the why-not question so that the expected data may be joined as a member of − .

IV. PROPOSED METHOD
Answering why-not − consist of several stages which can be seen in Figure 1. The main stages in the process of answering the − 's why-not question are: identifying the k rank of the products, increasing the market contribution value by modifying the query value which is the why-not point (data refinement) and validation. Before modifying the query, it is necessary to identify why the why-not point does not appear as a member of − . After the cause is found, then the process of modification of query or data refinement can be done by evaluating the list of data points that appear as members of − . The query modification process will then generate some possible combinations of data refinements on any one dimension of why-not point data.
Having obtained a list of possible combinations of data refinement, then the validation process is performed to ensure the correctness of the provided data refinement. The validation process is done by checking whether the data refinement solution provided can make the why-not point to become one of the − members.

A. Determining The Why-not Point
In this research, the why-not question is illustrated as a situation where the user is less satisfied with the results of the − query because the preferred product is not a top k-promising product. The product to be evaluated for not appearing in a − result is then referred to a why-not point. Therefore, the why-not point is the user feedback which will be evaluated at the next stage.

B. Identifying Rank of Why-not Point (nK-MPP)
The identification of the value of the why-not point is the first step to identify the rank of a − product from the perspective of the entire product contained in the dataset. Since − only displays -products with the best market contribution value, this stage is done to evaluate the market contribution value of the why-not point to the whole product. By this step, the cause of a why-not question can be answered by providing the first informative solution in the form of ranking information of the why-not point.
The rank of the market contribution value of the why-not point from the whole product ( , | ), denoted by ′, can be determined by changing the value of so that the value is equal to the market contribution value of the query point '. Therefore, we defined ′= ( ( , | )). ′ and query results ′ − that will be used in the next evaluation stage.
Example 1. Based on the dataset in Table I, the market contribution values in Table II is obtained after computing DSL, RSL, and probability. The three best-rated products, defined as 2 − s are product 1 , 11 , and 2 . Only these four products will be shown in the query result of − . If the product manufacturer 5 get this the result, then the question arises, why their product does not appear as a 2 − result. Therefore, as the first informative solution product 5 rank will be checked and made as the value of ′. Based on Table II, ′ = 3. After the value of ' is identified, the query result of 3 − and its MC value will also be informed.

C. Identifying the Cause
The higher the value of market contribution, the greater the chance of a product to emerge as a result of − . Since the market contribution value is derived from the sum of the probability values of each RSL member, the identification task of why-not question cause can be done by evaluating the RSL members of − and − .  Table II, it can be seen that 5 do not appear as 2 − because 5 are less than 1 , 11 and 2 which have a minimum number of RSLs among all 2 − members.
Example 3. 13 has the number of RSL members equal to 5 and 8 but does not appear as 3 − because 10 which have the best probability value at 3 − is not a member of 13 .

D. Query Modification
The query point modification process is a data refinement approach that proposed in this research. By modifying the attribute value of record to ′, the expected output is the emergence of ′ as a member of − .

Definition 3.
Query point modification is performed by considering the RSL members of promising product ∈ − . For the set of data dimension of each customer ∈ and is not the member , the minimum data value of customer in its dimension which has the closest difference with the value of is defined as ; Then, the value of ′ can be determined by changing the value in the dimension of the query point as the value.  Based on Definition 3, we can conclude that the data refinement process consists of three stages as shown in Figure 2. The first step that needs to be performed before the data modification task is the RSL identification of the query and each promising ∈ − . The pre-processing task begins with the collection of a set of ∈ in . Then we calculate the value difference of each customer ∈ in its each dimension with . The computational results are then stored on . Because the query modification process needs to consider the least cost change possible, the next pre-processing step is sorting the based on its minimal value of the overall data dimension . Table I, the preference value of 5 (which is the why-not 2 − ) is (16.6), while the RSL of each 2 − member and its preferred value are 1 (10.10), 2 (4.10), 3 (20.13), 4 (12.2), 6 (2.8), 8 (6.16) and 10 (18.6). The value difference of 5 to all members of RSL 2-MPP is depicted in Table III. After obtaining the table in the pre-processing, the query modification is done by changing the query value on one of its dimensions by considering the ( ) member's value which has a minimal difference in . By modifying the query based on the customer preference which appears as a member of ( ), then is also expected to appear as a member of ( ′). Modification of the query value leads to a change of probability score. Changes over probability score will also result in changes over MC score and − ranking.

Example 4. Based on the dataset in
Example 5. Based on the results of the in Table IV, the data point 6 and 10 has the closest value with why-not point in its 2 and 1 dimension. But, since 10 has become a member of 5 , the value chosen is 8 which is the preference value of 6 in its 2 dimension. The value can then be determined by changing the value of 2 from as the value of , so that the new query ′ = (16,8) is obtained. This process will continue to be done on the entire data in the table, so _ _ have obtained as in Table V. _ _ represents the data refinement result that may resolve the why-not point as − result.
Not all members of _ _ can resolve the why-not question problem so that as the next step, we require a validation process. The purpose of the validation process is to compute − with a new value of why-not point that has been modified as in _ _ . Since the _ _ table already contains the query modification value based on the smallest data change on its one dimension, the validation process will be stopped if one valid data refinement has obtained ′ result from _ _ which is a member of − . Based on  the three stages of query modification that have been done, data refinement will be obtained with the smallest change of value, so that the cost needed is also the least cost.

A. Experimental Data
As an experiment, this research uses three data types to be evaluated, i.e. Forest Cover type (FC) dataset, independent dataset (IND), and anti-correlated dataset (ANT). Each type of data has its variations on the amount of data and number of data dimensions .
Independent data (IND) is a synthetic dataset that has the distribution of random attribute values and its values are not mutually affected. The use of this data aims to test the performance of algorithms when dealing with data whose attribute values are not related to each other. The range of values used for each attribute is between 1 to 100.
Anti-correlated (ANT) data is a synthetic data set that has the opposite distribution of its attribute values, which means that the data has a high value on one of its attributes but is very low for the other attributes, along with its vice versa. The use of this data aims to test the performance of the algorithm when dealing with data whose attribute values are contradictory and has the least dominant relationship between the lowest data compared with other data types. The range of values used for each attribute is between 1 to 100.
The Forest Cover type (FC) data is a real dataset derived from the actual source. The use of this data aims to measure the performance of algorithms in the data with the distribution and range of attribute values that are interrelated.

B. Experimental Scenario
The experiment was performed on each dataset type (independent, anti-correlated, and forest cover type) with various variations for each of the available independent variables (cardinality, , and ΔK) as follows: a. Variations of cardinality of data: 5,000, 10,000, 20,000, 30,000, and 50,000. b. Variations in the number of dimensions ( ): 2, 3, 5, 7, and 10. c. Variation of Δ or the difference of why-not point ranking with on − : 1, 3, 5, 7, and 10. In addition to the above variation of variable values, there is also a fixed value which is the default value and not varied in any experiment against certain independent variables, which are: a. The cardinality of data: 20,000. b. The number of dimensions: 3. c. Δ : 3. For example, when an experiment is performed in the first scenario, where the data cardinality variable will be an independent variable whose value varies according to predetermined scenario (5,000, 10,000, 20,000, 30,000, and 50,000), the other variable is set as its fixed value = 3, and Δ = 3. There are three metrics to be analyzed. The first metric is the number of _ _ proposed as the data refinement suggestion, and the second metric is the successes of the data refinement approach evaluated in the query modification validation stage. The last metric is the average time needed to find the proposed data refinement with minimal cost that is able to resolve the why-not point as a member of − .

C. Experimental Result: Data Cardinality Variation
As depicted in Table VI, Table VII, and Table VIII, the number of variations of data refinement produced on each amount of cardinality will increase. Similarly, the validation time required to perform a final check whether the formed data refinement has been able to resolve the why-not K-MPP issue will also increase.
In addition, in Table V and Table VII, it can be concluded that the number of variations of data refinement generated on the IND and FC data types tends to be not much different, whereas a much different amount of variation resulted in the ANT data as in Table VI. The number of produced data refinement is also more constant than the other two data types. This is because of the characteristics of the distribution of data owned by each type of data. In IND and FC data, data tend to be more spread over each dimension of data, compared with ANT data.
In the overall test results that have been done, it can also be seen that the time required to determine the data refinement is relatively constant and will also experience a slight increase in the larger data amount. In the three tables below, average execution obtained a minimum value of 1.22 s and the maximum value of 1.9 s.

D. Experimental Result: Data Dimension Variation
In Table IX, Table X and Table XI the number of data refinement generated will be reduced in a larger number of data dimensions. Similarly, the validation time required to perform a final check whether the formed data refinement has been able to resolve the why-not K-MPP issue will also reduce.
In contrast to the previous scenario, in this scenario, the number of generated data refinement overall data types tends to have the same trend; as the number of data dimensions higher, the less data refinement is generated. This is due to the amount of skyline computing results that will have fewer results in a higher number of data dimensions. In addition, the time needed to determine the variation of data refinement will also decrease due to a higher number of data dimension.   Table XII, Table XIII, and Table  XIV. The amount of generated data refinement and validation time will increase in the higher number of ∆ . The number of variations of data refinement generated on IND and FC data types also tends to be similar, whereas in the ANT data as in Table XII have a relatively constant amount compared to the other two data types. This condition happens because of the distribution of data characteristics for each type of data. In IND and FC, data distribution tends to be more spread over each dimension.
In the overall results of the evaluation, it can also be seen that the time needed to determine the data refinement is relatively constant and will also experience a slight increase in the greater ∆ .

VI. CONCLUSION
Why-not question that appears in the given − query result, can be identified and resolved by evaluating the RSL member of the why-not point or product that is not a member of K-MPP and RSL of K-MPP members. If the number of RSLs from the why-not point is less than the minimum number of RSLs from K-MPP members, the cause of the product is not a member of K-MPP is the lack of RSL members. Conversely, if the number of RSLs of the why-not point equals or exceeds the minimum RSL number of K-MPP members, then the cause is the absence of Most Valuable Customer (MVC) on RSL members. The query point modification process is a data refinement approach which proposed in this research. By modifying the value to ′, the expected output is the appearance of ′ as a member of K-MPP. Query point modification can be done by considering RSL members of K-MPP. The minimum data value of customer in its dimension which has the closest difference with the value of is defined as ; Then, the value of ′ can be determined by changing the value in the dimension of the query point as the value. Data refinement approach is evaluated under three different scenarios (variation of data cardinality, variation of data dimension, and ∆ variation), the results obtained that: (a) The time required to find variation of data refinement tends to be constant and will increase in data cardinality and (b) In the variation of data dimensions, the time required to find data refinement will be faster in higher number of because the number of generated data refinement is fewer than the less one. (c) The validation process still requires a long time and its value will be higher in the large number of data cardinality and a large number of ∆ but will be decreased in the higher number of data dimensions.