To answer this question, we generated another type of data modelled after multidimensional chessboard chessboard data set.

## Deep Learning: Going Deeper toward Meaningful Patterns in Complex Data

X matrix contained 1, observations and 2, 3 attributes drawn from an uniform distribution on range [0, 1. There is clearly strong attribute dependence, but since all parts of decision boundary are parallel to one of the attributes this kind of data can be modelled with a decision tree with no bias. Figure 4 presents complexity curves and error curves for different dimensionalities of chessboard data.

Here the classification error becomes larger than indicated by complexity curve. The more dimensions, the more dependencies between attributes violating complexity curve assumptions. For a three-dimensional chessboard the classification problem becomes rather hard and the observed error decreases slowly, but the complexity curve remains almost the same as for a two-dimensional case.

This shows that the complexity curve is not expected to be a good predictor of classification accuracy in the problems where a lot of high-dimensional attribute dependencies occur for example, in epistatic domains in which the importance of one attribute depends on the values of the other. The results of experiments with controlled artificial data sets are consistent with our theoretical expectations. Based on these results, we can introduce a general interpretation of the difference between complexity curve and learning curve: learning curve below the complexity curve is an indication that the algorithm is able to build a good model without sampling the whole domain, limiting the variance error component.

On the other hand, the learning curve above the complexity curve is an indication that the algorithm includes complex attributes dependencies in the constructed model, promoting the variance error component. To evaluate the impact of the proposed preprocessing techniques whitening and ICA—Independent Component Analysis on complexity curves, we performed experiments with artificial data. In one data set all attributes were independent, in the other the same attribute was repeated eight times.

Small Gaussian noise was added to both sets. Figure 5 shows complexity curves calculated before and after whitening transform. We can see that whitening had no significant effect on the complexity curve of the independent set. In the case of the dependent set, complexity curve calculated after whitening decreases visibly faster and the area under the curve is smaller.

This is consistent with our intuitive notion of complexity: a data set with highly correlated or duplicated attributes should be significantly less complex. In the second experiment, two data sets with observations and four attributes were generated. The first data set was generated from the continuous uniform distribution on interval [0, 2], the second one from the discrete categorical uniform distribution on the same interval.

Figure 6 presents complexity curves for original, whitened and ICA-transformed data. Among the original data sets, the intuitive notion of complexity is preserved: the area under the complexity curve for categorical data is smaller. The difference disappears for the whitened data but is again visible in the ICA-transformed data. These simple experiments are by no means exhaustive but they confirm usefulness of the chosen signal processing techniques data whitening and the Independent Component Analysis in the complexity curve analysis.

The complexity curve is based on the expected Hellinger distance and the estimation procedure includes some variance. The natural assumption is that the variability caused by the sample size is greater than the variability resulting from a specific composition of a sample. Otherwise, averaging over samples of the same size would not be meaningful.

This assumption is already present in standard learning curve methodology where classifier accuracy is plotted against training set size. We expect that the exact variability of the complexity curve will be connected with the presence of outliers in the data set. Such influential observations will have a huge impact depending on whether they will be included in a sample or not. To verify whether these intuitions were true, we constructed two new data sets by introducing artificially outliers to wine data set.

Figure 7 presents conditional complexity curves for all three data sets. This means that adding so much noise increased the overall complexity of the data set significantly. The result support our hypothesis that large variability of complexity curve signify an occurrence of highly influential observations in the data set.

This makes complexity curve a valuable diagnostic tool for such situations. However, it should be noted that our method is unable to distinguish between important outliers and plain noise. To obtain this kind of insight, one has to employ different methods. We decided to compare experimentally complexity curve with those measures. Descriptions of the measures used are given in Table 1. According to our hypothesis conditional complexity curve should be robust in the context of class imbalance.

To demonstrate this property, we used for the comparison 81 imbalanced data sets used previously in the study by Diez-Pastor et al. We selected only binary classification problems. As the T2 measure seemed to have non-linear characteristics destroying the correlation additional column logT2 was added to comparison. Results are presented in Fig. This is to be expected as both measures are concerned with sample size in relation to attribute structure. The difference is that T2 takes into account only the number of attributes while AUCC considers also the shape of distributions of the individual attributes.

Correlations of AUCC with other measures are much lower and it can be assumed that they capture different aspects of data complexity and may be potentially complementary.

- Expert Oracle9i Database Administration;
- Recommended for you!
- Introduction.
- 101 Ethical Dilemmas.
- The Three Signs of a Miserable Job: A Fable for Managers (And Their Employees)?

The next step was to show that information captured by AUCC is useful for explaining classifier performance. In order to do so, we trained a number of different classifiers on the 81 benchmark data sets and evaluated their performance using random train-test split with proportion 0. The performance measure used was the area under the ROC curve.

The intuition was as follows: the linear classifiers do not model attributes interdependencies, which is in line with complexity curve assumptions. Selected non-linear classifiers on the other hand are—depending on the parametrisation—more prone to variance error, which should be captured by the complexity curve.

Most of the correlations are weak and do not reach statistical significance; however, some general tendencies can be observed. This may be explained by the high-bias and low-variance nature of these classifiers: they are not strongly affected by data scarcity but their performance depends on other factors. This is especially true for the LDA classifier, which has the weakest correlation among linear classifiers.

In k -NN, classifier complexity depends on k parameter: with low k values, it is more prone to variance error, with a larger k it is prone to bias if the sample size is not large enough Domingos, Depth parameter in decision tree also regulates complexity: the larger the depth the more classifier is prone to variance error and less to bias error. This suggests that AUCC should be more strongly correlated with performance of deeper trees. On the other hand, complex decision trees explicitly model attribute interdependencies ignored by complexity curve, which may weaken the correlation.

This is observed in the obtained results: for a decision stub tree of depth 1 , which is low-variance high-bias classifier, correlation with AUCC and logT2 is very weak. It should be noted that with large tree depth, as with small k values in k -NN, AUCC has stronger correlation with the classifier performance than logT2. A slightly more sophisticated way of applying data complexity measures is an attempt to explain classifier performance relative to some other classification method.

Table 3 presents correlations of both measures with classifier performance relative to LDA. Here we can see that correlations for AUCC are generally higher than for logT2 and reach significance for the majority of classifiers. Especially in the case of decision tree, AUCC explains relative performance better than logT2 correlation 0.

Results of the presented correlation analyses demonstrate the potential of the complexity curve to complement the existing complexity measures in explaining classifier performance. It is worth noting that despite the attribute independence assumption the complexity curve method proved useful for explaining performance of complex non-linear classifiers. There is a special category of machine learning problems in which the number of attributes p is large with respect to the number of samples n , perhaps even order of magnitudes larger.

To test how our complexity measure behaves in such situations, we calculated AUCC scores for a few microarray data sets and compared them with AUC ROC scores of some simple classifiers.

### Special order items

Classifiers were evaluated as in the previous section. Detailed information about the data sets is given in Supplemental Information 1 as Table S3. Results of the experiment are presented in Table 4. As expected, with the number of attributes much larger than the number of observations, data is considered by our metric as extremely scarce —values of AUCC are in all cases above 0. On the other hand, the AUC ROC classification performance is very varied between data sets with scores approaching or equal to 1.

This is because, despite the large number of dimensions, the form of the optimal decision function can be very simple, utilising only a few of available dimensions.

The complexity curve does not consider the shape of decision boundary at all, and thus does not reflect differences in classification performance. From this analysis we concluded that complexity curve is not a good predictor of classifier performance for data sets containing a large number of redundant attributes, as it does not differentiate between important and unimportant attributes. The logical way to proceed in such case would be to perform some form of feature selection or dimensionality reduction on the original data, and then calculate the complexity curve in the reduced dimensions.

The sets were chosen only as illustrative examples. The basic properties of the data sets are given in Supplemental Information as Table S4. For each data set, we calculated conditional complexity curve. The curves are presented in Fig. On most of the benchmark data sets we can see that complexity curve upper bounds the DT learning curve.

The bound is relatively tight in the case of glass and iris , and looser for breast-cancer-wisconsin and wine data set. A natural conclusion is that a lot of variability contained in this last data set and captured by the Hellinger distance is irrelevant to the classification task. The most straightforward explanation would be the presence of unnecessary attributes not correlated with the class which can be ignored altogether.

This is consistent with the results of various studies in feature selection. Choubey et al. On monks-1 and car , the complexity curve is no longer a proper upper bound on the DT learning curve. This is an indication of models relying heavily on attribute interdependencies to determine the correct class. This is not surprising: both monks-1 and car are artificial data sets with discrete attributes devised for evaluation of rule-based and tree-based classifiers Thrun et al.

Classes are defined with logical formulas utilising relations of multiple attributes rather than single values—clearly the attributes are interdependent.

In that context, the complexity curve can be treated as a baseline for independent attribute situation and the generalisation curve as diagnostic tool indicating the presence of interdependencies. Besides the slope of the complexity curve we can also analyse its variability. We can see that the shape of wine complexity curve is very regular with small variance in each point, while the glass curve displays much higher variance.

This mean that the observations in glass data set are more diverse and some observations or their combinations are more important for representing data structure than the other. This short analysis demonstrate how to use complexity curves to compare properties of different data sets.

Here only decision tree was used as reference classifier. The method can be easily extended to include multiple classifiers and compare their performance. We present such an extended analysis in Supplemental Information 2. The problem of data pruning in the context of machine learning is defined as reducing the size of training sample in order to reduce classifier training time and still achieve satisfactory performance.

It becomes extremely important as the data grows and a does not fit the memory of a single machine, b training times of more complex algorithms become very long. A classic method for performing data pruning is progressive sampling—training the classifier on data samples of increasing size as long as its performance increases. Geometric sampling uses samples of sizes a i n 0 , where n 0 —initial sample size, a —multiplier, i —iteration number.

**karmuscdiraru.tk**

## Complexity curve: a graphical measure of data complexity and classifier performance

In our method, instead of training classifier on the drawn data sample, we are probing the complexity curve. We are not trying to detect the convergence of classifier accuracy, but just search for a point on the curve corresponding to some reasonably small Hellinger distance value, e. This point designates the smallest data subset which still contains the required amount of information. In this setting, we were not interested in calculating the whole complexity curve but just in finding the minimal data subset, which still contains most of the original information.

The search procedure should be as fast as possible, since the goal of the data pruning is to save time spent on training classifiers.

We used classic Brent method Brent, to find a root of the criterion function. To speed up the procedure even further, we used the standard complexity curve instead of the conditional one and settled for whitening transform as the only preprocessing technique. To verify if this idea is of practical use, we performed an experiment with three bigger data sets from UCI repository. Their basic properties are given in Supplemental Information 1 as Table S5.

For all data sets, we performed a stratified 10 fold cross validation experiment. Achieving the same accuracy as with CC pruning was used as a stop criterion for progressive sampling. Classifiers were trained on pruned and unpruned data and evaluated on the testing part of each cross validation split. Standard error was calculated for the obtained values. We have used machine learning algorithms from the scikit-learn library Pedregosa et al. Table 5 presents measured times and obtained accuracies. As can be seen, the difference in classification accuracies between pruned and unpruned training data is negligible.

CC compression rate is rather stable with only small standard deviation, but PS compression rate is characterised with huge variance. The directional-vector maximum Fisher's discriminant ratio F1v. The overlap of the per-class bounding boxes F2. The maximum individual feature efficiency F3. The collective feature efficiency F4. Measures of class separability. This library provides routines that compute The leave-one-out error rate of the one-nearest neighbor classifier L1. The minimized sum of the error distance of a linear classifier L2.

The fraction of points on the class boundary N1. The training error of a linear classifier N3. This library provides routines that compute The nonlinearity of a linear classifier L3. The fraction of maximum covering spheres T1. The average number of points per dimension T2. Show related SlideShares at end. WordPress Shortcode. Full Name Comment goes here. Are you sure you want to Yes No. Niya Yadav. Jeremy Orina , Bioinformatics student. Akila Vairavel.

## Donate to arXiv

Show More. No Downloads. Views Total views. Actions Shares. Embeds 0 No embeds. No notes for slide.