Łukasz Struski
Schedae Informaticae, Volume 26, 2017, pp. 23 - 35
https://doi.org/10.4467/20838476SI.17.001.6807The use of machine learning methods in the case of incomplete data is an important task in many scientific fields, like medicine, biology, or face recognition. Typically, missing values are substituted with artificial values that are estimated from the known samples, and the classical machine learning algorithms are applied. Although this methodology is very common, it produces less informative data, because artificially generated values are treated in the same way as the known ones. In this paper, we consider a probabilistic representation of missing data, where each vector is identified with a Gaussian probability density function, modeling the uncertainty of absent attributes. This representation allows to construct an analogue of RBF kernel for incomplete data. We show that such a kernel can be successfully used in regression SVM. Experimental results confirm that our approach capture relevant information that is not captured by traditional imputation methods.
Łukasz Struski
Schedae Informaticae, Volume 24, 2015, pp. 133 - 142
https://doi.org/10.4467/20838476SI.15.013.3035We present a new subspace clustering method called SuMC (Subspace Memory Clustering), which allows to efficiently divide a dataset D RN into k 2 N pairwise disjoint clusters of possibly different dimensions. Since our approach is based on the memory compression, we do not need to explicitly specify dimensions of groups: in fact we only need to specify the mean number of scalars which is used to describe a data-point. In the case of one cluster our method reduces to a classical Karhunen-Loeve (PCA) transform. We test our method on some typical data from UCI repository and on data coming from real-life experiments.