hosted by
publicationslist.org
    

Boštjan Brumen

University of Maribor
Faculty of Electrical Engineering and Computer Science
Institute of informatics

Smetanova 17
Si-2000 Maribor

Slovenia, Europe
bostjan.brumen@uni-mb.si




ORCID.org Author ID: 0000-0002-0560-1230
Scopus Author ID: 6603364721

Boštjan Brumen is Assistant Professor at Institute of Informatics, Faculty of electrical engineering, computer science and informatics, University of Maribor, Slovenia. He graduated in informatics in 1996 at Univeristy of Maribor, where he received his Ph.D. in informatics in 2004.

Brumen's research areas include intelligent data analyses, data mining, supervised machine learning, learning curve modeling, data security and privacy, and data quality.



COPYRIGHT POLICY: All the material (html, PDF, doc, ppt files) is included in the web site to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here or elsewhere electronically.

The entire content of this site is subject to copyright protection. All rights reserved. You may download and retain on your disk or in hard copy form a single copy of material published on this site for your personal use only, provided that you do not remove any proprietary notices. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Journal articles

2012
2007
Boštjan Brumen, Matjaž B Jurič, Tatjana Welzer, Ivan Rozman, Hannu Jaakkola, Apostolos Papadopoulos (2007)  Assessment of classification models with small amounts of data   Informatica (Vilnius) 18: 3. 343-362  
Abstract: One of the tasks of data mining is classification, which provides a mapping from attributes (observations) to pre-specified classes. Classification models are built by using underlying data. In principle, the models built with more data yield better results. However, the relationship between the available data and the performance is not well understood, except that the accuracy of a classification model has diminishing improvements as a function of data size. In this paper, we present an approach for an early assessment of the extracted knowledge (classification models) in the terms of performance (accuracy), based on the amount of data used. The assessment is based on the observation of the performance on smaller sample sizes. The solution is formally defined and used in an experiment. In experiments we show the correctness and utility of the approach.
Notes:
2005
Boštjan Brumen, Tatjana Welzer, Marjan Družovec, Izidor Golob, Hannu Jaakkola, Ivan Rozman, Jiři Kubalík (2005)  Protecting Medical Data for Decision-Making Analyses   Journal of Medical Systems 29: 1. 65-80  
Abstract: In this paper, we present a procedure for data protection, which can be applied before any model building based analyses are performed. In medical environments, abundant data exist, but because of the lack of knowledge, they are rarely analyzed, although they hide valuable and often life-saving knowledge. To be able to analyze the data, the analyst needs to have a full access to the relevant sources, but this may be in the direct contradiction with the demand that data remain secure, and more importantly in medical area, private. This is especially the case if the data analyst is outsourced and not directly affiliated with the data owner. We address this issue and propose a solution where the model-building process is still possible while data are better protected. We consider the case where the distributions of original data values are preserved while the values themselves change, so that the resulting model is equivalent to the one built with original data.
Notes:
2004
Boštjan Brumen, Matjaž Jaušovec, Tatjana Welzer, Hannu Jaakkola, Lenka Lhotska (2004)  An advanced approach to estimation of consuption of electric energy   IIAS transactions on systems research and cybernetics 4: 2. 47-52  
Abstract: In Slovenia, the liberalization of the energy market started in 1999. Since 2001, the eligible market players on the buying side must purchase the energy at the organized energy exchange where the supply and demand meet. This change requires a very detailed planning of the energy consumption, since the energy is purchased for one day in advance. Penalty for over- or underestimation of the next dayâs consumption is levied. In the paper, we present a data mining approach to the estimation of the energy consumption based on history data and weather data. The results show that for regular (working) weekdays the approach is up to 30% better than the existing ones. For exceptions, such as holidays, Saturdays, Sundays and other non-uniform days, the approach is as useless as any other, including experienced expertâs educated guess.
Notes:
2003
Boštjan Brumen, Izidor Golob, Tatjana Welzer, Marjan Družovec, Hannu Jaakkola (2003)  Postopek ocenitve napake klasifikacijskih algoritmov   Elektrotehniški vestnik 70: 1-2. 34-39  
Abstract: Dejstvo, da se v podatkih skriva znanje, pospesuje uporabo podatkovnega rudarjenja, ki je del sirsega procesa odkrivanja znanja. Odkrivanje znanja lahko izvajamo na veliko nacinov z uporabo stevilnih paradigem. Ena izmed paradigem podatkovnega rudarjenja je tudi klasifikacija, kjer je cilj zgraditi klasifikacijski model na obstojecih podatkih in ga uporabiti na novih podatkih. Klasifikacijski modeli niso popolni in tako so nekateri zapisi razvrsceni v napacen razred. Napaka klasifikacije, ki meri stevilo napacno razvrscenih zapisov, pada s stevilom zapisov v ucni mnozici. Graf napake imenujemo tudi krivulja ucenja. Na splosno ni znano, koliksna bo napaka klasifikacije pri dolocenem stevilu ucnih zapisov ob izbranem klasifikacijskem algoritmu in danih podatkih. Zato bi bila taksna ocena v zacetku izvajanja podatkovnega rudarjenja dobrodosla, saj so s pripravo ucnih zapisov povezani stroski. V prispevku predstavljamo postopek ocenitve napake klasifikacijskih algoritmov na podlagi opazovanja delne krivulje ucenja, ki jo zgradimo s pomocjo manjsega stevila ucnih zapisov. Podajamo nacin graditve delne krivulje ucenja, kriterije za prekinitev gradnje in orisemo postopek za izracun popolne krivulje ucenja na podlagi delne krivulje. S pomocjo modela popolne krivulje ucenja lahko ocenimo napako algoritma za poljubno stevilo zapisov. Data hide important knowledge. The amount of data available today has increased to such an extent that new techniques are required for knowledge extraction. Data mining (DM) is a relatively young discipline and its results are very promising. It is a part of a comprehensive process, knowledge discovery in database s (KDD), defined in [1] as a "nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data". Fig. 1 depicts the process. To accomplish the data-mining task, several techniques have been devised (or re-introduced to the field). One of the techniques is classification, which provides a mapping from attributes (observations) to pre-specified groupings or classes [2]. Classification is considered to be supervised learning. The records (also called examples) must belong to a small set of classes that an expert has predefined. These records are called also a training set. The induced model consists of patterns, essentially generalizations over the records from a training set that are useful for distinguishing the classes. Once a model is induced, it can be used to automatically predict the class of other unclassified records. A generated model can classify a record into a wrong class. The number of misclassified records versus the total number of records is the classification error of the induced model (Eq. 1). If the induction algorithm used to build a model is capable of capturing the patterns that underlie the data, then the error rate tends to decrease as the number of records used to build the model increases. A plot of the error rate versus the sample size (number of records used) is called a learning curve (Fig. 2). In general, we cannot foretell how the chosen approach (a combination of the selected learning algorithm and available data) will reflect in a learning curve - in other words - what error rate will be achieved using agiven number of training records. Some approaches do not yield the desired results, such asa pre-specified error rate. Since DM using classification is a costly process, we developed a procedure to estimate the accuracy early in the process. First, we build a partial learning curve. The building process is terminated when exit conditions are met. Finally, the fulllearning curve is calculated. To build a learning curve, we need to repeatedly measure the performance as the amount of training records increases. For this reason we modified the incremental k-fold cross-validation, developed by s7d. The modification is presented in Procedure 1. In order to observe the learning curve, we need to be able to describe it. A learning curve can be described bya generic function e = f (n), where edenotes the performance nd n the number of records in a training set. In Eq. (2) we added the constant a because we believe that the learning algorithm never acquires 100% accuracy (0% error) on the data. We believe the performance of the learning algorithm approaches asymptotically to a certain (but unknown) number due to several reasons (e.g. noisy data). Additionally, the form we use is stili valid even in the case when the error rate actually reaches 0%. Several functions can be us ed to describe the learning curves, such as Eqs. (3) to (6). Which of the functions will actually be used, depends on the fitness of the measured pointsto the function parameters. When building the partiallearning curve, we need to know when to terminate the process. The first condition is that the error rate needs to be decreasing, or formally that the Eq. 7 is fulfilled. Additionally, as with an ideal learning curve, the graph of the error rate should be concave up (Eq. 8). When the points are obtained and the exit conditions are met, we calculate the full learning curve. The actual learning curve is described by the selected equation, where the unknown parameters are fit by using one of the non-linear least squares methods. 0nce the parameters are calculated, we have the description of the fulllearning curve for the given data set and for the selected algorithm. Such a model can be used for estimation of error rate of a learning algorithm for an arbitrary size of the learning set.
Notes:
Boštjan Brumen, Izidor Golob, Tatjana Welzer, Ivan Rozman, Marjan Družovec, Hannu Jaakkola (2003)  An algorithm for protecting knowledge discovery data   Informatica (Vilnius) 14: 3. 277-288  
Abstract: In the paper, we present an algorithm that can be applied to protect data before a data mining process takes place. The data mining, a part of the knowledge discovery process, is mainly about building models from data. We address the following question: can we protect the data and still allow the data modelling process to take place? We consider the case where the distributions of original data values are preserved while the values themselves change, so that the resulting model is equivalent to the one built with original data. The presented formal approach is especially useful when the knowledge discovery process is outsourced. The application of the algorithm is demonstrated through an example.
Notes:
2002
Boštjan Brumen, Izidor Golob, Tatjana Welzer, Ivan Rozman, Marjan Družovec, Hannu Jaakkola (2002)  Data protection for outsourced data mining   INFORMATICA (Ljubljana) 26: 2. 205-210  
Abstract: In the paper, we present data mining from the data protection point of view. In many cases, the companies have a lack of expertise in data mining and are required to get help from outside. In this case the data leave the organization and need to be protected against misuse, both legally and technically. In the paper a formal framework for protecting the data that leave the organizationâs boundary is presented. The data and the data structure are modified so that data modeling process can still take place and the results can be obtained, but the data content itself is hard to reveal. Once the data mining results are returned, the inverse process discloses the meaning of the model to the data owners. The approach is especially useful for model-based data mining.
Notes:
Powered by PublicationsList.org.