Publications

Copyright laws permitting I aim to make this page available for distributing any publications that I am involved with.

Conference/Journal

Title: Improving the Reliability of Naive Bayes and Decision Tree Learners, David Lindsay and Sian Cox, 5th IEEE International Conference of Data Mining, November 2004.
Abstract: The C4.5 Decision Tree and Naive Bayes learners are known to produce unreliable probability forecasts. We have used simple Binning Zadrozny et al (2001) and Laplace Transform Cestnik et al (1990) techniques to improve the reliability of these learners and compare their effectiveness with that of the newly developed Venn Probability Machine (VPM) meta-learner Vovk et al (2003). We assess improvements in reliability using loss functions, Receiver Operator Characteristic (ROC) curves and Empirical Reliability Curves (ERC). The VPM outperforms the simple techniques to improve reliability, although at the cost of increased computational intensity and slight increase in error rate. These trade-offs are discussed.
short_ReliableJ48andNB_ICDM2004.ps 198 KB
short_ReliableJ48andNB_ICDM2004.pdf 293 KB

Technical Reports/Working Papers

Title: D. Lindsay. Visualising and improving reliability - a machine learning perspective. CLRC-TR-04-01, Technical Report, Computer Learning Research Centre, Royal Holloway University of London, Egham, Surrey, UK, 2004.
Abstract: Reliable, or well calibrated probability forecasts are those which do not 'lie', for example a learner is reliable if the events it assigns with predicted probability p do not occur with relative frequency 1-p. Machine learning studies often use techniques such as loss functions and Receiver Operator Characteristic curves (ROC) to assess reliability. However, these techniques cannot indicate the over- or under-estimation (unreliability) of probability forecasts. To address this, I present a method for constructing an Empirical Reliability Curve (ERC) which has been developed from those used in psychological and meteorological studies. The ERC approach has been used to visualise the reliability of probability forecasts made by several well-known learners on benchmark datasets to reveal that some of the accurate learners tested do not produce reliable forecasts and vice versa. In addition, the ERC approach has been utilised to re-calibrate (improve reliability of) probability forecasts of several base learners. This study compares its improvement of probability forecasts output by base learner with various meta-learning techniques including Boosting, Bagging, Pairwise Coupled Logistic Regression, Find Best Weights and Laplace Smoothing. Experiments have been conducted using both binary and multi-class real life datasets, with varying size, noise and complexity. PASSWORD PROTECTED
visualReliableCLRC.pdf 11,052 KB

 

Title: D. Lindsay. Reliable Probability Forecasting Using the Venn Probability Machine Learner. CLRC-TR-04-01, Technical Report, Computer Learning Research Centre, Royal Holloway University of London, Egham, Surrey, UK, 2004.
Abstract: The Venn Probability Machine (VPM) is a meta-learning technique introduced by Vovk et al (2003) that generates provenly valid bounds for conditional probabilities in the online learning setting. However we present strong empirical results to suggest that these results work well in the traditionally studied offline learning setting. We demonstrate how the use of these bounds can be used to provide useful bounds on
test error rate. This paper presents a simple modification to the existing VPM algorithm to output reliable probability forecasts for each class label. We have tested a variety of new implementations of the VPM, based on Neural Networks, Support Vector Machines and C4.5 Decision Trees and $K$-Nearest
Neighbours. We verify the validity of our new VPM's probability forecasts using Empirical Reliability Curves (ERC), Receiver Operator Characteristic (ROC) Curves and loss functions. We compare the results with those taken from a recent survey of existing methods for producing probability forecasts Lindsay (2004). Central to the creation of a VPM learner is the need for a fixed mechanism of clustering examples into. In this paper we demonstrate the traditionally used `embedded' methodology for defining types and compare it to an `easy' implementation using simple discretisation methods. Our results demonstrate that the VPM is very effective at producing reliable probability forecasts. PASSWORD PROTECTED
empiricalVPMOfflineCLRC.pdf 253 KB

 

Title: Vladimir Vovk, David Lindsay, Ilia Nouretdinov and Alex Gammerman. Mondrian Confidence Machine, On-line Compression Modelling Project, http://vovk.net/kp, Working Paper #4
Abstract: Mondrian Confidence Machine (MCM) is an on-line prediction algorithm that, given a split of all examples into a finite number of types k and for each type a significance level \delta_k, outputs as its prediction the set of labels deemed possible at the level \delta_k. MCM includes as special cases Transductive Confidence Machine (TCM) and Inductive Confidence Machine (ICM) and is designed to take care of such issues as
different risks of false positive and false negative predictions, conditional inference, and a slow teacher. In this paper we generalize known results about TCM and ICM showing that each MCM is type-wise well-calibrated, in the sense that predictions at significance levels \delta_k will be wrong with relative frequency at most \delta_k for each type k in the long run. Our experimental results show advantages of MCM over the previously known algorithms.
mcm.ps.zip 157 KB
mcm.pdf 276 KB

 

Title: D. Lindsay and S. Cox, Learning From String Sequences, Working Paper #1, November 2002.
Abstract: The Universal Similarity Metric (USM) has been demonstrated to give practically useful measures of "similarity" between sequence data. Here we have used the USM as an alternative distance metric in a K-Nearest Neighbours (K-NN) learner to allow effective pattern recognition of variable length sequence data. We compare this USM approach with the commonly used string-to-word vector approach. Our experiments have used two data sets of divergent domains: (1) spam email filtering and (2) protein subcellular localisation. Our results with this data reveal that the USM based K-NN learner (1) gives predictions with higher classification accuracy than those output by techniques that use the string to word vector approach, and (2) can be used to generate reliable probability forecasts.
usmLearnerALT2004.pdf 190 KB

Other

Here is a copy of my undergraduate third year project which was developed in collaboration with St. Bartholomews Hospital, London.

Title: Transductive Confidence Machine and its Application to Medical Datasets
Abstract: The Transductive Confidence Machine Nearest Neighbours (TCMNN) algorithm and a supporting, simple user interface was developed. Different settings of the TCMNN algorithms' parameters were tested on medical data sets, in addition to the use of different Minkowski metrics and polynomial kernels. The effect of increasing the number of nearest neighbours and marking results with significance was also investigated. SVM implementation of the Transductive Confidence Machine was compared with Nearest Neighbours implementation. The application of neural networks was investigated as a useful comparison to the transductive algorithms.
dissertation.pdf 1,453 KB

 

Last modified: 8 December, 2004 10:42 PM By: DL
Home Contact me Search my web site Sign my guestbook! www.david-lindsay.co.uk Discussion board www.david-lindsay.co.uk Home Contact me Sign my guestbook To my discussion board Search my website www.david-lindsay.co.uk Go to hompage Contact me Sign my guestbook Visit my discussion groups Search my web site Useful links