Publications
Copyright laws permitting I aim to make this page available for
distributing any publications that I am involved with.
Conference/Journal
| Title: Improving
the Reliability of Naive Bayes and Decision Tree Learners, David
Lindsay and Sian Cox, 5th IEEE International Conference of Data
Mining, November 2004. |
| Abstract: The C4.5 Decision Tree and Naive
Bayes learners are known to produce unreliable probability
forecasts. We have used simple Binning Zadrozny et al (2001)
and Laplace Transform Cestnik et al (1990) techniques to improve
the reliability of these learners and compare their effectiveness
with that of the newly developed Venn Probability Machine (VPM)
meta-learner Vovk et al (2003). We assess improvements in reliability
using loss functions, Receiver Operator Characteristic (ROC)
curves and Empirical Reliability Curves (ERC). The VPM outperforms
the simple techniques to improve reliability, although at the
cost of increased computational intensity and slight increase
in error rate. These trade-offs are discussed. |
| short_ReliableJ48andNB_ICDM2004.ps |
198 KB |
 |
| short_ReliableJ48andNB_ICDM2004.pdf |
293 KB |
 |
Technical Reports/Working Papers
| Title: D. Lindsay. Visualising and improving
reliability - a machine learning perspective. CLRC-TR-04-01,
Technical Report, Computer Learning Research Centre, Royal Holloway
University of London, Egham, Surrey, UK, 2004. |
| Abstract: Reliable, or well calibrated
probability forecasts are those which do not 'lie', for
example a learner is reliable if the events it assigns with
predicted probability p do not occur with relative frequency
1-p. Machine learning studies often use techniques such
as loss functions and Receiver Operator Characteristic curves
(ROC) to assess reliability. However, these techniques cannot
indicate the over- or under-estimation (unreliability) of probability
forecasts. To address this, I present a method for constructing
an Empirical Reliability Curve (ERC) which has been developed
from those used in psychological and meteorological studies.
The ERC approach has been used to visualise the reliability
of probability forecasts made by several well-known learners
on benchmark datasets to reveal that some of the accurate learners
tested do not produce reliable forecasts and vice versa. In
addition, the ERC approach has been utilised to re-calibrate
(improve reliability of) probability forecasts of several base
learners. This study compares its improvement of probability
forecasts output by base learner with various meta-learning
techniques including Boosting, Bagging, Pairwise Coupled Logistic
Regression, Find Best Weights and Laplace Smoothing. Experiments
have been conducted using both binary and multi-class real life
datasets, with varying size, noise and complexity.
PASSWORD PROTECTED |
| visualReliableCLRC.pdf |
11,052 KB |
 |
| Title: D. Lindsay. Reliable Probability
Forecasting Using the Venn Probability Machine Learner. CLRC-TR-04-01,
Technical Report, Computer Learning Research Centre, Royal Holloway
University of London, Egham, Surrey, UK, 2004. |
Abstract: The Venn Probability Machine
(VPM) is a meta-learning technique introduced by Vovk et
al (2003) that generates provenly valid bounds for conditional
probabilities in the online learning setting. However we present
strong empirical results to suggest that these results work
well in the traditionally studied offline learning setting.
We demonstrate how the use of these bounds can be used to provide
useful bounds on
test error rate. This paper presents a simple modification to
the existing VPM algorithm to output reliable probability forecasts
for each class label. We have tested a variety of new implementations
of the VPM, based on Neural Networks, Support Vector Machines
and C4.5 Decision Trees and $K$-Nearest
Neighbours. We verify the validity of our new VPM's probability
forecasts using Empirical Reliability Curves (ERC), Receiver
Operator Characteristic (ROC) Curves and loss functions. We
compare the results with those taken from a recent survey of
existing methods for producing probability forecasts Lindsay
(2004). Central to the creation of a VPM learner is the need
for a fixed mechanism of clustering examples into. In this paper
we demonstrate the traditionally used `embedded' methodology
for defining types and compare it to an `easy' implementation
using simple discretisation methods. Our results demonstrate
that the VPM is very effective at producing reliable probability
forecasts. PASSWORD
PROTECTED |
| empiricalVPMOfflineCLRC.pdf |
253 KB |
 |
| Title: Vladimir
Vovk, David Lindsay, Ilia Nouretdinov and Alex Gammerman. Mondrian
Confidence Machine, On-line Compression Modelling Project, http://vovk.net/kp,
Working Paper #4 |
Abstract: Mondrian Confidence Machine
(MCM) is an on-line prediction algorithm that, given a split
of all examples into a finite number of types k and for
each type a significance level \delta_k, outputs as its
prediction the set of labels deemed possible at the level \delta_k.
MCM includes as special cases Transductive Confidence Machine
(TCM) and Inductive Confidence Machine (ICM) and is designed
to take care of such issues as
different risks of false positive and false negative predictions,
conditional inference, and a slow teacher. In this paper we
generalize known results about TCM and ICM showing that each
MCM is type-wise well-calibrated, in the sense that predictions
at significance levels \delta_k will be wrong with relative
frequency at most \delta_k for each type k in
the long run. Our experimental results show advantages of MCM
over the previously known algorithms. |
| mcm.ps.zip |
157 KB |
 |
| mcm.pdf |
276 KB |
 |
| Title: D. Lindsay and S. Cox, Learning
From String Sequences, Working Paper #1, November 2002. |
| Abstract: The Universal Similarity Metric
(USM) has been demonstrated to give practically useful measures
of "similarity" between sequence data. Here we have
used the USM as an alternative distance metric in a K-Nearest
Neighbours (K-NN) learner to allow effective pattern
recognition of variable length sequence data. We compare this
USM approach with the commonly used string-to-word vector approach.
Our experiments have used two data sets of divergent domains:
(1) spam email filtering and (2) protein subcellular localisation.
Our results with this data reveal that the USM based K-NN
learner (1) gives predictions with higher classification accuracy
than those output by techniques that use the string to word
vector approach, and (2) can be used to generate reliable probability
forecasts. |
| usmLearnerALT2004.pdf |
190 KB |
 |
Other
Here is a copy of my undergraduate
third year project which was developed in collaboration with St.
Bartholomews Hospital, London.
| Title: Transductive Confidence Machine
and its Application to Medical Datasets |
| Abstract: The Transductive Confidence Machine
Nearest Neighbours (TCMNN) algorithm and a supporting, simple
user interface was developed. Different settings of the TCMNN
algorithms' parameters were tested on medical data sets, in
addition to the use of different Minkowski metrics and polynomial
kernels. The effect of increasing the number of nearest neighbours
and marking results with significance was also investigated.
SVM implementation of the Transductive Confidence Machine was
compared with Nearest Neighbours implementation. The application
of neural networks was investigated as a useful comparison to
the transductive algorithms. |
| dissertation.pdf |
1,453 KB |
 |
|