Robust Distance-Based Clustering

with Applications to Spatial Data Mining

by Vladimir Estivill-Castro and Michael Houle

to appear in 2000 in

height 7pt

**KEVIN PULO**

**24 May 2000**

[-0.5in]Spatial Data Mining

height 3pt

- Large spatial datasets of points.
- Too large to visualise.
- Possibly higher dimensional.

- Want to find sets of similar points within the dataset.
- Exploratory analysis desires techniques which are automatic and fast.
- Avoid human bias.
- Fast response times aid the exploration.

- Most common application is 2 dimensional spatial data in Geographical Information Systems (GIS).

[-0.5in]Clustering Methods

height 3pt

- Clustering is a key technique in Spatial Data Mining.
- Good quality clustering algorithms are usually expensive, ie. or worse.
- -MEANS is a simple clustering algorithm taking time.
- -MEDOIDS (
*-median*) is a more constrained version of -MEANS, taking time. - The paper presents a variant of a -MEDOIDS heuristic, using the Delaunay triangulation to require only time.

[-0.5in]Clustering

height 3pt

- clusters, each represented by a
*representative*point. - The most appropriate number of clusters may or may not be determined by the algorithm.
- Representative point may or may not be a data point of the dataset.
- Points need not belong to a cluster, eg. noise and outlier points.
- Clustering algorithms give approximate solutions to the problem of minimising
the sum of the distances of each data point from their nearest representative
point.

[-0.5in]Sample Dataset

height 3pt

Sample points lying in the 2 dimensional - plane.

[-0.5in]-MEANS

height 3pt

- Take an initial (perhaps random) clustering.
- For each cluster, find the new representative by computing the average of the cluster points.
- Using these new representatives, find a new clustering (ie. assign points to their nearest, possibly different, representative).
- Repeat these two steps until no change occurs.
- This is ``non-combinatorial'' reclassification, the ``combinatorial'' reclassification variant recomputes the clustering each time a new representative is computed.

[-0.5in]-MEANS:

Advantages and Disadvantages

height 3pt

[-0.5in]-MEDOIDS

height 3pt

Similar to -MEANS, but has the additional restriction that the representative points must be chosen from the set of data points.

The TB heuristic, by Teitz and Bart, 1968, is the best known benchmark:

- Maintains a selection of representative points from the data points.
- Considers the data points in a fixed circular ordering .
- If is a representative, it is skipped, otherwise, it considers interchanging with , for all possible representatives .
- If any of these yield an improvement, the interchange is performed, and the search continues from .
- The search stops with a local optimum when a full cycle of the data points gives no improvement.

The circular ordering of points balances the need to explore various interchanges against the greed of improving the solution quickly.

[-0.5in]-MEDOIDS:

Advantages and Disadvantages

height 3pt

[-0.5in]Delaunay Triangulation

height 3pt

Delaunay triangulations succinctly encapsulate the proximity information of a set of points.

- Points and are joined by the Delaunay edge iff they are ``nearest neighbours''.
- Delaunay edge some circle through and containing no other points.
- For data points, at most edges.
- Average number of neighbours of a point is less than 6.
- Minimum angle of all triangles is the maximum possible.
- Robust computation in time.
- nearest neighbours of a point found in time.

[-0.5in]Sample Dataset:

Delaunay Triangulation

height 3pt

Sample points lying in the 2 dimensional - plane
and their Delaunay triangulation.

[-0.5in]-MEDOIDS Delaunay Variant

height 3pt

- Evaluating exactly requires time per interchange possibility.
- Greatest contribution to is made by outliers (and initially there are more outlying representatives).
- Increase performance not by limiting the points examined by TB, but by approximating with .
- considers only the nearest neighbours, precomputed for all points in time using the Delaunay triangulation.
- Evaluating takes only time per interchange possibility.
- If is chosen to be , then the overall time bound simplifies to . The choice of determines a speed-quality tradeoff.

[-0.5in]Sample Dataset:

Clustering Results

height 3pt

The clustering of sample points produced by the algorithm.

[-0.5in]-MEDOIDS Delaunay Variant:

Advantages and Disadvantages

height 3pt

[-0.5in]Finding an Initial Clustering

height 3pt

The following method is used to find the initial clustering in time.

- Put each data point into its own set.
- Sort the edges of the Delaunay triangulation by increasing length.
- For each edge, merge the sets connected by its endpoints.
- Stop when sets of size at least (say, 2 or 3) remain.
- These sets form the initial clustering, representatives are chosen arbitrarily from them.

[-0.5in]Finding Automatically

height 3pt

This method can be adapted to find a suitable value of for the dataset in time.

Consider the profile of sorted edge lengths.

Merge sets as before until just before the inflexion point, these are all the short edges which are ``obviously'' within clusters.

The number of sets of size at least is then taken as , and the initial representatives chosen from these sets.

[-0.5in]Comparison with Other

Subquadratic Algorithms

height 3pt

- DBSCAN (1996) --
expected

- STING (1997) -- construction, queries

[-0.5in]Experiments

height 3pt

- Clustered dataset generator program
- Additive noise (random points through the dataset).
- Multiplicative noise (random perturbation of points).
- Guarantees two close clusters, relative to point spacings within clusters.

- Compared -MEANS (random and Delaunay-based initialisations),
original TB (-MEDOIDS), modified TB (both given and
not given ).
- -MEANS only competitive when no noise.
- Modified TB competitive with original TB when no multiplicative noise.
- With multiplicative noise, modified TB worse than original TB, but still better than the Delaunay-based initialisation of -MEANS.
- Modified TB, when not given , performs at least as well as the Delaunay-based initialisation of -MEANS.

[-0.5in]Blank Slide

height 3pt

2000-08-23