Next: About this document ... Up: Database Internals

Paper Presentation:
Robust Distance-Based Clustering
with Applications to Spatial Data Mining
by Vladimir Estivill-Castro and Michael Houle
to appear in 2000 in Algorithmica

height 7pt

KEVIN PULO

24 May 2000

[-0.5in]Spatial Data Mining

height 3pt

Large spatial datasets of points.
- Too large to visualise.
- Possibly higher dimensional.
Want to find sets of similar points within the dataset.
Exploratory analysis desires techniques which are automatic and fast.
- Avoid human bias.
- Fast response times aid the exploration.
Most common application is 2 dimensional spatial data in Geographical Information Systems (GIS).

[-0.5in]Clustering Methods

height 3pt

Clustering is a key technique in Spatial Data Mining.
Good quality clustering algorithms are usually expensive, ie. $O(n^{2})$ or worse.
$k$ -MEANS is a simple clustering algorithm taking $O(n)$ time.
$k$ -MEDOIDS ( $p$ -median) is a more constrained version of $k$ -MEANS, taking $\Omega (n^{2})$ time.
The paper presents a variant of a $k$ -MEDOIDS heuristic, using the Delaunay triangulation to require only $O(n\log n)$ time.

[-0.5in]Clustering

height 3pt

$k$ clusters, each represented by a representative point.
The most appropriate number of clusters $k$ may or may not be determined by the algorithm.
Representative point may or may not be a data point of the dataset.
Points need not belong to a cluster, eg. noise and outlier points.
Clustering algorithms give approximate solutions to the problem of minimising the sum of the distances of each data point from their nearest representative point.

$\begin{displaymath} \textrm{minimise }M(C)=\sum ^{n-1}_{i=0}d(s_{i},\textrm{rep}[s_{i},C])\end{displaymath}$

[-0.5in]Sample Dataset

height 3pt

$\resizebox*{15cm}{!}{\includegraphics{dataset.eps}}$

Sample points lying in the 2 dimensional $x$ - $y$ plane.

[-0.5in] $k$ -MEANS

height 3pt

Take an initial (perhaps random) clustering.
For each cluster, find the new representative by computing the average of the cluster points.
Using these new representatives, find a new clustering (ie. assign points to their nearest, possibly different, representative).
Repeat these two steps until no change occurs.
This is ``non-combinatorial'' reclassification, the ``combinatorial'' reclassification variant recomputes the clustering each time a new representative is computed.

[-0.5in] $k$ -MEANS:
Advantages and Disadvantages

height 3pt

$\begin{dinglist}{52} \item Fast - $ O(n) $linear time. \item Simple. \end{dinglist}$

$\begin{dinglist}{56} \item Requires $ k $to be specified. \item Representative... ...oor local optimum. \item Uses the square of the Euclidean metric. \end{dinglist}$

[-0.5in] $k$ -MEDOIDS

height 3pt

Similar to $k$ -MEANS, but has the additional restriction that the representative points must be chosen from the set of data points.

The TB heuristic, by Teitz and Bart, 1968, is the best known benchmark:

Maintains a selection of $k$ representative points from the $n$ data points.
Considers the data points in a fixed circular ordering $(s_{0},s_{1},s_{2},\ldots ,s_{n-1})$ .
If $s_{i}$ is a representative, it is skipped, otherwise, it considers interchanging $s_{i}$ with $s_{j}$ , for all possible representatives $s_{j}$ .
If any of these yield an improvement, the interchange is performed, and the search continues from $s_{i+1}$ .
The search stops with a local optimum when a full cycle of the data points gives no improvement.

The number of points considered for interchange per iteration is typically constant. However, all $n$ points must be examined for termination.

The circular ordering of points balances the need to explore various interchanges against the greed of improving the solution quickly.

[-0.5in] $k$ -MEDOIDS:
Advantages and Disadvantages

height 3pt

$\begin{dinglist}{52} \item Representative points are taken from the dataset poin... ... Higher quality local optima found, even with noise and outliers. \end{dinglist}$

$\begin{dinglist}{56} \item Slow - $ \Omega (n^{2}) $superquadratic time. \it... ...m Exact solution is NP-hard, ie. superpolynomial time complexity. \end{dinglist}$

[-0.5in]Delaunay Triangulation

height 3pt

Delaunay triangulations succinctly encapsulate the proximity information of a set of points.

Points $a$ and $b$ are joined by the Delaunay edge $ab$ iff they are ``nearest neighbours''.
Delaunay edge $ab$ $\Leftrightarrow$ $\exists$ some circle through $a$ and $b$ containing no other points.
For $n$ data points, at most $3n-6$ edges.
Average number of neighbours of a point is less than 6.
Minimum angle of all triangles is the maximum possible.
Robust computation in $O(n\log n)$ time.
$u$ nearest neighbours of a point found in $O(u\log u)$ time.

[-0.5in]Sample Dataset:
Delaunay Triangulation

height 3pt

$\resizebox*{15cm}{!}{\includegraphics{delaunay.eps}}$

Sample points lying in the 2 dimensional $x$ - $y$ plane and their Delaunay triangulation.

[-0.5in] $k$ -MEDOIDS Delaunay Variant

height 3pt

Evaluating $M(C)$ exactly requires $O(n)$ time per interchange possibility.
Greatest contribution to $M(C)$ is made by outliers (and initially there are more outlying representatives).
Increase performance not by limiting the points examined by TB, but by approximating $M(C)$ with $M'(C)$ .
$M'(C)$ considers only the $u$ nearest neighbours, precomputed for all points in $O(un\log n)$ time using the Delaunay triangulation.
Evaluating $M'(C)$ takes only $O(uk)$ time per interchange possibility.
If $u$ is chosen to be $\Theta (\frac{\log n}{\log \log n})$ , then the overall time bound simplifies to $O(n\log n)$ . The choice of $u$ determines a speed-quality tradeoff.

[-0.5in]Sample Dataset:
Clustering Results

height 3pt

$\resizebox*{15cm}{!}{\includegraphics{clustering.eps}}$

The clustering of sample points produced by the algorithm.

[-0.5in] $k$ -MEDOIDS Delaunay Variant:
Advantages and Disadvantages

height 3pt

$\begin{dinglist}{52} \item Fast - $ O(n\log n) $subquadratic time. \item Rep... ...he actual Euclidean metric. \item Discovers $ k $automatically. \end{dinglist}$

$\begin{dinglist}{56} \item Complicated. \item Does not classify outliers. \item Sensitive to initial representatives. \end{dinglist}$

[-0.5in]Finding an Initial Clustering

height 3pt

The following method is used to find the initial clustering in $O(n\log n)$ time.

Put each data point into its own set.
Sort the edges of the Delaunay triangulation by increasing length.
For each edge, merge the sets connected by its endpoints.
Stop when $k$ sets of size at least $\nu$ (say, 2 or 3) remain.
These sets form the initial clustering, representatives are chosen arbitrarily from them.

This technique maximises the minimum distance between initial clusters.

[-0.5in]Finding $k$ Automatically

height 3pt

This method can be adapted to find a suitable value of $k$ for the dataset in $O(n\log n)$ time.

Consider the profile of sorted edge lengths.

Merge sets as before until just before the inflexion point, these are all the short edges which are ``obviously'' within clusters.

The number of sets of size at least $\nu$ is then taken as $k$ , and the initial representatives chosen from these $\nu$ sets.

$\resizebox*{13cm}{!}{\includegraphics{edgelengths.eps}}$

[-0.5in]Comparison with Other
Subquadratic Algorithms

height 3pt

DBSCAN (1996) -- $\Theta (n\log n)$ expected

$\begin{dinglist}{52} \item Finds clusters of arbitrary shape, not just convex. \item Good quality clusterings produced. \end{dinglist}$

$\begin{dinglist}{56} \item Requires user assistance in removing outliers. \end{dinglist}$
STING (1997) -- $O(n)$ construction, $\Theta (n)$ queries

$\begin{dinglist}{52} \item After initial processing, clustering time is sublinea... ...ar to multidimensional database, thus suited to SQL-type queries. \end{dinglist}$

$\begin{dinglist}{56} \item Grid size is hard to determine. \item Assumes mixture... ...ogonal polygons. \item Gives poor, non-robust clustering results. \end{dinglist}$

[-0.5in]Experiments

height 3pt

Clustered dataset generator program
- Additive noise (random points through the dataset).
- Multiplicative noise (random perturbation of points).
- Guarantees two close clusters, relative to point spacings within clusters.
Compared -MEANS (random and Delaunay-based initialisations), original TB (-MEDOIDS), modified TB (both given and not given ).
- $k$ -MEANS only competitive when no noise.
- Modified TB competitive with original TB when no multiplicative noise.
- With multiplicative noise, modified TB worse than original TB, but still better than the Delaunay-based initialisation of $k$ -MEANS.
- Modified TB, when not given $k$ , performs at least as well as the Delaunay-based initialisation of $k$ -MEANS.

[-0.5in]Blank Slide

height 3pt

About this document ...

Next: About this document ... Up: Database Internals

Kevin Pulo
2000-08-23

Paper Presentation: Robust Distance-Based Clustering with Applications to Spatial Data Mining by Vladimir Estivill-Castro and Michael Houle to appear in 2000 in Algorithmica height 7pt

Paper Presentation:
Robust Distance-Based Clustering
with Applications to Spatial Data Mining
by Vladimir Estivill-Castro and Michael Houle
to appear in 2000 in Algorithmica

height 7pt