next up previous
Next: About this document ... Up: Database Internals

Paper Presentation:
Robust Distance-Based Clustering
with Applications to Spatial Data Mining
by Vladimir Estivill-Castro and Michael Houle
to appear in 2000 in Algorithmica





height 7pt

KEVIN PULO

24 May 2000

[-0.5in]Spatial Data Mining


height 3pt

[-0.5in]Clustering Methods


height 3pt

[-0.5in]Clustering


height 3pt

[-0.5in]Sample Dataset


height 3pt



\resizebox*{15cm}{!}{\includegraphics{dataset.eps}}



Sample points lying in the 2 dimensional \( x \)-\( y \) plane.

[-0.5in]\( k \)-MEANS


height 3pt

[-0.5in]\( k \)-MEANS:
Advantages and Disadvantages


height 3pt


\begin{dinglist}{52}
\item Fast - \( O(n) \)linear time.
\item Simple.
\end{dinglist}

\begin{dinglist}{56}
\item Requires \( k \)to be specified.
\item Representative...
...oor local optimum.
\item Uses the square of the Euclidean metric.
\end{dinglist}

[-0.5in]\( k \)-MEDOIDS


height 3pt

Similar to \( k \)-MEANS, but has the additional restriction that the representative points must be chosen from the set of data points.

The TB heuristic, by Teitz and Bart, 1968, is the best known benchmark:

The number of points considered for interchange per iteration is typically constant. However, all \( n \) points must be examined for termination.

The circular ordering of points balances the need to explore various interchanges against the greed of improving the solution quickly.

[-0.5in]\( k \)-MEDOIDS:
Advantages and Disadvantages


height 3pt


\begin{dinglist}{52}
\item Representative points are taken from the dataset poin...
... Higher quality local optima found, even with noise and outliers.
\end{dinglist}

\begin{dinglist}{56}
\item Slow - \( \Omega (n^{2}) \)superquadratic time.
\it...
...m Exact solution is NP-hard, ie. superpolynomial time complexity.
\end{dinglist}

[-0.5in]Delaunay Triangulation


height 3pt

Delaunay triangulations succinctly encapsulate the proximity information of a set of points.

[-0.5in]Sample Dataset:
Delaunay Triangulation


height 3pt



\resizebox*{15cm}{!}{\includegraphics{delaunay.eps}}



Sample points lying in the 2 dimensional \( x \)-\( y \) plane and their Delaunay triangulation.

[-0.5in]\( k \)-MEDOIDS Delaunay Variant


height 3pt

[-0.5in]Sample Dataset:
Clustering Results


height 3pt



\resizebox*{15cm}{!}{\includegraphics{clustering.eps}}



The clustering of sample points produced by the algorithm.

[-0.5in]\( k \)-MEDOIDS Delaunay Variant:
Advantages and Disadvantages


height 3pt


\begin{dinglist}{52}
\item Fast - \( O(n\log n) \)subquadratic time.
\item Rep...
...he actual Euclidean metric.
\item Discovers \( k \)automatically.
\end{dinglist}

\begin{dinglist}{56}
\item Complicated.
\item Does not classify outliers.
\item Sensitive to initial representatives.
\end{dinglist}

[-0.5in]Finding an Initial Clustering


height 3pt

The following method is used to find the initial clustering in \( O(n\log n) \) time.

This technique maximises the minimum distance between initial clusters.

[-0.5in]Finding \( k \) Automatically


height 3pt

This method can be adapted to find a suitable value of \( k \) for the dataset in \( O(n\log n) \) time.

Consider the profile of sorted edge lengths.

Merge sets as before until just before the inflexion point, these are all the short edges which are ``obviously'' within clusters.

The number of sets of size at least \( \nu \) is then taken as \( k \), and the initial representatives chosen from these \( \nu \) sets.



\resizebox*{13cm}{!}{\includegraphics{edgelengths.eps}}



[-0.5in]Comparison with Other
Subquadratic Algorithms


height 3pt

[-0.5in]Experiments


height 3pt

[-0.5in]Blank Slide


height 3pt

 




next up previous
Next: About this document ... Up: Database Internals
Kevin Pulo
2000-08-23