DATABASE INTERNALS 2000 - PROJECT PROPOSAL
==========================================
Kevin Pulo, kev@hons.cs.usyd.edu.au
Garrick Welsh, gaz@hons.cs.usyd.edu.au
TOPIC
-----
(Efficient) Clustering in Spatial Databases (ie. Data Mining)
TASK
----
We plan to implement a k-medoid method of clustering spatial data. This
will most likely be the algorithm described in [1], herein referred to ECH,
or the heuristic described in [2], herein referred to as TB. Both are
quite challenging k-medoid hill-climbing heuristics, which we feel are
appropriate for data-mining of large spatial databases.
REASON
------
Clustering is a crucial part of data mining and spatial databases are often
extremely large in terms of the number of points in the database. This
means that efficient clustering methods are very important to finding sets
of points which are similar in some way - the basic task in data mining.
K-medoid methods are also more useful for data-mining than the related
method of k-means. This is for two main reasons:
1) k-means chooses representative points which are not neccessarily data
points, which is bad for categorically spatial data.
2) k-means is usually based on the square of the Euclidean distance between
points, which is less useful for geographic data sets (such as GIS).
The k-means method is quite efficient with expected complexity of O(n),
while the ECH heuristic is O(nlogn) expected and the TB heuristic is
O(n^2).
Despite the poorer complexity, we feel that k-medoid methods are more
useful for data-mining and spatial database applications, and so are more
appropriate for the project.
PAPER PRESENTATION
------------------
Kevin Pulo will present [1] to the class for the paper presentation, as he
is required to read (and possibly also implement) the ECH heuristic for his
Honours project.
Garrick Welsh expects to present [2] to the class for the paper
presentation. However, we have been unable to obtain a copy of this paper
so far (it is in Fisher library's storage). Presenting [3] (which is in
Fisher Research), which deals with general k-medoid issues, is also a
possibility.
REFERENCES
----------
[1] V. Estivill-Castro and M. E. Houle. "Robust distance-based clustering
with applications to spatial data mining." Algorithmica (Special Issue:
Algorithms for Geographical Information). To appear in 2000.
http://www.cs.usyd.edu.au/~meh/papers/gisclust.ps.gz
[2] M. B. Teitz and P. Bart. "Heuristic methods for estimating the
generalized vertex median of a weighted graph." Operations Research,
16:955-961, 1968.
[3] R. T. Ng and J. Han. "Efficient and effective clustering methods for
spatial data mining." In J. Bocca, M. Jarke, and C. Zaniolo, editors,
Proceedings of the 20th Conference on Very Large Data Bases (VLDB), pages
144-155, San Francisco, Ca, 1994. Santiago, Chile, Morgan Kaufmann
Publishers.