DATABASE INTERNALS 2000 - PROJECT PROPOSAL ========================================== Kevin Pulo, kev@hons.cs.usyd.edu.au Garrick Welsh, gaz@hons.cs.usyd.edu.au TOPIC ----- (Efficient) Clustering in Spatial Databases (ie. Data Mining) TASK ---- We plan to implement a k-medoid method of clustering spatial data. This will most likely be the algorithm described in [1], herein referred to ECH, or the heuristic described in [2], herein referred to as TB. Both are quite challenging k-medoid hill-climbing heuristics, which we feel are appropriate for data-mining of large spatial databases. REASON ------ Clustering is a crucial part of data mining and spatial databases are often extremely large in terms of the number of points in the database. This means that efficient clustering methods are very important to finding sets of points which are similar in some way - the basic task in data mining. K-medoid methods are also more useful for data-mining than the related method of k-means. This is for two main reasons: 1) k-means chooses representative points which are not neccessarily data points, which is bad for categorically spatial data. 2) k-means is usually based on the square of the Euclidean distance between points, which is less useful for geographic data sets (such as GIS). The k-means method is quite efficient with expected complexity of O(n), while the ECH heuristic is O(nlogn) expected and the TB heuristic is O(n^2). Despite the poorer complexity, we feel that k-medoid methods are more useful for data-mining and spatial database applications, and so are more appropriate for the project. PAPER PRESENTATION ------------------ Kevin Pulo will present [1] to the class for the paper presentation, as he is required to read (and possibly also implement) the ECH heuristic for his Honours project. Garrick Welsh expects to present [2] to the class for the paper presentation. However, we have been unable to obtain a copy of this paper so far (it is in Fisher library's storage). Presenting [3] (which is in Fisher Research), which deals with general k-medoid issues, is also a possibility. REFERENCES ---------- [1] V. Estivill-Castro and M. E. Houle. "Robust distance-based clustering with applications to spatial data mining." Algorithmica (Special Issue: Algorithms for Geographical Information). To appear in 2000. http://www.cs.usyd.edu.au/~meh/papers/gisclust.ps.gz [2] M. B. Teitz and P. Bart. "Heuristic methods for estimating the generalized vertex median of a weighted graph." Operations Research, 16:955-961, 1968. [3] R. T. Ng and J. Han. "Efficient and effective clustering methods for spatial data mining." In J. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings of the 20th Conference on Very Large Data Bases (VLDB), pages 144-155, San Francisco, Ca, 1994. Santiago, Chile, Morgan Kaufmann Publishers.