next up previous
Next: 2.1.2 Data structures Up: 2.1 Design Previous: 2.1 Design

2.1.1 Data Format

The data format was chosen to be simple ASCII formats which are easily parsed by both programs and humans. There are two data formats used, an unclustered dataset and a clustered dataset.

The unclustered dataset simply specifies points followed by their annotations, one per line. Each line has the format:

(<real>,<real>) <string>
The two real (decimal) numbers specify the location of the point in the plane, and the string is a string representation of the point's annotation.

The clustered dataset is basically the same as the unclustered dataset, except that each point is prefixed with the cluster number it belongs to. That is,

<int> (<real>,<real>) <string>
The integer at the start of the line denotes the cluster the point belongs to, or -1 if the point has been designated as an outlier.

The one exception to this is where the annotation is ``representative''. This indicates that the point is actually the representative point of the cluster.



Kevin Pulo
2000-08-23