3.2.1 Restrictions on datasets

Next: 3.2.2 Problems splitting the Up: 3.2 Problems encountered Previous: 3.2 Problems encountered

3.2.1 Restrictions on datasets

It was noted earlier in Section 2.1 that the datasets to be analysed required several restrictions, most notably on their number of instances and attributes, and on the number of possible values each attribute may take. The reason is that since this project has limited scale, and that a human was required to analyse the data, overly complicated data would simply ``clog'' the human's mind, making the issue of finding rules for the dataset substantially harder.

For example, a dataset such as the www.microsoft.com web server log files (also from [Blake et al, 1998, the UCI ML dataset repository]) has approximately 160 attributes, most of which are nominally valued. For a human to be able to make sufficient sense of these to be able to make accurate classifications would be an immensely formidable task, well beyond the reach of this project.

There is another dilemma concerning the complexity of the datasets, relating to whether datasets should have continuous or discrete attributes and classes. Fuzzy logic has the greatest advantage over bivalent techniques when applied to continuous values; when applied to discrete values it can be reduced to being a somewhat advanced version of bivalent logic. However, datasets with continuous values are considerably more complex and involved than those with discrete values. Thus, as above, discrete valued datasets were chosen in the interests of restraining complexity.

Next: 3.2.2 Problems splitting the Up: 3.2 Problems encountered Previous: 3.2 Problems encountered

Kevin Pulo
2000-08-22