In their comparison of datasets, [Lim et al, 1999] used a ten-fold cross validation technique. That is, in basic terms, they split the dataset into a 90% training set and 10% validation set for 10 different disjoint set splits, then running the algorithm over each of the 10 splits and averaging the results. However, there is certainly not enough time for the human to perform the rule construction 10 times over, especially when in theory the rules for the dataset should be the same (and the human is aware of this fact). As such, it is virtually impossible for the human to run several truly distinct iterations of the rule generation with the different training sets (as required for the cross validation), as human memory will skew and bias the later results.
As such, a split of 70% training data, 30% validation data was used in an attempt to partially compensate for these different validation results - the larger validation set should provide a better test of the created rules without sacrificing too many training instances.
A related problem was that the dataset was split sequentially - that is, the training set was simply the first 70% of the instances, the validation set the last 30%. It isn't known if the data is listed with any drifts or trends in it (but there is no reason to assume so). The instances in the dataset should have been randomized before being split into the training and validation set, however it is believed that any effect not doing this may have had is negligible when compared with the overall results.
A related problem is that of the feedback from the training set often not being particularly helpful, for the same reasons as when overfitting. Once the rules have been run on the validation set, the results cannot be used to further improve the rules, as this would simply merge the validation set into the training set. Thus the rules must be run on the validation set once only, at which point they cannot be further developed. This leaves the human in a very awkward and counter-intuitive positionof having little information with which to decide if the rules are complete. The training set results are further dubious because of overfitting - large errors do not necessarily indicate that the rules are particularly bad4. More importance must be placed on how the human feels about the rules, that is, if the human thinks they accurately represent the underlying structure of the problem, usually in terms of their elegance, their being sufficiently large (and yet not overly large), and other measures of the human's satisfaction, and these can be hard to gauge.