Thursday, January 31, 2008

The short cut of SPSS C5 - fast in execution and so is high false positives

The C5 implementation in SPSS is very fast.

I also found one problem (may be others too, and i will keep reporting when I discover) is that it is more likely to allocate more of the observations to the highest incidence group and stop sooner; this creates naive assignments. That is almost 100% correct assignment for the highest incidence class but almost 100% error for other classes.

One possibility is use to the misclassification cost matrix; You can look at the confusion matrix of training and testing to re-adjust the cost matrix; but it is not easy; also it seems C5 is not using some of the basic principles of modeling using the correlation and information content of covariates.

No comments: