BIG Data Mining, Predictive Solutions and Training Using SAS, R, SPSS, ..., Hadoop, Hive, Pig,...: The short cut of SPSS C5 - fast in execution and so is high false positives

Thursday, January 31, 2008

The short cut of SPSS C5 - fast in execution and so is high false positives

The C5 implementation in SPSS is very fast.

I also found one problem (may be others too, and i will keep reporting when I discover) is that it is more likely to allocate more of the observations to the highest incidence group and stop sooner; this creates naive assignments. That is almost 100% correct assignment for the highest incidence class but almost 100% error for other classes.

One possibility is use to the misclassification cost matrix; You can look at the confusion matrix of training and testing to re-adjust the cost matrix; but it is not easy; also it seems C5 is not using some of the basic principles of modeling using the correlation and information content of covariates.

Thursday, January 31, 2008

The short cut of SPSS C5 - fast in execution and so is high false positives

No comments: