BIG Data Mining, Predictive Solutions and Training Using SAS, R, SPSS, ..., Hadoop, Hive, Pig,...: Why not utilize PROC VARCLUS as a standard variable reduction method for every problem?

Saturday, March 12, 2011

Why not utilize PROC VARCLUS as a standard variable reduction method for every problem?

The key inspirational clips,

"The VARCLUS procedure attempts to divide a set of variables into nonoverlapping
clusters in such a way that each cluster can be interpreted as essentially unidimensional..."

"A large set of variables can often be replaced by the set of cluster components with little loss of information. A given number of cluster components does not generally explain as much variance as the same number of principal components on the full set of variables, but the cluster components are usually easier to interpret than the principal components, even if the latter are rotated.

For example, an educational test might contain fifty items. PROC VARCLUS can be
used to divide the items into, say, five clusters. Each cluster can then be treated as a subtest, with the subtest scores given by the cluster components.

If the cluster components are centroid components of the covariance matrix, each
subtest score is simply the sum of the item scores for that cluster."

are directly credited to http://www.okstate.edu/sas/v8/saspdf/stat/chap68.pdf

This is because primary variable representing as one variable for all the variables within a cluster can indeed be good enough representing all of the variables within that cluster, becuase the distances among them are distinctly smaller than other collections of variables.

However, if we are faced with a datamining situation where possibly even a 3%-5% coverage of a variable might be included as important variable in predictive equation, this strategy of variable reduction can lead to tug of war between the number of clusters (and hence number of variables in the original analysis) and the total explained variance from the original set of variables.

The stopping rule for clustering (on the basis of when to stop) is typically a challenge which easily hovers around science and art of prediction (art of prediction?...mm what is it?), will need to be determined on the basis of balancing following parameters.

- The total variance among the original set of variables explained
- The correlations of each of the members with in a cluster to the cluster dimension
- The correlations of each of the members with in a cluster with that of the other members of other clusters
- Explain reasonably and possibly upto 90% of the variation in the original set of variables (because this is also a prediction problem)

2 comments:

Anonymous said...: Do you know the equivelant/similar procedure in R?; March 2, 2012 at 2:04 PM
Data Monster - Insight Monster said...: There is a VARCLUS that is available in R.

"...Take a look at the varclus function in the Frank Harrel's Hmisc package.

require(Hmisc)
? varclus

..."

http://stats.stackexchange.com/a/9042; April 12, 2012 at 8:43 AM