BIG Data Mining, Predictive Solutions and Training Using SAS, R, SPSS, ..., Hadoop, Hive, Pig,...
Saturday, April 28, 2012
SAS - Some Key References that Works on Rows of Variables List - Useful for Data Mining
The first important tool you need to consider rows of variables as data, which is not easy in the usual data structure implied in SAS.
The first one that aids in that direction is PROC SQL which creates rows of variables, not as a data set, but as a macro variables list. That is good, because you can call a macro to execute each of the variable name that is provided by the macro to work on specific columns of a sas data set.
A good reference here is http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001360983.htm
Suppose you want to work on collection of variables in a call, which may be dynamically changing then you need a tool to extract collection of variables. For example, you may extract all possible combinations of variables, each time as a subset, from a data set, to find out which is the best combination.
To do this you may use the following two macros depending upon whether you need combinations of variables or permutations of variables.
All credit goes to SAS. This is written, supplied, supported by SAS. I just dumped here for my reference and usage.
"""""""
%macro permute(r) / parmbuff; /* the parmbuff option assigns */
%let i=2; /* the invocation parameter list to the */
%let things=; /* macro variable &syspbuff */
%do %while (%Qscan(&syspbuff,&i,%STR(,%))) ne ); /* scan the syspbuff */
%let p&i="%Qscan(&syspbuff,&i,%STR(,%)))"; /* to determine r */
%if &i=2 %then %let things=&&p&i; /* and count the number */
%else %let things=&things,&&p&i; /* of elements, n */
%let i=%eval(&i+1);
%end;
%let n=%eval(&i-2);
data permute;
drop i j copy;
array check (*) $ 10 r1-r&r; /* create a total of r */
%do m=1 %to &r; /* variables for looping */
do r&m = &things;
%end;
copy=0;
do i=2 to &r; /* look for duplicate items */
do j=1 to i-1; /* and keep the unique ones */
if check(j)=check(i) then copy+1;
end;
end;
if copy = 0 then output; /* writes to a SAS data set */
if copy = 0 then put r1-r&r; /* writes to the log */
%do m=1 %to &r;
end; /* end the r DO LOOPS */
%end;
run;
proc print uniform data=permute;
title "permutations of &n items taken &r at a time ";
run;
%mend permute;
%macro combo(r)/parmbuff;
%let i=2;
%let things=;
%do %while (%Qscan(&syspbuff,&i,%STR(,%))) ne );
%let p&i="%Qscan(&syspbuff,&i,%STR(,%)))";
%if &i=2 %then %let things=&&p&i;
%else %let things=&things,&&p&i;
%let i=%eval(&i+1);
%end;
%let n=%eval(&i-2);
data combo;
keep v1-v&r;
array word $8 w1-w&n (&things);
array rr (*) r1-r&r;
array v $8 v1-v&r;
%do i=1 %to &r; /* create the DO LOOPs */
%if &i=1 %then %do;
do r&i=1 to &n-(&r-&i);
%end;
%else %do;
do r&i=r%eval(&i-1)+1 to &n-(&r-&i);
%end;
%end;
do k=1 to &r; /* select subscripted items */
v(k)=word (rr(k)); /* for a SAS data set */
put v(k) ' ' @; /* for log */
end;
put; /* writes to log */
output; /* writes to a SAS data set */
%do i=1 %to &r;
end; /* create ENDs for the DO LOOPs */
%end;
put;
run;
proc print uniform data=combo;
title "combinations of &n items taken &r at a time ";
run;
%mend combo;
""""""""
Reference: http://support.sas.com/techsup/technote/ts498.html
So the idea here is create the combinations of variables using this macro and use the output as a dataset and read one row at a time and execute your macro.
Sunday, February 19, 2012
SAS 1001 Tips - Tip 11 - How to read varying length character variable
Some times the data comes with character variables which are of different lengths and with in each variable, the records also change their lengths because of the nature of data (for example, the character variable is say capturing the web site page url name) or collection of items mentioned in the variable values itself. Say the data comes like this in a file called F:\import.txt
origin products TotalUnits
China printer, printer ink 5,678
Mexico stem tomoto 22,212
USA pears apples 121,425
South Africa
The right way to read this is as follows
data import;
infile 'F:\import.txt' delimiter = ' ' dsd;
input origin $12 products & $25. units;
;
proc print;
run;
The & symbol says that this dsd file needs to be read until it finds the next variable. The alternative symbol is :, which means give maximum character length of 25 characters. & goes extra step to read upto 25 characters and also treat consequetive spaces as single space.
Thursday, February 16, 2012
SAS 1001 Tips - Tip 10 - How to find index of max or min in a vector (array)
data _null_;
/*data rc5;*/
/*set rc5 ; */
set rc5(obs=10);
array _i{5} rcseg1 -- rcseg5;
minperc=min(of _i[*]);
Pred_SEG=whichN(minperc,of _i[*]);
/*put rcseg1 -- rcseg5;*/
/*put (minperc pred_seg)(=/);*/
run;
Reference:
Wednesday, March 30, 2011
SAS 1001 Tips - Tip 9 - Saving format into a library
Saturday, March 12, 2011
Why not utilize PROC VARCLUS as a standard variable reduction method for every problem?
The key inspirational clips,
"The VARCLUS procedure attempts to divide a set of variables into nonoverlapping
clusters in such a way that each cluster can be interpreted as essentially unidimensional..."
"A large set of variables can often be replaced by the set of cluster components with little loss of information. A given number of cluster components does not generally explain as much variance as the same number of principal components on the full set of variables, but the cluster components are usually easier to interpret than the principal components, even if the latter are rotated.
For example, an educational test might contain fifty items. PROC VARCLUS can be
used to divide the items into, say, five clusters. Each cluster can then be treated as a subtest, with the subtest scores given by the cluster components.
If the cluster components are centroid components of the covariance matrix, each
subtest score is simply the sum of the item scores for that cluster."
are directly credited to http://www.okstate.edu/sas/v8/saspdf/stat/chap68.pdf
This is because primary variable representing as one variable for all the variables within a cluster can indeed be good enough representing all of the variables within that cluster, becuase the distances among them are distinctly smaller than other collections of variables.
However, if we are faced with a datamining situation where possibly even a 3%-5% coverage of a variable might be included as important variable in predictive equation, this strategy of variable reduction can lead to tug of war between the number of clusters (and hence number of variables in the original analysis) and the total explained variance from the original set of variables.
The stopping rule for clustering (on the basis of when to stop) is typically a challenge which easily hovers around science and art of prediction (art of prediction?...mm what is it?), will need to be determined on the basis of balancing following parameters.
- The total variance among the original set of variables explained
- The correlations of each of the members with in a cluster to the cluster dimension
- The correlations of each of the members with in a cluster with that of the other members of other clusters
- Explain reasonably and possibly upto 90% of the variation in the original set of variables (because this is also a prediction problem)
Sunday, January 9, 2011
SAS Tips - Tip 8 - A Nice Full Coverage Training for Statistical Softwares
http://www.ats.ucla.edu/stat/
Friday, December 31, 2010
SAS 1001 Tips - Tip 6 - Macro variable - How to put it in title?
/* &maxIncome is the macrovariable value at the time of this particular title is exeuted - note that we need double quotes */
SAS 1001 Tips - Tip 5 - How can I store aggregate value as a macro parameter value
select max(AvgIncome)
into :maxIncome
from sql.statedata
where state = ’New Jersey’;
reset print;
title "The Highest Average Income in among counties in NJ: &maxIncome";
select County, AvgHigh format 4.1
from sql.AvgIncome
where state = ’New Jersey’;
/* improvisation from SAS.com notes */
SAS 1001 Tips - Tip 4 - How do i know the current macro variable names and their stored values?
use
%put _all_; or
%put _user_;
_all_ option will print system macro values also.
check out the file vmacro in the sashelp directory.
Why is this called VIEW but not a dataset which is usually called TABLE?
Identify one interesting VIEW and one interesting TABLE in SASHELP directory
SAS 1001 Tips - Tip 3 - reading number of observations from a dataset and storing into a macro variable
proc sql noprint;
create table sqla as
select *
from a;
quit;
%put Number of observatrions=&sqlobs;
/* The number of observations is stored in a macro variable initiated by proc sql *
/*********************************************/;
Question:
1. what are the other built in macro values that can be extracted using proc SQL?
SAS 1001 Tips - Tip 2 - Finding Number of days between two dates
where
sdate=start date
edate=end date.
eg:
data a;
sdate='21jan2008'd;
edate='31dec2010'd;
num_day=datdif(sdate,edate,'act/act');
put _all_;
run;
sdate=17552 edate=18627 num_day=1075 _ERROR_=0 _N_=1
NOTE: The data set WORK.A has 1 observations and 3 variables.
SAS 1001 Tips - Tip 1 - Covert Numeric to Character & character to numeric
char2num=input(var,3.);
Saturday, December 12, 2009
Basic Practical Statistics With SAS - Basic Course
1. What are the statistical problems (45 actual problem mentioning them, in story telling and creating intrigue) - remaining 55 will be used for exercise sessions and labs (1)
2. Why statistical methods are essential and unavoidable to solve the problems mentioned in (1/ )
4. What are the natural questions in a programming context?
5. Reading and writing simple tables (3 hours/ )
- direct input
- csv/txt
- checking data (through GUI directory), by printing (PROC PRINT)
- meta data (PROC CONTENTS)
- ordering of variables and searching for information
6. sending SAS datasets to others and receiving and using them (PROC IMPORT/EXPORT files)
7. Complex data structures (one record per line, multiple records per line, multiple lines per record, multiple records across multiple files, nth record selected, randomly selected record, )
8. various systems (UNIX/IBM/PC) - how the data differs in reading/writing/storing/outputting - this may be moved for special sessions or coaching (1 hr / ...)
9. Data steps/sort/merge/Importance and Usage of First.xxxx and Last.xxxx observations (1 hour/ ), output statement
10. Data _NULL_, list, put statements (1 hour/ )
11. PROC SQL (3 hour/ hour)
12. Proc univariate (1 hour/ hour)
13. PROC means (1 hour/ hour)
14. PROC FREQ (1 hour/ hour)
15. PROC SUMMARY/TABULATE (1/ hour)
16. ODS for outputting
17. Data components of PROCs (standard outputs of procs, how to understand object names of components of PROCs, viewing, recombining for your special template output needs) - 1 hour/ hour
18. LAB 1/ hour
19. LAB 2/ hour
20. LAB 3/ hour
21. Evaluation and Certification (30 Min/ hour)
Wednesday, September 2, 2009
Scheduling your SAS program for Automated Execution and Read and Post Files in the Web Directly
Say, the pgm is daily number of visitors to a store or a hospital, as a time series, analyzed with all the bells and whistles of graphical output.
The pgm name is daily_update_4AM.sas.
The time to run is morning 4:00AM.
The input file is "ftp://www.CRMportals.com/outbound"
The output file has to be "ftp://www.CRMportals.com/todate/num_visitors.html"
The OS is WINDOWS.
Create a batch file (a file with .bat extension using a notepad editor)
Monday, December 1, 2008
SAS TIPS - TIP 8 - How to Create Weighted Calculations in Proc Tabulate
PROC TABULATE DATA=A.Datain_with_wgts MISSING;
VAR flag ;WEIGHT R_wgt;
Title "People who postponed their purchases";
CLASS segments postpone quarter_year;
WHERE segements=Seg1;
TABLE (quater_Year ALL) , (Postpone ALL)*flag*(sum*F=6.0 ROWPCTSUM*F=6.2) /rts=10;
FORMAT segments segments. ;
RUN;
Created in association with Rathy. C
Sunday, July 27, 2008
SAS Tips 1001 - Tip 10 - PROC ACECLUS provides better results if variables are standardized
See the discussion in
http://support.sas.com/documentation/cdl/en/statug/59654/HTML/default/statug_stdize_sect020.htm#statug.stdize.stdizesummary
Interestingly, the best standardizations with in 5 observations missclassifications (3% of observations missclassified) are (SPACING (0.14)-25, MAXABS-26, IQR, AGK(0.14)-28, RANGE-32, MIDRANGE-32. All these suggest 7 clusters.
Interestingly, STD and L(2) well known looks like the best because both identify only 5 clusters with missclassifications of 33 obs!
SAS Tips 1001 - Tip 14 - The Full SAS 9.2 User Guide
This specific page takes you to PROC DISTANCE. On the left, all the guide materials for all PROCs are available.
Saturday, July 26, 2008
SAS Tips 1001 - Tip 12 - Calculating Mahalanobis distance - Not so commonly known
This material is posted here from SAS.
http://support.sas.com/kb/30/662.html
This sample shows one way of computing Mahalanobis distance in each of the following scenarios:
from each observation to the mean
from each observation to a specific observation
from each observation to all other observations (all possible pairs) 1) To compute the Mahalanobis distance from each observation to the mean, first run PROC PRINCOMP with the STD option to produce principal component scores in the OUT= data set having an identity covariance matrix. The Mahalanobis distance and Euclidean distances are equivalent for these scores. Then use a DATA step with a statement such as:
mahalanobis_distance_to_mean = sqrt(uss(of prin:));
to complete the required distance.
2) To compute the Mahalanobis distance from each observation to a specific point, compute the principal component score for that point using the original scoring coefficients. Then compute the Euclidean distance from each observation to the reference point. One easy way to do this is to use PROC FASTCLUS treating the reference point as the SEED.
3) To compute Mahalanobis distances between all possible pairs, run PROC DISTANCE on the OUT= data set as created by PRINCOMP in the steps above. PROC DISTANCE will automatically calculate all possible pairs.
Thursday, June 5, 2008
SAS Tips 1001 - Tip 13 - Knowing output dataset names in various PROCS in SAS STAT
1. Use the command outside of proc statements;
(before the specific proc statement) ODS TRACE ON;
(after that proc statement ) ODS TRACE OFF:
2. Run the program and look at the log and find dataset name of output files from any Procs; Check the sas output datasets to clearly understand name of the file and the contents;
3. Now take out those ODS Trace statments; introduce the following before the PROC statment:
4. ODS output Name (file name from ODS trace)=New name (it could have a specific path); /* storing it in a particular path and will not be lost after the SAS run)
5. ODS output close;
6. Now do the proc contents to know the column names and attributes.
7. In the next run use the sas dataset name as a dataset to access, to pull out your elements selectively, and to format the outputs to your needs;
A great introduction with lot of actual sample on this step and also lot more on formating and using styles and templates for output is given in
http://www.ats.ucla.edu/stat/sas/seminars/sas_ods/ODS%20Technology%20for%20Decision%20Makers.pps
This is extremely useful for automation and for large scale data mining problems.
Sunday, May 25, 2008
List of Basic Questions For SAS Beginners
Data Set Options:
- Data set options override OPTIONS statement
- Data set options are usable in data statement or set statement. Be watchful about the effect of data set options when specified in Data statement vs. set statement.
- Top data set options
- Difference between ALTER=, PW=, READ=, WRITE=, ENCRYPT (this is the most restrictive) for managing passwords; if encrypt= option is used and password is lost, the only way to get the data set is to recreate the data set
- The importance of CNTLLEV=LIB/MEM/REC for simultaneous shared access to data sets
- KEEP=, DROP=, RENAME=(old name=new name ....)
- The difference between FIRSTOBS= and OBS=
- GENMAX= vs. GENNUM= usage.
- Usefulness of IDXNAME and IDXWHERE, and INDEX
- SORTEDBY=
- WHERE= and WHEREUP=YES/NO
Exercises (You have a basic SAS data set, say FIRST, with vars, VAR1, VAR2, VAR3, VAR4, VAR5 VAR6, VAR7, VAR8, VAR9, VAR10, VAR11):
- Read observations from 5 to 10 and print them with page options which provides 132 columns width and 64 rows.
- Store the 5 observations you created from FIRST and save it with READ=mypass and with MAXGEN=2. Call this as data set SECOND.
- The SECOND data set should have facility for two people to read and work on it at the same time (Note: otherwise the data set will keep the lock with some one who opened it first and will not allow even the creator to work on it, if the creator wants to modify the data set)
- Keep only VAR7 and VAR8 by renaming them as GENDER and AGE and also at the same time selecting only those who are aged 75 and above); Name this data set as THIRD
- Modify the THIRD data set, created out of exercise 4 with IDXNAME created out of VAR1, VAR2, and VAR6 (FirstName, LastName, ZIP, with _ as the connector)
- Create data set FOURTH, from THIRD, creating index from VAR10 (which is PHONENUM)
- Create data set FIFTH, from FOURTH sorted by VAR9 (which is CUSTOMER_NUM), keeping only those records for which there is VAR11 not missing (SSNUM)
In the next summary, we will see the FORMATS.
Sunday, May 4, 2008
Lift Charts - TP, FP, KS, C - Key Words
- The graph of a traditional lift chart in marketing is between Xaxis=Deciles and Yaxis=Cumulative Probability of True Identification of Consumers (True Positive - TP curve).
- We can also create a lift curve for FP (false positives). This tells us the burden we have to bear (mistakes we commit) because we select the top decile, for example. The cumulative will give FP rate for any given cut of point selected, instead of top 10%.
- In other words, we want to compare really TP and FP trade off and compare lift curves on the basis of area (or any other measure that provides one number which is independent of a cut off point)
- This is done in KS statistics, where we measure maximum difference between TP and TN lift curves.
- Alternatively we can do this (compare KS statistics or compare TP vs. FP curves) using C value from the SAS output.
- Use c to judge comparative effectiness of lift curves
Sunday, April 27, 2008
Two Very Important Documents for SAS Forecasting
The user guide for PROC HPF is provided by SAS support system, and is available here.
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/hpf_ug_7305.pdf
Comparing PROC HPF and SAS® Enterprise Miner for forecasting is provided in
http://www2.sas.com/proceedings/sugi30/194-30.pdf
Monday, April 21, 2008
A Simple Write Up that Teaches All the Tricks of PROC FORMAT
- http://www2.sas.com/proceedings/sugi31/249-31.pdf
This includes
- Table look up using Proc Format
- Using Proc Format for finding unexpected values
- Generating new variables with Proc Format
- Using 2-Dimensional table look up
More examples of 2-dimensional and 3-dimensional table lookups using PROC FORMAT can be found in SUGI31
paper by Perry Watts, "Using Database Principles to Optimize SAS® Format Construction from Tabular Data".
- Using PROC FORMAT to extract data
- Specifying ranges of values in PROC FORMAT
- Using PROC FORMAT for Data merges
- Creating Multilabel formats
- Saving formats
Today's Workshop Article - SAS MACROS - Beyond Basics
The following article summarizes all the functionalities of SAS MACROS.
http://www2.sas.com/proceedings/sugi24/Handson/p149-24.pdf
The following questions are answered here.
- do you know why %str function is important?
- what really happens when SAS macro language processing is done in SAS
- What would you know if you use %put _all_; and %put _user_; statements, (1) at the beginning of a sas program, inside a macro program, outside a macro program?
- why would you need %NRSTR function in a macro or in a %let statement?
- knowing how to use &&double and &&&triple
- difference between %SYSEVALF and %SYSFUNC
- using %SYMPUT % SYMGET
- %IF and %DO executions
- %RESOLVE and %EXECUTE purposes
See the article for more tutorial type information
www2.sas.com/proceedings/sugi24/Handson/p149-24.pdf
Friday, April 18, 2008
Randomly Selecting a Subset of Records From a File with Out Reading the Whole File
filename out1 "c:\tests\outf1.dat" RECFM=F LRECL=1765;;
data _null_;
infile in1 obs=1000;
file out1;
input @1 _FULLREC $EBCDIC1765.;
if ranuni(0) le 0.1 then
put @1 _FULLREC $EBCDIC1765.;
run;
This code actually reads only 1000 records and with in that 10% of the data is selected for outputing to the SAS data set.
Sunday, April 13, 2008
A first in MSA (Masters of Science in Analytics) evolving
The science of analytics is ready to thrive as a university discipline.
The courses are interesting; it points out the various possibilities and how wide it can be. Interesting also is the close relationship between SAS and this program.
Institute for Advanced Analytics
Institute Main Office:920 Main Campus Drive, Venture 2-101Raleigh, NC 27606
MSA Program Offices and Classroom:900 Main Campus Drive, Venture 3-165CRaleigh, NC 27606
T: 919-513-3940F: 919-424-4401
Saturday, April 5, 2008
Working with Latent Models - How to Docs
(latent variable, manifest variable) - Analysis to be used
(continuous, continuous) - Factor analysis - read more about it here.
(continuous, discrete) - Latent profile analysis (LPA)/mixture model - read more about it here.
(discrete, continuous) - Latent trait analysis (LTA)/item response models/Rasch - read more about it here.
(discrete, discrete) - Latent class analysis (LCA)/multinomial model - read more about this here.
Some important reading material about using SAS in this category of models.
(1) www2.sas.com/proceedings/forum2007/192-2007.pdf - LCA related
Some more references to come ...
Saturday, March 29, 2008
A Complex Do Loop and One Record per Household
run;
/* creating one record per HH using all DPID related data */;
data pin_data.hh_dpid_onerec (keep=hhid sid pid dem_wt dmt1-dmt18 dmd1-dmd18 dmq1-dmq18 dst1-dst18 dsd1-dsd18 dsq1-dsq18) ;
set pin_data.hh_dpid_mean_weighted_enh;
by hhid;
where dpid not in (.);
retain i dmt1-dmt18 dmd1-dmd18 dmq1-dmq18 dst1-dst18 dsd1-dsd18 dsq1-dsq18;
array dmt{*} dmt1-dmt18;
array dmd{*} dmd1-dmd18;
array dmq{*} dmq1-dmq18;
array dst{*} dst1-dst18;
array dsd{*} dsd1-dsd18;
array dsq{*} dsq1-dsq18;
if first.hhid
then do j=1 to 18;
dmt{j}=0;
dmd{j}=0;
dmq{j}=0;
dst{j}=0;
dsd{j}=0;
dsq{j}=0;
end;
if first.hhid then i=0;
i+1;
dmt{i}=mt;
dmd{i}=md;
dmq{i}=mq;
dst{i}=st;
dsd{i}=sd;
dsq{i}=sq;
if last.hhid then output;
run;
/* creating one record per HH using all CHID related data */;
data pin_data.hh_chid_onerec (keep=hhid cmt1-cmt9 cmd1-cmd9 cmq1-cmq9 cst1-cst9 csd1-csd9 csq1-csq9) ;
set pin_data.hh_chid_mean_weighted_enh;
by hhid;
where chid not in ( . );
retain i cmt1-cmt9 cmd1-cmd9 cmq1-cmq9 cst1-cst9 csd1-csd9 csq1-csq9;
array cmt{*} cmt1-cmt9;
array cmd{*} cmd1-cmd9;
array cmq{*} cmq1-cmq9;
array cst{*} cst1-cst9;
array csd{*} csd1-csd9;
array csq{*} csq1-csq9;
if first.hhid then do j=1 to 9;
cmt{j}=0;
cmd{j}=0;
cmq{j}=0;
cst{j}=0;
csd{j}=0;
csq{j}=0;
end;
if first.hhid then i=0;
i+1;
cmt{i}=mt;
cmd{i}=md;
cmq{i}=mq;
cst{i}=st;
csd{i}=sd;
csq{i}=sq;
if last.hhid then output;
run;
/* creating one record per HHID using UCID based codes - UCID=0 and UCID=. not inlcuded */;
data pin_data.hh_ucid_onerec (keep=hhid umt1-umt362 umd1-umd362 umq1-umq362 ust1-ust362 usd1-usd362 usq1-usq362) ;
set pin_data.hh_ucid_mean_weighted_enh;
by hhid;
where ucid not in (.);
retain i umt1-umt362 umd1-umd362 umq1-umq362 ust1-ust362 usd1-usd362 usq1-usq362;
array umt{*} umt1-umt362;
array umd{*} umd1-umd362;
array umq{*} umq1-umq362;
array ust{*} ust1-ust362;
array usd{*} usd1-usd362;
array usq{*} usq1-usq362;
if first.hhid then
do j=1 to 362;
umt{j}=0;
umd{j}=0;
umq{j}=0;
ust{j}=0;
usd{j}=0;
usq{j}=0;
end;
if first.hhid then i=0;
i+1;
umt{i}=mt;
umd{i}=md;
umq{i}=mq;
ust{i}=st;
usd{i}=sd;
usq{i}=sq;
if last.hhid then output; run;
/* creating one record per HHID merging all DPID, CHID, UCID variablesNote that the first variable in each refers to -1, that is unknown code */;
proc sort data=pin_data.hh_dpid_onerec;
by hhid;
run;
proc sort data=pin_data.hh_chid_onerec;
by hhid;
run;
proc sort data=pin_data.hh_ucid_onerec;
by hhid;
run;
data pin_data.hhid_panel_onerec;merge pin_data.hh_dpid_onerec (in=dpid) pin_data.hh_chid_onerec(in=chid) pin_data.hh_ucid_onerec(in=ucid);
by hhid;
if dpid;
run;
Friday, March 28, 2008
ROC - Receiver Operating Curve
The common misunderstanding I find is the following
- ROC is not same as traditional lift curve that marketing analysts use, though both look the same
- (1-Specificity) vs. Sensitivity is the plot of ROC where as Decile vs. Sensitivity is the lift curve
- The area under ROC is not same as area under lift curve
- Hosmer-Lemeshow test may not be the one marketing analysts will use to see the goodness of fit of the model; they may use area under ROC curve and KS statistic.
- The statistic "c" in SAS output is area under ROC curve
- KS just tells you what maximum differences occur in the ROC between cumulative response (sensitivity) vs. cumulative non-response (Specificity).
Tuesday, March 11, 2008
A List of SAS Questions Every Statistician has to Know Their Answers - This List Will be Evolving
- What is a PDV (program data vector)
- Why we need a RETAIN statement; give an example; relate it to PDV
- Want to post the total number of observations to all the records; what are the codes (if we do not manually input anything)
- How do we read multiple records to create a single individual
- How do we output multiple individual records from a single record (assuming multiple individuals' data are available in a single record)
- What is the purpose of first.account and last.account when we set data sets by account? Think of an example where this is useful. Why is this not available outside of the data step, automatically (HINT: relate it to PDV basics)
- How do we find the macro variable names and their values? (HINT: Use PROC SQL)
- What is metrics "C" in logistic regression output? how is it related to the area under the ROC curve
- What is sensitivity and specificity? Which is more important than the other in a marketing modeling context
- What goes on in the PDV when we use the statement "if _n_=1 then set a; set b;"
- How to subset observations at input level?
- What is the difference between obs=10 in options statement and obs=10 in the infile statement
- What is the purpose of missover, flowover, DSD, DLM options in infile statement?
- Identify the list of SAS generated important variables (during SET statement) which are not available outside the data step.
- ...
Thursday, February 28, 2008
A Better Way to Map A Market: An Example from Harvard Business Review
Look at this link for the graph , breaking down the market by end users, not by purchasers (A tradational way of looking at its market, by Thomson Financial).
You may have to look deep in to the article for the picture i am referring to.
Wednesday, February 27, 2008
SAS ODS - SAS output objects - Trace on and Trace off to know the output objects
http://www.asu.edu/sas/sasdoc/sashtml/ods/z0471334.htm
To know the names of the objects for ODS, use the following:
ods trace on;
proc related statements;
run;
ods trace off;
Executing this step gives the table names to use.
Now when the proc is run, use the following statement before the run; in the proc related statements.
ods output newFileName=tableName /* from the ods trace*/
The newFileName is in the work directory and one can manipulate this file to specifically pick output elements, arrange them any order one wants and combine with other outputs.
- SAS becomes an object oriented with this! (well close to what it can deliver now)
This is a very powerful technique.
Wednesday, February 13, 2008
Sensitivity (True Positive) Vs. 1- Specificity (True Negative) in Marketing Models
Specificity: The likelihood of heart attack, given that the patient's test is negative; FALSE NEGATIVE
(1-specificity) is the likelihood of no heart attack given that the test is negative (again posterior probability) - TRUE NEGATIVE
Let us say "AGE" is the covariate. That is, we have a posterior parametric model with AGE as the independent variable.
A test based on AGE dominates another test (means better than the other), if the TRUE POSITIVES VS. TRUE NEGATIVES (discrimination curve - Receiver Operating Curve - ROC) is above the already established discrimination curve.
In traditional marketing context, we use sample deciles and TRUE positives as the lift curve. So we are likely to build great models which may have bad rates of "FALSE POSITIVES" and "FALSE NEGATIVES".
A great introduction about True positivies, True Negatives, False Positives, and False Negatives is http://www.cs.cornell.edu/courses/cs678/2006sp/performance_measures.4up.pdf
Netflix Challenge - Reading all the data into SAS
===========================================================================
libname Net "C:\Documents and Settings\All Users\Documents\NetFlixChallenge\";
libname Netdown "C:\Documents and Settings\All Users\Documents\NetFlixChallenge\download\";
libname Nettrain "C:\Documents and Settings\All Users\Documents\NetFlixChallenge\training_set\";
data netdown.movies1(compress=yes);
infile "C:\Documents and Settings\All Users\Documents\NetFlixChallenge\download\movie_titles.txt" dsd dlm=","
missover truncover /*obs=100 */ ;
input m_num $ rel_yr $ title $80.
;
run;
%let lengthi=6;
data netdown.movies(drop=prefix1 ); set netdown.movies1;
prefix1=repeat('0',&lengthi-length(m_num));
m_expand='mv_'||trim(prefix1)||trim(m_num);
run;
proc print data=netdown.movies(obs=10); run;
proc sql noprint;
select m_expand into: MID1 -: MID17770
from netdown.movies;
quit;
data netdown.superdat;
length mid $7. cid 8. rank 8. rank_dt $10.;
run;
%macro netdata(tfiles=j);
%do i=1 %to &tfiles;
data netdown.mm (keep=mid);
length mid $7.;
mid1="&&mid&i";
mid=substr(mid1,4,7);
run;
data netdown.add1(keep=cid rank rank_dt);
infile "c:\Documents and Settings\All Users\Documents\NetFlixChallenge\training_set\&&MID&i...txt" dsd dlm=","
missover truncover firstobs=2 ;
input cid rank rank_dt $10.;
keep cid rank rank_dt;
run;
data netdown.add ; if _n_=1 then set netdown.mm; set netdown.add1; run;
proc append base=netdown.superdat data=netdown.add ; run;
%end;
%mend;
%netdata(tfiles=17770);
proc freq data=netdown.superdat;
table mid; run;
data netdown.superdat1(compress=yes); set netdown.superdat;
if mid ne .;
run;
/* work with first 10 movies as a sample */
data netdown.sample1_10; set netdown.superdat1 (obs=10);
run;
proc sql noprint;
select cid into: c1 -: c10
from netdown.sample1_10;
quit;
PROC SQL NOPRINT;
CREATE TABLE netdown.macro_vars AS
SELECT name, value
FROM sashelp.vmacro
WHERE scope='GLOBAL'
ORDER BY 1;
QUIT;
*********** SAMPLE WORK ON FIRST MOVIE ***********************;
%let s1_numc=10; /*number of customers for sample1_10 */
proc sort data=netdown.superdat1 out=netdown.superC; by cid; run;
proc sort data=netdown.sample1_10; by cid; run;
data netdown.sample1_10_merged; merge netdown.superc (in=c) netdown.sample1_10 (in=s); by cid;
if s; run;
proc sql noprint;
create table netdown.superc_sql as
select a.*
from netdown.superc as a left join
netdown.sample1_10 as b
on a.cid=b.cid
where a.rank_dt ne "";
quit;
data netdown.s1_scores;
length mid $7. cid 8. rank 8. rank_dt $10.;
run;
%macro collect_scores(movie_num=mn,num_c=s1_numc);
title "Movie evaluated is &movie_num";
%do i=1 %to &num_c;
data netdown.s1_scores_add;
set netdown.superc;
where cid=&&c&i; /* since i know the customers ids are stored in the macro seq. c(i) = cn */
run;
proc append base=netdown.s1_scores data=netdown.s1_score_add; run;
%end;
%mend;
%collect_scores(movie_num='0000001',num_c=10);
quit;
Tuesday, February 12, 2008
When "Where" Matters - "If" Matters? - SAS - Differences between 'WHERE" and "IF" statements
http://www2.sas.com/proceedings/sugi31/238-31.pdf.
In one of my application recently i found that I had to use the End of File flag (in set statement - EOF=LAST option; LAST=1 or 0 based on whether the file reached the last record or not);
I had to subset the observations and work on the smaller data set and make use of LAST for decision to execute a bloc of statements.
If we use "IF" the execution will not be correct.
Have to use "WHERE" statement to subset.
Using PROC SQL to Rename Vars - this time stripping characters
* THE LIBRARY AND Dataset NAMES IN*
* MACRO CALL MUST BE IN UPPERCASE *
***********************************;
*options macrogen mprint mlogic symbolgen;
%macro rename(lib,dsn);
options pageno=1 nodate;
proc contents data=&lib..&dsn;
title "Before Renaming All Variables";
run;
*********************************
* Keep original variable names *
*********************************;
proc sql noprint;
select nvar into :num_vars
from dictionary.tables
where libname="&LIB" and
memname="&DSN";
select distinct(name) into
:var1-:var%TRIM(%LEFT(&num_vars))
from dictionary.columns
where libname="&LIB" and
memname="&DSN";
quit;
run;
******************************************
* Remove suffix _base from variable names*
******************************************;
proc sql noprint;
select nvar into :num_vars
from dictionary.tables
where libname="&LIB" and
memname="&DSN";
select distinct(tranwrd(name,'_base','')) into
:varnew1-:varnew%TRIM(%LEFT(&num_vars))
from dictionary.columns
where libname="&LIB" and
memname="&DSN";
quit;
run;
********************
* Rename variables *
********************;
proc datasets library=&LIB;
modify &DSN;
%do i=1 %to &num_vars;
rename &&var&i = &&varnew&i;
%end;
quit;
run;
/**** CHECKING THE RESULTS ****/;
options pageno=1 nodate;
proc contents data=&lib..&dsn;
title "After Renaming Variables without SUFFIX _base";
run;
%mend rename;
%rename(WORK,PANEL5);
run;
Thursday, January 31, 2008
The short cut of SPSS C5 - fast in execution and so is high false positives
I also found one problem (may be others too, and i will keep reporting when I discover) is that it is more likely to allocate more of the observations to the highest incidence group and stop sooner; this creates naive assignments. That is almost 100% correct assignment for the highest incidence class but almost 100% error for other classes.
One possibility is use to the misclassification cost matrix; You can look at the confusion matrix of training and testing to re-adjust the cost matrix; but it is not easy; also it seems C5 is not using some of the basic principles of modeling using the correlation and information content of covariates.
Wednesday, January 23, 2008
Subsetting IBM Dataset based on outside file info with out affecting the integrity of the data layout
RETAINED FOR FURTHER PROCESSING; THE TRICKY PART IS THE SUBSETTING IS BASED ON ANOTHER
FILE WITH A COMPLEX SUBSETING PROCESS
In this example, the susbeting IDs are in YS.active_high(N=1429) read into work.subset;
The larger ibm dataset is ibm.dat (N=55,000)
The output data set is ibm_subset(N=1429)
******/
/************=================================================*********/;
libname ys "c:\SAS\PROJECT1";
%macro fsubset(outname=);
proc sort data=ys.&&outname out=subset; by ind_id1; /* N=1429 */run;
filename ibm "c:\SAS\PROJECT1\ib.dat" RECFM=F LRECL=1305;
data ibm; infile ibm dsd missover;
input @1 _full $EBCDIC1305.
@1156 id1 $EBCDIC8.;
run;
proc sort data=ibm; by id1; run;
data ibm_subset; merge ibm(in=c) subset(in=s); by ind_id1;
if s; /* subset set of ibm data N=1429 */;
run;
filename ibm_s "c:\SAS\PROJECT1\&outname..dat" RECFM=F LRECL=1305;
data _null_;
file ibm_s RECFM=F LRECL=1305;
set ibm_subset;
put @1 _full $EBCDIC1305.;
run;
/**********=====================================================******/;
%mend;
options mlogic mprint;
%fsubset(outname=A_HIGH);
%fsubset(outname=active_med);
%fsubset(outname=active_mhi);
%fsubset(outname=active_low);
Thursday, January 10, 2008
creating a txt/csv/delimited file from a sas data set
data _null_;
file out1 dlm="|"; /* dlm could be any character */
set one;
put fname lname source id; /* any number of variable from the sas dataset one */
run;