Friday, April 8, 2016

R Data Analysis and Modeling using R

Most use SAS procedures
1) Proc Sql
2) proc format;
3) proc cluster;
4) proc logistics
5) proc reg
5) proc life test
6) proc phreg
7) proc factor
8) proc PCA
9) proc corr
10)proc freq
11) proc import
12) proc export
13) proc append
14) proc glm
My Industry is predictive/statistical modelling.


Data Analysis and Modeling using R


1)      Read the data set
We can read the data using excel. Html text file, web url , data base file, xml etc
2)      Load the data set
 load("Full path to rda data set")
3)      Summarize the data
summary(Data set name)
4)      Count the missing values
sapply(data set name, function(x)(sum(is.na(x)))) # NA counts
5)      Computing new variable:
 Data Set name $new variable name <- with(data set name, calculation/logic(using if, while and for loop etc)
6)      Adding Observation Number:
Data set Name $ObsNumber <- 1:100
7)      Standardize variable:
Data Set name $Z.variable name <- .Z[,1] to standardize more than one variable we can use same command but we have to chane 1 to 2, 3 etc
8)      Converting  continuous variable to bin
Data Set name $derived bin variable <- bin.var(Data Set name $continuous variable name , bins=# of bin  method='intervals',  labels=FALSE)
9)      Dropping variable:
Data Set Name$variable name <- NULL
10)  Renaming variable
names(data set name)[c(Col#)] <- c("New name")
11)  Decile analysis
numSummary(Data Set name[,"variable"], statistics=c("mean", "sd", "quantiles"),
  quantiles=c(0,.25,.5,.75,1))
This can be changed according to need of business and analysis purpose
12)  Frequency Analysis for categorical variables
.Table <- table(Data Set Name$variable Name)
.Table  # counts for variable Name
100*.Table/sum(.Table)  # percentages for variable name
remove(.Table)

13)  Correlation Matrix
cor(Data Set Name[,c("var_1","var_2","var_3"…,”var_n”)], use="complete.obs")
14)  Principal Component Analysis using Scree plot
.PC <- princomp(~Var_1+Var_2+…,Var_n, cor=TRUE, data=data set name)
unclass(loadings(.PC))  # component loadings
.PC$sd^2  # component variances
screeplot(.PC)
data set name$PC1 <- .PC$scores[,1]
Data Set Name$PC2 <- .PC$scores[,2]
Data Set name$PC3 <- .PC$scores[,3]
remove(.PC)
We are retaining only three principal Components. No of  component to be retained depends on data size and analysis requirement.
15)  Factor Analysis
FA <- factanal(~Var_1+Var_2+…+Var_n, factors=# of factor to be retain, rotation="varimax", scores="none",   data=Data Set Name)
.FA
remove(.FA)
16)  Cluster Analysis
A) K Means Cluster
.cluster <- KMeans(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3, Data Set Name), centers = # of Cluster,   iter.max = 10, num.seeds = 10)
.cluster$size # Cluster Sizes
.cluster$centers # Cluster Centroids
.cluster$withinss # Within Cluster Sum of Squares
.cluster$tot.withinss # Total Within Sum of Squares
.cluster$betweenss # Between Cluster Sum of Squares
biplot(princomp(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3, Data Set Name)), xlabs =
  as.character(.cluster$cluster))
Hatco$KMeans <- assignCluster(model.matrix(~-1 +Var_1 + Var_2+…+ Var_3, Data Set Name),
  Hatco, .cluster$cluster)
remove(.cluster)

 B) Hierarchical Cluster:
Cluster Solution Name <- hclust(dist(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3 , Data Set Name)) , method=   "ward")
plot(Cluster Solution Name, main= "Cluster Dendrogram for Solution Cluster Solution Name", xlab=   "Observation Number in Data Set Hatco",
  sub="Method=ward; Distance=euclidian")

Note: Lot of experiment need to be done on selecting optimal number of cluster and method etc to address actual need of business and to justify clustering.
Logistics Regression Model
1)Model Name <- glm(Dependent Varaible ~ Var_1+Var_2+…+Var_n, family=binomial(logit),  data=Data Set Name)
summary(Model Name)
2)Adjusting Confidence Interval for Model
Confint(Model Name, level=.95, type="LR")
Level can be adjusted based on sample size and model requirement





3)Model Accuracy Test:
a)      AIC Value:
AIC(Model Name) : Small value Indicate good fit
b)      BIC Value:
BIC(Model Name) : Small Value Indicate good fit
4)Visualization of Model:
1)Basic Diagnostics plot:
oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,2))
plot(Model Name)
par(oldpar)

2)Component+ Residual Plot
cr.plots(Model Name, ask=FALSE)
3)Influence Plot:
influencePlot(Model Name)
4) Effect Plot:
trellis.device(theme="col.whitebg")
plot(allEffects(Model Name), ask=FALSE)


No comments:

Post a Comment