Most use SAS procedures
1) Proc Sql
2) proc format;
3) proc cluster;
4) proc logistics
5) proc reg
5) proc life test
6) proc phreg
7) proc factor
8) proc PCA
9) proc corr
10)proc freq
11) proc import
12) proc export
13) proc append
14) proc glm
My Industry is predictive/statistical modelling.
1) Proc Sql
2) proc format;
3) proc cluster;
4) proc logistics
5) proc reg
5) proc life test
6) proc phreg
7) proc factor
8) proc PCA
9) proc corr
10)proc freq
11) proc import
12) proc export
13) proc append
14) proc glm
My Industry is predictive/statistical modelling.
Data Analysis and Modeling using R
1)
Read the data set
We can read the data using excel. Html text file, web url ,
data base file, xml etc
2)
Load the data set
load("Full path
to rda data set")
3)
Summarize the data
summary(Data set name)
4)
Count the missing values
sapply(data set name, function(x)(sum(is.na(x)))) # NA
counts
5)
Computing new variable:
Data Set name $new
variable name <- with(data set name, calculation/logic(using if, while and
for loop etc)
6)
Adding Observation Number:
Data set Name $ObsNumber <- 1:100
7)
Standardize variable:
Data Set name $Z.variable name <- .Z[,1] to standardize
more than one variable we can use same command but we have to chane 1 to 2, 3
etc
8)
Converting continuous
variable to bin
Data Set name $derived bin variable <- bin.var(Data Set
name $continuous variable name , bins=# of bin
method='intervals', labels=FALSE)
9)
Dropping variable:
Data Set Name$variable name <-
NULL
10)
Renaming variable
names(data set name)[c(Col#)] <-
c("New name")
11) Decile
analysis
numSummary(Data Set name[,"variable"],
statistics=c("mean", "sd", "quantiles"),
quantiles=c(0,.25,.5,.75,1))
This can be changed according to
need of business and analysis purpose
12)
Frequency Analysis for categorical variables
.Table <- table(Data Set
Name$variable Name)
.Table # counts for variable Name
100*.Table/sum(.Table) # percentages for variable name
remove(.Table)
13) Correlation
Matrix
cor(Data Set
Name[,c("var_1","var_2","var_3"…,”var_n”)],
use="complete.obs")
14)
Principal Component Analysis using
Scree plot
.PC <-
princomp(~Var_1+Var_2+…,Var_n, cor=TRUE, data=data set name)
unclass(loadings(.PC)) # component loadings
.PC$sd^2 # component variances
screeplot(.PC)
data set name$PC1 <-
.PC$scores[,1]
Data Set Name$PC2 <-
.PC$scores[,2]
Data Set name$PC3 <-
.PC$scores[,3]
remove(.PC)
We are retaining only three
principal Components. No of component to
be retained depends on data size and analysis requirement.
15)
Factor Analysis
FA <-
factanal(~Var_1+Var_2+…+Var_n, factors=# of factor to be retain,
rotation="varimax", scores="none", data=Data Set Name)
.FA
remove(.FA)
16) Cluster
Analysis
A) K Means Cluster
.cluster <-
KMeans(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3, Data Set Name), centers = #
of Cluster, iter.max = 10, num.seeds =
10)
.cluster$size # Cluster Sizes
.cluster$centers # Cluster Centroids
.cluster$withinss # Within Cluster
Sum of Squares
.cluster$tot.withinss # Total Within
Sum of Squares
.cluster$betweenss # Between Cluster
Sum of Squares
biplot(princomp(model.matrix(~-1 +
Var_1 + Var_2+…+ Var_3, Data Set Name)), xlabs =
as.character(.cluster$cluster))
Hatco$KMeans <-
assignCluster(model.matrix(~-1 +Var_1 + Var_2+…+ Var_3, Data Set Name),
Hatco, .cluster$cluster)
remove(.cluster)
B) Hierarchical
Cluster:
Cluster Solution Name <-
hclust(dist(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3 , Data Set Name)) ,
method= "ward")
plot(Cluster Solution Name, main=
"Cluster Dendrogram for Solution Cluster Solution Name", xlab= "Observation Number in Data Set
Hatco",
sub="Method=ward; Distance=euclidian")
Note: Lot of experiment need to be
done on selecting optimal number of cluster and method etc to address actual
need of business and to justify clustering.
Logistics Regression Model
1)Model Name <- glm(Dependent Varaible ~
Var_1+Var_2+…+Var_n, family=binomial(logit),
data=Data Set Name)
summary(Model Name)
2)Adjusting
Confidence Interval for Model
Confint(Model Name, level=.95,
type="LR")
Level can be adjusted based on
sample size and model requirement
3)Model
Accuracy Test:
a)
AIC Value:
AIC(Model Name) : Small value Indicate good fit
b)
BIC Value:
BIC(Model Name) : Small Value Indicate good fit
4)Visualization of
Model:
1)Basic Diagnostics plot:
oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,2))
plot(Model Name)
par(oldpar)
2)Component+ Residual Plot
cr.plots(Model Name, ask=FALSE)
3)Influence Plot:
influencePlot(Model Name)
4) Effect Plot:
trellis.device(theme="col.whitebg")
plot(allEffects(Model Name), ask=FALSE)
No comments:
Post a Comment