Assignment 1
CHAPTER 2
Use the Churn data set for the following:
33)Explore whether there are any missing values for any of the variables.
Load the churn data set onto your workspace.
Note: Missing values are found as NA, i.e., Not applicable variables
>is.na(data1) # we use this command to search for the missing variables.
reached getOption("max.print")  omitted 3286 rows
# we reach this “max.print” option so that it limits the amount of printable data on the R console, we can change this by using the following command
>options(max.print = 999999)
# now this command will not omit any rows, it is by default that it omits because once the data is big enough, printing it to the R console rapidly becomes more of a pain.
>which(is.na(data1))
integer(0)
# this says that there aren’t any missing values in the data set.
34)Compare the area code and state fields. Discuss any apparent abnormalities.
Area Code We can see that the area codes have been
state 408 415 510 repeating for different states which is a
AK 14 24 14 abnormality.
AL 25 40 15
AR 13 27 15
AZ 15 36 13
CA 7 17 10
CO 25 29 12
CT 22 39 13
DC 14 27 13
DE 13 31 17
FL 12 31 20
GA 15 21 18
ME 15 25 22
MI 12 39 22
MN 20 40 24
MO 15 37 11
MS 15 31 19
MT 17 34 17
NC 25 28 15
ND 19 28 15
NE 13 34 14
NH 25 19 12
NJ 15 34 19
NM 16 35 11
NV 14 34 18
NY 19 47 17
OH 22 40 16
OK 17 27 17
OR 14 44 20
PA 14 19 12
RI 12 35 18
SC 13 30 17
SD 16 28 16
TN 11 30 12
TX 20 37 15
UT 12 37 23
VA 25 35 17
VT 17 36 20
WA 23 26 17
WI 22 35 21
WV 20 52 34
WY 17 41 19
35)Use a graph to visually determine whether there are any outliers among the number of calls made to customer service.
Since we are working with data sets, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.
We can use the histogram graph to identify the outliers.
We can use other graphs as well to show better visual and clear indications of outliers present in the data set.
> hist(data1$CustServ.Calls,breaks=20,main="Histogram of Calls made to Customer Service",xlab="Calls made",ylab="Counts")
> box(which="plot",Ity="solid",col="black")
From the above graph we can clearly observe the outliers from 5 number of calls made daily as the observation point is very far from the other observations.
>invsqrt.cust<1/sqrt(data1$CustServ.Calls)
>qqnorm(invsqrt.cust,datax=TRUE,col=”red”,ylim=c(0.01,0.5),main=” normal QQ plot of inverse”)
>qqline(invsqrt.cust,col=”blue”,datax=TRUE)
36)Identify the range of Customer Service Calls that should be considered outliers, using:
> m<mean(data1$CustServ.Calls) #mean of Customer service calls
> m
[1] 1.562856
> s<sd(data1$CustServ.Calls) #standard deviation
> s
[1] 1.315491
> z.custcall<(data1$CustServ.Callsm)/s #transformation of Zscore
> hist(z.custcall,breaks=20,xlim=c(2,6),main="Histogram of Zscore of Customer Service Calls",xlab="Zscore of Customer Service Calls",ylab="Counts")
> box(which="plot",lty = "solid",col'')
From the Histogram of Zscore, there are data values that cross the outlier boundary, i.e., 3, and all the observations that cross the Zscore of 3 are identified as outliers based on this method.
The range of outliers using this method is the Zscore from :
[3 to 6]
On detransformation of Zscore values to x, we get the range of outliers from [5.5,9.45].
IQR = Q3  Q1. In other words, the IQR is the 1st quartile subtracted from the 3rd quartile;
A sample graphical representation of the IQR method is shown below:
The first quartile (Q1) is the 25^{th} percentile.
The second quartile (Q2) is the median, i.e., the 50^{th} percentile.
The third quartile (Q3) is the 75^{th} percentile.
A robust measure of outlier detection is therefore defined as follows.
A data value is an outlier if
> it is located 1.5(IQR) or more below Q1, or
> it is located 1.5(IQR) or more above Q3.
In essence, the data values will be considered outliers if
> q1<quantile(data1$CustServ.Calls,0.25)
> m1<median(data1$CustServ.Calls)
> m1
[1] 1
> q2=m1
> q3=quantile(data1$CustServ.Calls,0.75)
> q3
75%
2
> q3q1
75%
1
> iqr1=IQR(data1$CustServ.Calls)
> iqr1
[1] 1
> q11.5*iqr1
25%
0.5
> q3+1.5*iqr1
75%
3.5
Anything outside of: [0.5, 3.5] is an outlier.
37) Transform the day minutes attribute using Zscore Standardization.
> m<mean(data1$Day.Mins)
> s<sd(data1$Day.Mins)
> z.daymin<(data1$Day.Minsm)/s
> z.daymin
38) Work with skewness as follows
Skewness
skewness is a measure of the asymmetry of the probability distribution of a realvalued random variable about its mean.
We use the following statistic o measure the skewness of a distribution.
Skewness = 3*(mean – median)/standard deviation
Now, applying this measurement of skewness on a given data set, our data could be right skewed or left skewed.
For right skewed data, the mean is greater than the median and thus the skewness will be positive and vice versa for left skewed or negative skewness.
For perfectly symmetrical data, the mean, median and the mode are all equal and so the skewness will be zero.
> day<data1$Day.Mins
> m<mean(day)
> s<sd(day)
> med<median(day)
> skewness=3*(mmed)/s
> m
[1] 179.7751
> s
[1] 54.46739
> med
[1] 179.4
> skewness
[1] 0.02065993
> z.skewness=(3*(mean(z.daymin)median(z.daymin))/sd(z.daymin))
> z.skewness
[1] 0.02065993
The Zscore standardization has no effect on the skewness.
Based on the skewness value, one can clearly state that the day minutes is nearly perfectly symmetrical as its mean(179.77), median(179.4) and mode(154) are almost the same.
The skewness is also almost equal to 0 ~ 0.0206.
CHAPTER 3
Use the adult data set for the following with the target variable as income.
22) Which variables are categorical and which are continuous?
The variables are as follows
Age : Continuous
Years of education : Continuous
Hours : Continuous
Capital gains : Continuous
Capital losses : Continuous
Training : Continuous
Marital status : Categorical
Work class : Categorical
Gender : Dichotomous Categorical, “Male” or “Female”
Race : Categorical
Income : Categorical
Capital Gains or Losses : Dichotomous Categorical, “True” or “False”
Capnet : Continuous
24) Investigate whether we have any correlated variables.
> corrdata<cbind(data2$Age_mm,data2$Years.of.education_mm,data2$Income,data2$Capital.gains_mm,data2$Training)
> corrpvalues<matrix(rep(0,25),ncol = 5)
> for(i in 1:4){for(j in (i+1):5){corrpvalues[i,j]<corrpvalues[j,i]<round(cor.test(corrdata[,i],corrdata[,j])$p.value,4)}}
> round(cor(corrdata),4)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0000 0.0001 0.000 0.0000 0.8406
[2,] 0.0001 0.0000 0.000 0.0000 0.3324
[3,] 0.0000 0.0000 0.000 0.0000 0.1550
[4,] 0.0000 0.0000 0.000 0.0000 0.5977
[5,] 0.8406 0.3324 0.155 0.5977 0.0000
From this table, its safe to say that the only common correlation the variables have are with variable “Training”.
The other variables, for example, capital gains/losses and absolute gains/losses are correlated.
25) For each of the categorical variables, construct a bar chart of the variable, with an overlay of the target variable. Normalize if necessary.
Normalizing the bar charts would give a better visual explanation of up to what extent does the target variable have over a particular variable. It is given as a percentage of the variable.
Married couples have it easy going, one can safely say as they hold the majority percentage who earn more than 50k as income.
Self employed community of the working class group have a better standing percentage while compared to the Government positions and private ones when it comes down to incomes that are greater than 50k.
The Whites and the AsianPacislander race have a better number of people who have incomes greater than 50k when compared to the rest of the other communities.
shows that the male population are dominant when it comes to incomes greater than 50k than the female population.
Capital gains or losses is pretty self explanatory, the incomes lesser than or equal to 50k take up the majority when it comes to losses and the incomes greater than 50k are dominant when it comes down to Capital gains.
Race, Working Class and gender would make a significant appearance in my opinion.
As you can see clear drops in certain categories.
Male and Female disparities and differences play a very important role in a working environment, but what plays the role when it comes to income and why does it affect the gender variable OR how the self employed category of individuals have a better chance of earning a hefty income as compared to other groups in the working class community.
Race also plays an important role, individual and its interesting to see the range of classifications we and target variables we can apply on this certain group.
29)Report the mean, median, minimum, maximum, and standard deviation for each of the following numerical variables.
Numerical variables 
Mean 
Median 
Minimum 
Maximum 
Standard deviation 

Age 
0.2964 
0.274 
0 
1 
0.187 

Years of Education 
0.6055 
0.6 
0 
1 
0.1705 

Hours 
0.4021 
0.398 
0 
1 
0.125 

Capital Gains 
0.01099 
0 
0 
1 
0.0763 

Training 
0.3752 
0.3755 
0 
0.75 
0.2166 

Capnet 
0.009 
0 
1 
1 
0.12156 
30)Construct a histogram of each numerical variable, with an overlay of the target variable income. Normalize if necessary.
We can clearly notice that the income >50k increases gradually in the case of Years of education. We can even say that the income is directly proportional to years of education.
There is a certain middle point in the age category where the people in that age group have the highest percentages of earning an income that is greater than 50k, and we see the graph decreasing gradually as the age decreases as well.
As observed from the graph, the percentage of people who work less hours have very low chances of earning above the 50k mark as their income as in respect to their counterparts who have higher work hours they tend to have a higher percentage to beat that >50k mark.
It is surprising to notice that Training has no change with respect to Income. It remains the same throughout.
Age, training, hours worked and capnet.
31) For each pair of numeric variables, construct a scatterplot of the variables. Discuss your salient results.
> pairs(~data2$Age_mm + data2$Years.of.education_mm + data2$Hours_mm + data2$Capital.gains_mm + data2$Capital.losses_mm + data2$Abs.CapGains.Losses + data2$Training + data2$Capnet)
>cor(data2[,c(1,2,3,4,5,6,7,14)])
From the scatter plot we observe there are linear correlations between capital gains, capital losses, Capnet and absolute capital gains/losses.
There is very little correlation between the rest of the variables with each other. One can say that the age is not a factor when it comes to years of education which is very surprising to observe.
Using the correlation function, it is easier to pinpoint and make statements made above.
32) Based on your EDA so far, identify interesting subgroups of records within the data set that would be worth further investigation.
From the above solved questions, we can identify interesting sub groups for different combination of variables:
Age and Years of Education
Training and Hours
Age and Training
Years of Education and Training
33) Apply Binning to one of the numeric variables. Do it in such a way as to maximize the effect of the classes thus created (following the suggestions in the text). Now do it in such a way as to minimize the effect of the classes so that the difference between the classes is diminished. Comment.
> training2<table(data2$Income,cut(data2$Training,pretty(data2$Training,2)),dnn = c("Income","Training"))
> barplot(training2,names.arg = c("<=0.5","<=1"),col = c("skyblue","pink"),xlab = "Training",ylab = "Income",main = "Traning vs Income")
> training4<table(data2$Income,cut(data2$Training,pretty(data2$Training,4)),dnn = c("Income","Training"))
> barplot(training4,names.arg = c("<=0.2","<=0.4","<=0.6","<=0.8"),col = c("skyblue","pink"),xlab = "Training",ylab = "Income",main = "Traning vs Income")
From the below graphs, maximizing the effect of classes we obtain a much better gradual decrease with respect to income, as far as training is concerned. On using only 2 classes to differentiate the x axis, i.e., training, we can see an abrupt decrease in income and it is not as clear in providing information as its latter counterpart of using 4 classifications on the xaxis.
CHAPTER 4
15) First filter out all the batters with fewer than 100 at bats. Next standardize all the numerical variables using zscores.
at.bats<baseball$at_bats
k<100<at.bats
No batters were found with lesser than 100 at bats.
Next we use this function to standardize all the numeric variables.
Zscore Standardization.
zscore<function(x){
(xmean(x))/sd(x)
}
> baseball$age_z<zscore(baseball$age)
> baseball$games_z<zscore(baseball$games)
> baseball$at_bats_z<zscore(baseball$at_bats)
and similarly you can convert all the numerical variables to their respective Zscores.
16) Suppose we are interested in estimating the number of home runs, based on the other numerical variables in the data set. All the other numeric variables will be our predictors. Investigate whether sufficient variability exists among the predictors to perform PCA.
> View(cor(baseball[,c(21:36)]))
> symnum(cor(baseball[,c(21:36)]))
a__ r h_ d g c st__ st_ w R ag_ sl__ o b t hL
at_bats_z 1
runs_z * 1
hits_z B * 1
doubles_z + + * 1
games_z B + * + 1
caught_stealing_z . . . . . 1
stone_bases_z . . . . . , 1
strikeouts_z , , , , , . . 1
walks_z , , , , , . , 1
RBIs_z + + + + + . , , 1
age_z 1
slugging_pct_z . , . , . . . , 1
on_base_z . , . . . . , , , 1
bat_ave_z . . , . . . . , , 1
triples_z . . . . . . . 1
homerunsLN_Z , , , . , , . , , . . 1
attr(,"legend")
[1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
> analysis2<principal(baseball[,c(21:36)],nfactors=16,rotate="none",scores=T)
> analysis2$values
[1] 9.12131191 1.99744439 1.15826231 1.05617972 0.61981074 0.56073474
[7] 0.46793879 0.31148335 0.24830684 0.17579437 0.12359764 0.05174055
[13] 0.04694734 0.03874794 0.01781648 0.00388288
The criteria for this method is that the eigen values calculated for this data set , the components should retain a value of greater or equal to 1 , and the rest are all dropped .
In this case we take the first 4 components PC1 PC2 PC3 AND PC4.
> analysis2$loadings
We should check the cumulative proportion Values ,what percentage of variance is retained in the components.
From the above data we can see , if we want to retain 80% of the data we can extract the first 4 components , but if we want to retain 90% of the data , we should extract six components namely PC1 to PC6.
> plot(analysis2$sd^2,type="b",main="Scree plot for Baseball")
From the above plot, according to the Scree plot criterion, it would indicate that the maximum number of components we should extract is 4, as the fourth component occurs just before where the line first begins to straighten out.
Communality variables can be calculated as the sum of the squared component weights, for a given variable. Communalities for each variable should be more than 0.5, so that it retains more than 50% of its variance with the other variables.
Incase of component PC1 , it contains majority of the variables above the 0.5 mark.
On closer observations we can see that the variable “age_z” does not contain more than 50% of its variability with other variables. Therefore, we take component PC4 under consideration.
(0.319)^{2} + (0.212)^{2} + (0.87)^{2} is greater than 0.5
Hence through this criterion we extract 4 components.
18.Based on the information from the previous exercise make a decision about how many components you shall extract?
The Scree plot Criterion  4
The Communality Criterion 4
The Eigen value Criterion 4
The proportion of variance explained criterion4 for 80% and 6 for 90%
So through a majority we extract 4 components.
19.Apply PCA using varimax rotation , with you choosen number of components. Write up a short profile of the first few components extracted.
> fafa<principal(baseball_z,nfactors=16,rotate="varimax",
scores = T)
> fafa
For the varimax option as well, we take the variability of the components above 0.5 for it to suffice the information retained by it. But in this case we extract the components that contain the sufficient variability with respective to its particular variables.
So in this case
Component 1contains sufficient information for variables –at_bats_z,hits_z,doubles_z,games_z,strikeouts_z,walks_z and RBIs_z.
Similarly, component 8 contains –stone_bases_z.
Note that component 5 is not that necessary to include as the other components have the necessary information for their respective variables.
Therefore, components through RC1RC4 and RC6RC8 would be extracted.
A userdefined composite is simply a linear combination of the variables, which combines several variables together into a single composite measure.
Userdefined composites take the form where .
From previous questions we observe that the component PC1 there are 14 variables highly correlated to each other. Applying userdefined composite form to PC1:
User defined composite structure uses weighted functions, so as to give more priority to certain variables to increase its functionality.
In this way it is better than PCA which gives it own calculated weights which cannot be changed.
The above is just an example of how we can use the user defined composite.
If the user has good hold of knowledge with a certain data set, he can use better weights to those particular predictor variables to increase the predictability of the function.
hihi
To export a reference to this article please select a referencing stye below.
Assignment Hippo (2022) . Retrive from https://assignmenthippo.com/sampleassignment/thechurndataset
"." Assignment Hippo ,2022, https://assignmenthippo.com/sampleassignment/thechurndataset
Assignment Hippo (2022) . Available from: https://assignmenthippo.com/sampleassignment/thechurndataset
[Accessed 13/08/2022].
Assignment Hippo . ''(Assignment Hippo,2022) https://assignmenthippo.com/sampleassignment/thechurndataset accessed 13/08/2022.
Want to order fresh copy of the Sample Template Answers? online or do you need the old solutions for Sample Template, contact our customer support or talk to us to get the answers of it.
Our motto is deliver assignment on Time. Our Expert writers deliver quality assignments to the students.
Get reliable and unique assignments by using our 100% plagiarismfree.
Get connected 24*7 with our Live Chat support executives to receive instant solutions for your assignment.
Get Help with all the subjects like: Programming, Accounting, Finance, Engineering, Law and Marketing.
Get premium service at a pocketfriendly rate at AssignmentHippo
I was struggling so hard to complete my marketing assignment on brand development when I decided to finally reach to the experts of this portal. They certainly deliver perfect consistency and the desired format. The content prepared by the experts of this platform was simply amazing. I definitely owe my grades to them.
Get instant assignment help