The Churn Data Set

Assignment 1

CHAPTER 2

Use the Churn data set for the following:

33)Explore whether there are any missing values for any of the variables.

 Load the churn data set onto your workspace.


Note: Missing values are found as NA, i.e., Not applicable variables

>is.na(data1) # we use this command to search for the missing variables.

  reached getOption("max.print") -- omitted 3286 rows

# we reach this “max.print” option so that it limits the amount of printable data on the R console, we can change this by using the following command

>options(max.print = 999999)

# now this command will not omit any rows, it is by default that it omits because once the data is big enough, printing it to the R console rapidly becomes more of a pain.

>which(is.na(data1))

  integer(0)

# this says that there aren’t any missing values in the data set.

34)Compare the area code and state fields. Discuss any apparent abnormalities.

       Area Code         We can see that the area codes have been     

state 408 415 510           repeating for different states which is a

   AK  14  24  14      abnormality.

   AL  25  40  15

   AR  13  27  15

   AZ  15  36  13

   CA   7  17  10

   CO  25  29  12

   CT  22  39  13

   DC  14  27  13

   DE  13  31  17

   FL  12  31  20

   GA  15  21  18

   ME  15  25  22

   MI  12  39  22

   MN  20  40  24

   MO  15  37  11

   MS  15  31  19

   MT  17  34  17

   NC  25  28  15

   ND  19  28  15

   NE  13  34  14

   NH  25  19  12

   NJ  15  34  19

   NM  16  35  11

   NV  14  34  18

   NY  19  47  17

   OH  22  40  16

   OK  17  27  17

   OR  14  44  20

   PA  14  19  12

   RI  12  35  18

   SC  13  30  17

   SD  16  28  16

   TN  11  30  12

   TX  20  37  15

   UT  12  37  23

   VA  25  35  17

   VT  17  36  20

   WA  23  26  17

   WI  22  35  21

   WV  20  52  34

   WY  17  41  19

35)Use a graph to visually determine whether there are any outliers among the number of calls made to customer service.

Since we are working with data sets, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

We can use the histogram graph to identify the outliers.

We can use other graphs as well to show better visual and clear indications of outliers present in the data set.

> hist(data1$CustServ.Calls,breaks=20,main="Histogram of Calls made to Customer Service",xlab="Calls made",ylab="Counts")

> box(which="plot",Ity="solid",col="black")

From the above graph we can clearly observe the outliers from-          5 number of calls made daily as the observation point is very far from the other observations.

>invsqrt.cust<-1/sqrt(data1$CustServ.Calls)

>qqnorm(invsqrt.cust,datax=TRUE,col=”red”,ylim=c(0.01,0.5),main=” normal Q-Q plot of inverse”)

>qqline(invsqrt.cust,col=”blue”,datax=TRUE)

36)Identify the range of Customer Service Calls that should be considered outliers, using:

  • The Z-score Method;
  • The IQR Method.
  1. The Z-score method for identifying the outliers states that a data value is an outlier if it has a Z-score that is either less than -3 or greater than 3.
  1. a) Solution-

> m<-mean(data1$CustServ.Calls) #mean of Customer service calls

> m

[1] 1.562856

> s<-sd(data1$CustServ.Calls) #standard deviation

> s

[1] 1.315491
> z.custcall<-(data1$CustServ.Calls-m)/s #transformation of Z-score

> hist(z.custcall,breaks=20,xlim=c(-2,6),main="Histogram of Z-score of Customer Service Calls",xlab="Z-score of Customer Service Calls",ylab="Counts")

> box(which="plot",lty = "solid",col'')

From the Histogram of Z-score, there are data values that cross the outlier boundary, i.e., 3, and all the observations that cross  the Z-score of 3 are identified as outliers based on this method.

The range of outliers using this method is the Z-score from :  

[3 to 6]

On detransformation of Z-score values to x, we get the range of outliers from [5.5,9.45].

  1. Interquartile range (IQR) : is the most significant basic robust measure of scale.

 IQR = Q3 - Q1. In other words, the IQR is the 1st quartile subtracted from the 3rd quartile; 

A sample graphical representation of the IQR method is shown below:

The first quartile (Q1) is the 25th percentile.

The second quartile (Q2) is the median, i.e., the 50th percentile.

The third quartile (Q3) is the 75th percentile.


A robust measure of outlier detection is therefore defined as follows.

A data value is an outlier if

> it is located 1.5(IQR) or more below Q1, or

> it is located 1.5(IQR) or more above Q3.

In essence, the data values will be considered outliers if

  • It is lower than the Q1 – 1.5(IQR) value, or
  • It is higher than Q3 + 1.5(IQR) value.

> q1<-quantile(data1$CustServ.Calls,0.25)

> m1<-median(data1$CustServ.Calls)

> m1

[1] 1

> q2=m1

> q3=quantile(data1$CustServ.Calls,0.75)

> q3

75%

  2

> q3-q1

75%

  1

> iqr1=IQR(data1$CustServ.Calls)

> iqr1

[1] 1

> q1-1.5*iqr1

 25%

-0.5

> q3+1.5*iqr1

75%

3.5

Anything outside of: [-0.5, 3.5] is an outlier.

37) Transform the day minutes attribute using Z-score Standardization.

> m<-mean(data1$Day.Mins)

> s<-sd(data1$Day.Mins)

> z.daymin<-(data1$Day.Mins-m)/s

> z.daymin

38) Work with skewness as follows

  1. a) Calculate the skewness of day minutes.
  2. b) Then calculate the skewness of the Z-score Standardized day minutes. Comment
  3. c) Based on the skewness value, would you consider day minutes to be skewed or nearly perfectly symmetric?

Skewness
skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

We use the following statistic o measure the skewness of a distribution.

Skewness = 3*(mean – median)/standard deviation

Now, applying this measurement of skewness on a given data set, our data could be right skewed or left skewed.

For right skewed data, the mean is greater than the median and thus the skewness will be positive and vice versa for left skewed or negative skewness.

For perfectly symmetrical data, the mean, median and the mode are all equal and so the skewness will be zero.

  1. Solution-

> day<-data1$Day.Mins

> m<-mean(day)

> s<-sd(day)

> med<-median(day)

> skewness=3*(m-med)/s

> m

[1] 179.7751

> s

[1] 54.46739

> med

[1] 179.4

> skewness

[1] 0.02065993

  1. b) Note: z.daymin is the Z-score standardized data of day minutes.

> z.skewness=(3*(mean(z.daymin)-median(z.daymin))/sd(z.daymin))

> z.skewness

[1] 0.02065993

The Z-score standardization has no effect on the skewness.

  1. c) Solution-

Based on the skewness value, one can clearly state that the day         minutes is nearly perfectly symmetrical as its mean(179.77), median(179.4) and   mode(154) are almost the same.

The skewness is also almost equal to 0 ~ 0.0206.

CHAPTER 3


Use the adult data set for the following with the target variable as income.

22) Which variables are categorical and which are continuous?

The variables are as follows

Age                      : Continuous

Years of education       : Continuous

Hours                    : Continuous

Capital gains            : Continuous

Capital losses           : Continuous

Training                 : Continuous

Marital status           : Categorical

Work class               : Categorical

Gender                   : Dichotomous Categorical, “Male” or “Female”

Race                     : Categorical

Income                   : Categorical
Capital Gains or Losses  : Dichotomous Categorical, “True” or “False”

Capnet                   : Continuous

24) Investigate whether we have any correlated variables.

> corrdata<-cbind(data2$Age_mm,data2$Years.of.education_mm,data2$Income,data2$Capital.gains_mm,data2$Training)

> corrpvalues<-matrix(rep(0,25),ncol = 5)

> for(i in 1:4){for(j in (i+1):5){corrpvalues[i,j]<-corrpvalues[j,i]<-round(cor.test(corrdata[,i],corrdata[,j])$p.value,4)}}

> round(cor(corrdata),4)

[,1]   [,2]  [,3]   [,4]   [,5]

[1,] 0.0000 0.0001 0.000 0.0000 0.8406

[2,] 0.0001 0.0000 0.000 0.0000 0.3324

[3,] 0.0000 0.0000 0.000 0.0000 0.1550

[4,] 0.0000 0.0000 0.000 0.0000 0.5977

[5,] 0.8406 0.3324 0.155 0.5977 0.0000

From this table, its safe to say that the only common correlation the variables have are with variable “Training”.

The other variables, for example, capital gains/losses and absolute gains/losses are correlated.

25) For each of the categorical variables, construct a bar chart of the variable, with an overlay of the target variable. Normalize if necessary.

Normalizing the bar charts would give a better visual explanation of up to what extent does the target variable have over a particular variable. It is given as a percentage of the variable.

  1. Discuss the relationship, if any, each of these variables has with the target variables.

-Married couples have it easy going, one can safely say as they hold the majority percentage who earn more than 50k as income.

-Self employed community of the working class group have a better standing percentage while compared to the Government positions and private ones when it comes down to incomes that are greater than 50k.

-The Whites and the Asian-Pac-islander race have a better number of people who have incomes greater than 50k when compared to the rest of the other communities.

-shows that the male population are dominant when it comes to        incomes greater than 50k than the female population.

-Capital gains or losses is pretty self explanatory, the incomes lesser than or equal to 50k take up the majority when it comes to losses and the incomes greater than 50k are dominant when it comes down to Capital gains.

  1. Which variables would you expect to make a significant appearance in any data mining classification model we work with?

Race, Working Class and gender would make a significant appearance in my opinion.

As you can see clear drops in certain categories.

Male and Female disparities and differences play a very important role in a working environment, but what plays the role when it comes to income and why does it affect the gender variable OR how the self employed category of individuals have a better chance of earning a hefty income as compared to other groups in the working class community.

Race also plays an important role, individual and its interesting to see the range of classifications we and target variables we can apply on this certain group.

29)Report the mean, median, minimum, maximum, and standard deviation for each of the following numerical variables.

Numerical variables

Mean

Median

Minimum

Maximum

Standard deviation

Age

0.2964

0.274

0

1

0.187

Years of Education

0.6055

0.6

0

1

0.1705

Hours

0.4021

0.398

0

1

0.125

Capital Gains

0.01099

0

0

1

0.0763

Training

0.3752

0.3755

0

0.75

0.2166

Capnet

-0.009

0

-1

1

0.12156

30)Construct a histogram of each numerical variable, with an overlay of the target variable income. Normalize if necessary.

  1. Discuss the relationship, if any, each of these variables has with the target variables.

    -We can clearly notice that the income >50k increases gradually in   the case of Years of education. We can even say that the income is directly proportional to years of education.

    -There is a certain middle point in the age category where the people in that age group have the highest percentages of earning an income that is greater than 50k, and we see the graph decreasing gradually as the age decreases as well.

     -As observed from the graph, the percentage of people who work less hours have very low chances of earning above the 50k mark as their income as in respect to their counterparts who have higher work hours they tend to have a higher percentage to beat that >50k mark.

     -It is surprising to notice that Training has no change with respect to Income. It remains the same throughout.

  

  1. b) Which variables would you expect to make a significant appearance in any data mining classification model we work with?

Age, training, hours worked and capnet.

31) For each pair of numeric variables, construct a scatterplot of the variables. Discuss your salient results.

> pairs(~data2$Age_mm + data2$Years.of.education_mm + data2$Hours_mm + data2$Capital.gains_mm + data2$Capital.losses_mm + data2$Abs.CapGains.Losses + data2$Training + data2$Capnet)

>cor(data2[,c(1,2,3,4,5,6,7,14)])

From the scatter plot we observe there are linear correlations between capital gains, capital losses, Capnet and absolute capital gains/losses.

There is very little correlation between the rest of the variables with each other. One can say that the age is not a factor when it comes to years of education which is very surprising to observe.

Using the correlation function, it is easier to pinpoint and make statements made above.

32) Based on your EDA so far, identify interesting sub-groups of records within the data set that would be worth further investigation.

From the above solved questions, we can identify interesting sub groups for different combination of variables:

-Age and Years of Education

-Training and Hours

   -Age and Training

-Years of Education and Training

33) Apply Binning to one of the numeric variables. Do it in such a way as to maximize the effect of the classes thus created (following the suggestions in the text). Now do it in such a way as to minimize the effect of the classes so that the difference between the classes is diminished. Comment.

> training2<-table(data2$Income,cut(data2$Training,pretty(data2$Training,2)),dnn = c("Income","Training"))

> barplot(training2,names.arg = c("<=0.5","<=1"),col = c("skyblue","pink"),xlab = "Training",ylab = "Income",main = "Traning vs Income")

> training4<-table(data2$Income,cut(data2$Training,pretty(data2$Training,4)),dnn = c("Income","Training"))

> barplot(training4,names.arg = c("<=0.2","<=0.4","<=0.6","<=0.8"),col = c("skyblue","pink"),xlab = "Training",ylab = "Income",main = "Traning vs Income")

From the below graphs, maximizing the effect of classes we obtain a much better gradual decrease with respect to income, as far as training is concerned. On using only 2 classes to differentiate the x axis, i.e., training, we can see an abrupt decrease in income and it is not as clear in providing information as its latter counterpart of using 4 classifications on the x-axis.

CHAPTER 4

15) First filter out all the batters with fewer than 100 at bats. Next standardize all the numerical variables using z-scores.

 at.bats<-baseball$at_bats

k<-100<at.bats

No batters were found with lesser than 100 at bats.

Next we use this function to standardize all the numeric variables.

Z-score Standardization.

zscore<-function(x){

  (x-mean(x))/sd(x)

  }

> baseball$age_z<-zscore(baseball$age)

> baseball$games_z<-zscore(baseball$games)

> baseball$at_bats_z<-zscore(baseball$at_bats)

and similarly you can convert all the numerical variables to their respective Z-scores.

16) Suppose we are interested in estimating the number of home runs, based on the other numerical variables in the data set. All the other numeric variables will be our predictors. Investigate whether sufficient variability exists among the predictors to perform PCA.

> View(cor(baseball[,c(21:36)]))

> symnum(cor(baseball[,c(21:36)]))

                  a__ r h_ d g c st__ st_ w R ag_ sl__ o b t hL

at_bats_z         1                                           

runs_z            *   1                                       

hits_z            B   * 1                                      

doubles_z         +   + *  1                                  

games_z           B   + *  + 1                                

caught_stealing_z .   . .  . . 1                              

stone_bases_z     .   . .  . . , 1                             

strikeouts_z      ,   , ,  , , . .    1                       

walks_z           ,   , ,  , ,   .    ,   1                   

RBIs_z            +   + +  + + .      ,   , 1                 

age_z                                         1                

slugging_pct_z    .   , .  , .        .   . ,     1           

on_base_z         .   , .  . .        .   , ,     ,    1      

bat_ave_z         .   . ,  . .            . .     ,    , 1    

triples_z         .   . .  . . . .                         1  

homerunsLN_Z      ,   , ,  . ,        ,   . ,     ,    . .   1

attr(,"legend")

[1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

  1. How many components should be extracted according to
  1. The Eigen value Criterion

> analysis2<-principal(baseball[,c(21:36)],nfactors=16,rotate="none",scores=T)

> analysis2$values

 [1] 9.12131191 1.99744439 1.15826231 1.05617972 0.61981074 0.56073474

 [7] 0.46793879 0.31148335 0.24830684 0.17579437 0.12359764 0.05174055

[13] 0.04694734 0.03874794 0.01781648 0.00388288

 The criteria for this method is that the eigen values calculated for this data set , the components should retain a value of greater or equal to 1 , and the rest are all dropped .
In this case we take the first 4 components PC1 PC2 PC3 AND PC4.

  1. The proportion of variance explained criterion

> analysis2$loadings

                      

We should check the cumulative proportion Values ,what percentage of variance is retained in the components.
From the above data we can see , if we want to retain 80% of the data we can extract the first 4 components , but if we want to retain 90% of the data , we should extract six components namely PC1 to PC6.

  1. The Scree plot Criterion

> plot(analysis2$sd^2,type="b",main="Scree plot for Baseball")

From the above plot, according to the Scree plot criterion, it would indicate that the maximum number of components we should extract is 4, as the fourth component occurs just before where the line first begins to straighten out.

  1. The Communality Criterion

Communality variables can be calculated as the sum of the squared component weights, for a given variable. Communalities for each variable should be more than 0.5, so that it retains more than 50% of its variance with the other variables.

Incase of component PC1 , it contains majority of the variables above the 0.5 mark.
On closer observations we can see that the variable “age_z” does not contain more than 50% of its variability with other variables. Therefore, we take component PC4 under consideration.

(-0.319)2 + (-0.212)2 + (0.87)2 is greater than 0.5

Hence through this criterion we extract 4 components.

18.Based on the information from the previous exercise make a decision about how many components you shall extract?

The Scree plot Criterion - 4

The Communality Criterion- 4

The Eigen value Criterion- 4

The proportion of variance explained criterion-4 for 80% and 6 for 90%

So through a majority we extract 4 components.

19.Apply PCA using varimax rotation , with you choosen number of components. Write up a short profile of the first few components extracted.

> fafa<-principal(baseball_z,nfactors=16,rotate="varimax",

scores = T)

> fafa

For the varimax option as well, we take the variability of the components above 0.5 for it to suffice the information retained by it. But in this case we extract the components that contain the sufficient variability with respective to its particular variables.

So in this case

Component 1-contains sufficient information for variables –at_bats_z,hits_z,doubles_z,games_z,strikeouts_z,walks_z and RBIs_z.

Similarly, component 8 contains –stone_bases_z.

Note that component 5 is not that necessary to include as the other components have the necessary information for their respective variables.

Therefore, components through RC1-RC4 and RC6-RC8 would be extracted.

  1. Construct a useful user defined composite using the predictors. Describe situations where the composite would be more appropriate or useful than the principal components and vice versa.

A user-defined composite is simply a linear combination of the variables, which combines several variables together into a single composite measure.

User-defined composites take the form  where .

From previous questions we observe that the component PC1 there are 14 variables highly correlated to each other. Applying user-defined composite form to PC1:

User defined composite structure uses weighted functions, so as to give more priority to certain variables to increase its functionality.

In this way it is better than PCA which gives it own calculated weights which cannot be changed.

The above is just an example of how we can use the user defined composite.

If the user has good hold of knowledge with a certain data set, he can use better weights to those particular predictor variables to increase the predictability of the function.

hihi


Want latest solution of this assignment

Want to order fresh copy of the Sample Template Answers? online or do you need the old solutions for Sample Template, contact our customer support or talk to us to get the answers of it.