Question 1: (9 Total Marks)
Provide a Short Answer (3 marks each)
Question 2: (18 Total Marks)
The file baby.dat contains data on 247 premature births. For each, the birthweight (in grams), the gestational age (in weeks), the one and five minute Apgar scores (on a scale of 1 to 9) and the pH level of the venous blood are recorded. In addition, an indicator of whether the baby survived or not is in the first column, a value of 1 indicating survival and 0 indicating death.
Question 3: (18 Total Marks)
The file crckt.dat contains data for ball-by-ball outcomes of 217 men’s international 50-over matches, 151 men’s international 20-over matches, 72 women’s international 50-over matches and 70 women’s international 20-over matches played during 2015 and 2016. Each row of the data corresponds to a single ball of a match. For each ball, the information recorded is:
a) There is debate about whether scoring rates for 20-over matches are the same as those in the final 20-overs of a 50-over match (i.e., balls 120 down to 1). Use an appropriate technique to model the relationship between expected runs scored and balls remaining which allows comparison across match types and appropriately adjusts for any other important factors. Further, produce a plot (or small collection of plots) to illustrate the similarity or dissimilarity in the expected runs scored versus balls remaining relationship across the two match types. [9 marks]
b) When a wicket falls, a new batsman must start their innings. It is often claimed this is the most difficult time for a batter to score. Investigate whether there is a difference in the runs scored at any given stage of a match depending on whether a wicket has fallen on the previous delivery or not. [9 marks]
ACSC71-326 Advanced Regression
Final Exam – SAMPLE - SOLUTIONS
Question 1:
Question 2:
a) This is a binomial outcome model. So, in general, we could apply any one of:
However, as we are asked about the change in survival chances, we need a model which will give probability estimates as primary output, so discrimination models are contra-indicated here.
If we choose GLMs, we would try various links and go through variable selection procedures (and perform diagnostics along the way, such as Areas under ROC curves or confusion tables). One key issue, though, is whether there are interactions between the gestational age variable and the others, since if there are that would mean the survival rate change for a change in gestational age from 30 to 31 weeks would depend on the actual values of the other predictors, and we are only told to assume they wouldn’t change, but not their actual values. We might just check a bunch of interactions manually by directly including them in our GLM, or instead fit a tree model and see what, if any, interactions it can pick up automatically. Now, if we fit a tree model, use cross-validation and prune it back, we get the following:
> bby <- read.table("bby.day",header=T)
> library(tree)
> bby.tree <- tree(Survival ~ .,data=bby)
> order(cv.tree(bby.tree)$dev)
[1] 8
> plot(prune.tree(bby.tree,best=8)
So, it appears there is some potential interactions, since pH is only an important factor for larger birthweight babies, while age and Apgar1 are the important factors for low birthweight babies. Fortunately, (via deviance testing or other model selection techniques), only the pH/Weight interaction appears significant. So, the quantity in question is determined just by the coefficient of the gestational age predictor. How we turn this into a confidence interval for the change in survival chances depends on our link structure. As an example, suppose we fit a logistic link model and the coefficient in question is ??^?????? with standard deviation of ????(??^??????). Then, ??^?????? represents the change in the log-odds of survival for a 1 week change in gestational age:
ln{??^31/(1 − ??^31)} − ln{??^30/(1 − ??^30)} = ??^??????
which rearranges to:
??^31(1 − ??^30) ??^??????
-------------------------------= ??
??^30(1 − ??^31)
So, a confidence interval for the odds would be ????^??????±1.96????(??^??????). Alternatively, for a CI for some other form, we could use the delta method to determine an appropriate standard deviation.
b) Apgar scores are valued on an 11-point scale. Initially, we might think of using continuous response models (and fitting them would certainly have gotten some marks, if done properly), but a better approach here is to use a model for ordinal responses.
> library(MASS)
> bby$Apgar5 <- factor(bby$Apgar5,
+ levels=c("0","1","2","3","4","5","6","7","8","9","10"),
+ ordered=T)
> bby.pol1 <- polr(Apgar5 ~ .,method="logistic",data=bby)
> summary(bby.pol1)
Fitting a proportional odds logistic regression for the distribution of five minute Apgar score shows that Age and Weight are not significant once we have included the 1 minute Apgar score in the model. Further, there are clearly some issues with categories 0 and 10. A quick table shows that there are only 4 values of 0 and a single value of 10 in the entire dataset. So, we should probably combine these end categories with the one next to them, as the analysis will be unstable if it has to fit categories with such low numbers of observations. If we re-fit the model without Age and Weight and with the updated Apgar score variables, we can see the fitted CDFs of Apgar5 for each value of Apgar1:
> A5 <- ifelse(bby$Apgar5==0,1,ifelse(bby$Apgar5==10,9,bby$Apgar5))
> A1 <- ifelse(bby$Apgar1==0,1,bby$Apgar1)
> apgar.tbl <- as.vector(table(A5,A1))
> a5 <- factor(rep(1:9,9),levels=1:9,ordered=T)
> a1 <- factor(as.character(rep(1:9,rep(9,9))))
> bby.pol2 <- polr(a5 ~ a1,weights=apgar.tbl)
> fitp <- bby.pol2$fitted.values[seq(1,81,9),]
> fitF <- apply(cbind(0,fitp),1,cumsum)
> matplot(0:9,fitF,type="b",lty=1,pch=16,col=rainbow(9),xlab="Apgar5",
+ ylab="CDF")
> legend(0,1,legend=as.character(1:9),title="Apgar1",lty=1,pch=16,
+ col=rainbow(9))
Alternatively, we could ignore the ordinal nature and consider Apgar scores as just categorical and then we could use classification or discrimination techniques to predict Apgar scores. (NOTE: It is even tempting to consider Apgar scores as continuous given they have so many categories, but this is not appropriate.)
Question 3: (18 Total Marks)
a) Runs per over is a count variable, so Poisson regression is a reasonable way to go here (although continuous response models would also work as long as we were careful about the fact that we cannot have negative values for our response variable, so an appropriate link structure such as the natural logarithm would be needed); however, we are ultimately asked for some graphical depictions of our relationship, so some spline or local regression methods will also be useful. To do both, we can use a GAM model. (NOTE: A bit of investigation shows the runs distribution is skewed, but also has lots of zeroes, so we can’t use a log-link, and thus a square-root link seems the best approach).
> ckt <- read.table("crkt.dat",header=T,sep=" ")
> ckt120 <- ckt[ckt$BallsRem<=120,]
> ckt120$Gender <- as.factor(ckt120$Gender)
> ckt120$GameType <- as.factor(ckt120$GameType)
> ckt120$Year <- as.factor(ckt120$Year)
> ckt120$WLastBall <- as.factor(ckt120$WLastBall)
> ckt120$Wckts <- as.factor(ckt120$Wckts)
> par(mfrow=c(2,3))
> ckt120.gam <- gam(Runs ~ lo(BallsRem,span=0.5) + Gender + GameType +
+ + Wckts + WLastBall + Year,
+ family=poisson(link=sqrt),data=ckt120)) > plot(ckt120.gam)
So, it does appear that there is a significant difference between the two game types. However, the practical implications of this statistical result (which is significant in part due to the huge sample size) seems very small (it is on the order of 0.05 of a run!).
b) We could just use the above analysis, and it again shows that a wicket on the last ball creates a statistically significant drop in average runs scored on the next ball, but the size of the drop is again not that practically important (on the order of 0.15 runs per ball). Alternatively, we could use a Poisson GLM to investigate whether there are any interactions.
To export a reference to this article please select a referencing stye below.
Assignment Hippo (2022) . Retrive from https://assignmenthippo.com/sample-assignment/acsc71-326-advanced-regression
"." Assignment Hippo ,2022, https://assignmenthippo.com/sample-assignment/acsc71-326-advanced-regression
Assignment Hippo (2022) . Available from: https://assignmenthippo.com/sample-assignment/acsc71-326-advanced-regression
[Accessed 15/08/2022].
Assignment Hippo . ''(Assignment Hippo,2022) https://assignmenthippo.com/sample-assignment/acsc71-326-advanced-regression accessed 15/08/2022.
Want to order fresh copy of the Sample Template Answers? online or do you need the old solutions for Sample Template, contact our customer support or talk to us to get the answers of it.
Our motto is deliver assignment on Time. Our Expert writers deliver quality assignments to the students.
Get reliable and unique assignments by using our 100% plagiarism-free.
Get connected 24*7 with our Live Chat support executives to receive instant solutions for your assignment.
Get Help with all the subjects like: Programming, Accounting, Finance, Engineering, Law and Marketing.
Get premium service at a pocket-friendly rate at AssignmentHippo
I was struggling so hard to complete my marketing assignment on brand development when I decided to finally reach to the experts of this portal. They certainly deliver perfect consistency and the desired format. The content prepared by the experts of this platform was simply amazing. I definitely owe my grades to them.
Get instant assignment help