Solution file for additional exercise 2.12 ------------------------------------------ Data on length and height of English cathedrals (or parts hereof), of either Roman or Gothic architectural style. We will explore models for the logarithmic length as a function of the logarithmic height, y_ij = (natural log of) length in feet x_ij = (natural log of) height in feet of the j'th cathedral (part) in architectural group i, i=R,C; j=1,...,n_i. A plot of length against height show the cathedrals in Bath and Ripon (rows 11 and 19, respectively) to be somewhat outside the pattern of the others, and we omit them from the analysis. Question 1) ----------- Analysis of covariance model (parallel regression lines), using z as a covariate: y_ij = mu_i + beta * x_ij + eps_ij where the eps_ij's are assumed i.i.d. and N(0,sigma^2). MTB > WOpen "H:\VHM\VHM802\Data_csv\hs02_12.csv"; SUBC> FType; SUBC> CSV; SUBC> DecSep; SUBC> Period; SUBC> Field; SUBC> Comma; SUBC> TDelimiter; SUBC> DoubleQuote. Retrieving worksheet from file: ‘H:\VHM\VHM802\Data_csv\hs02_12.csv’ Worksheet was saved on 20/02/2014 MTB > Name C5 'lnh' MTB > Let 'lnh' = ln('h') MTB > Name C6 'lnl' MTB > Let 'lnl' = ln('l') MTB > Plot 'lnl'*'lnh'; SUBC> Symbol. Scatterplot of lnl vs lnh MTB > Copy 'lnl'; SUBC> Newws "c7"; SUBC> Varnames. MTB > Copy 'lnl' c7; SUBC> Varnames. MTB > let c7(11)='*' MTB > let c7(19)='*' MTB > Name C8 "SRES". MTB > GLM; SUBC> Response 'lnl_1'; SUBC> Nodefault; SUBC> Continuous 'lnh'; SUBC> Categorical 'style'; SUBC> Unstandardized; SUBC> Terms lnh style lnh*style; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TCoefficients; SUBC> TEquation; SUBC> TFactor; SUBC> TDiagnostics 0; SUBC> Rtype 2; SUBC> GFOURPACK; SUBC> SResiduals 'SRES'. General Linear Model: lnl_1 versus lnh, style Method Factor coding (-1, 0, +1) Rows unused 2 Factor Information Factor Type Levels Values style Fixed 2 G, R Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value lnh 1 1.11505 68.75% 0.159343 0.159343 7.31 0.014 style 1 0.08367 5.16% 0.010455 0.010455 0.48 0.497 lnh*style 1 0.00913 0.56% 0.009135 0.009135 0.42 0.525 Error 19 0.41405 25.53% 0.414047 0.021792 Lack-of-Fit 17 0.39325 24.25% 0.393246 0.023132 2.22 0.355 Pure Error 2 0.02080 1.28% 0.020801 0.010400 Total 22 1.62190 100.00% S R-sq R-sq(adj) PRESS R-sq(pred) 0.147621 74.47% 70.44% 0.779432 51.94% Coefficients Term Coef SE Coef 95% CI T-Value P-Value VIF Constant 2.40 1.37 ( -0.46, 5.26) 1.75 0.096 lnh 0.858 0.317 ( 0.194, 1.522) 2.70 0.014 4.55 style G -0.95 1.37 ( -3.81, 1.91) -0.69 0.497 1877.91 lnh*style G 0.205 0.317 (-0.459, 0.870) 0.65 0.525 1867.88 Regression Equation style G lnl_1 = 1.449 + 1.064 lnh R lnl_1 = 3.34 + 0.653 lnh Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit 95% CI Resid Std Resid Del Resid HI Cook’s D 4 5.8406 6.0565 0.1036 (5.8397, 6.2733) -0.2159 -2.05 -2.26 0.492354 1.02 22 5.2040 5.4979 0.0817 (5.3269, 5.6689) -0.2939 -2.39 -2.78 0.306179 0.63 Obs DFITS 4 -2.23040 R 22 -1.84806 R R Large residual Residual Plots for lnl_1 MTB > PPlot 'SRES'; SUBC> Normal; SUBC> Symbol; SUBC> FitD; SUBC> Grid 2; SUBC> Grid 1; SUBC> MGrid 1. Probability Plot of SRES The P-value of the Anderson-Darling test is 0.219. Comments: --------- In this model, there is no significant interaction between style and lnh, that is, the lines for Roman and Gothic cathedrals are roughly parallel. As the sequential sum of squares (SS) for style is substantial larger than the partial SS, we should refit the model without the interaction term to get a better estimates for the parallel lines model. The model has one large residual (deletion residual = -2.78) for observation 22. It is the cathedral with smallest length and height. The Bonferroni-corrected P-value is 0.28, so it is far from significant. It has, however, a large value of Cook's statistic (0.63), so it may be expected to be rather influential. The normal probability plot does not look very straight but the A-D normality test does not indicate a significant deviation from normality. We refit the model without the interaction. MTB > Name C8 "SRES". MTB > GLM; SUBC> Response 'lnl_1'; SUBC> Nodefault; SUBC> Continuous 'lnh'; SUBC> Categorical 'style'; SUBC> Unstandardized; SUBC> Terms lnh style; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TCoefficients; SUBC> TEquation; SUBC> TFactor; SUBC> TDiagnostics 0; SUBC> Rtype 2. General Linear Model: lnl_1 versus lnh, style ... Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value lnh 1 1.11505 68.75% 1.05729 1.05729 49.97 0.000 style 1 0.08367 5.16% 0.08367 0.08367 3.95 0.061 Error 20 0.42318 26.09% 0.42318 0.02116 Lack-of-Fit 18 0.40238 24.81% 0.40238 0.02235 2.15 0.365 Pure Error 2 0.02080 1.28% 0.02080 0.01040 Total 22 1.62190 100.00% S R-sq R-sq(adj) PRESS R-sq(pred) 0.145462 73.91% 71.30% 0.612406 62.24% Coefficients Term Coef SE Coef 95% CI T-Value P-Value VIF Constant 1.614 0.632 ( 0.297, 2.931) 2.56 0.019 lnh 1.040 0.147 ( 0.733, 1.346) 7.07 0.000 1.01 style G -0.0620 0.0312 (-0.1271, 0.0030) -1.99 0.061 1.01 Regression Equation style G lnl_1 = 1.552 + 1.040 lnh R lnl_1 = 1.676 + 1.040 lnh Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit 95% CI Resid Std Resid Del Resid HI Cook’s D 22 5.2040 5.5091 0.0787 (5.3451, 5.6732) -0.3051 -2.49 -2.93 0.292392 0.86 Obs DFITS 22 -1.88209 R R Large residual Comments: --------- In the additive model, there is a close to significant difference in lnl between styles, with Roman cathedrals being 2*0.062=0.124 units longer. Question 2) ----------- For this question we add a quadratic term in ln(h) to the model: y_ij = mu_i + beta_1 * x_ij + beta_2 * (x_ij)^2 + eps_ij MTB > GLM; SUBC> Response 'lnl_1'; SUBC> Nodefault; SUBC> Continuous 'lnh'; SUBC> Categorical 'style'; SUBC> Unstandardized; SUBC> Terms lnh lnh*lnh style; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TCoefficients; SUBC> TEquation; SUBC> TFactor; SUBC> TDiagnostics 0; SUBC> Rtype 2; SUBC> SResiduals 'SRES_1'. General Linear Model: lnl_1 versus lnh, style ... Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value lnh 1 1.11505 68.75% 0.149249 0.149249 9.61 0.006 lnh*lnh 1 0.20677 12.75% 0.128253 0.128253 8.26 0.010 style 1 0.00515 0.32% 0.005149 0.005149 0.33 0.571 Error 19 0.29493 18.18% 0.294929 0.015523 Lack-of-Fit 17 0.27413 16.90% 0.274129 0.016125 1.55 0.463 Pure Error 2 0.02080 1.28% 0.020801 0.010400 Total 22 1.62190 100.00% S R-sq R-sq(adj) PRESS R-sq(pred) 0.124590 81.82% 78.94% 0.463819 71.40% Coefficients Term Coef SE Coef 95% CI T-Value P-Value VIF Constant -26.31 9.73 ( -46.67, -5.95) -2.70 0.014 lnh 14.17 4.57 ( 4.61, 23.74) 3.10 0.006 1325.58 lnh*lnh -1.541 0.536 ( -2.662, -0.419) -2.87 0.010 1322.41 style G -0.0178 0.0308 (-0.0823, 0.0468) -0.58 0.571 1.34 Regression Equation style G lnl_1 = -26.33 + 14.17 lnh - 1.541 lnh*lnh R lnl_1 = -26.29 + 14.17 lnh - 1.541 lnh*lnh Fits and Diagnostics for Unusual Observations Obs lnl_1 Fit SE Fit 95% CI Resid Std Resid Del Resid HI Cook’s D 5 6.0088 6.2444 0.0447 (6.1508, 6.3380) -0.2356 -2.03 -2.23 0.128828 0.15 22 5.2040 5.2918 0.1013 (5.0798, 5.5037) -0.0878 -1.21 -1.23 0.660802 0.71 Obs DFITS 5 -0.85652 R 22 -1.71011 X R Large residual X Unusual X Comments: --------- This model has one added parameter to the parallel regression lines model, and it is significantly better (the partial F-statistic for the quadratic term is 8.26, or the t-statistic is -2.87). In this model, however, there is virtually no difference between the two architectural styles (P=0.57). Observation 22 is fitted much better by the model, but still has a high leverage and a high Cook's statistic. As the cathedral is the smallest one, it is not surprising to see this upon adding a quadratic term in height. MTB > Plot 'SRES_1'*'lnh'; SUBC> Symbol. Scatterplot of SRES_1 vs lnh MTB > Plot 'SRES_1'*'lnh'; SUBC> Symbol 'style'. Scatterplot of SRES_1 vs lnh Question 3) ----------- Plotting standardized residuals against ln(h) with different symbols for the architectural styles shows that the heights are in a very narrow range for the Roman cathedrals, and that these residuals have an inverse U-shape. This leads us to suspect that very different curves would apply for the two architectural shapes if fitted separately. We can achieve this either by separating the observations in two columns, or by fitting a model with all interactions with style: MTB > Name C10 "FITS" C11 "SRES_2". MTB > GLM; SUBC> Response 'lnl_1'; SUBC> Nodefault; SUBC> Continuous 'lnh'; SUBC> Categorical 'style'; SUBC> Unstandardized; SUBC> Terms lnh lnh*lnh style lnh*style lnh*lnh*style; SUBC> TExpand; SUBC> TMethod; SUBC> TAnova; SUBC> TSummary; SUBC> TCoefficients; SUBC> TEquation; SUBC> TFactor; SUBC> TDiagnostics 0; SUBC> Rtype 2; SUBC> Fits 'FITS'; SUBC> SResiduals 'SRES_2'; SUBC> Hi 'HI'; SUBC> CookD 'COOK'; SUBC> DFits 'DFIT'. General Linear Model: lnl_1 versus lnh, style ... Analysis of Variance Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value lnh 1 1.11505 68.75% 0.17516 0.175161 18.90 0.000 lnh*lnh 1 0.20677 12.75% 0.17223 0.172230 18.59 0.000 style 1 0.00515 0.32% 0.13095 0.130955 14.13 0.002 lnh*style 1 0.00539 0.33% 0.13151 0.131514 14.19 0.002 lnh*lnh*style 1 0.13202 8.14% 0.13202 0.132021 14.25 0.002 Error 17 0.15751 9.71% 0.15751 0.009265 Lack-of-Fit 15 0.13671 8.43% 0.13671 0.009114 0.88 0.654 Pure Error 2 0.02080 1.28% 0.02080 0.010400 Total 22 1.62190 100.00% S R-sq R-sq(adj) PRESS R-sq(pred) 0.0962574 90.29% 87.43% 0.321084 80.20% Coefficients Term Coef SE Coef 95% CI T-Value P-Value VIF Constant -203.7 47.8 (-304.6, -102.7) -4.26 0.001 lnh 97.0 22.3 ( 49.9, 144.1) 4.35 0.000 52923.45 lnh*lnh -11.21 2.60 (-16.69, -5.72) -4.31 0.000 52119.32 style G 179.9 47.8 ( 78.9, 280.8) 3.76 0.002 5414820.60 lnh*style G -84.1 22.3 (-131.1, -37.0) -3.77 0.002 21708589.45 lnh*lnh*style G 9.81 2.60 ( 4.33, 15.30) 3.77 0.002 5483331.43 Regression Equation style G lnl_1 = -23.78 + 12.95 lnh - 1.395 lnh*lnh R lnl_1 = -383.6 + 181.1 lnh - 21.02 lnh*lnh Fits and Diagnostics for Unusual Observations Std Del Obs lnl_1 Fit SE Fit 95% CI Resid Resid Resid HI Cook’s D DFITS 4 5.8406 5.8300 0.0876 (5.6451, 6.0149) 0.0106 0.27 0.26 0.828892 0.06 0.569477 X X Unusual X MTB > Plot 'SRES_2'*'lnh'; SUBC> Symbol 'style'. Scatterplot of SRES_2 vs lnh MTB > Plot 'FITS'*'lnh'; SUBC> Symbol 'style'; SUBC> Connect 'style'. Scatterplot of FITS vs lnh Comments: --------- The model shows again a much improved fit to the data (compare the MSE's or refer to the strongly significant F-statistics for all terms in the model). The leverage and Cook's D for obs. 22 are even higher now (despite the observation not being listed among the "unusual" observations), and also obs. 4 has high leverage, but the model fit seems very good (as assessed from the residual and normal plots). The estimated quadratic regressions listed above are seen to be very different, as is also apparent from the plots of fitted values against lnh for the two styles separately. The two first models had the same relation between height and length in the two groups, which however does not seem to be supported by the data at all. In a situation like this with two very different fitted equations, one may wonder whether it is reasonable to assume equal error standard deviations for the two styles. The simplest way to find out is to analyze the data for each style on its own. It turns out that the residual standard deviations are not too different after all (results not shown), so our analysis seems fine.