Supplementary Exercise 2.57 IPS7e
---------------------------------

Data on particulate pollution in a city and a rural location close to the city
in a direction so that prevailing winds blow from the rural location to the city.
Interest is in predicting city pollution levels from rural levels. Readings
are available every 6 days over a 6-month period, although there are
many missing values (stated to usually be due to equipment failure). 

The data are clearly sampled as two responses, and we will assume that the
missing values occur randomly without any relation to the values that could
have been obtained. This is an important assumption when the data have so
many missing values; for example, it would be a serious problem if missing
values occurred predominantly at high pollution levels. However, interest 
is stated to be in predicting city pollution levels from the rural levels.


(a) Minitab commands and output:

MTB > WOpen "R:\Chapter 2\ex02_057.mtw".
Retrieving worksheet from file: ‘R:\Chapter 2\ex02_057.mtw’
Worksheet was saved on 15/11/2014

MTB > Fitline 'City' 'Rural';
SUBC>   Confidence 95.0.
Regression Analysis: City versus Rural 

The regression equation is
City = - 2.580 + 1.094 Rural

S = 4.47919   R-Sq = 95.1%   R-Sq(adj) = 94.9%

Analysis of Variance

Source      DF       SS       MS       F      P
Regression   1  9371.10  9371.10  467.08  0.000
Error       24   481.52    20.06
Total       25  9852.62
 
Fitted Line: City versus Rural 

Answers to questions:
---------------------
The fitted line plot shows a strong positive and apparently quite linear
relation between the two pollution levels. There is one point somewhat
off the fitted line. Nevertheless, the regression seems to be useful 
for prediction across the range of rural pollution levels. The largest 
rural pollution value (108 for Day 15) is substantially larger than all
other values, and we should be cautious with predictions for high rural
pollution levels, exceeding 90 (say, when determined visually this 
value is somewhat arbitrary). The estimated regression line is stated
above:
         City = - 2.580 + 1.094 Rural
The listing also gives R^2=95.1%, a very high value. The linear
regression accounts for 95% of the variation in city pollution values.


Minitab commands and output for (b)-(e):

MTB > Name C3 "SRES" C4 "HI" C5 "COOK".
MTB > Regress;
SUBC>   Response 'City';
SUBC>   Nodefault;
SUBC>   Continuous 'Rural';
SUBC>   Terms Rural;
SUBC>   Constant;
SUBC>   Unstandardized;
SUBC>   Rtype 2;
SUBC>  GFOURPACK;
SUBC>   Gvariable 'Rural';
SUBC>   Tmethod;
SUBC>   Tanova;
SUBC>   Tsummary;
SUBC>   Tcoefficients;
SUBC>   Tequation;
SUBC>   TDiagnostics 0;
SUBC>   Sresiduals 'SRES';
SUBC>   Hi 'HI';
SUBC>   Cookd 'COOK'.
Regression Analysis: City versus Rural 

Method
Rows unused  10

Analysis of Variance
Source         DF  Adj SS   Adj MS  F-Value  P-Value
Regression      1  9371.1  9371.10   467.08    0.000
  Rural         1  9371.1  9371.10   467.08    0.000
Error          24   481.5    20.06
  Lack-of-Fit  19   312.5    16.45     0.49    0.884
  Pure Error    5   169.0    33.80
Total          25  9852.6

Model Summary
      S    R-sq  R-sq(adj)  R-sq(pred)
4.47919  95.11%     94.91%      93.25%

Coefficients
Term        Coef  SE Coef  T-Value  P-Value   VIF
Constant   -2.58     2.73    -0.95    0.354
Rural     1.0935   0.0506    21.61    0.000  1.00

Regression Equation
City = -2.58 + 1.0935 Rural

Fits and Diagnostics for Unusual Observations
                              Std
Obs    City     Fit  Resid  Resid
 15  123.00  115.52   7.48   2.26  R  X
 26   69.00   53.19  15.81   3.60  R

R  Large residual
X  Unusual X
 
Residual Plots for City 
Residuals from City vs Rural 

MTB > Predict 'City';
SUBC>   Nodefault;
SUBC>   KPredictors 88;
SUBC>   TEquation;
SUBC>   TPrediction.
Prediction for City 

Variable  Setting
Rural          88
    Fit   SE Fit        95% CI              95% PI
93.6485  2.06618  (89.3841, 97.9128)  (83.4677, 103.829)


Answers to questions:
---------------------
(b)  The above analysis using Minitab's Regression menu included plots of
the standardized residuals (preferable over the raw residuals) against
the fitted values, the observed rural pollution values and the order of
observations. 

The first two of these plots show that most residuals are around and
below zero, with a few scattered points further away from zero. The
two largest standardized residuals are listed in the table of unusual 
observations: 2.26 for Day 15, and 3.60 for Day 26. Overall there is no
striking relationship with rural pollution in the residuals. The plot of 
residuals against days suggests that there may be increasing variation over time.
That would be a violation of model assumptions because the model assumes
the same variance (homogeneity) for all observations.


(c)  It is not straightforward to visually assess the most influential
observation. The two obvious candidates are the points already discussed
above with extreme residuals, although a very influential point does not
necessarily have an extreme residual. The table of influential
observations indicated by X that Day 15 is potentially very influential;
this is because the x-value is substantially larger than any other
observation. Further diagnostics (beyond the scope of the course) are 
needed to determine exactly how influential each of these observations are. 

The Minitab analysis stored values of leverage and Cook's D. The
leverage measures potential influence of an observation, and the
indicated X is based on leverage. So we know the leverage is largest for
Day 15. The Cook's D statistic measures actual influence, and it is seen
that the value for this statistic is by far largest for Day 15. We have
therefore established that the most extreme residual was not for the
most influential observation. The explanation of this that the value for 
Day 26 is central among the x-values, so that the poor fit for this observation
does not strongly affect the regression line.


(d)  The predicted value for x=88 is yhat=93.65. This can either be seen
in the Minitab listing or be computed manually from the estimated 
regression line. The Minitab listing gives a 95% prediction interval of 
(83.5, 103.8).


(e)  The residual plots included a normal plot and a histogram for the
standardized residuals. The normal plot is far from a straight line, and
the deviations from the line appear to be associated with more than the
two points discussed above. The histogram shows that the distribution of
the standardized residuals is right-skewed. Overall, the residual plots
show that the model assumptions are not really satisfied for this
analysis. This is mostly due to the influence of the two extreme points
discussed. They affect the model fit quite strongly because all other
points are much closer to the line.