* do-file for linear regression exercise # 1 (VER 14.1)

version 14 /* works also with version 13 */
set more off
cd "h:\vhm\vhm802\data_stata"

* open the btb_episodes data set
use btb_episodes.dta, clear

* Q1a: distribution of outcome -intvl-
codebook intvl
histogram intvl

* Q1b: natural log transformed outcome
generate intvl_ln=ln(intvl)
histogram intvl_ln

* Q2: simple linear regressions
regress intvl_ln p_rct
regress intvl_ln p_year
regress intvl_ln hdsize

* Q3: multiple linear regression
regress intvl_ln p_rct p_year hdsize
* added calculation of correlation between p_rct and hdsize
pwcorr p_rct hdsize
* comparing coefficients from 2 models
regress intvl_ln hdsize
estimates store smpl
regress intvl_ln p_rct p_year hdsize
estimates store mltpl
estimates table smpl mltpl

* Q4: prediction intervals for means and individuals (based only on hdsize)
regress intvl_ln hdsize
* doing the calculations yourself
predict pv, xb
predict pv_mean_se, stdp
scalar tstar=invttail(2985,.025) /* using DFE */
generate pv_mean_u=pv + tstar*pv_mean_se
generate pv_mean_l=pv - tstar*pv_mean_se
predict pv_ind_se, stdf
generate pv_ind_u=pv + tstar*pv_ind_se
generate pv_ind_l=pv - tstar*pv_ind_se
twoway (scatter intvl_ln hdsize, msize(vsmall)) (line pv hdsize) ///
  (line pv_mean_u hdsize) (line pv_mean_l hdsize) ///
  (line pv_ind_u hdsize) (line pv_ind_l hdsize)
* using the built in graphics capabilities (only simple linear regression)
twoway (scatter intvl_ln hdsize, sort msize(vsmall)) ///
  (lfitci intvl_ln hdsize, ciplot(rline)) ///
  (lfitci intvl_ln hdsize, stdf ciplot(rline))

* Q5: R2
regress intvl_ln p_year hdsize
regress intvl_ln p_rct p_year hdsize
