* VHM 812/802 - Winter 2016
* L3a-Model Building II

* change working directory and open a log file
cd "c:\vhm812-data"
capture log close
log using l3a_model_build_II.txt, text replace
set more off

* open the DAISY Red dataset
use daisy2red.dta, clear
gen month=month(calv_dt)
gen aut_calv=(month>=2 & month<=7) 
gen hs_ct=herd_size-251
gen hs_sq=herd_size^2
gen parity1=parity-1
gen milk120k=milk120/1000
gen wpc_sqrt=sqrt(wpc)

* specifying maximum model
* no analyses

* causal model
* no analyses

* Reducing the Number of Predictors
* descriptive statistics
codebook cf vag_disch
sum  milk120 parity herd_size dyst rp vag_disch, detail

* correlation
corr  milk120 parity herd_size
pwcorr milk120 parity herd_size, obs star(0.05)

* indices
* no analyses

* unconditional associations
reg cf parity 
reg cf vag_disch

* principle components / factor analysis / correspondence analysis
* not covered

* Functional Form of predictors
* residual plots
reg cf milk120
predict stdres, rsta

* lowess smoother
twoway (scatter stdres milk120) (lowess stdres milk120)

* Detecting and Correcting for non-linearity (transformation of X)
* categorization of predictor
reg cf parity
egen parity_c6=cut(parity), at(0 1 2 3 4 5 6 15) icodes
tab parity parity_c6
reg cf i.parity_c6

* quadratic function of X
gen ln_cf=ln(cf)
reg ln_cf c.milk120k##c.milk120k
estat vif
vce, corr

* redoing the analysis with milk120 centred
summ milk120k
gen m120k_ct=(milk120k-r(mean)) /*r(mean) - memory variable created by summ command that store the mean of the variable*/
reg ln_cf c.m120k_ct##c.m120k_ct
estat vif
vce, corr

capture drop stdres
predict stdres, rsta 
twoway (scatter stdres m120k_ct) (lowess stdres m120k_ct)	

* fractional polynomials
fp <milk120k>,  scale center replace: reg ln_cf <milk120k> 
fp plot, r(none) ytitle(Predicted Ln(cf))

*or you can predict fitted values and standardized residuals and plot them
capture drop fit1 
capture drop stdres
predict fit1 
predict stdres, rstand
twoway scatter ln_cf milk120k || line fit1 milk120k, sort 
twoway (scatter stdres milk120k) (lowess stdres milk120k)

* interactions
* 2-way
reg  wpc_sqrt hs_ct hs_sq aut_calv twin dyst##vag_disch

* Selection criteria
* non-statistical
* no analyses

* statistical - nested models
* Wald test
reg  wpc_sqrt c.hs_ct##c.hs_ct aut_calv twin i.dyst##vag_disch
testparm i.dyst#vag_disch

* AIC and BIC
* statistical non-nested - this in logistic regression

*Selection strategy - Coleman dataset
* best subset - install command vselect
use coleman.dta, clear
rename x1 x1_staff_sal
rename x2 x2_father_job
rename x3 x3_ses
rename x4 x4_test_teach
rename x5 x5_edu_mother
rename y y_test_scr

vselect y_test_scr x1_staff_sal x2_father_job x3_ses ///
	x4_test_teach x5_edu_mother, best

* forward selection
stepwise, pe(0.1): reg y_test_scr x1_staff_sal x2_father_job x3_ses x4_test_teach x5_edu_mother
* backward elimination
stepwise, pr(0.1): reg y_test_scr x1_staff_sal x2_father_job x3_ses x4_test_teach x5_edu_mother
* stepwise selection
stepwise, pe(0.1) pr(0.11): reg y_test_scr x1_staff_sal x2_father_job x3_ses x4_test_teach x5_edu_mother

*selection strategy - daisy2red
use daisy2red, clear
gen hs_ct=herd_size-251
gen hs_sq=herd_size^2
gen parity1=parity-1
gen month=month(calv_dt)
gen aut_clv=(month>=2 & month<=7) if  !missing(calv_dt)
gen sqrt_wpc=sqrt(wpc)
gen twdy=twin*dyst
gen rpvd=rp*vag_disch

reg sqrt_wpc  parity1 aut_clv  hs_ct hs_sq twin dyst twdy  rp vag_disch rpvd   

*stepwise backward
stepwise, pe(0.05) pr(0.051) lockterm1: reg sqrt_wpc  parity1 aut_clv (hs_ct hs_sq)   ///
	(dyst twin twdy) (rp vag_disch rpvd)		 
estimates store sw_1
	
stepwise, pe(0.05) pr(0.051) lockterm1: reg sqrt_wpc  parity1 aut_clv (hs_ct hs_sq)   ///
	dyst twin twdy rp vag_disch rpvd
estimates store sw_2
estimates table sw_1 sw_2
	
* reliability
* split sample analysis
use daisy2red, clear
gen wpc_sqrt=sqrt(wpc)
gen hs_ct=herd_size-251
gen parity1=parity-1
gen month=month(calv_dt)
gen aut_calv=(month>=2 & month<=7) 

gen rand=uniform()
reg  wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin i.dyst##vag_disch if rand<0.6
scalar r2_1 = e(r2)
capture drop pv
predict pv
*cross validation correlation
corr pv wpc_sqrt if rand>=0.6
*shrinkage on cross-validation
scalar cv_sq = r(rho)^2   /* this computes r2 - r(rho) is a "saved result" from -corr-  */
di "R2 first model= " r2_1 "  and CV2= " cv_sq
di "shrinkage on cross-validation = " r2_1-cv_sq

*leave-one-out (advanced commands)
* computin "n" regression models without obs "i" using residual and leverage values
* and then calculating PRESS (predicted residual sum of squares)

reg  wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin dyst##vag_disch 
capture drop res lev eq1
predict res, res
predict lev ,lev
gen eq1=(res/(1-lev))^2
summ eq1
di "PRESS = " r(sum) //error sum square from predicted points
di "ESS full= "e(rss) //residuals sum square from full model

di "R2(pred) = "1-(r(sum)/ (`e(mss)' + `e(rss)'))
di "R2 full model = " e(r2)

* presenting results
* standardize coefficients
reg  wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin dyst##vag_disch
* rescale coefficient by SD of the predictor
matrix define coeff = e(b)'
capture drop coeff*
svmat coeff
summ hs_ct parity1 aut_calv twin dyst vag_disch

* IQR
reg  wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin dyst##vag_disch 
summ parity1,d  /* note IQR = 3 - 0 = 3  */
			/* coefficient for parity = 0.062 */			
