* VHM 812/802 - Winter 2016 * L3a-Model Building II * change working directory and open a log file cd "c:\vhm812-data" capture log close log using l3a_model_build_II.txt, text replace set more off * open the DAISY Red dataset use daisy2red.dta, clear gen month=month(calv_dt) gen aut_calv=(month>=2 & month<=7) gen hs_ct=herd_size-251 gen hs_sq=herd_size^2 gen parity1=parity-1 gen milk120k=milk120/1000 gen wpc_sqrt=sqrt(wpc) * specifying maximum model * no analyses * causal model * no analyses * Reducing the Number of Predictors * descriptive statistics codebook cf vag_disch sum milk120 parity herd_size dyst rp vag_disch, detail * correlation corr milk120 parity herd_size pwcorr milk120 parity herd_size, obs star(0.05) * indices * no analyses * unconditional associations reg cf parity reg cf vag_disch * principle components / factor analysis / correspondence analysis * not covered * Functional Form of predictors * residual plots reg cf milk120 predict stdres, rsta * lowess smoother twoway (scatter stdres milk120) (lowess stdres milk120) * Detecting and Correcting for non-linearity (transformation of X) * categorization of predictor reg cf parity egen parity_c6=cut(parity), at(0 1 2 3 4 5 6 15) icodes tab parity parity_c6 reg cf i.parity_c6 * quadratic function of X gen ln_cf=ln(cf) reg ln_cf c.milk120k##c.milk120k estat vif vce, corr * redoing the analysis with milk120 centred summ milk120k gen m120k_ct=(milk120k-r(mean)) /*r(mean) - memory variable created by summ command that store the mean of the variable*/ reg ln_cf c.m120k_ct##c.m120k_ct estat vif vce, corr capture drop stdres predict stdres, rsta twoway (scatter stdres m120k_ct) (lowess stdres m120k_ct) * fractional polynomials fp , scale center replace: reg ln_cf fp plot, r(none) ytitle(Predicted Ln(cf)) *or you can predict fitted values and standardized residuals and plot them capture drop fit1 capture drop stdres predict fit1 predict stdres, rstand twoway scatter ln_cf milk120k || line fit1 milk120k, sort twoway (scatter stdres milk120k) (lowess stdres milk120k) * interactions * 2-way reg wpc_sqrt hs_ct hs_sq aut_calv twin dyst##vag_disch * Selection criteria * non-statistical * no analyses * statistical - nested models * Wald test reg wpc_sqrt c.hs_ct##c.hs_ct aut_calv twin i.dyst##vag_disch testparm i.dyst#vag_disch * AIC and BIC * statistical non-nested - this in logistic regression *Selection strategy - Coleman dataset * best subset - install command vselect use coleman.dta, clear rename x1 x1_staff_sal rename x2 x2_father_job rename x3 x3_ses rename x4 x4_test_teach rename x5 x5_edu_mother rename y y_test_scr vselect y_test_scr x1_staff_sal x2_father_job x3_ses /// x4_test_teach x5_edu_mother, best * forward selection stepwise, pe(0.1): reg y_test_scr x1_staff_sal x2_father_job x3_ses x4_test_teach x5_edu_mother * backward elimination stepwise, pr(0.1): reg y_test_scr x1_staff_sal x2_father_job x3_ses x4_test_teach x5_edu_mother * stepwise selection stepwise, pe(0.1) pr(0.11): reg y_test_scr x1_staff_sal x2_father_job x3_ses x4_test_teach x5_edu_mother *selection strategy - daisy2red use daisy2red, clear gen hs_ct=herd_size-251 gen hs_sq=herd_size^2 gen parity1=parity-1 gen month=month(calv_dt) gen aut_clv=(month>=2 & month<=7) if !missing(calv_dt) gen sqrt_wpc=sqrt(wpc) gen twdy=twin*dyst gen rpvd=rp*vag_disch reg sqrt_wpc parity1 aut_clv hs_ct hs_sq twin dyst twdy rp vag_disch rpvd *stepwise backward stepwise, pe(0.05) pr(0.051) lockterm1: reg sqrt_wpc parity1 aut_clv (hs_ct hs_sq) /// (dyst twin twdy) (rp vag_disch rpvd) estimates store sw_1 stepwise, pe(0.05) pr(0.051) lockterm1: reg sqrt_wpc parity1 aut_clv (hs_ct hs_sq) /// dyst twin twdy rp vag_disch rpvd estimates store sw_2 estimates table sw_1 sw_2 * reliability * split sample analysis use daisy2red, clear gen wpc_sqrt=sqrt(wpc) gen hs_ct=herd_size-251 gen parity1=parity-1 gen month=month(calv_dt) gen aut_calv=(month>=2 & month<=7) gen rand=uniform() reg wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin i.dyst##vag_disch if rand<0.6 scalar r2_1 = e(r2) capture drop pv predict pv *cross validation correlation corr pv wpc_sqrt if rand>=0.6 *shrinkage on cross-validation scalar cv_sq = r(rho)^2 /* this computes r2 - r(rho) is a "saved result" from -corr- */ di "R2 first model= " r2_1 " and CV2= " cv_sq di "shrinkage on cross-validation = " r2_1-cv_sq *leave-one-out (advanced commands) * computin "n" regression models without obs "i" using residual and leverage values * and then calculating PRESS (predicted residual sum of squares) reg wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin dyst##vag_disch capture drop res lev eq1 predict res, res predict lev ,lev gen eq1=(res/(1-lev))^2 summ eq1 di "PRESS = " r(sum) //error sum square from predicted points di "ESS full= "e(rss) //residuals sum square from full model di "R2(pred) = "1-(r(sum)/ (`e(mss)' + `e(rss)')) di "R2 full model = " e(r2) * presenting results * standardize coefficients reg wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin dyst##vag_disch * rescale coefficient by SD of the predictor matrix define coeff = e(b)' capture drop coeff* svmat coeff summ hs_ct parity1 aut_calv twin dyst vag_disch * IQR reg wpc_sqrt c.hs_ct##c.hs_ct parity1 aut_calv twin dyst##vag_disch summ parity1,d /* note IQR = 3 - 0 = 3 */ /* coefficient for parity = 0.062 */