#### Transcript Data Mining Tutorial_old

Data Mining Tutorial D. A. Dickey Data Mining - What is it? • • • • Large datasets Fast methods Not significance testing Topics – Trees (recursive splitting) – Regression & Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc. If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others http://www.ofesite.com/spirit/palm/lines/linelife.htm Wilson & Mather JAMA 229 (1974) X=life line length Y=age at death proc sgplot; scatter Y=age X=line; reg Y=age X=line; run ; Result: Predicted Age at Death = 79.24 – 1.367(lifeline) (Is this “real”??? Is this repeatable???) We Use LEAST SQUARES Squared residuals sum to 9609 Distribution of t Under H0 Estimated slopes vary in repeated samples. Standard deviation (estimated) of sample slopes = “Standard error” Compute t = (estimate – hypothesized)/standard error p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope) p-value is sum of two tail areas. Traditionally p<0.05 implies hypothesized value is wrong. p>0.05 is inconclusive. proc reg data=life; model age=line; run; Parameter Estimates Variable DF Intercept 1 Line 1 Parameter Estimate 79.23341 -1.36697 Standard Error 14.83229 1.59782 H0:slope=0 -0.86 t Value Pr > |t| 5.34 <.0001 -0.86 0.3965 Area 0.19825 Area 0.19825 0.39650 0.86 Conclusion: insufficient evidence against the hypothesis of no linear relationship. H0: H1: H0: Innocence H1: Guilt Beyond reasonable doubt P<0.05 H0: True slope is 0 (no association) H1: True slope is not 0 P=0.3965 Need estimate of variability around the true line. True variance is Estimate uses sums of squared residuals (SS). Sum of squared residuals from the mean is “SS(total)” Sum of squared residuals around the line is “SS(error)” 2 9755 9609 (1) SS(total)-SS(error) is SS(model) = 146 (2) Variance estimate is SS(error)/(degrees of freedom) = 200 (3) SS(model)/SS(total) is R2, i.e. proportion of variablity “explained” by the model. Analysis of Variance Source Model Error Corrected Total Root MSE 14.14854 DF 1 48 49 Sum of Squares 146.51753 9608.70247 9755.22000 R-Square 0.0150 Mean Square 146.51753 200.18130 F Value 0.73 Pr > F 0.3965 Trees • • • • • • • • A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Loan Defaults Example: Framingham Heart Study Example: Automobile fatalities Recursive Splitting Pr{default} =0.007 Pr{default} =0.012 Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.0001 Pr{default} =0.003 No default Default X2 = Age Some Actual Data • Framingham Heart Study • First Stage Coronary Heart Disease – P{CHD} = Function of: • Age - no drug yet! • Cholesterol • Systolic BP Import Example of a “tree” Pruning options: N=4 Gini for splits Assessment = Avg. Sq. Err. How to make splits? • Which variable to use? • Where to split? – Cholesterol > ____ – Systolic BP > _____ • Goal: Pure “leaves” or “terminal nodes” • Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems How to make splits? Contingency tables Heart Disease No Yes 180 ? 240? Low BP High BP 95 5 Heart Disease No Yes 100 100 75 55 150 45 50 DEPENDENT (effect) 25 100 100 75 25 150 50 INDEPENDENT (no effect) c2 Test Statistic • Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25) Heart Disease No Yes Low BP High BP (observed exp ected ) 2 c allcells exp ected 2 95 (75) 55 (75) 5 (25) 45 (25) 100 150 50 200 100 WHERE IS HIGH BP CUTOFF??? 2(400/75)+ 2(400/25) = 42.67 Compare to Tables – Significant! Measuring “Worth” of a Split • P-value is probability of Chi-square as great as that observed if independence is true. (Pr {c2>42.67} is 6.4E-11) • P-values all too small. • Logworth = -log10(p-value) = 10.19 • Best Chi-square max logworth. Logworth for Age Splits ? Age 47 maximizes logworth How to make splits? • Which variable to use? • Where to split? – Cholesterol > ____ – Systolic BP > _____ • Idea – Pick BP cutoff to minimize p-value for c2 • What does “signifiance” mean now? Multiple testing • 50 different BPs in data, 49 ways to split • Sunday football highlights always look good! • If he shoots enough times, even a 95% free throw shooter will miss. • Tried 49 splits, each has 5% chance of declaring significance even if there’s no relationship. Multiple testing a= Pr{ falsely reject hypothesis 2} a= Pr{ falsely reject hypothesis 1} Pr{ falsely reject one or the other} < 2a Desired: 0.05 probabilty or less Solution: use a = 0.05/2 Or – compare 2(p-value) to 0.05 Multiple testing • • • • • • 50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni – original idea Kass – apply to data mining (trees) Stop splitting if minimum p-value is large. For m splits, logworth becomes -log10(m*p-value) ! ! ! Validation • Traditional stats – small dataset, need all observations to estimate parameters of interest. • Data mining – loads of data, can afford “holdout sample” • Variation: n-fold cross validation – Randomly divide data into n sets – Estimate on n-1, validate on 1 – Repeat n times, using each set as holdout. Pruning • Grow bushy tree on the “fit data” • Classify validation (holdout) data • Likely farthest out branches do not improve, possibly hurt fit on validation data • Prune non-helpful branches. • What is “helpful”? • What is good discriminator criterion? Goals • Split if diversity in parent “node” > summed diversities in child nodes • Prune to optimize – Estimates – Decisions – Ranking • in validation data Accounting for Costs • Pardon me (sir, ma’am) can you spare some change? • Say “sir” to male +$2.00 • Say “ma’am” to female +$5.00 • Say “sir” to female -$1.00 (balm for slapped face) • Say “ma’am” to male -$10.00 (nose splint) Including Probabilities Leaf has Pr(M)=.7, Pr(F)=.3. You say: Sir Ma’am True Gender M 0.7 (2) 0.7 (-10) 0.3 (5) F +$1.10 -$5.50 Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir” Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am” Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits. Support Vector Machines Find a point X0 that “optimally” separates red from blue. Optimally separate events from non-events. Maximize the “margin” & take midpoint “margin” Support Vector Machines Let Z=1 for events, Z = -1 for non-events Minimize slope of line subject to YZ>=1 everywhere Y= -16.38 + 32.73 X so X0=16.38/32.73 Y>=1 and Z=1 Y<= - 1 and Z= -1 What about higher dimensions? Separator is a line (not point). Which line maximizes margin? What about higher dimensions? Separator is a line (not point). Which line maximizes margin? Find plane with minimum slope to get separating line. Subject to YZ-1 >= 0 Example: X2 = expenditures X1=income Event = carry credit charge Plane is Y = 0 – 10 X1 + 10 X2 line is X2 = X1. so division Credit card payments versus debt to income ratio . Pay off card Pay interest only X = debt to income ratio . default Idea: plot Z against X and X2 Move to “higher dimension” Distances between points change Reality: Events and non-events typically mingled Need to lighten up on ZY-1 >= 0 requirement ! This plus the move to higher dimension is full blown support vector technology. Additional Ideas • Forests – Draw samples with replacement (bootstrap) and grow multiple trees. • Random Forests – Randomly sample the “features” (predictors) and build multiple trees. • Classify new point in each tree then average the probabilities, or take a plurality vote from the trees Lift 3.3 * Cumulative Lift Chart - Go from leaf of most to least predicted 1 response. - Lift is proportion responding in first p% overall population response rate Regression Trees • Continuous response Y • Predicted response Pi constant in regions i=1, …, 5 Predict 80 Predict 50 X2 Predict 130 Predict 100 X1 Predict 20 • Prediction PREDi in cell i. • Yij jth response in cell i. • Split to minimize Si Sj (Yij-PREDi)2 Predict 80 Predict 50 Predict 130 Predict 100 Predict 20 • Predict Pi in cell i. • Yij jth response in cell i. • Split to minimize Si Sj (Yij-Pi)2 Real data example: Traffic accidents in Portugal* Y = injury induced “cost to society” Help - I ran Into a “tree” Help - I ran Into a “tree” * Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education An alternative method: Multiple Regression Issues: (1) Testing joint importance versus individual significance Two engine plane can still fly if engine #1 fails Two engine plane can still fly if engine #2 fails Neither is critical individually Jointly critical (can’t omit both!!) (2) Prediction versus modeling individual effects (3) Collinearity (correlation among inputs) Example: Hypothetical company’s sales Y depend on TV advertising X1 and Radio Advertising X2. Y = b0 + b1X1 + b2X2 +e Data Sales; length sval $8; length cval $8; input store TV radio sales; (more code) cards; Sales 1 869 868 9089 2 836 820 8290 (more data) 40 969 961 10130 Radio TV proc g3d data=sales; scatter radio*TV=sales/shape=sval color=cval zmin=8000; run; Conclusion: Can predict well with just TV, just radio, or both! SAS code: proc reg data=next; model sales = TV radio; Analysis of Variance Source Model Error Corrected Total Root MSE Sum of Squares 32660996 1683844 34344840 DF 2 37 39 213.32908 Mean Square 16330498 45509 R-Square F Value 358.84 Pr > F <.0001 (Can’t omit both) 0.9510 Explaining 95% of variation in sales Parameter Estimates Variable Intercept TV radio DF 1 1 1 Parameter Estimate 531.11390 5.00435 4.66752 Standard Error 359.90429 5.01845 4.94312 t Value 1.48 1.00 0.94 Pr > |t| 0.1485 0.3251 (can omit TV) 0.3512 (can omit radio) Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (standard deviation 213). TV approximately equal to radio so, approximately Estimated Sales = 531 + 9.7 TV or Estimated Sales = 531 + 9.7 radio Summary: Good predictions given by Sales = 531 + 5.0 x TV + 4.7 x Radio or Sales = 479 + 9.7 x TV or Sales = 612 + 9.6 x Radio or (lots of others) Why the confusion? The evil Multicollinearity!! (correlated X’s) Multicollinearity can be diagnosed by looking at principal components (axes of variation) Variance along PC axes “eigenvalues” of correlation matrix Direction axes point “eigenvectors” of correlation matrix Principal Component Axis 1 Proc Corr; Var TV radio sales; Pearson Correlation Coefficients, N = 40 Prob > |r| under H0: Rho=0 TV radio sales TV 1.00000 0.99737 <.0001 0.97457 <.0001 radio 0.99737 <.0001 1.00000 0.97450 <.0001 sales 0.97457 <.0001 0.97450 <.0001 1.00000 TV $ Principal Component Axis 2 Radio $ Grades vs. IQ and Study Time Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time; cards; 105 10 75 110 12 79 120 6 68 116 13 85 122 16 91 130 8 79 114 20 98 102 15 76 ; Proc reg data=tests; model Grade = IQ; Proc reg data=tests; model Grade = IQ Study_Time; Variable Intercept IQ Variable Intercept IQ Study_Time DF 1 1 Parameter Estimate 62.57113 0.16369 Standard Error 48.24164 0.41877 t Value 1.30 0.39 Pr > |t| 0.2423 0.7094 DF 1 1 1 Parameter Estimate 0.73655 0.47308 2.10344 Standard Error 16.26280 0.12998 0.26418 t Value 0.05 3.64 7.96 Pr > |t| 0.9656 0.0149 0.0005 Contrast: TV advertising looses significance when radio is added. IQ gains significance when study time is added. Model for Grades: Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time Question: Does an extra hour of study really deliver 2.10 points for everyone regardless of IQ? Current model only allows this. proc reg; model Grade = IQ Study_Time IQ_S; Source Model Error Corrected Total Root MSE Variable Intercept IQ Study_Time IQ_S DF Sum of Squares Mean Square 3 4 7 610.81033 31.06467 641.87500 203.60344 7.76617 2.78678 R-Square F Value Pr > F 26.22 0.0043 0.9516 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 72.20608 -0.13117 -4.11107 0.05307 54.07278 0.45530 4.52430 0.03858 1.34 -0.29 -0.91 1.38 0.2527 0.7876 0.4149 0.2410 “Interaction” model: Predicted Grade = 72.21 0.13 x IQ 4.11 x Study Time + 0.053 x IQ x Study Time = (72.21 0.13 x IQ )+( 4.11 + 0.053 x IQ )x Study Time IQ = 102 predicts Grade = (72.21-13.26)+(5.41-4.11) x Study Time = 58.95+ 1.30 x Study Time IQ = 122 predicts Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time Slope = 2.36 Slope = 1.30 (1) (2) (3) (4) Adding interaction makes everything insignificant (individually) ! Do we need to omit insignificant terms until only significant ones remain? Has an acquitted defendant proved his innocence? Common sense trumps statistics! Logistic Regression • • • • “Trees” seem to be main tool. Logistic – another classifier Older – “tried & true” method Predict probability of response from input variables (“Features”) • Linear regression gives infinite range of predictions • 0 < probability < 1 so not linear regression. Example: Seat Fabric Ignition • Flame exposure time = X • Ignited Y=1, did not ignite Y=0 – Y=0, X= 3, 5, 9 10 , 13, 16 – Y=1, X = 7, 11, 12, 14, 15, 17, 25, 30 • Q=(1-p1)(1-p2)p3(1-p4)(1-p5)p6p7(1-p8)p9p10(1p11)p12p13p14 • p’s all different : pi=exp(a+bXi) /(1+exp(a+bXi)) • Find a,b to maximize Q(a,b) • Logistic idea: Map p in (0,1) to L in whole real line • Use L = ln(p/(1-p)) • Model L as linear in temperature, e.g. • Predicted L = a + b(temperature) • Given temperature X, compute L(x)=a+bX then p = eL/(1+eL) • p(i) = ea+bXi/(1+ea+bXi) • Write p(i) if ignition, 1-p(i) if not • Multiply all n of these together, find a,b to maximize Generate Q for array of (a,b) values DATA LIKELIHOOD; ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14; DO I=1 TO 14; INPUT X(I) y(I) @@; END; DO A = -3 TO -2 BY .025; DO B = 0.2 TO 0.3 BY .0025; Q=1; DO i=1 TO 14; L=A+B*X(i); P=EXP(L)/(1+EXP(L)); IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P); END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END; CARDS; 3 0 5 0 7 1 9 0 10 0 11 1 12 1 13 0 14 1 15 1 16 0 17 1 25 1 30 1 ; Likelihood function (Q) -2.6 0.23 Concordant pair Discordant Pair IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter Intercept TIME DF 1 1 Estimate -2.5879 0.2346 Standard Error 1.8469 0.1502 Wald Chi-Square 1.9633 2.4388 Pr > ChiSq 0.1612 0.1184 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 79.2 20.8 0.0 48 Somers' D Gamma Tau-a c 0.583 0.583 0.308 0.792 Example: Shuttle Missions • • • • • O-rings failed in Challenger disaster Low temperature Prior flights “erosion” and “blowby” in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0) Example: Framingham • X=age • Y=1 if heart trouble, 0 otherwise Framingham The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Intercept age 1 1 Standard Wald Estimate Error Chi-Square -5.4639 0.0630 0.5563 0.0110 96.4711 32.6152 Pr>ChiSq <.0001 <.0001 Neural Networks X1 inputs X2 X3 H1 (0,1) Output = Pr{1} H2 Logistic function of X4 • Very flexible functions • “Hidden Layers” • “Multilayer Perceptron” Logistic functions ** Of data ** (note: Hyperbolic tangent functions are just reparameterized logistic functions) Example: Y = a + b1 H1 + b2 H2 + b3 H3 Y = 0 + 9 H1 + 3 H2 + 5 H3 “bias” “weights” -1 to 1 X H1 b1 H2 b2 b3 Y H3 Arrows on right represent linear combinations of “basis functions,” e.g. hyperbolic tangents (reparameterized logistic curves) (-10) 3 -0.4 0.8 X1 (-13) 0 (-1) -1 0.25 X2 P -0.9 0.01 (20) 2.5 (“biases”) A Complex Neural Network Surface • Should always use holdout sample • Perturb coefficients to optimize fit (fit data) – Nonlinear search algorithms • Eliminate unnecessary complexity using holdout data. • Other basis sets – Radial Basis Functions – Just normal densities (bell shaped) with adjustable means and variances. A Combined Example Cell Phone Texting Locations Black circle: Phone moved > 50 feet in first two minutes of texting. Green dot: Phone moved < 50 feet. . Tree Neural Net Logistic Regression Three Models Training Data Lift Charts Validation Data Lift Charts Resulting Surfaces Unsupervised Learning • We have the “features” (predictors) • We do NOT have the response even on a training data set (UNsupervised) • Clustering – Agglomerative • Start with each point separated – Divisive • Start with all points in one cluster then spilt – Direct • State # clusters beforehand EM PROC FASTCLUS • Step 1 – find (50) “seeds” as separated as possible • Step 2 – cluster points to nearest seed – Drift: As points are added, change seed (centroid) to average of each coordinate – Alternatively: Make full pass then recompute seed and iterate. • Step 3 – aggregate clusters using Ward’s method Clusters as Created As Clustered – PROC FASTCLUS Statistics to Data Mining Dictionary Statistics (nerdy) Data Mining (cool) Independent variables Dependent variable Estimation Clustering Features Target Training, Supervised Learning Unsupervised Learning Prediction Slopes, Betas Intercept Scoring Weights (Neural nets) Bias (Neural nets) Composition of Hyperbolic Tangent Functions Radial Basis Function and my personal Type I and Type II Errors Neural Network Normal Density favorite… Confusion Matrix Association Analysis • Market basket analysis – What they’re doing when they scan your “VIP” card at the grocery – People who buy diapers tend to also buy _________ (beer?) – Just a matter of accounting but with new terminology (of course ) Association Analysis is just elementary probability with new names 0.3 A Pr{A and B} = 0.2 Pr{A} =0.5 Pr{B} =0.3 A: Purchase Milk B B: Purchase Cereal 0.1 0.4 0.3+0.2+0.1+0.4 = 1.0 Cereal=> Milk Rule B=> A “people who buy B will buy A” Support: Support= Pr{A and B} = 0.2 A 0.3 0.2 B 0.1 0.4 Independence means that Pr{A|B} = Pr{A} = 0.5 Pr{A} = 0.5 = Expected confidence if there is no relation to B.. Confidence: Confidence = Pr{A|B}=Pr{A and B}/Pr{B}=2/3 ??- Is the confidence in B=>A the same as the confidence in A=>B?? (yes, no) Lift: Lift = confidence / E{confidence} = (2/3) / (1/2) = 1.33 Gain = 33% B Marketing A to the 30% of people who buy B will result in 33% better sales than marketing to a random 30% of the people. TEXT MINING Hypothetical collection of news releases (“corpus”) : release 1: Did the NCAA investigate the basketball scores and vote for sanctions? release 2: Republicans voted for and Democrats voted against it for the win. (etc.) Compute word counts: NCAA basketball score vote Republican Democrat win Release 1 1 1 1 1 0 0 0 Release 2 0 0 0 2 1 1 1 Text Mining Mini-Example: Word counts in 16 e-mails --------------------------------words----------------------------------------- d o c u m e n t E l e c t i o n P r e s i d e n t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 5 0 8 0 10 2 4 26 19 2 16 14 1 8 6 2 9 0 6 3 1 13 22 0 19 17 0 R e p u b l i c a n B a s k e t b a l l D e m o c r a t V o t e r s N C A A 10 9 0 7 4 9 1 4 9 10 0 21 12 4 12 5 14 0 16 5 13 16 2 11 14 0 0 21 6 4 0 12 0 5 0 2 16 9 1 13 20 3 0 2 2 14 0 19 1 4 20 12 3 9 19 6 1 0 12 2 15 5 12 9 6 0 12 0 0 9 L i a r T o u r n a m e n t S p e e c h 5 9 0 12 2 20 13 0 24 14 0 16 12 3 3 0 16 3 17 0 20 12 4 10 16 4 5 8 8 12 4 15 3 18 0 9 30 22 12 12 9 0 W i n s S c o r e _ V S c o r e _ N 18 12 24 22 9 13 0 3 9 3 17 0 6 3 15 9 19 8 0 9 1 0 10 1 23 0 1 10 21 0 30 2 1 14 6 0 14 0 8 2 4 20 Eigenvalues of the Correlation Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 Eigenvalue Difference 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion Cumulative 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Prin 2 Prin 1 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000 55% of the variation in these 13-dimensional vectors occurs in one dimension. Variable Prin1 Basketball NCAA Tournament Score_V Score_N Wins -.320074 -.314093 -.277484 -.134625 -.120083 -.080110 Speech Voters Liar Election Republican President Democrat 0.273525 0.294129 0.309145 0.315647 0.318973 0.333439 0.336873 Eigenvalues of the Correlation Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 Eigenvalue Difference 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Cumulative 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000 Prin 2 Prin 1 Prin1 coordinate = .707(word1) – .707(word2) 55% of the variation in these 13-dimensional vectors occurs in one dimension. Variable Prin1 Basketball NCAA Tournament Score_V Score_N Wins -.320074 -.314093 -.277484 -.134625 -.120083 -.080110 Speech Voters Liar Election Republican President Democrat 0.273525 0.294129 0.309145 0.315647 0.318973 0.333439 0.336873 PROC CLUSTER (single linkage) agrees ! Cluster 2 Cluster 1 Summary • Data mining – a set of fast stat methods for large data sets • Some new ideas, many old or extensions of old • Some methods: – Trees (recursive splitting) – Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc. Classification Variables (dummy variables, indicator variables) Predicted Accidents = 1181 + 2579 X11 X11 is 1 in November, 0 elsewhere. Interpretation: In November, predict 1181+2579(1) = 3660. In any other month predict 1181 + 2579(0) = 1181. 1181 is average of other months. 2579 is added November effect (vs. average of others) Model for NC Crashes involving Deer: Proc reg data=deer; model deer = X11; Analysis of Variance Source Model Error Corrected Total Root MSE Variable Intercept X11 DF 1 58 59 580.42294 Label Intercept Sum of Squares 30473250 19539666 50012916 R-Square DF 1 1 Mean Square 30473250 336891 F Value 90.45 Pr > F <.0001 0.6093 Parameter Estimate 1181.09091 2578.50909 Standard Error 78.26421 271.11519 t Value 15.09 9.51 Pr > |t| <.0001 <.0001 Looks like December and October need dummies too! Proc reg data=deer; model deer = X10 X11 X12; Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 3 56 59 46152434 3860482 50012916 15384145 68937 Root MSE Variable Intercept X10 X11 X12 262.55890 DF 1 1 1 1 Parameter Estimate 929.40000 1391.20000 2830.20000 1377.40000 R-Square Standard Error 39.13997 123.77145 123.77145 123.77145 date F Value Pr > F 223.16 <.0001 0.9228 t Value 23.75 11.24 22.87 11.13 Pr > |t| <.0001 <.0001 <.0001 <.0001 Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December. JAN03 FEB03 MAR03 APR03 MAY03 JUN03 JUL03 AUG03 SEP03 OCT03 NOV03 DEC03 JAN04 FEB04 MAR04 APR04 MAY04 JUN04 JUL04 AUG04 SEP04 OCT04 NOV04 DEC04 x10 x11 x12 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 What the heck – let’s do all but one (need “average of rest” so must leave out at least one) Proc reg data=deer; model deer = X1 X2 … X10 X11; Analysis of Variance Source Model Error Corrected Total Root MSE DF 11 48 59 182.07290 Sum of Squares 48421690 1591226 50012916 R-Square Mean Square 4401972 33151 F Value 132.79 Pr > F <.0001 0.9682 Parameter Estimates Variable Label Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Intercept DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 1 1 1 1 1 1 1 2306.80000 -885.80000 -1181.40000 -1220.20000 -1486.80000 -1526.80000 -1433.00000 -1559.20000 -1646.20000 -1457.20000 13.80000 1452.80000 81.42548 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 115.15301 28.33 -7.69 -10.26 -10.60 -12.91 -13.26 -12.44 -13.54 -14.30 -12.65 0.12 12.62 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.9051 <.0001 Average of rest is just December mean 2307. Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December. positive negative Add date (days since Jan 1 1960 in SAS) to capture trend Proc reg data=deer; model deer = date X1 X2 … X10 X11; Analysis of Variance Source Model Error Corrected Total Root MSE DF 12 47 59 129.83992 Sum of Squares 49220571 792345 50012916 R-Square Mean Square 4101714 16858 F Value 243.30 Pr > F <.0001 0.9842 Parameter Estimates Variable Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 date Label Intercept DF 1 1 1 1 1 1 1 1 1 1 1 1 1 Parameter Estimate -1439.94000 -811.13686 -1113.66253 -1158.76265 -1432.28832 -1478.99057 -1392.11624 -1525.01849 -1618.94416 -1436.86982 27.42792 1459.50226 0.22341 Standard Error 547.36656 82.83115 82.70543 82.60154 82.49890 82.41114 82.33246 82.26796 82.21337 82.17106 82.14183 82.12374 0.03245 t Value -2.63 -9.79 -13.47 -14.03 -17.36 -17.95 -16.91 -18.54 -19.69 -17.49 0.33 17.77 6.88 Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0. Pr > |t| 0.0115 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.7399 <.0001 <.0001 Receiver Operating Characteristic Curve Cut point 1 Logits of 1s Logits of 0s Logits of 0s red of 1s Logits black red black Receiver Operating Characteristic Curve Cut point 2 Logits of 1s Logits of 0s Logits of 0s red of 1s Logits black red black Receiver Operating Characteristic Curve Cut point 3 Logits of 1s Logits of 0s Logits of 0s red of 1s Logits black red black Receiver Operating Characteristic Curve Cut point 3.5 Logits of 1s red Logits of 0s black Receiver Operating Characteristic Curve Cut point 4 Logits of 1s red Logits of 0s black Receiver Operating Characteristic Curve Cut point 5 Logits of 1s Logits of 0s Logits of 0s red of 1s Logits black red black Receiver Operating Characteristic Curve Cut point 6 Logits of 1s red Logits of 0s black