Correlationand Regression
1.Correlation between variables
Correlationis a technique that is used statistically to describe the directionand strength of a relationship between any two variable. Incorrelation, two parts are always provided. The first part shows thedirection and either positive or negative, while the second part isthe correlation coefficient which ranges from 0 to 1. The strength ofthe correlation increases as the value tends to 1.
2)Coefficient of determination and its interpretation
Coefficientof determination (R^{2})is the main output of regression analysis and it is used to express aproportion of variance in dependent variable predictable fromindependent variable. The value is such that0 < r^{2}<1 to denote line of association between the x and y values.
3)Slope and intercept of regression model
Interceptof a regression model if equivalent to the height/value of y when x =0 while the slope is the rise over run depicting the level of changein y as x changes. 4)Security Market Line
Securitymarket line refers to the line reflecting the risk of an investmentversus the returns. Measure of risk is beta and the lines begins atzero risk moving upwards towards the right. The increase ininvestment risk leads to an increase in return on investment. Aninvestor with low risk profile should choose the investment at thebeginning of security market line. Since SML reflects return oninvestment relative to risk, the slope represents risk premium ofinvestments while the yintercept is equivalent to riskfree interestrate.
5)Saturated Model
Asaturated model refers to the model containing several estimatedparameters that are used as data points. This leads to a perfect fitbut statistically, this will be of little use since there will be nodata left for estimating the variance. A saturated model providesexact fit for the data with zero error terms and provides a way ofevaluating the observed data. For instance, in hospitals, when acoding of dummy variables is represented by t_{1} =0 when therapy 1 and = 1 for therapy 2 and t_{2}=0 when incurable and = 1 if curable, then the regressionmodel become
Forthis to be effective in regression, saturated model should have manyparameters as data values with zero degrees of freedom.
6)Regression to the mean
Regressionto the mean is a statistical phenomenon that makes natural variationfor repeated data appear like real change, particularly whenunusually small or large measurements are followed by measurementswhich are closer to the mean. The effects of Regression to the Meanwithin a sample is more conspicuous with the increase in measurementerror and after the followup measurements are examined using asubsample on baseline value.These effects can be alleviated using an improved study design andsuitable statistical methods. Question2 – Computational exercise a)Correlation coefficient, r is expressed using the formula below
Fromthe table below
  
Risk (x) 
Stock Return (y) 
x^{2} 
y^{2} 
xy 
  
0.6 
0.6 
0.36 
0.36 
0.36 
  
0.8 
0.52 
0.64 
0.2704 
0.416 
  
1 
0.64 
1 
0.4096 
0.64 
  
1.2 
0.56 
1.44 
0.3136 
0.672 
  
1.4 
0.68 
1.96 
0.4624 
0.952 
Total 
5 
3 
5.4 
1.816 
3.04 
Mean 
1 
0.6 
Therefore,regression of y on x can be expressed as where M is slope while C is the yintercept.
Mis calculated as
Thereforethe slope is 0.1
Yintercept, C, is calculated as
Question3 – Computer exercise: Pythagorean Theorem a& b) Percentage of expected and actual time of win
Team 
Wins 
Losses 
Runs Scored 
Runs Allowed 
Expected Time to Win 
Actual time of win 
Arizona 
64 
98 
615 
742 
41% 
40% 
Atlanta 
79 
83 
573 
597 
48% 
49% 
Baltimore 
96 
66 
705 
593 
59% 
59% 
Boston 
71 
91 
634 
715 
44% 
44% 
Chi Cubs 
73 
89 
614 
707 
43% 
45% 
Chi White Sox 
73 
89 
660 
758 
43% 
45% 
Cincinnati 
76 
86 
595 
612 
49% 
47% 
Cleveland 
85 
77 
669 
653 
51% 
52% 
Colorado 
66 
96 
755 
818 
46% 
41% 
Detroit 
90 
72 
757 
705 
54% 
56% 
Houston 
70 
92 
629 
723 
43% 
43% 
Kansas City 
89 
73 
651 
624 
52% 
55% 
LA Angels 
98 
64 
773 
630 
60% 
60% 
LA Dodgers 
94 
68 
718 
617 
58% 
58% 
Miami 
77 
85 
645 
674 
48% 
48% 
Milwaukee 
82 
80 
650 
657 
49% 
51% 
Minnesota 
70 
92 
715 
777 
46% 
43% 
NY Mets 
79 
83 
629 
618 
51% 
49% 
NY Yankees 
84 
78 
633 
664 
48% 
52% 
Oakland 
88 
74 
729 
572 
62% 
54% 
Philadelphia 
73 
89 
619 
687 
45% 
45% 
Pittsburgh 
88 
74 
682 
631 
54% 
54% 
San Diego 
77 
85 
535 
577 
46% 
48% 
San Francisco 
88 
74 
665 
614 
54% 
54% 
Seattle 
87 
75 
634 
554 
57% 
54% 
St. Louis 
90 
72 
619 
603 
51% 
56% 
Tampa Bay 
77 
85 
612 
625 
49% 
48% 
Texas 
67 
95 
637 
773 
40% 
41% 
Toronto 
83 
79 
723 
686 
53% 
51% 
Washington 
96 
66 
686 
555 
60% 
59% 
c)Scatterplot showing the relationship between expected and observedwinning percentage.
Figure1: TheRegression of Actual Winning Percentage to Expected WinningPercentage
FromFigure 1, the regression about the best curve shows a coefficient ofdetermination of 0.8338. This is closer to the recommended value of‘1’. This implies that the actual winning percentage and expectedwinning percentage are can be correlated. From the graph, the linearequation is represented by.
Fromthe equation,
Yintercept is the value of y when x is zero. Hence, Yintercept is 0.0466=4.66%
Slopeis 0.9054
Correlationcoefficient, r, =
Question4: Lying Statistics Inthis question, the one year production data in one of the renownedmanufacturing company is presented.
Thisis presented in the table below
Duration (Months) 
Expected Production (Tonnes) 
Actual Production(Tonnes) 
Jan14 
250 
259 
Feb14 
247 
251 
Mar14 
244 
250 
Apr14 
250 
258 
May14 
247 
250 
Jun14 
259 
242 
Jul14 
248 
246 
Aug14 
256 
251 
Sep14 
253 
250 
Oct14 
251 
256 
Nov14 
252 
250 
Dec14 
247 
255 
Thescatter plot of Actual production attained vs. the expectedproduction in tonnes is presented in Figure 2 below
Figure2: Regressionof actual production vs. expected production
Fromthe graph, the coefficient of determination is 0.1019. This impliesthat the company cannot correlated its expected production with theactual production. Also, the slope, form the linear equation(Y=0.3683x + 343.7), is a negative number. The presentation of thisdata might not give accurate details regarding production. Forinstance, production tends to vary based on conditions such as demandand efficiency of production system. Therefore, the statisticalrepresentation of such data may ignore the explanation of thevariation between the value predicted and the actual value obtained.
Inconclusion caution should be taken to ensure that the omittedvariable bias with the linear models is eliminated. It should benoted that some informal methods in statistics can be presented usingformal statistical models where their weakness become immediatelyapparent. This helps in assessing the effectiveness of quantitativeanalysis and avoiding future mistakes.
Reference
Rice, J. A. (2007). Mathematical Statistics and Data Analysis. Belmont, CA: Thomson/Brooks/Cole.