Correlation and Regression

Correlationand Regression

1.Correlation between variables

Correlationis a technique that is used statistically to describe the directionand strength of a relationship between any two variable. Incorrelation, two parts are always provided. The first part shows thedirection and either positive or negative, while the second part isthe correlation coefficient which ranges from 0 to 1. The strength ofthe correlation increases as the value tends to 1.

2)Coefficient of determination and its interpretation

Coefficientof determination (R2)is the main output of regression analysis and it is used to express aproportion of variance in dependent variable predictable fromindependent variable. The value is such that0 &lt&nbspr2&lt1 to denote line of association between the x and y values.

3)Slope and intercept of regression model

Interceptof a regression model if equivalent to the height/value of y when x =0 while the slope is the rise over run depicting the level of changein y as x changes. 4)Security Market Line

Securitymarket line refers to the line reflecting the risk of an investmentversus the returns. Measure of risk is beta and the lines begins atzero risk moving upwards towards the right. The increase ininvestment risk leads to an increase in return on investment. Aninvestor with low risk profile should choose the investment at thebeginning of security market line. Since SML reflects return oninvestment relative to risk, the slope represents risk premium ofinvestments while the y-intercept is equivalent to risk-free interestrate.

5)Saturated Model

Asaturated model refers to the model containing several estimatedparameters that are used as data points. This leads to a perfect fitbut statistically, this will be of little use since there will be nodata left for estimating the variance. A saturated model providesexact fit for the data with zero error terms and provides a way ofevaluating the observed data. For instance, in hospitals, when acoding of dummy variables is represented by t1&nbsp=0&nbspwhen therapy 1 and = 1&nbspfor therapy 2 and t2=0&nbspwhen incurable and = 1&nbspif curable, then the regressionmodel become

Forthis to be effective in regression, saturated model should have manyparameters as data values with zero degrees of freedom.

6)Regression to the mean

Regressionto the mean is a statistical phenomenon that makes natural variationfor repeated data appear like real change, particularly whenunusually small or large measurements are followed by measurementswhich are closer to the mean. The effects of Regression to the Meanwithin a sample is more conspicuous with the increase in measurementerror and after the follow-up measurements are examined using asub-sample on baseline value.These effects can be alleviated using an improved study design andsuitable statistical methods. Question2 – Computational exercise a)Correlation coefficient, r is expressed using the formula below

Fromthe table below

&nbsp

Risk (x)

Stock Return (y)

x2

y2

xy

&nbsp

0.6

0.6

0.36

0.36

0.36

&nbsp

0.8

0.52

0.64

0.2704

0.416

&nbsp

1

0.64

1

0.4096

0.64

&nbsp

1.2

0.56

1.44

0.3136

0.672

&nbsp

1.4

0.68

1.96

0.4624

0.952

Total

5

3

5.4

1.816

3.04

Mean

1

0.6

Therefore,regression of y on x can be expressed as where M is slope while C is the y-intercept.

Mis calculated as

Thereforethe slope is 0.1

Yintercept, C, is calculated as

Question3 – Computer exercise: Pythagorean Theorem a&amp b) Percentage of expected and actual time of win

Team

Wins

Losses

Runs Scored

Runs Allowed

Expected Time to Win

Actual time of win

Arizona

64

98

615

742

41%

40%

Atlanta

79

83

573

597

48%

49%

Baltimore

96

66

705

593

59%

59%

Boston

71

91

634

715

44%

44%

Chi Cubs

73

89

614

707

43%

45%

Chi White Sox

73

89

660

758

43%

45%

Cincinnati

76

86

595

612

49%

47%

Cleveland

85

77

669

653

51%

52%

Colorado

66

96

755

818

46%

41%

Detroit

90

72

757

705

54%

56%

Houston

70

92

629

723

43%

43%

Kansas City

89

73

651

624

52%

55%

LA Angels

98

64

773

630

60%

60%

LA Dodgers

94

68

718

617

58%

58%

Miami

77

85

645

674

48%

48%

Milwaukee

82

80

650

657

49%

51%

Minnesota

70

92

715

777

46%

43%

NY Mets

79

83

629

618

51%

49%

NY Yankees

84

78

633

664

48%

52%

Oakland

88

74

729

572

62%

54%

Philadelphia

73

89

619

687

45%

45%

Pittsburgh

88

74

682

631

54%

54%

San Diego

77

85

535

577

46%

48%

San Francisco

88

74

665

614

54%

54%

Seattle

87

75

634

554

57%

54%

St. Louis

90

72

619

603

51%

56%

Tampa Bay

77

85

612

625

49%

48%

Texas

67

95

637

773

40%

41%

Toronto

83

79

723

686

53%

51%

Washington

96

66

686

555

60%

59%

c)Scatterplot showing the relationship between expected and observedwinning percentage.

Figure1: TheRegression of Actual Winning Percentage to Expected WinningPercentage

FromFigure 1, the regression about the best curve shows a coefficient ofdetermination of 0.8338. This is closer to the recommended value of‘1’. This implies that the actual winning percentage and expectedwinning percentage are can be correlated. From the graph, the linearequation is represented by.

Fromthe equation,

Yintercept is the value of y when x is zero. Hence, Yintercept is 0.0466=4.66%

Slopeis 0.9054

Correlationcoefficient, r, =

Question4: Lying Statistics Inthis question, the one year production data in one of the renownedmanufacturing company is presented.

Thisis presented in the table below

Duration (Months)

Expected Production (Tonnes)

Actual Production(Tonnes)

Jan-14

250

259

Feb-14

247

251

Mar-14

244

250

Apr-14

250

258

May-14

247

250

Jun-14

259

242

Jul-14

248

246

Aug-14

256

251

Sep-14

253

250

Oct-14

251

256

Nov-14

252

250

Dec-14

247

255

Thescatter plot of Actual production attained vs. the expectedproduction in tonnes is presented in Figure 2 below

Figure2: Regressionof actual production vs. expected production

Fromthe graph, the coefficient of determination is 0.1019. This impliesthat the company cannot correlated its expected production with theactual production. Also, the slope, form the linear equation(Y=-0.3683x + 343.7), is a negative number. The presentation of thisdata might not give accurate details regarding production. Forinstance, production tends to vary based on conditions such as demandand efficiency of production system. Therefore, the statisticalrepresentation of such data may ignore the explanation of thevariation between the value predicted and the actual value obtained.

Inconclusion caution should be taken to ensure that the omittedvariable bias with the linear models is eliminated. It should benoted that some informal methods in statistics can be presented usingformal statistical models where their weakness become immediatelyapparent. This helps in assessing the effectiveness of quantitativeanalysis and avoiding future mistakes.

Reference

Rice, J. A. (2007). Mathematical Statistics and Data Analysis. Belmont, CA: Thomson/Brooks/Cole.