# Lab 6: Covariance and longer and wider tables

In [3]:
library(tidyverse)
install.packages("dslabs")
library(dslabs)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



## 1. Covariance and Correlation

$$Cov(X,Y) = E(XY) - E(X)E(Y)$$
<br>
$$Correlation(X,Y) = \rho_{X,Y} = \frac{Cov(X,Y)}{sd(X)sd(Y)}$$

Couple of rules:
* $Cov(X, X) = var(X)$
* $Cov(X, aY+c) = aCov(X,Y)$ where $X$ and $Y$ are random variables, $a$ and $c$ are constants
* If $X$ and $Y$ are independent than $Cov(X,Y) = 0$
* $Var(aX+c) = a^2 Var(X)$

Example: Let $X$ and $Z$ be independent random variables and let $Y = 2X+Z+5$. What is $\rho_{X, Y}$? (Suppose both $X$ and $Z$ are coming from $N(0,1)$)

#### Theoretical solution:
$$Cov(X, Y) = Cov(X, 2X+Z+5) = Cov(X, 2X)+Cov(X,Z)+Cov(X,5) = Cov(X,2X) = 2Var(X) = 2$$
<br>
$$Var(Y) = Var(2X+Z+5) = 4Var(X) + Var(Z) = 5$$
<br>
$$\rho_{X, Y} = \frac{Cov(X, Y)}{sd(X)sd(Y)} = \frac{2Var(X)}{sd(X)sd(Y)} = \frac{2}{\sqrt{5}} = 0.89 $$

#### Empirical solution:

In [4]:
set.seed(108)

n = 100000
X = rnorm(n)
Z = rnorm(n)
Y = 2*X + Z + 5

cor(X,Y)

#### In real data, we can usually only find the empirical solution--most of the time we do not know the underlying distribution.

In [6]:
gapminder %>% filter(year == 2011) %>% group_by(continent) %>% 
summarise(rho = cor(infant_mortality, life_expectancy))

Unnamed: 0_level_0,continent,rho
Unnamed: 0_level_1,<fct>,<dbl>
1,Africa,-0.6300899
2,Americas,
3,Asia,
4,Europe,-0.6746311
5,Oceania,


In [None]:
?cor

#### Exercise 1: Why there are `NA`s for some continents? How can you correct that?

## 1. Longer and Wider tables

In [8]:
grades_wide = tribble(
  ~name, ~Sex, ~`2015`, ~`2016`, ~`2017`,
     'Wu',  'M', 83,      89,      93,
  'Alice',  'F', 92,      90,      93,
 'Jordan',   NA, 80,      87,      99,
 'Gilberto','M', 67,      90,      92)
grades_wide

name,Sex,2015,2016,2017
<chr>,<chr>,<dbl>,<dbl>,<dbl>
Wu,M,83,89,93
Alice,F,92,90,93
Jordan,,80,87,99
Gilberto,M,67,90,92


In [9]:
grades_long = grades_wide %>% 
pivot_longer(-c(name, Sex), names_to = "year", values_to = "grades")
grades_long

name,Sex,year,grades
<chr>,<chr>,<chr>,<dbl>
Wu,M,2015,83
Wu,M,2016,89
Wu,M,2017,93
Alice,F,2015,92
Alice,F,2016,90
Alice,F,2017,93
Jordan,,2015,80
Jordan,,2016,87
Jordan,,2017,99
Gilberto,M,2015,67


In [10]:
grades_long %>%  pivot_wider(names_from = year, values_from = grades)

name,Sex,2015,2016,2017
<chr>,<chr>,<dbl>,<dbl>,<dbl>
Wu,M,83,89,93
Alice,F,92,90,93
Jordan,,80,87,99
Gilberto,M,67,90,92


## MLB dataset

In [11]:
mlb = read_csv('https://raw.githubusercontent.com/enesdilber/stats306_labs/master/lab5/mlb.csv')
mlb %>% head


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  year = [32mcol_double()[39m,
  name = [31mcol_character()[39m,
  team = [31mcol_character()[39m,
  division = [31mcol_character()[39m,
  PA = [32mcol_double()[39m,
  HR = [32mcol_double()[39m,
  BBrate = [31mcol_character()[39m,
  BB_K = [31mcol_character()[39m,
  AVG = [32mcol_double()[39m,
  FB = [32mcol_double()[39m,
  playerid = [32mcol_double()[39m
)




year,name,team,division,PA,HR,BBrate,BB_K,AVG,FB,playerid
<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
2016,Abraham Almonte,Indians,Central,194,1,4.1 %,8/42,0.264,43,5486
2017,Abraham Almonte,Indians,Central,195,3,10.3 %,20/46,0.233,34,5486
2015,Adam Moore,Indians,Central,4,0,0.0 %,0/2,0.25,1,9362
2016,Adam Moore,Indians,Central,5,0,0.0 %,0/4,0.0,1,9362
2018,Adam Plutko,Indians,Central,2,0,0.0 %,0/0,0.0,0,15846
2018,Adam Rosales,Indians,Central,21,1,4.8 %,1/5,0.211,7,9682


#### Exercise 2: Calculate the `Home Run to Fly Ball rate (HR/FB)` in total for each team and year. That is $HR\_FB = \frac{\sum HR_i}{\sum FB_i}$. Make sure you have the division at the final dataset. So  you'll have `division`, `team`, `year` and `HR_FB`.

 #### Exercise 3: Convert this to a wide dataset, so your variables should be `division`, `team`, and `2015-2018`, where values are the `HR/FB` rate. Note that, again, you should ensure that `division` is still in the dataset.

 #### Exercise 4: Create a variable called `increased`, which checks if the `HR/FB` rate was higher in 2018 than it was in 2015 for that team.

#### Exercise 5: Calculate the correlation between each year with the following year. That is $\rho_{2015, 2016}, \rho_{2016, 2017}, \rho_{2017, 2018}$

#### Exercise 6: Turn `df_wide` back into a "long" dataset

#### Exercise 7: Using df_long, create a faceted line plot of `HR/FB` rate on `year`. Color it by `team`, facet it by `division`, and choose the linetype according to the `increased` variable.