Lab: Relationship between Z-test and Chi-square test¶

Good morning/afternoon. Today, we're exploring the relationship between these two tests.

Main Point¶

The $\chi^2$ statistic is the z-test statistic squared.
The p-values you get for testing the "same" hypotheses using the two different methods are the same.

I can mic drop right here. You can leave the room now.

# LIBRARIES
library(readr)
library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Our Dataset¶

Flashback to the 60's.

Background¶

Collected to assess the effects of behavior type on coronary heart disease (CHD)
3524 men were enrolled, aged 39-59 from corporations in California
Each individuals behavior type was assessed during an interview
Full data is available for 3142 participants.
Of these, 257 (8.2%) had a CHD event.

What the rows mean¶

chd69=1 implies that a CHD event occurred vs. chd69=0 codes no CHD event.
dibpat0=1 codes participants with a "Type A" personality and dibpat0=0 codes participants with a "Type B" personality.
Here, CHD is the response variable and personality type is the explanatory variable.

# READ IN DATA
dat <- read_csv("data/lab_11.csv")

Parsed with column specification:
cols(
  id = col_integer(),
  age0 = col_integer(),
  height0 = col_integer(),
  weight0 = col_integer(),
  sbp0 = col_integer(),
  dbp0 = col_integer(),
  chol0 = col_integer(),
  behpat0 = col_integer(),
  ncigs0 = col_integer(),
  dibpat0 = col_integer(),
  chd69 = col_integer(),
  arcus0 = col_integer(),
  cigs = col_integer()
)

# PREVIEW DATA
head(dat)

Hypotheses¶

Recognize that we're testing for independence.
Because we're working with a 2x2 contingency table (as you'll see soon), the hypotheses narrow down to the same thing.

$H_0$: P(CHD=1|Type A) = P(CHD=1|Type B)
$H_1$: P(CHD=1|Type A) $\neq$ P(CHD=1|Type B)

Two Sample Z-Test¶

By Hand Calculation¶

# THE OVERALL PROPORTION
# OFTEN CALLED "POOLED"
overall_p <- dat %>% 
             summarize(overall_p = mean(chd69),
                       se = sqrt(overall_p*(1-overall_p)*(1/100 + 1/100)))
overall_p

# CALCULATE THE POPULATION STATS
summary_stats <- dat %>% 
                 group_by(dibpat0) %>%
                 summarize(n = n(), propCHD = mean(chd69))

# BASE OUR TEST OFF OF THIS
summary_stats

Using the above values, we calculate our z-statistic in the form of: "Proportion of Two Populations" from the bCourses Statistical Inference Reference Sheet.

# Z-TEST STATISTIC
z_stat <- (0.16 - 0.03) / 0.04146685
z_stat

# TWO-SIDED
p_value <- pnorm(q = z_stat, lower.tail = F)*2
p_value

Using R¶

Now that we've seen the machinery... take a shortcut.

# WOW, I PREFER THIS
prop.test(x = c(3, 16), n = c(100, 100), correct = F)

	2-sample test for equality of proportions without continuity
	correction

data:  c(3, 16) out of c(100, 100)
X-squared = 9.8284, df = 1, p-value = 0.001718
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.2092514 -0.0507486
sample estimates:
prop 1 prop 2 
  0.03   0.16

Chi-square Test for GOF¶

By Hand Calculation¶

Consult bCourses Files > Ch21_Inference-catergoical-var-greater-than-2-levels.pdf for the test statistic.

two_way <- matrix(c(3, 97, 16, 84), byrow=TRUE, nrow=2)

two_way

row.names(two_way) <- c("type a", "type b")
colnames(two_way) <- c("chd=1", "chd=0")

two_way

We need to calculate marginals.

totals_1 <- c(3+97, 16+84)
totals_2 <- c(3+16, 97+84, 3+97+16+84)

two_way <- rbind(cbind(two_way, totals_1), totals_2)
two_way

Get the "$E_i-O$"'s.

ei_rows <- c(19*100/200, 181*100/200)

Note that the rows won't always be identical. This is just the case because we have an even amount of samples in each category.

expected_counts <- rbind(ei_rows, ei_rows)
expected_counts

two_way[1:2,1:2] - expected_counts

We will construct the statistic as we see in the reference page.

# CHI-SQ TEST STATISTIC
sum((two_way[1:2,1:2] - expected_counts)^2 / expected_counts)

Using R¶

We can use the R function now.

chisq.test(two_way, correct=FALSE)

	Pearson's Chi-squared test

data:  two_way
X-squared = 9.8284, df = 4, p-value = 0.04342

Relating the two distributions¶

The relationship between the statistics:

z_stat <- 3.13503437082875
x_stat <- 9.8284

z_stat^2

All done!

id	age0	height0	weight0	sbp0	dbp0	chol0	behpat0	ncigs0	dibpat0	arcus0	cigs
6092	45	70	168	118	84	275	3	14	0	0	1
3579	40	75	163	116	72	199	2	0	1	0	0
12671	48	70	173	138	88	197	1	0	1	0	0
13074	39	72	170	110	76	259	1	40	1	0	3
10366	49	69	182	122	82	238	3	0	0	1	0
3496	40	66	145	126	70	195	4	0	0	0	0

ei_rows	9.5	90.5
ei_rows	9.5	90.5

	chd=1	chd=0
type a	3	97
type b	16	84

	chd=1	chd=0	totals_1
type a	3	97	100
type b	16	84	100
totals_2	19	181	200

	chd=1	chd=0
type a	-6.5	6.5
type b	6.5	-6.5