Lab: Relationship between Z-test and Chi-square test

Good morning/afternoon. Today, we're exploring the relationship between these two tests.

Main Point

  • The $\chi^2$ statistic is the z-test statistic squared.
  • The p-values you get for testing the "same" hypotheses using the two different methods are the same.

I can mic drop right here. You can leave the room now.

In [1]:
# LIBRARIES
library(readr)
library(dplyr)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Our Dataset

Flashback to the 60's.

Background

  • Collected to assess the effects of behavior type on coronary heart disease (CHD)
  • 3524 men were enrolled, aged 39-59 from corporations in California
  • Each individuals behavior type was assessed during an interview
  • Full data is available for 3142 participants.
  • Of these, 257 (8.2%) had a CHD event.

What the rows mean

  • chd69=1 implies that a CHD event occurred vs. chd69=0 codes no CHD event.
  • dibpat0=1 codes participants with a "Type A" personality and dibpat0=0 codes participants with a "Type B" personality.
  • Here, CHD is the response variable and personality type is the explanatory variable.
In [2]:
# READ IN DATA
dat <- read_csv("data/lab_11.csv")
Parsed with column specification:
cols(
  id = col_integer(),
  age0 = col_integer(),
  height0 = col_integer(),
  weight0 = col_integer(),
  sbp0 = col_integer(),
  dbp0 = col_integer(),
  chol0 = col_integer(),
  behpat0 = col_integer(),
  ncigs0 = col_integer(),
  dibpat0 = col_integer(),
  chd69 = col_integer(),
  arcus0 = col_integer(),
  cigs = col_integer()
)
In [3]:
# PREVIEW DATA
head(dat)
idage0height0weight0sbp0dbp0chol0behpat0ncigs0dibpat0chd69arcus0cigs
609245 70 168 118 84 275 3 14 0 0 0 1
357940 75 163 116 72 199 2 0 1 0 0 0
1267148 70 173 138 88 197 1 0 1 0 0 0
1307439 72 170 110 76 259 1 40 1 0 0 3
1036649 69 182 122 82 238 3 0 0 0 1 0
349640 66 145 126 70 195 4 0 0 0 0 0

Hypotheses

Recognize that we're testing for independence.
Because we're working with a 2x2 contingency table (as you'll see soon), the hypotheses narrow down to the same thing.

$H_0$: P(CHD=1|Type A) = P(CHD=1|Type B)
$H_1$: P(CHD=1|Type A) $\neq$ P(CHD=1|Type B)

Two Sample Z-Test

By Hand Calculation

In [4]:
# THE OVERALL PROPORTION
# OFTEN CALLED "POOLED"
overall_p <- dat %>% 
             summarize(overall_p = mean(chd69),
                       se = sqrt(overall_p*(1-overall_p)*(1/100 + 1/100)))
overall_p
overall_pse
0.095 0.04146685
In [5]:
# CALCULATE THE POPULATION STATS
summary_stats <- dat %>% 
                 group_by(dibpat0) %>%
                 summarize(n = n(), propCHD = mean(chd69))
In [6]:
# BASE OUR TEST OFF OF THIS
summary_stats
dibpat0npropCHD
0 100 0.03
1 100 0.16

Using the above values, we calculate our z-statistic in the form of: "Proportion of Two Populations" from the bCourses Statistical Inference Reference Sheet.

In [7]:
# Z-TEST STATISTIC
z_stat <- (0.16 - 0.03) / 0.04146685
z_stat
3.13503437082875
In [8]:
# TWO-SIDED
p_value <- pnorm(q = z_stat, lower.tail = F)*2
p_value
0.00171833977337647

Using R

Now that we've seen the machinery... take a shortcut.

In [9]:
# WOW, I PREFER THIS
prop.test(x = c(3, 16), n = c(100, 100), correct = F)
	2-sample test for equality of proportions without continuity
	correction

data:  c(3, 16) out of c(100, 100)
X-squared = 9.8284, df = 1, p-value = 0.001718
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.2092514 -0.0507486
sample estimates:
prop 1 prop 2 
  0.03   0.16 

Chi-square Test for GOF

By Hand Calculation

Consult bCourses Files > Ch21_Inference-catergoical-var-greater-than-2-levels.pdf for the test statistic.

In [10]:
two_way <- matrix(c(3, 97, 16, 84), byrow=TRUE, nrow=2)

two_way
397
1684
In [11]:
row.names(two_way) <- c("type a", "type b")
colnames(two_way) <- c("chd=1", "chd=0")
In [12]:
two_way
chd=1chd=0
type a 397
type b1684

We need to calculate marginals.

In [13]:
totals_1 <- c(3+97, 16+84)
totals_2 <- c(3+16, 97+84, 3+97+16+84)
In [14]:
two_way <- rbind(cbind(two_way, totals_1), totals_2)
two_way
chd=1chd=0totals_1
type a 3 97100
type b16 84100
totals_219 181200

Get the "$E_i-O$"'s.

In [15]:
ei_rows <- c(19*100/200, 181*100/200)

Note that the rows won't always be identical. This is just the case because we have an even amount of samples in each category.

In [18]:
expected_counts <- rbind(ei_rows, ei_rows)
expected_counts
ei_rows9.5 90.5
ei_rows9.5 90.5
In [17]:
two_way[1:2,1:2] - expected_counts
chd=1chd=0
type a-6.5 6.5
type b 6.5-6.5

We will construct the statistic as we see in the reference page.

In [19]:
# CHI-SQ TEST STATISTIC
sum((two_way[1:2,1:2] - expected_counts)^2 / expected_counts)
9.82843849956383

Using R

We can use the R function now.

In [20]:
chisq.test(two_way, correct=FALSE)
	Pearson's Chi-squared test

data:  two_way
X-squared = 9.8284, df = 4, p-value = 0.04342

Relating the two distributions

The relationship between the statistics:

In [21]:
z_stat <- 3.13503437082875
x_stat <- 9.8284
In [22]:
z_stat^2
9.82844050627762

All done!