Statistical Inference (aka "Learning")
Fundamentals
- Using data to infer the distribution that generated it
- Frequentist: data are random variables and parameters are fixed quantities
- Bayesian: data are fixed quantities and parameters are random variables
Point Estimation (“Best Guess”)
- Examples:
- parameter of a parametric model,
- a CDF $F$,
- a PDF $f$,
- a regression function $r$, or
- a prediction of a future value $Y$ of some random variable
- It is a function of the data: $ \hat{\theta} = g(X_1, X_2, …, X_n)$
- The distribution of $\hat{\theta}$ induced by the data generating process is called the sampling distribution. Expectations of $\hat{\theta}$ are taken with respect to the distribution that generated the data, $\Pi_{i}^{n} f(x_i \mid \theta).$
- Metrics:
- Bias: $\text{bias}(\hat{\theta}) = E(\hat{\theta}) - \theta$
- Standard error: $\text{se}(\hat{\theta}) = \sqrt{\text{Var}(\hat{\theta})}$
- Mean squared error: $\text{MSE} = E(\hat{\theta} - \theta)^2$
- Bias-variance decomposition: $\text{MSE} = \text{bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})$
Confidence Sets
- The $1 - \alpha$ confidence interval for parameter $\theta$ is an interval $C_n = (a, b)$ where $a,b$ are functions of the data such that
- $C_n = (a,b)$ “traps” the parameter $\theta$ with probability of at least $1-\alpha$. Note that $C_n$ is the random component here, and $\theta$ is assumed to be fixed.
- The (frequentist) idea is that if we repeated the experiment and statistical procedure many times, and each time we computed a new $C_n$, we would trap $\theta$ in $1-\alpha$ of the repetitions.
- When a variable follows a Normal distribution (e.g. from CLT), then we can construct 95% confidence intervals as $\pm 1.96 \hat{\text{se}}$.
- It can be hard to compute confidence intervals analytically. With modern computers we can often use the bootstrap to estimate them instead.
Hypothesis Tests
- Define a test
- Let $X$ be a random variable with suppport $\mathcal{X}$. We test the hypothesis by finding a subset of outcomes $R \subset \mathcal{X}$ called the rejection region. We reject the test if $X \in R$.
- Generally $R = { x : T(x) > c }$ where $T$ is a test statistic and $c$ is a critical value.
- The power function is given by
- The size is defined as $ \alpha = \sup_{\theta \in \Theta_0} \beta(\theta) $
- Test has level $\alpha$ if its sizes is $\leq \alpha$
- The p-value is the smallest level $\alpha$ at which the test rejects
- It is the probability under $H_0$ of observing a value of the test statistic that is the same as or more extreme than what was actually observed.
- It measures the probability mass in the tail(s) of the distrbution under $H_0$.
Wald Test
Consider
\[H_0: \theta = \theta_0 \quad H_1: \theta \neq \theta_0.\]Assume that $\hat\theta$ is asymptotically Normal
\[\frac{\hat\theta - \theta_0}{\hat{se}} \to N(0, 1)\]as $n \to \infty$. The size $\alpha$ Wald test is to reject $H_0$ when $ \mid W \mid = \mid \frac{\hat\theta - \theta_0}{\hat{se}} \mid > z_{\alpha/2}. $ Note that the Wald test is equivalent to checking whether the null value is in the confidence interval. The Wald test rejects when
\[\theta_0 \notin (\hat\theta - \hat{\text{se}} z_{\alpha/2}, \hat\theta + \hat{\text{se}} z_{\alpha/2})\]Pearson’s $\chi^2$ for Multinomial Data
Permutation Test
- It can be difficult/impossible to compute p-values for complex cases, but with modern computers we can make use of a permutation test. This test repeatedly permutes the variables of interest (explicitly breaking any association) and computes the test statistic.
- The p-value is taken as the proportion of values that are greater or equal than the observed value (the tails of the sampling distribution under the null hypothesis).
Likelihood Ratio Test
This test generalizes the Wald test to vector-valued parameters. Consider
\[H_0 : \theta \in \Theta_0 \quad H_1 : \theta \notin \Theta_0.\]The test statistic is
\[\lambda = 2 \log \frac{\sup_{\theta \in \Theta} \mathcal{L}(\theta)}{\sup_{\theta \in \Theta_0} \mathcal{L}(\theta)} = 2 \log \frac{\mathcal{L}(\hat \theta)}{\mathcal{L}(\hat \theta_0)},\]where $\hat \theta$ is the MLE and $\hat \theta_0$ is the MLE when $\theta$ is restricted to lie in $\Theta_0$.
Parametric Estimation
Maximum Likelihood
Define the likelihood function as the joint density of the data
\[\mathcal{L}(\theta) = \Pi_{i=0}^{n} f(X_i \mid \theta),\]then the maximum likelihood estimator (MLE) is given by the solution to
\[\hat{\theta} = \text{argmax}_{\theta} \mathcal{L}(\theta).\]Note that the likelihood $\mathcal{L}$ is not a probability density and does not integrate to $1$ (over $\theta$). We treat the likelihood as a function of $\theta$.
Properties of the MLE
Under regularity conditions (smoothness) on $f(x \mid \theta)$.
- Consistency
- Asymptotically Normal (can construct asymptotic confidence intervals)
- Asymptotically optimal (smallest variance)
- Equivariant ($\hat\theta$ for $\theta \implies g(\hat \theta)$ for $g(\theta)$)
- Approximately Bayes estimator
Expectation Maximization
Fill in the missing data/latent variables with some initial values (e.g. zero). Then repeat until convergence:
- M-step: use “filled-in” data to construct estimates $\hat{\theta}_t$ for the parameters of interest in the standard way (e.g. MLE)
- E-step: update the missing data/latent variables with conditional expectations based on the non-missing data, assuming $\theta = \hat{\theta}_t$
In this way, the process iterates between filling in the missing data/latent variables and computing the parameters of interest. This algorithm is guaranteed to converge to the MLE for exponential families. Moreover, the likelihood increases with every step.
Bayesian Inference
Posterior distribution:
\[P(\theta \mid D) \propto P(D \mid \theta) P(\theta)\]