Introduction
Welcome to our tutorial on advanced data analysis with PyMC3 and Stan! In this tutorial, we will explore the power of these two probabilistic programming languages for advanced data analysis.
First, let’s start with a brief overview of PyMC3 and Stan. PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo and variational fitting algorithms. Stan is a platform for statistical modeling and high-performance statistical computation. It provides a flexible and powerful language for specifying complex statistical models, and it uses Hamiltonian Monte Carlo (HMC), a Markov chain Monte Carlo (MCMC) method, to sample from the posterior distribution of the model.
Now, why should you care about advanced data analysis techniques? In today’s world, data is everywhere, and the ability to analyze and make sense of large and complex datasets is becoming increasingly important. Advanced data analysis techniques, such as those provided by PyMC3 and Stan, allow you to go beyond basic statistical methods and perform more sophisticated analyses. These techniques can help you uncover hidden patterns, make more accurate predictions, and make better decisions based on data.
Finally, why use PyMC3 and Stan for advanced data analysis? Both PyMC3 and Stan offer several benefits for advanced data analysis. First, they allow you to specify complex statistical models using a high-level language, making it easier to implement and understand these models. Second, they use advanced MCMC and variational inference algorithms to efficiently sample from the posterior distribution of the model, even for high-dimensional and complex models. Third, they provide a wide range of tools for model checking and criticism, allowing you to evaluate the fit of your models and diagnose any issues.
Advanced Data Analysis with PyMC3
Model Specification in PyMC3
A statistical model is a mathematical representation of the relationships between variables in a dataset. In PyMC3, we can specify a model using the Model()
function. This function creates a PyMC3 model object, which we can use to define the model’s structure and specify the prior distributions for the model’s parameters.
Once we have created a PyMC3 model object, we can start defining the prior distributions for the model’s parameters. A prior distribution represents our prior knowledge or assumptions about the value of a parameter before seeing the data. In PyMC3, we can specify prior distributions using the pm.Normal
or pm.HalfNormal
functions for continuous variables, or the pm.Dirichlet
function for categorical variables.
After specifying the prior distributions, we need to define the model’s variables. In PyMC3, we can define two types of variables: deterministic and stochastic. Deterministic variables are functions of other variables in the model, and their values are determined by the values of those other variables. Stochastic variables, on the other hand, are variables that have a probability distribution, and their values are sampled from that distribution.
Here’s an example of how to specify a simple linear regression model in PyMC3:
with pm.Model() as model:
# Priors
beta_0 = pm.Normal('beta_0', mu=0, sigma=10)
beta_1 = pm.Normal('beta_1', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=10)
# Deterministic variable
mu = pm.math.exp(beta_0 + beta_1 * X)
# Stochastic variable
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
Code language: Python (python)
In this example, we define a linear regression model with one predictor variable X
and one response variable y
. We specify normal prior distributions for the intercept beta_0
and slope beta_1
parameters, and a half-normal prior distribution for the residual standard deviation sigma
. We then define a deterministic variable mu
that represents the expected value of y
given X
, and a stochastic variable y_obs
that represents the observed values of y
.
Model Fitting in PyMC3
Once we have specified our model in PyMC3, the next step is to fit the model to our data. In PyMC3, we can fit the model using MCMC (Markov Chain Monte Carlo) methods. MCMC methods are a class of algorithms used to sample from the posterior distribution of a model’s parameters.
To fit the model using MCMC methods, we can use the sample()
function in PyMC3. The sample()
function takes several arguments, including the number of samples to draw from the posterior distribution, the thinning interval (i.e., the number of samples to skip between each drawn sample), and the number of chains to run.
Here’s an example of how to use the sample()
function to fit a PyMC3 model:
with model:
trace = pm.sample(1000, tune=1000, chains=4)
Code language: Python (python)
In this example, we are drawing 1000 samples from the posterior distribution, with a tuning period of 1000 iterations, and running 4 chains in parallel.
When fitting a PyMC3 model, it’s important to monitor convergence and diagnose any convergence issues. Convergence refers to the process of ensuring that the MCMC chains have reached a stationary distribution that approximates the true posterior distribution of the model’s parameters.
To monitor convergence, we can use several diagnostic tools, such as trace plots, autocorrelation plots, and the Gelman-Rubin statistic. Trace plots show the evolution of the MCMC chains over time, and can help us identify any trends or patterns in the chains. Autocorrelation plots show the correlation between samples drawn from the same chain, and can help us identify any residual autocorrelation in the chains. The Gelman-Rubin statistic compares the variance between chains to the variance within chains, and can help us identify any differences in the chains that may indicate non-convergence.
Here’s an example of how to generate trace plots and autocorrelation plots for a PyMC3 model:
pm.plot_trace(trace, var_names=['beta_0', 'beta_1', 'sigma'])
pm.plot_autocorr(trace, var_names=['beta_0', 'beta_1', 'sigma'])
Code language: Python (python)
In this example, we are generating trace plots and autocorrelation plots for the beta_0
, beta_1
, and sigma
parameters in the model.
If we identify any convergence issues, we may need to adjust the MCMC settings or the model specification. For example, we may need to increase the number of tuning iterations, increase the number of samples, or adjust the prior distributions.
In addition to MCMC methods, PyMC3 also supports other inference methods, such as Variational Inference (VI) and No-U-Turn Sampling (NUTS). NUTS is a variant of HMC that automatically adjusts the step size and the number of leapfrog steps, making it more efficient than traditional HMC. VI is a deterministic method that approximates the posterior distribution using a simpler distribution, such as a Gaussian distribution.
To use NUTS in PyMC3, we can simply set the step
argument in the sample()
function to pm.NUTS()
. For example:
with model:
trace = pm.sample(1000, tune=1000, chains=4, step=pm.NUTS())
Code language: Python (python)
Similarly, to use VI in PyMC3, we can use the fit()
function with the ADVI()
method. For example:
with model:
approx = pm.fit(n=10000, method='advi')
Code language: Python (python)
In this example, we are using ADVI to approximate the posterior distribution using a Gaussian distribution with 10000 samples.
Model Checking and Criticism in PyMC3
Once we have fit our PyMC3 model to the data, it’s important to check and criticize the fit of the model. Model checking and criticism involves evaluating the adequacy of the model and identifying any potential issues or limitations.
One way to check the fit of a PyMC3 model is to use posterior predictive checks. A posterior predictive check compares the observed data to simulated data generated from the posterior distribution of the model’s parameters. By comparing the observed data to the simulated data, we can evaluate the goodness-of-fit of the model and identify any discrepancies between the two.
To perform a posterior predictive check in PyMC3, we can use the predict()
function to generate simulated data from the posterior distribution, and then compare the simulated data to the observed data using a visualization tool such as a histogram or a Q-Q plot.
Here’s an example of how to perform a posterior predictive check in PyMC3:
with model:
# Generate simulated data from the posterior distribution
ppc = pm.sample_posterior_predictive(trace)
# Plot the observed and simulated data
pm.plot_posterior_predictive(trace, samples=1000, var_names=['y'])
Code language: Python (python)
In this example, we are generating simulated data from the posterior distribution of the y
variable, and then plotting the observed data and simulated data using the plot_posterior_predictive()
function.
Another way to check the fit of a PyMC3 model is to check for residual autocorrelation. Residual autocorrelation refers to the correlation between the residuals (i.e., the difference between the observed and predicted values) at different time points. If the residuals are autocorrelated, it may indicate that the model is misspecified or that there is residual structure in the data.
To check for residual autocorrelation in PyMC3, we can use the autocorr()
function to calculate the autocorrelation of the residuals at different lags.
Here’s an example of how to check for residual autocorrelation in PyMC3:
with model:
# Calculate the residuals
resid = y - pm.math.exp(trace['beta_0'] + trace['beta_1'] * X)
# Calculate the autocorrelation of the residuals
acf = pm.autocorr(resid, lags=10)
# Plot the autocorrelation
plt.plot(acf)
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.show()
Code language: Python (python)
In this example, we are calculating the residuals by subtracting the predicted values from the observed values, and then calculating the autocorrelation of the residuals at lags 1 to 10 using the autocorr()
function.
Finally, to compare the fit of different PyMC3 models, we can use leave-one-out cross-validation (LOO-CV) or widely applicable information criterion (WAIC). LOO-CV and WAIC are Bayesian model selection criteria that estimate the out-of-sample predictive accuracy of a model. By comparing the LOO-CV or WAIC values of different models, we can identify the model that best fits the data.
To calculate the LOO-CV or WAIC values in PyMC3, we can use the az.loo()
or az.waic()
functions from the arviz
library.
Here’s an example of how to calculate the LOO-CV value in PyMC3:
import arviz as az
# Fit the model
with model:
trace = pm.sample(1000, tune=1000, chains=4)
# Calculate the LOO-CV value
loo = az.loo(trace)
# Print the LOO-CV value
print(loo.loo)
Code language: Python (python)
In this example, we are fitting the model using MCMC methods, and then calculating the LOO-CV value using the az.loo()
function from the arviz
library.
Advanced PyMC3 Techniques
In this section, we will explore some advanced techniques for using PyMC3, including Hamiltonian Monte Carlo (HMC), the No U-Turn Sampler (NUTS), and Variational Inference.
Implementing Hamiltonian Monte Carlo (HMC)
HMC is a MCMC algorithm that uses the Hamiltonian dynamics to explore the posterior distribution of a model. HMC uses a combination of gradient information and a proposal distribution to generate samples from the posterior distribution. This allows HMC to explore the posterior distribution more efficiently than traditional MCMC methods, such as the Metropolis-Hastings algorithm.
To implement HMC in PyMC3, we can use the sample()
function with the hmc
argument set to True
. For example:
with model:
trace = pm.sample(1000, tune=1000, chains=4, step=pm.HamiltonianMC())
Code language: Python (python)
In this example, we are drawing 1000 samples from the posterior distribution using HMC, with a tuning period of 1000 iterations, and running 4 chains in parallel.
Using the No U-Turn Sampler (NUTS)
NUTS is a variant of HMC that automatically adjusts the step size and the number of leapfrog steps based on the geometry of the posterior distribution. This allows NUTS to explore the posterior distribution more efficiently than traditional HMC.
To use NUTS in PyMC3, we can simply set the step
argument in the sample()
function to pm.NUTS()
. For example:
with model:
trace = pm.sample(1000, tune=1000, chains=4, step=pm.NUTS())
Code language: Python (python)
In this example, we are drawing 1000 samples from the posterior distribution using NUTS, with a tuning period of 1000 iterations, and running 4 chains in parallel.
Implementing Variational Inference
Variational Inference (VI) is a deterministic method that approximates the posterior distribution using a simpler distribution, such as a Gaussian distribution. VI is typically faster than MCMC methods, but may not capture the full complexity of the posterior distribution.
To implement VI in PyMC3, we can use the fit()
function with the method
argument set to 'advi'
. For example:
with model:
approx = pm.fit(n=10000, method='advi')
Code language: Python (python)
In this example, we are approximating the posterior distribution using ADVI with 10000 samples.
Advanced Data Analysis with Stan
Model Specification in Stan
In this section, we will explore how to specify a statistical model using the Stan language. Stan is a probabilistic programming language that allows us to define complex statistical models and perform advanced data analysis.
Defining the Model using the Stan Language
The Stan language is a high-level language for specifying statistical models. It is similar to other programming languages, such as Python or R, but has some unique features for defining statistical models.
To define a model in Stan, we need to specify the likelihood function, the prior distributions for the model’s parameters, and any transformed parameters. The likelihood function represents the probability of the data given the model’s parameters, while the prior distributions represent our prior knowledge or assumptions about the values of the parameters.
Here’s an example of how to define a simple linear regression model in Stan:
data {
int<lower=0> N; // number of data points
vector[N] x; // predictor variable
vector[N] y; // response variable
}
parameters {
real beta_0; // intercept
real beta_1; // slope
real<lower=0> sigma; // residual standard deviation
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
Code language: Python (python)
In this example, we define a linear regression model with one predictor variable x
and one response variable y
. We specify the likelihood function using the y ~ normal()
syntax, where y
is assumed to follow a normal distribution with mean beta_0 + beta_1 * x
and residual standard deviation sigma
.
Specifying Priors for Model Parameters
In Stan, we can specify prior distributions for the model’s parameters using the parameters
block. A prior distribution represents our prior knowledge or assumptions about the value of a parameter before seeing the data.
In the linear regression example, we can specify prior distributions for the intercept beta_0
, slope beta_1
, and residual standard deviation sigma
using the real
and real<lower=0>
syntax. For example:
parameters {
real beta_0 ~ normal(0, 10); // intercept
real beta_1 ~ normal(0, 10); // slope
real<lower=0> sigma ~ exponential(1); // residual standard deviation
}
Code language: Python (python)
In this example, we are specifying normal prior distributions for the intercept and slope, and an exponential prior distribution for the residual standard deviation.
Defining Transformed Parameters
In Stan, we can define transformed parameters using the transformed parameters
block. Transformed parameters are functions of the model’s parameters, and can be used to simplify the model or improve convergence.
In the linear regression example, we can define a transformed parameter for the expected value of y
using the real
syntax. For example:
transformed parameters {
real mu;
mu = beta_0 + beta_1 * x;
}
model {
y ~ normal(mu, sigma);
}
Code language: Python (python)
In this example, we are defining a transformed parameter mu
that represents the expected value of y
given x
, beta_0
, and beta_1
. We can then use mu
in the likelihood function instead of beta_0 + beta_1 * x
.
Model Fitting in Stan
Once we have specified our statistical model in Stan, the next step is to fit the model to the data. In Stan, we can fit the model using MCMC methods, such as HMC or the No-U-Turn Sampler (NUTS).
Using the stan()
Function to Fit the Model
To fit a Stan model to the data, we can use the stan()
function from the rstan
package in R or the sample()
function from the cmdstanpy
package in Python.
Here’s an example of how to fit a Stan model using the stan()
function in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}"
model <- stan_model(model_code = model_code)
# Fit the model
fit <- stan(model_data = data, model_model = model)
Code language: Python (python)
Specifying the Inference Method
In Stan, we can specify the inference method using the method
argument in the stan()
or sample()
function. The available inference methods in Stan include MCMC, HMC, and the No-U-Turn Sampler (NUTS).
Here’s an example of how to fit a Stan model using HMC in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
"
model <- stan_model(model_code = model_code)
# Fit the model using HMC
fit <- stan(model_data = data, model_model = model, control = list(adapt_delta = 0.95))
Code language: Python (python)
In this example, we are fitting the model using HMC with a target acceptance probability of 0.95.
Monitoring Convergence and Diagnosing Convergence Issues
When fitting a Stan model, it’s important to monitor convergence and diagnose any convergence issues. Convergence refers to the process of ensuring that the MCMC chains have reached a stationary distribution that approximates the true posterior distribution of the model’s parameters.
To monitor convergence in Stan, we can use the n_eff
and Rhat
diagnostics. The n_eff
diagnostic measures the effective sample size of the MCMC chains, while the Rhat
diagnostic measures the potential scale reduction factor. If the n_eff
diagnostic is large and the Rhat
diagnostic is close to 1, it indicates that the MCMC chains have converged.
Here’s an example of how to monitor convergence in R:
# Extract the posterior samples
samples <- extract(fit)
# Check the `n_eff` and `Rhat` diagnostics
print(summary(samples))
Code language: Python (python)
In this example, we are extracting the posterior samples from the fit
object, and then checking the n_eff
and Rhat
diagnostics using the summary()
function.
If we identify any convergence issues, we may need to adjust the MCMC settings or the model specification. For example, we may need to increase the number of iterations, adjust the step size or the number of leapfrog steps, or adjust the prior distributions.
Model Checking and Criticism in Stan
Once we have fit our Stan model to the data, it’s important to check and criticize the fit of the model. Model checking and criticism involves evaluating the adequacy of the model and identifying any potential issues or limitations.
Using Posterior Predictive Checks
A posterior predictive check compares the observed data to simulated data generated from the posterior distribution of the model’s parameters. By comparing the observed data to the simulated data, we can evaluate the goodness-of-fit of the model and identify any discrepancies between the two.
To perform a posterior predictive check in Stan, we can use the posterior_predict()
function to generate simulated data from the posterior distribution, and then compare the simulated data to the observed data using a visualization tool such as a histogram or a Q-Q plot.
Here’s an example of how to perform a posterior predictive check in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
generated quantities {
vector[N] y_rep;
y_rep = normal_rng(beta_0 + beta_1 * x, sigma);
}"
model <- stan_model(model_code = model_code)
# Fit the model
fit <- stan(model_data = data, model_model = model)
# Generate simulated data from the posterior distribution
y_rep <- posterior_predict(fit)
# Plot the observed and simulated data
par(mfrow = c(1, 2))
hist(y, main = "Observed Data")
hist(y_rep[1, ], main = "Simulated Data")
Code language: Python (python)
In this example, we are generating simulated data from the posterior distribution of the model’s parameters using the posterior_predict()
function, and then plotting the observed data and simulated data using the hist()
function.
Checking for Residual Autocorrelation
Residual autocorrelation refers to the correlation between the residuals (i.e., the difference between the observed and predicted values) at different time points. If the residuals are autocorrelated, it may indicate that the model is misspecified or that there is residual structure in the data.
To check for residual autocorrelation in Stan, we can use the residuals()
function to calculate the residuals, and then use a visualization tool such as an autocorrelation plot to check for autocorrelation.
Here’s an example of how to check for residual autocorrelation in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
"
model <- stan_model(model_code = model_code)
# Fit the model
fit <- stan(model_data = data, model_model = model)
# Calculate the residuals
resid <- residuals(fit)
# Plot the residuals
par(mfrow = c(1, 1))
acf(resid, main = "Residuals")
Code language: Python (python)
In this example, we are calculating the residuals using the residuals()
function, and then plotting the residuals using the acf()
function.
Comparing Model Fit using LOO-CV and WAIC
To compare the fit of different Stan models, we can use leave-one-out cross-validation (LOO-CV) or widely applicable information criterion (WAIC). LOO-CV and WAIC are Bayesian model selection criteria that estimate the out-of-sample predictive accuracy of a model. By comparing the LOO-CV or WAIC values of different models, we can identify the model that best fits the data.
To calculate the LOO-CV or WAIC values in Stan, we can use the loo()
or waic()
functions from the rstan
package in R or the cmdstanpy
package in Python.
Here’s an example of how to calculate the LOO-CV value in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
"
model <- stan_model(model_code = model_code)
# Fit the model
fit <- stan(model_data = data, model_model = model)
# Calculate the LOO-CV value
loo <- loo(fit)
print(loo)
Code language: Python (python)
In this example, we are fitting the model using the stan()
function, and then calculating the LOO-CV value using the loo()
function.
Advanced Stan Techniques
In this section, we will explore some advanced techniques for using Stan, including Adaptive Hamiltonian Monte Carlo (HMC), the Stan Math Library, and Variational Inference.
Implementing Adaptive Hamiltonian Monte Carlo (HMC)
Hamiltonian Monte Carlo (HMC) is a MCMC algorithm that uses the Hamiltonian dynamics to explore the posterior distribution of a model. HMC uses a combination of gradient information and a proposal distribution to generate samples from the posterior distribution. This allows HMC to explore the posterior distribution more efficiently than traditional MCMC methods, such as the Metropolis-Hastings algorithm.
In Stan, we can implement adaptive HMC using the adapt_engaged
argument in the stan()
or sample()
function. The adapt_engaged
argument controls the length of the adaptation phase, which is the initial phase of the MCMC algorithm where the step size and the number of leapfrog steps are adjusted.
Here’s an example of how to implement adaptive HMC in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
"
model <- stan_model(model_code = model_code)
# Fit the model using adaptive HMC
fit <- stan(model_data = data, model_model = model, control = list(adapt_engaged = TRUE))
Code language: Python (python)
In this example, we are fitting the model using adaptive HMC with an engagement period of 1000 iterations.
Using the Stan Math Library
The Stan Math Library is a library of mathematical functions that can be used in Stan models. The Stan Math Library includes functions for linear algebra, random number generation, and probability distributions.
To use the Stan Math Library in a Stan model, we can include the using namespace stan::math;
statement at the beginning of the model code. This will allow us to use the functions from the Stan Math Library in the model.
Here’s an example of how to use the Stan Math Library in a Stan model:
using namespace stan::math;
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
// Use the log1m() function from the Stan Math Library
real log_likelihood = log(pdf_normal(y, beta_0 + beta_1 * x, sigma));
}
Code language: Python (python)
In this example, we are using the log1m()
function from the Stan Math Library to calculate the log-likelihood of the data.
Implementing Variational Inference
Variational Inference (VI) is a deterministic method that approximates the posterior distribution using a simpler distribution, such as a Gaussian distribution. VI is typically faster than MCMC methods, but may not capture the full complexity of the posterior distribution.
To implement VI in Stan, we can use the variational()
function from the rstan
package in R or the cmdstanpy
package in Python.
Here’s an example of how to implement VI in R:
library(rstan)
# Load the data
data <- list(N = n, x = x, y = y)
# Compile the Stan model
model_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real beta_0;
real beta_1;
real<lower=0> sigma;
}
model {
y ~ normal(beta_0 + beta_1 * x, sigma);
}
"
model <- stan_model(model_code = model_code)
# Define the variational distribution
distribution <- variational_distribution(
prior = normal(0, 10),
transform = function(x) x
)
# Fit the model using VI
fit <- stan_vi(model_data = data, model_model = model, variational_distribution = distribution)
Code language: Python (python)
In this example, we are fitting the model using VI with a normal prior distribution and the identity transform.
Practical Examples
Bayesian Hierarchical Modeling using PyMC3
In this section, we will explore a practical example of using PyMC3 for Bayesian hierarchical modeling. Hierarchical modeling is a statistical technique for modeling grouped data, where the parameters of the model are allowed to vary across groups. This allows us to estimate group-level and population-level effects, and to make inferences about the relationships between groups.
Modeling Grouped Data
Suppose we have a dataset of test scores for students in multiple schools. We want to model the test scores as a function of the school and student-level variables. We can use a hierarchical model to estimate the school-level and student-level effects, and to make inferences about the relationships between schools.
Here’s an example of how to model grouped data in PyMC3:
import pymc3 as pm
import numpy as np
# Load the data
data <- ...
# Define the model
with pm.Model() as model:
# Priors for the population-level effects
mu_beta_0 = pm.Normal('mu_beta_0', mu=0, sigma=10)
mu_beta_1 = pm.Normal('mu_beta_1', mu=0, sigma=10)
tau_sigma = pm.HalfNormal('tau_sigma', sigma=10)
# Priors for the group-level effects
beta_0 = pm.Normal('beta_0', mu=mu_beta_0, sigma=tau_sigma, shape=n_schools)
beta_1 = pm.Normal('beta_1', mu=mu_beta_1, sigma=tau_sigma, shape=n_schools)
sigma = pm.HalfNormal('sigma', sigma=10)
# Model the data
mu = pm.math.exp(beta_0[school_id] + beta_1[school_id] * student_variable)
y = pm.Normal('y', mu=mu, sigma=sigma, observed=test_score)
# Fit the model
trace <- pm.sample(1000)
Code language: Python (python)
In this example, we are defining a hierarchical model with population-level and group-level effects. The population-level effects are the mu_beta_0
and mu_beta_1
parameters, which represent the mean intercept and slope for the student-level variable. The group-level effects are the beta_0
and beta_1
parameters, which represent the school-level deviations from the population-level effects.
Specifying Group-Level and Population-Level Effects
To specify the group-level and population-level effects in the model, we can use the shape
argument in the pm.Normal()
function. The shape
argument specifies the number of group-level effects, and the mu
argument specifies the population-level effects.
Here’s an example of how to specify the group-level and population-level effects in PyMC3:
with pm.Model() as model:
# Population-level effects
mu_beta_0 = pm.Normal('mu_beta_0', mu=0, sigma=10)
mu_beta_1 = pm.Normal('mu_beta_1', mu=0, sigma=10)
tau_sigma = pm.HalfNormal('tau_sigma', sigma=10)
# Group-level effects
beta_0 = pm.Normal('beta_0', mu=mu_beta_0, sigma=tau_sigma, shape=n_schools)
beta_1 = pm.Normal('beta_1', mu=mu_beta_1, sigma=tau_sigma, shape=n_schools)
# Model the data
mu = pm.math.exp(beta_0[school_id] + beta_1[school_id] * student_variable)
y = pm.Normal('y', mu=mu, sigma=sigma, observed=test_score)
Code language: Python (python)
In this example, we are specifying the population-level effects as the mu_beta_0
and mu_beta_1
parameters, and the group-level effects as the beta_0
and beta_1
parameters.
Comparing Model Fit using LOO-CV and WAIC
To compare the fit of different hierarchical models, we can use leave-one-out cross-validation (LOO-CV) or widely applicable information criterion (WAIC). LOO-CV and WAIC are Bayesian model selection criteria that estimate the out-of-sample predictive accuracy of a model. By comparing the LOO-CV or WAIC values of different models, we can identify the model that best fits the data.
To calculate the LOO-CV or WAIC values in PyMC3, we can use the loo()
or waic()
functions from the pymc3.stats
module.
Here’s an example of how to calculate the LOO-CV value in PyMC3:
from pymc3.stats import loo
# Fit the model
trace <- pm.sample(1000)
# Calculate the LOO-CV value
loo_values <- loo(trace, pointwise_log_prob)
print(loo_values)
Code language: Python (python)
In this example, we are fitting the model using the pm.sample()
function, and then calculating the LOO-CV value using the loo()
function.
Non-linear Mixed Effects Modeling using Stan
In this section, we will explore a practical example of using Stan for non-linear mixed effects modeling. Non-linear mixed effects modeling is a statistical technique for modeling non-linear relationships between variables, where the parameters of the model are allowed to vary across groups. This allows us to estimate group-level and population-level effects, and to make inferences about the relationships between groups.
Modeling Non-Linear Relationships
Suppose we have a dataset of growth curves for children in multiple schools. We want to model the growth curves as a non-linear function of age, where the parameters of the model are allowed to vary across schools. We can use a non-linear mixed effects model to estimate the school-level and population-level effects, and to make inferences about the relationships between schools.
Here’s an example of how to model non-linear relationships in Stan:
functions {
real log_growth_curve(real age, real beta_0, real beta_1, real beta_2, real sigma) {
return log(pdf_normal(age, mu=beta_0 + beta_1 * exp(-beta_2 * age), sigma));
}
}
data {
int<lower=0> N;
real age[N];
real growth_curve[N];
int<lower=0> n_schools;
int<lower=1> school_id[N];
}
parameters {
real beta_0;
real beta_1;
real beta_2;
real<lower=0> sigma;
real beta_0_school[n_schools];
real beta_1_school[n_schools];
real beta_2_school[n_schools];
}
model {
// Population-level effects
beta_0 ~ normal(0, 10);
beta_1 ~ normal(0, 10);
beta_2 ~ normal(0, 10);
sigma ~ exponential(1);
// Group-level effects
beta_0_school ~ normal(beta_0, sigma);
beta_1_school ~ normal(beta_1, sigma);
beta_2_school ~ normal(beta_2, sigma);
// Model the data
for (i in 1:N) {
target += log_growth_curve(age[i], beta_0_school[school_id[i]], beta_1_school[school_id[i]], beta_2_school[school_id[i]], sigma);
}
}
Code language: Python (python)
In this example, we are defining a non-linear mixed effects model with population-level and group-level effects. The population-level effects are the beta_0
, beta_1
, and beta_2
parameters, which represent the mean growth curve for the population. The group-level effects are the beta_0_school
, beta_1_school
, and beta_2_school
parameters, which represent the school-level deviations from the population-level effects.
Specifying Random Effects
To specify the random effects in the model, we can use the shape
argument in the pm.Normal()
function. The shape
argument specifies the number of group-level effects, and the mu
argument specifies the population-level effects.
Here’s an example of how to specify the random effects in Stan:
functions {
real log_growth_curve(real age, real beta_0, real beta_1, real beta_2, real sigma) {
return log(pdf_normal(age, mu=beta_0 + beta_1 * exp(-beta_2 * age), sigma));
}
}
data {
int<lower=0> N;
real age[N];
real growth_curve[N];
int<lower=1> n_schools;
int<lower=1> school_id[N];
}
parameters {
real beta_0;
real beta_1;
real beta_2;
real<lower=0> sigma;
real beta_0_school[n_schools];
real beta_1_school[n_schools];
real beta_2_school[n_schools];
}
model {
// Population-level effects
beta_0 ~ normal(0, 10);
beta_1 ~ normal(0, 10);
beta_2 ~ normal(0, 10);
sigma ~ exponential(1);
// Group-level effects
beta_0_school ~ normal(beta_0, sigma, shape=n_schools);
beta_1_school ~ normal(beta_1, sigma, shape=n_schools);
beta_2_school ~ normal(beta_2, sigma, shape=n_schools);
// Model the data
for (i in 1:N) {
target += log_growth_curve(age[i], beta_0_school[school_id[i]], beta_1_school[school_id[i]], beta_2_school[school_id[i]], sigma);
}
}
Code language: Python (python)
In this example, we are specifying the random effects as the beta_0_school
, beta_1_school
, and beta_2_school
parameters, which represent the school-level deviations from the population-level effects.
Comparing Model Fit using LOO-CV and WAIC
To compare the fit of different non-linear mixed effects models, we can use leave-one-out cross-validation (LOO-CV) or widely applicable information criterion (WAIC). LOO-CV and WAIC are Bayesian model selection criteria that estimate the out-of-sample predictive accuracy of a model. By comparing the LOO-CV or WAIC values of different models, we can identify the model that best fits the data.
To calculate the LOO-CV or WAIC values in Stan, we can use the loo()
or waic()
functions from the rstan
package in R or the cmdstanpy
package in Python.
Here’s an example of how to calculate the LOO-CV value in R:
library(rstan)
# Fit the model
fit <- stan(model_code = model_code, data = data)
# Calculate the LOO-CV value
loo_values <- loo(fit)
print(loo_values)
Code language: Python (python)
In this example, we are fitting the model using the stan()
function, and then calculating the LOO-CV value using the loo()
function.
By using these techniques, we can perform non-linear mixed effects modeling using Stan. This will allow us to make inferences about group-level and population-level effects, and to compare the fit of different non-linear mixed effects models.
Time Series Analysis using PyMC3
In this section, we will explore a practical example of using PyMC3 for time series analysis. Time series analysis is a statistical technique for modeling time-dependent data, where the parameters of the model are allowed to vary over time. This allows us to estimate time-varying effects, and to make predictions about future values of the time series.
Modeling Time Series Data
Suppose we have a dataset of monthly sales for a retail store. We want to model the sales as a function of time, where the parameters of the model are allowed to vary over time. We can use a time series model to estimate time-varying effects, and to make predictions about future values of the sales.
Here’s an example of how to model time series data in PyMC3:
import pymc3 as pm
import numpy as np
# Load the data
data <- ...
# Define the model
with pm.Model() as model:
# Priors for the population-level effects
mu_beta_0 = pm.Normal('mu_beta_0', mu=0, sigma=10)
mu_beta_1 = pm.Normal('mu_beta_1', mu=0, sigma=10)
tau_sigma = pm.HalfNormal('tau_sigma', sigma=10)
# Priors for the time-varying effects
beta_0 = pm.GaussianRandomWalk('beta_0', tau=tau_sigma, shape=n_months)
beta_1 = pm.GaussianRandomWalk('beta_1', tau=tau_sigma, shape=n_months)
sigma = pm.HalfNormal('sigma', sigma=10)
# Model the data
mu = pm.math.exp(beta_0[month_id] + beta_1[month_id] * time_variable)
y = pm.Normal('y', mu=mu, sigma=sigma, observed=sales)
# Fit the model
trace <- pm.sample(1000)
Code language: Python (python)
In this example, we are defining a time series model with population-level and time-varying effects. The population-level effects are the mu_beta_0
and mu_beta_1
parameters, which represent the mean intercept and slope for the time series. The time-varying effects are the beta_0
and beta_1
parameters, which represent the time-dependent deviations from the population-level effects.
Implementing State-Space Models
To implement state-space models in PyMC3, we can use the pm.StateSpace
class. The pm.StateSpace
class allows us to define a state-space model, which is a type of time series model where the parameters of the model are allowed to vary over time.
Here’s an example of how to implement a state-space model in PyMC3:
with pm.Model() as model:
# Population-level effects
mu_beta_0 = pm.Normal('mu_beta_0', mu=0, sigma=10)
mu_beta_1 = pm.Normal('mu_beta_1', mu=0, sigma=10)
tau_sigma = pm.HalfNormal('tau_sigma', sigma=10)
# State-space model
state_space = pm.StateSpace('state_space', target=y,
mu=pm.math.exp(mu_beta_0 + mu_beta_1 * time_variable),
var=sigma**2,
sigma=sigma,
shape=n_months)
# Model the data
state_space.observe(time_variable, sales)
# Fit the model
trace <- pm.sample(1000)
Code language: PHP (php)
In this example, we are defining a state-space model with population-level effects and a state-space model for the time series. The state-space model allows us to estimate time-varying effects, and to make predictions about future values of the time series.
Comparing Model Fit using LOO-CV and WAIC
To compare the fit of different time series models, we can use leave-one-out cross-validation (LOO-CV) or widely applicable information criterion (WAIC). LOO-CV and WAIC are Bayesian model selection criteria that estimate the out-of-sample predictive accuracy of a model. By comparing the LOO-CV or WAIC values of different models, we can identify the model that best fits the data.
To calculate the LOO-CV or WAIC values in PyMC3, we can use the loo()
or waic()
functions from the pymc3.stats
module.
Here’s an example of how to calculate the LOO-CV value in PyMC3:
from pymc3.stats import loo
# Fit the model
trace <- pm.sample(1000)
# Calculate the LOO-CV value
loo_values <- loo(trace, pointwise_log_prob)
print(loo_values)
Code language: PHP (php)
In this example, we are fitting the model using the pm.sample()
function, and then calculating the LOO-CV value using the loo()
function.
This tutorial provided a comprehensive overview of advanced data analysis with PyMC3 and Stan, including Bayesian inference, model specification and fitting, model checking and criticism, advanced techniques, and practical examples. Key takeaways include the importance of model checking and criticism, the benefits of advanced techniques such as Hamiltonian Monte Carlo and Variational Inference, and the wide range of data analysis tasks that can be accomplished using PyMC3 and Stan. Future directions for advanced data analysis with PyMC3 and Stan include implementing advanced statistical models, developing custom probability distributions and inference methods, and applying these tools to real-world problems.