Introduction

Imagine you have a dataset of $n$ observations, where each observation consists of a response $y_i$ and a collection of $d$ predictors $x_i = (x_{i1},\ldots,x_{id})^\top$, and your goal is to understand the realtionship between these two.

A classical regression approach would have you select a type of model based on whatever prior knowledge you have and/or tools are available to you, and then perform a number of tests assessing the fit and validitity of the resulting model. Alternatively, you could select two classes of potential models, say a linear model and a nonparametric model, and potentially perform some form of hypothesis test to determine which model to use. However, instead of requiring some prior knoweledge and pigeon-holing yourself into predetermined model types, it would be desirable to be able to uniformly select the correct regression model from a broader and more general set of model classes. This is where Zhang and Wu's "Time-Varying Nonlinear Regression Models: Nonparametric Estimation and Model Selection" comes into play.

Setting

Zhang and Wu's paper considers 4 classes of models:

Model I: $$y_i = m_i(x_i) + e_i,\quad i=1,\ldots,n$$

where $e_i$ are the errors and $m_i(\cdot) = m(\cdot,i/n)$ is the time-varying nonlinear regression function $m:\mathbb{R}^d\times[0,1]\to\mathbb{R}$. This is the most general of the models and is sensitive to almost any underlying relationship between the response and the predictors, but is also the most computationally expensive to estimate and is the most complex.

Model II: $$y_i = \mu(x_i) + e_i,\quad i = 1,\ldots,n$$

This is the classical nonparametric model, and is nested within model I if we restrict $m(\cdot\,,\cdot)$ to the case where $m(\cdot,i/n)=\mu(\cdot)$ for all $i=1,\ldots,n$, and thus is constant over time.

Model III: $$y_i = x_i^\top\beta_i + e_i,\quad i=1,\ldots,n$$

where $^\top$ is the transpose operator and $\beta_i=\beta(i/n)$ is some smooth function $\beta:[0,1]\to\mathbb{R}^d$. This model is also nested within model I, in this case restricting it to the case where $m(\cdot\,,\cdot)$ is linear in space but is still allowed to vary in time, as opposed to the time-constant model 2.

Model IV: $$y_i = x_i^\top\theta + e_i,\quad i=1,\ldots,n$$

This is the classical linear model, and it is nested within all 3 of the other models, by restricting the regression function to be both linear in space and constant in time.

Estimation

Understanding the framework that we are working in is obviously important, but it only means so much unless we actually have the means to estimate the 4 classes of regression models. For models II-IV, it works to use classical methods which have been developed in the literature, but one of the major contributions that Zhang and Wu make to the literature is by proposing and then proving the desired asymptotic properties of a consistent nonparametric estimator for model I.

Model I

In order to understand how Zhang and Wu's estimator for model I works, we first have to have a brief discussion on nonparametric regression as a whole. When estimating a nonlinear regression relationship, since there are not a finite number of parameters to be estimated to understand the underlying relationship, we wish then to use the "local" behavior of the data near any given point to inform our estimate of what the function is doing there. To achieve this, we use kernel functions, which, in 1-dimension, are functions $K:\mathbb{R}\to\mathbb{R}$ that

  • integrate to 1, i.e. $\int_{\mathbb{R}}K(u)\,du=1$
  • are symmetric, i.e. $K(-u)=K(u)$ for all $u\in\mathbb{R}$

with the go-to function being the Epanechnikov kernel, given by

$$K(u) = \begin{cases} \frac{3}{4}(1-u^2) & \text{for } |u|\leq1 \\ 0 & \text{otherwise} \end{cases}$$

(In higher dimensions, a common approach is to simply use a 1-dimensional kernel function on the norm of the vector.) These functions allow us to capture the local behavior of the data, by assigning weights to the data $x_i$ around a point $u$ using the kernel $K$ and a bandwidth $h_n$ that is dependent on $n$ but not the data itself in the form $\frac{1}{h_n}K([u-x_i]/h_n)$.

Now that we have a bit of understanding about kernel functions, we can discuss the estimator itself. First, we let $K_S(\cdot)$ be a $d$-dimensional spatial kernel and let $K_{S,h_n}(\cdot) = K_S(\cdot/h_n)/h_n$, and we let $K_t(\cdot)$ be a 1-dimensional temporal kernel and let $K_{T,b_n}(\cdot) = K_T(\cdot/b_n)/b_n$, where $b_n$ is a temporal bandwidth. Then we can define

$$\begin{array}{rcl} \hat{f}_n(u,t) & = & \displaystyle \sum_{i=1}^n K_{S,h_n}(u-x_i)w_{b_n,i}(t) \\ \hat{T}_n(u,t) & = & \displaystyle \sum_{i=1}^n y_iK_{S,h_n}(u-x_i)w_{b_n,i}(t) \end{array}$$

where $$\begin{array}{rcl} w_{b_n,i}(t) & = & \displaystyle \left[\frac{S_2(t)-(t-i/n)S_1(t)}{S_2(t)S_0(t)-S_1^2(t)}\right]K_{T,b_n}(t-i/n) \\ S_r(t) & = & \displaystyle \sum_{i=1}^n(t-i/n)^rK_{T,b_n}(t-i/n) \end{array}$$

At this point, it's worth pointing out (as this will help simplify the implementation), that each of these can be viewed as a local linear estimate in time if we momentarily allow ourselves to consider $K_{S,h_n}(u-x_i)$ and $y_iK_{S,h_n}(u-x_i)$ to be the "response." In other words, these functions are using the temporal kernel function to fit a weighted line through the points closest to the given time $t$, with more weight given to the time points closest to $t$.

From here, we can then use the definitions from above to define the full time-varying kernel regression estimator for $m(\cdot\,,\cdot)$ as

$$\hat{m}_n(u,t) = \frac{\hat{T}_n(u,t)}{\hat{f}_n(u,t)}$$

This estimator has a form similar to the traditional Nadaraya-Watson estimator used for kernel regression, which is itself a form of local constant estimator, which, like similar to the local linear estimate, finds a weighted average of points "near" the given $u$ based on the spatial kernel.

Thus, we see that the estimator $\hat{m}_n(\cdot\,,\cdot)$ is the intersection of a local linear estimate in time and a Nadarya-Watson-esque local constant estimate in space, which allows us to compute the estimate for model I using the npregress function from the ibr package to compute the local linear fits as below:

model_1 = function(x, y, sband, tband, skern = 'e', tkern = 'e'){
  # Kernel choice setup
  if(skern=='e'){
    skern_fun = epk
  }
  
  # Create estimated and index vectors
  n = length(y)
  y_hat = rep(0,n)
  idx = seq(1,n)/n
  x = as.matrix(x)
  
  # Fit values
  for(i in 1:n){
    # Generate modified response vectors for time-LLRs
    z = skern_fun(apply(x,1,function(w) norm(as.matrix(w-x[i,]),'F')), sband)
    z_prime = y*z

    # Generate time-LLRs
    f_hat_xi = npregress(idx, z, 
                         bandwidth = tband, 
                         kernel = tkern, 
                         control.par = list(degree = 1))
    t_hat_xi = npregress(idx, z_prime, 
                         bandwidth = tband, 
                         kernel = tkern, 
                         control.par = list(degree = 1))
    
    # Peel off fitted value at x_i
    y_hat[i] = t_hat_xi$fitted[i]/f_hat_xi$fitted[i]
  }
  return(y_hat)
}

Model II

To estimate model II, we thankfully have well established methods in the literature and thus we don't need to form anything new. As mentioned before, the classical approach is to use the Nadaraya-Watson estimator to fit a local constant estimate using

$$\begin{array}{rcl} \hat{\mu}_n(u) & = & \displaystyle \frac{\tilde{T}_n(u)}{\tilde{f}_n(u)} \\ \tilde{T}_n(u) & = & \displaystyle \sum_{i=1}^n y_iK_{S,h_n}(u-x_i) \\ \tilde{f}_n(u) & = & \displaystyle \sum_{i=1}^n K_{S,h_n}(u-x_i) \end{array}$$

This is the approach used in Zhang and Wu's paper, as it provides for simpler proofs of its consistency under their data generation framework, but in general we would actually choose to opt for a local linear estimate, as it has less bias than the local constant and is design-adaptive (i.e. the accuracy of the fit is not dependent on the form of the density $f(x)$), which is given by

$$\hat{\mu}_n(u) = \sum_{i=1}^n y_i\ell_i(u)$$ where $$\begin{array}{rcl} \ell_i(u) & = & \displaystyle \left[\frac{S_2(u)-(u-x_i)^\top S_1(u)} {S_2(u)S_0(u)-(S_1(u))^\top S_1(u)}\right]K_{S,h_n}(u-x_i) \\ S_0(u) & = & \displaystyle \sum_{i=1}^n K_{S,h_n}(u-x_i) \\ S_1(u) & = & \displaystyle \sum_{i=1}^n(u-x_i) K_{S,h_n}(u-x_i) \\ S_2(u) & = & \displaystyle \sum_{i=1}^n(u-x_i)^\top(u-x_i) K_{S,h_n}(u-x_i) \end{array}$$

Because of these desirable properties, in practice we'll use the local linear estimator, which, as seen above, is already implemented by the npregress function, and thus only requires a simple wrapper function to standardize the inputs and outputs of the modeling functions:

model_2 = function(x, y, band, kern = 'e'){
  np_model = npregress(x, y, 
                       bandwidth = band, 
                       kernel = kern, 
                       control.par = list(degree = 1))
  y_hat = np_model$fitted
  return(y_hat)
}

Model III

Again, like the estimation for model II, the estimation for model III is also already established in the literature, and thus we can use the kernel estimator of Priestly and Chao (1972) to estimate $\hat{\beta}(t)$ via $$\hat{\beta}_n(t) = \left(\frac{1}{n}\sum_{i=1}^nx_ix_i^\top K_{T,b_n}(i/n-t)\right)^{-1} \left(\frac{1}{n}\sum_{i=1}^nx_iy_iK_{T,b_n}(i/n-t)\right)$$

Here we can recognize that, at a given point and time $(u,t)$, this has the form of a weighted least-squares estimate, where the weights are given by $w_i = K_{T,b_n}(i/n-t)$, and thus we can use the lm function to implement this estimate as thus:

model_3 = function(x, y, band, kern = 'e'){
  # Kernel choice setup
  if(kern == 'e'){
    kern_fun = epk
  }

  n = length(y)
  y_hat = rep(0,n)
  idx = seq(1,n)/n
  
  for(i in 1:n){
    w = kern_fun(idx - i/n, band)
    time_mod = lm(y~x, weights = w)
    y_hat[i] = time_mod$fitted.values[i]
  }
  return(y_hat)
}

Model IV

And finally, we get to the classical linear model, which calls for the use of the classical least-squares estimate using the lm function:

model_4 = function(x, y){
  lin_model = lm(y~x)
  y_hat = lin_model$fitted.values
  return(y_hat)
}

Model Selection

Now that we've established the techniques for estimating each of the models, we now need to understand Zhang and Wu's technique for selecting the desired model. In order to uniformly select from all four candidate models at once, we will use an information criterion to quantify both the accuracy and complexity of the four candidate models for comparison, choosing the model with the smallest value for its information criterion.

First, in order to calculate the information criterion, we need to define the restricted residual sum of squares. This is similar to the classical RSS, which is the sum of the squared residuals, or errors, between the fitted and actual values. In this case, however, we restrict our measurement to a compact set $\mathscr{T}\subset(0,1)$, which informs the definition of $\mathscr{I}_n=\{i=1,\ldots,n|i/n\in\mathscr{T}\}$, for time, and a second compact set $\mathscr{X}\subseteq\mathbb{R}^d$ where the probability density $f(u,t)$ is bounded away from 0 for $u\in\mathscr{X}$ for space. This gives Zhang and Wu's definition of the restricted residual sum of squares for model I as

$$\text{RSS}_n(\mathscr{X},\mathscr{T},\text{I}) = \sum_{i\in\mathscr{I}_n}[y_i - \hat{m}_n(x_i,i/n)]^2\mathbb{1}_{\{x_i\in\mathscr{X}\}}$$

and we can define it for models II-IV using their respective estimates. This is fairly easy to implement, only requiring a simple check for invalid fitted values that can arise sometimes from the npregress function:

rss_n = function(x, index, y, y_hat, X, I){
  valid = !is.nan(y_hat)
  rss_vec = ((y-y_hat)^2)[valid]
  svec = x_in_X(x,X)[valid]
  tvec = i_in_I(index,I)[valid]
  rss = sum(rss_vec*svec*tvec)
  return(rss)
}

x_in_X = function(x,X){
  x = as.matrix(x)
  X = as.matrix(X)
  svec = apply(x,1,function(w) all(((w-X[1,])>=0))&((X[2,]-w)>=0))
  return(svec)
}
               
i_in_I = function(index,I){
  tvec = (index>=I[1])&(index<=I[2])
  return(tvec)
}

From here, we then need to define a measure of complexity, which is typically handled by the degrees of freedom, which is typically a measure of the number of parameters estimated. For example, the degrees of freedom for the classical linear model, model IV, is simply $d$, as the vector $\theta$ has $d$ entries. This works for the model IV, but is not as clear for models I-III. To solve this issue, Zhang and Wu use the framework of Hurvich, Simonoff, and Tsai (1998) to define

  • $\text{DF}(\text{I}) = (b_nh_n^d)^{-1}\prod_{k=1}^d(2\text{IQR}_k)$
  • $\text{DF}(\text{II}) = h_n^{-d}\prod_{k=1}^d(2\text{IQR}_k)$
  • $\text{DF}(\text{III}) = b_n^{-1}d$
  • $\text{DF}(\text{IV}) = d$

where $\text{IQR}_k$ is the componentwise interquartile range across $i=1,\ldots,n$ of $x_{ik}$.

With the restricted residual sum of squares (accuracy) and the degrees of freedom (complexity) defined, Zhang and Wu define, for a candidate model $\varrho\in\{\text{I, II, III, IV}\}$, the Generalized Information Criterion as

$$\text{GIC}_{\mathscr{X},\mathscr{T}}(\varrho) = \log\{\text{RSS}_n(\mathscr{X},\mathscr{T},\varrho)/n\} + \tau_n\text{DF}(\varrho)$$

where $\tau_n$ is a tuning parameter determining the amount of penalization on the model complexity. Using the functions for the restriscted residual sum of squares and the definitions for the degrees of freedom, this is also straightforward in its implementation (note: $d$ is increased by 1 for models III and IV due to the inclusion of the intercept in the estimation, assuming the intercept is not given as a column of the data)

GIC = function(x, index, y, y_hat, X, I, model_num, tau, sband, tband, verb = FALSE){
  rss = rss_n(x, index, y, y_hat, X, I)
  df = 0
  x = as.matrix(x)
  n = dim(x)[1]
  d = dim(x)[2]
  
  if(model_num == 1){
    df = prod(2*apply(x,2,IQR)) / (tband*sband^d)
  } else if(model_num == 2){
    df = prod(2*apply(x,2,IQR)) / (sband^d)
  } else if(model_num == 3){
    df = (d+1) / tband
  } else if(model_num == 4){
    df = d + 1
  }
  gic = log(rss/n)+tau*df
  if(verb){
    return(list(gic = gic, rss = rss, df = df))
  } else {
    return(gic)
  }
}

An Example

Okay, that was certainly a lot of math, which of course was important to understand how this model selection methodology works, but I'm sure you're wondering, "How does this actually work in practice?" and be rest assured, I plan to show you. First, we have to establish a couple of default values to use for some of the given parameters, like the bandwidths $h_n$ and $b_n$ and the tuning parameter $\tau_n$. Now, I won't go into the theory behind the convergence rates of the estimators and such here, but we know that, in order to get the desirable estimation and selection properties, we need

  • $h_n(\text{I}) = c_sn^{-1/(d+5)}$
  • $b_n(\text{I}) = c_tn^{-1/(d+5)}$
  • $h_n(\text{II}) = c_sn^{-1/(d+4)}$
  • $b_n(\text{III}) = c_sn^{-1/5}$
  • $\tau_n = cn^{-(d+3)/(d+4)}\log n$

As rule of thumb values, Zhang and Wu use $c_s = \prod_{k=1}^d\text{IQR}_k$ and $c_t = 0.5$, and they note that for $\tau_n$, we simply need $c>0$ for the asymptotic results, so we can simply use $c=1$. This then allows us to create a function that will calculate and return the fitted values for all four candidate models:

fit_models_default = function(x, y){
  # Setup and default values
  x = as.matrix(x)
  n = dim(x)[1]
  d = dim(x)[2]
  cs = prod(apply(x,2,IQR))
  ct = 0.5
  idx = seq(1,n)/n
  
  # Get GIC val for Model 1
  sband_1 = cs*n^(-1/(d+5))
  tband_1 = ct*n^(-1/(d+5))
  y_hat1 = model_1(x, y, sband_1, tband_1)
  
  # Get GIC val for Model 2
  sband_2 = cs*n^(-1/(d+4))
  y_hat2 = model_2(x, y, sband_2)
  
  # Get GIC val for Model 3
  tband_3 = ct*n^(-0.2)
  y_hat3 = model_3(x, y, tband_3)
  
  # Get GIC val for Model 4
  y_hat4 = model_4(x, y)
    
  # Return fitted values
  return(cbind(y_hat1, y_hat2, y_hat3, y_hat4))
}

And now we can look at an example. Let's generate data which is nonlinear but is time-constant, so we can see the efficacy of model I and II over the others, but it will also be easier to see the underlying relationship in the $x-y$ plane. For this, we'll generate the $x_i$ values from a 1-dimensional gaussian process as defined in Zhang and Wu's work, and we'll let the regression and error process be defined as

$$m(u,t) = u + e^u\sin(\pi u)\quad \sigma(u,t) = te^{u/3}$$

where $e_i = \sigma(x_i,i/n)\eta_i$, where $\eta_i$ are independent and identically distributed $\text{N}(0,1)$ random variables. Letting $n = 500$, we see the data graphed below:

n = 500
x = gen_x(n)
t = seq(1,n)/n

m = x + exp(x)*sinpi(x)
e = rnorm(n, 0, t*exp(x/3))
y = m + e

true_x = seq(min(x),max(x),length.out = 10000)
true_y = true_x + exp(true_x)*sinpi(true_x)

plot(x, y)
lines(true_x,true_y, col = colors[1], lwd = 2)
legend("topleft", legend = c("True function"), col = c('red'), pch = 15)

Then we can fit the four candidate models to the data and see their resulting fits compared to the true regression function:

fitted = fit_models_default(x,y)
fitted = fitted[order(x),]

plot(x,y)
lines(true_x,true_y, col = colors[1], lwd = 2)
lines(sort(x),fitted[,1], col = colors[2], lwd = 2)
lines(sort(x),fitted[,2], col = colors[3], lwd = 2)
lines(sort(x),fitted[,3], col = colors[4], lwd = 2)
lines(sort(x),fitted[,4], col = colors[5], lwd = 2)
legend("topleft", 
       legend = c("True function", "Model 1", "Model 2", "Model 3", "Model 4"),
       col = colors,
       pch = 15)

We can see here that, due to their temporal dynamics which are not present in the underlying model, models I and III introduce a lot of noise into their estimations which models II and IV do not. We do see that, as expected, models I and II generally adhere to the underlying regression function, and thus produce fairly accurate estimates, unlike the linear estimates of models III and IV, which are inappropriate for the highly nonlinear regression function used.

Now that we've seen the estimates, we can also test out the GIC to see if it will select the true underlying model, which in this case is model II. To do this, we first implement the model selection procedure, similarly to the model fitting function from before:

select_model_default = function(x, y, X, I, verb = FALSE){
  # Setup and default values
  x = as.matrix(x)
  n = dim(x)[1]
  d = dim(x)[2]
  cs = prod(apply(x,2,IQR))
  ct = 0.5
  idx = seq(1,n)/n
  tau = log(n)*n^(-(d+3)/(d+4))
  gic = rep(Inf,4)
  if(verb){
    rss = rep(0,4)
    df = rep(0,4)
  }
  
  # Get GIC val for Model 1
  sband_1 = cs*n^(-1/(d+5))
  tband_1 = ct*n^(-1/(d+5))
  y_hat = model_1(x, y, sband_1, tband_1)
  if(verb){
    vals = GIC(x, idx, y, y_hat, X, I, 1, tau, sband_1, tband_1, TRUE)
    gic[1] = vals$gic
    rss[1] = vals$rss
    df[1] = vals$df
  } else {
    gic[1] = GIC(x, idx, y, y_hat, X, I, 1, tau, sband_1, tband_1)
  }
  
  # Get GIC val for Model 2
  sband_2 = cs*n^(-1/(d+4))
  y_hat = model_2(x, y, sband_2)
  if(verb){
    vals = GIC(x, idx, y, y_hat, X, I, 2, tau, sband_2, 0, TRUE)
    gic[2] = vals$gic
    rss[2] = vals$rss
    df[2] = vals$df
  } else {
    gic[2] = GIC(x, idx, y, y_hat, X, I, 2, tau, sband_2, 0)
  }
  
  # Get GIC val for Model 3
  tband_3 = ct*n^(-0.2)
  y_hat = model_3(x, y, tband_3)
  if(verb){
    vals = GIC(x, idx, y, y_hat, X, I, 3, tau, 0, tband_3, TRUE)
    gic[3] = vals$gic
    rss[3] = vals$rss
    df[3] = vals$df
  } else {
    gic[3] = GIC(x, idx, y, y_hat, X, I, 3, tau, 0, tband_3)
  }
  
  # Get GIC val for Model 4
  y_hat = model_4(x, y)
  if(verb){
    vals = GIC(x, idx, y, y_hat, X, I, 4, tau, 0, 0, TRUE)
    gic[4] = vals$gic
    rss[4] = vals$rss
    df[4] = vals$df
  } else {
    gic[4] = GIC(x, idx, y, y_hat, X, I, 4, tau, 0, 0)
  }
  
  selected = which.min(gic)
  if(verb){
    return(list(selected = selected, gic = gic, rss = rss, df = df))
  } else {
    return(selected)
  }
}

And then we can use $\mathscr{T} = [0.2,0.8]$, which was Zhang and Wu's go-to option and bounds the used data away from any potentially undesirable behavior at the beginning or end of its generation/collection, and $\mathscr{X} = [-1,2.5]$, as we can see data being generated densly enough from that region to be confident that $f(u,t)$ is bounded away from 0 in that region, to perform the model selection using the GIC:

sel_mod = select_model_default(x,y,c(-1,2.5),c(0.2,0.8),TRUE)
output = rbind(sel_mod$gic, log(sel_mod$rss/n), sel_mod$df)
rownames(output) = c("GIC", "log(RSS/n)", "DF")
colnames(output) = c("Model I", "Model II", "Model III", "Model IV")
output
A matrix: 3 × 4 of type dbl
Model I Model II Model III Model IV
GIC 0.7459894 -0.6796292 2.264940 1.752864
log(RSS/n) -0.6215959 -0.9782099 1.667778 1.666712
DF 31.7480210 6.9314484 13.862897 2.000000

And we see that, as expected, model II has the minimum GIC and thus is selected, matching the true underlying model.

Concluding Remarks

I would be remiss if I did not reinforce the fact that the credit for the theoretical and methodological work here should be entirely given to Ting Zhang and Wei Biao Wu, as it is their paper that this work originates from, and the original work that I completed here is only the implementation of their methods. I also must give a second shout-out to Professor Ting Zhang for advising my Honors Senior Thesis at Boston University where I completed this work. If you are interested in my full thesis, you can read it here. My full thesis also has all of the proper citations for the papers mentioned here, in addition to a few others. If you are interested in the full package of functions that I created that make this work possible, you can view that on my github. Thank you for giving this a read, and I look forward to making some more posts like this sharing some more of the work that I do!