Overview

For some stocks, a goal might be to create multiple coastwide indices, where these indices are defined by latitudinal breaks. This vignette highlights the ability to fit these models using the indexwc package, and how to deal with sparse or missing observations

Yellowtail rockfish example

As a multi-area index example, we’ll focus on yellowtail rockfish

First we’ll load in the configuration file. This is bringing in specific settings by survey and species that are used to subset the data and define the parameterization of the index estimation. This file is included as both an rda file that is loaded with the package but is also available for users who download the source code in the data-raw folder (download the package to see this).

# Load from the rda file
configuration_ytk <- configuration |>
  dplyr::filter(species == "yellowtail rockfish")

The configuration file contains settings you may wish to change, including attributes of filters (years, latitudes, depths), and model components (anisotropy, knots, spatiotemporal effects, etc).

The configuration includes multiple entries for a single species for separate surveys and error distributions (e.g., delta-lognormal, delta-gamma, etc.).

In this example, we are going to only run a single-index for the Northwest Fisheries Science Center (NWFSC) West Coast Groundfish Bottom Trawl (WCGBT) survey (i.e., also referred to as the NWFSC.Combo survey in the data NWFSC data warehouse).

configuration_ytk_wcgbt <- configuration_ytk |>
  dplyr::filter(source == "NWFSC.Combo" & family == "sdmTMB::delta_gamma()")

Statistical model

For most indices developed on the west coast of the USA, the main formula is usually something similar to the following, where fyear represents a fixed year effect, and pass_scaled is factor describing whether the haul is part of pass 1 (early season) or pass 2 (late season). There are two unique index formula’s available for yellowtail rockfish:

configuration_ytk_wcgbt[, "formula"]
#> # A tibble: 2 × 1
#>   formula                                               
#>   <chr>                                                 
#> 1 catch_weight ~ 0 + fyear*split_mendocino + pass_scaled
#> 2 catch_weight ~ 0 + fyear + pass_scaled

we are going to use the formula with an area based term. Please see the vignette on single area indices for an example run with the other formulation available for yellowtail rockfish.

There are many ways to build on this general framework to allow for multi-area indices. For yellowtail rockfish, we’re going to generate 2 indices, north and south of Cape Mendocino. Including the interaction fyear*split_mendocino allows each region to have a separate trend, and because no interaction is included with pass_scaled, a global effect is assumed.

configuration_to_run <- configuration_ytk_wcgbt |>
  dplyr::filter(
    formula == "catch_weight ~ 0 + fyear*split_mendocino + pass_scaled"
  )

As a quick aside, the indexwc package allows for other spatial splits, and in implementing them, the choice of names must be consistent with those used in the package (and specifically in the prediction grid). Variables currently included are: * split_mendocino representing a split N/S of 40.167 degrees * split_conception representing a split N/S of 34.45 degrees * split_monterey representing a split N/S of 36 degrees * split_state representing three splits for California, Oregon, Washington

Generating new variables representing new splits will allow you to fit models with sdmTMB, but the predictions (index) will fail – for these cases, a better solution is to request additional splits be added to the package.

Preparing the data

This block applies filters (based on latitude, depth, and year) to each observation, and creates the new variable split_mendocino that already exists in the prediction dataframe.

data <- configuration_to_run |>
  # Row by row ... do stuff then ungroup
  dplyr::rowwise() |>
  # Pull the data based on the function found in fxn column
  dplyr::mutate(
    data_raw = list(format_data(eval(parse(text = fxn)))),
    data_filtered = list(data_raw |>
      dplyr::filter(
        depth <= min_depth, depth >= max_depth,
        latitude >= min_latitude, latitude <= max_latitude,
        year >= min_year, year <= max_year
      ) |>
      dplyr::mutate(split_mendocino = ifelse(latitude > 40.1666667, "N", "S")))
  ) |>
  dplyr::ungroup()

Yellowtail rockfish occur more north of Cape Mendocino, and as a check we can see if there’s any years with no positive catches. This filtering shows that no positive catches exist for 2007.

dplyr::filter(data$data_filtered[[1]]) |>
  dplyr::group_by(split_mendocino, year) |>
  dplyr::summarise(sum_catch_weight = sum(catch_weight)) |>
  dplyr::filter(sum_catch_weight == 0)
#> # A tibble: 2 × 3
#> # Groups:   split_mendocino [1]
#>   split_mendocino  year sum_catch_weight
#>   <chr>           <int>            <dbl>
#> 1 S                2007                0
#> 2 S                2024                0

This is a problem because if we try to estimate the fyear:split_mendocino interaction, the coefficient will not be identifiable for 2007. This is not a unique problem for sdmTMB but a general issue for any regression / GLM / GAM / GLMM model. The workaround that we’ll implement is to “map off” or not estimate the fyear:split_mendocino in 2007 – but estimate it in all other years.

As a first step, we’ll summarize the names of all main effect coefficients in our model. We use lm() to do this for simplicity, but other approaches could also be used.

lm <- lm(
  formula = as.formula(configuration_to_run$formula),
  data = data$data_filtered[[1]]
)

coef_names <- names(coef(lm))

print(coef_names)
#>  [1] "fyear2003"                  "fyear2004"                 
#>  [3] "fyear2005"                  "fyear2006"                 
#>  [5] "fyear2007"                  "fyear2008"                 
#>  [7] "fyear2009"                  "fyear2010"                 
#>  [9] "fyear2011"                  "fyear2012"                 
#> [11] "fyear2013"                  "fyear2014"                 
#> [13] "fyear2015"                  "fyear2016"                 
#> [15] "fyear2017"                  "fyear2018"                 
#> [17] "fyear2019"                  "fyear2021"                 
#> [19] "fyear2022"                  "fyear2023"                 
#> [21] "fyear2024"                  "split_mendocinoS"          
#> [23] "pass_scaled"                "fyear2004:split_mendocinoS"
#> [25] "fyear2005:split_mendocinoS" "fyear2006:split_mendocinoS"
#> [27] "fyear2007:split_mendocinoS" "fyear2008:split_mendocinoS"
#> [29] "fyear2009:split_mendocinoS" "fyear2010:split_mendocinoS"
#> [31] "fyear2011:split_mendocinoS" "fyear2012:split_mendocinoS"
#> [33] "fyear2013:split_mendocinoS" "fyear2014:split_mendocinoS"
#> [35] "fyear2015:split_mendocinoS" "fyear2016:split_mendocinoS"
#> [37] "fyear2017:split_mendocinoS" "fyear2018:split_mendocinoS"
#> [39] "fyear2019:split_mendocinoS" "fyear2021:split_mendocinoS"
#> [41] "fyear2022:split_mendocinoS" "fyear2023:split_mendocinoS"
#> [43] "fyear2024:split_mendocinoS"

All of the coefficients from the presence - absence model look to be identifiable

pres_not_identifiable <- names(which(is.na(coef(lm))))
print(pres_not_identifiable)
#> character(0)

To identify the labels of the coefficients from the positive model that are not estimable, we can re-fit the regression, using only positive observations.

lm_pos <- lm(
  formula = as.formula(configuration_to_run$formula),
  data = dplyr::filter(data$data_filtered[[1]], catch_weight > 0)
)

pos_not_identifiable <- names(which(is.na(coef(lm_pos))))
print(pos_not_identifiable)
#> [1] "fyear2007:split_mendocinoS" "fyear2024:split_mendocinoS"

Finally, we can create some mapping variables to be passed to sdmTMB. We’ll do this separately for the presence-absence and positive models.

map_pres <- coef_names
map_pres[coef_names %in% pres_not_identifiable] <- NA
map_pres <- factor(map_pres)

map_pos <- coef_names
map_pos[coef_names %in% pos_not_identifiable] <- NA
map_pos <- factor(map_pos)

Along with these mapping vectors, we’ll create vectors of starting values (arbitrarily assigning -20 to values that are not estimable)

start_pres <- rep(0, length(coef_names))
start_pres[coef_names %in% pres_not_identifiable] <- -20

start_pos <- rep(0, length(coef_names))
start_pos[coef_names %in% pos_not_identifiable] <- -20

Fitting the model with indexwc and sdmTMB

Now we can use the indexwc package to fit the model. The indexwc package acts as a wrapper for sdmTMB here, combining the estimation process (sdmTMB()) with the index generation (get_index()).

This code is optimized to fit multiple models (across stocks / species, or model configurations). In our case, the configurations dataframe has 1 row – but this could include multiple rows. A key point to highlight is that in this code, we use sdmTMBcontrol() to pass in a list that contains the variables above we use to map off and initialize the parameters. For datasets without missing values, you may not want to include this sdmTMBcontrol() line, but it may be useful to adjust the newton_loops, etc.

Note: this code will fit a single model for each row.

best <- data |>
  dplyr::mutate(
    # Evaluate the call in family
    family = purrr::map(family, .f = ~ eval(parse(text = .x))),
    # Run the model on each row in data
    results = purrr::pmap(
      .l = list(
        data = data_filtered,
        formula = formula,
        family = family,
        anisotropy = anisotropy,
        n_knots = knots,
        share_range = share_range,
        spatiotemporal = purrr::map2(spatiotemporal1, spatiotemporal2, list),
        sdmtmb_control = list(
          sdmTMB::sdmTMBcontrol(
            map = list(b_j = map_pos, b_j2 = map_pos),
            start = list(b_j = start_pos, b_j2 = start_pos),
            newton_loops = 1 # increase for real models
          )
        )
      ),
      .f = indexwc::run_sdmtmb
    )
  )

Examining the index and diagnostics

Output and diagnostics for all models and predictions are stored in a folder that indexwc creates; it’s location is “yellowtail_rockfish/wcgbts/delta_gamma..”. The “index” folder contains diagnostics and predictions for 10 sets of indices. These 10 indices represent (1) N/S of Cape Mendocino, (2) N/S of Monterrey, (3) N/S of Pt Conception, a coastwide index, and state-specific indices. For multi-region models, like the yellowtail rockfish used here, it wouldn’t make sense to interpret indices other than the split used in the model (here, N/S of Cape Mendocino) and the coastwide index.