a2_multiple_area_indices.Rmd
For some stocks, a goal might be to create multiple coastwide
indices, where these indices are defined by latitudinal breaks. This
vignette highlights the ability to fit these models using the
indexwc
package, and how to deal with sparse or missing
observations
As a multi-area index example, we’ll focus on yellowtail rockfish
First we’ll load in the configuration file. This is bringing in
specific settings by survey and species that are used to subset the data
and define the parameterization of the index estimation. This file is
included as both an rda
file that is loaded with the
package but is also available for users who download the source code in
the data-raw
folder (download the package to see this).
# Load from the rda file
configuration_ytk <- configuration |>
dplyr::filter(species == "yellowtail rockfish")
The configuration file contains settings you may wish to change, including attributes of filters (years, latitudes, depths), and model components (anisotropy, knots, spatiotemporal effects, etc).
The configuration includes multiple entries for a single species for separate surveys and error distributions (e.g., delta-lognormal, delta-gamma, etc.).
In this example, we are going to only run a single-index for the Northwest Fisheries Science Center (NWFSC) West Coast Groundfish Bottom Trawl (WCGBT) survey (i.e., also referred to as the NWFSC.Combo survey in the data NWFSC data warehouse).
configuration_ytk_wcgbt <- configuration_ytk |>
dplyr::filter(source == "NWFSC.Combo" & family == "sdmTMB::delta_gamma()")
For most indices developed on the west coast of the USA, the main
formula is usually something similar to the following, where
fyear
represents a fixed year effect, and
pass_scaled
is factor describing whether the haul is part
of pass 1 (early season) or pass 2 (late season). There are two unique
index formula’s available for yellowtail rockfish:
configuration_ytk_wcgbt[, "formula"]
#> # A tibble: 2 × 1
#> formula
#> <chr>
#> 1 catch_weight ~ 0 + fyear*split_mendocino + pass_scaled
#> 2 catch_weight ~ 0 + fyear + pass_scaled
we are going to use the formula with an area based term. Please see the vignette on single area indices for an example run with the other formulation available for yellowtail rockfish.
There are many ways to build on this general framework to allow for
multi-area indices. For yellowtail rockfish, we’re going to generate 2
indices, north and south of Cape Mendocino. Including the interaction
fyear*split_mendocino
allows each region to have a separate
trend, and because no interaction is included with
pass_scaled
, a global effect is assumed.
configuration_to_run <- configuration_ytk_wcgbt |>
dplyr::filter(
formula == "catch_weight ~ 0 + fyear*split_mendocino + pass_scaled"
)
As a quick aside, the indexwc
package allows for other
spatial splits, and in implementing them, the choice of names must be
consistent with those used in the package (and specifically in the
prediction grid). Variables currently included are: *
split_mendocino
representing a split N/S of 40.167 degrees
* split_conception
representing a split N/S of 34.45
degrees * split_monterey
representing a split N/S of 36
degrees * split_state
representing three splits for
California, Oregon, Washington
Generating new variables representing new splits will allow you to
fit models with sdmTMB
, but the predictions (index) will
fail – for these cases, a better solution is to request additional
splits be added to the package.
This block applies filters (based on latitude, depth, and year) to
each observation, and creates the new variable
split_mendocino
that already exists in the prediction
dataframe.
data <- configuration_to_run |>
# Row by row ... do stuff then ungroup
dplyr::rowwise() |>
# Pull the data based on the function found in fxn column
dplyr::mutate(
data_raw = list(format_data(eval(parse(text = fxn)))),
data_filtered = list(data_raw |>
dplyr::filter(
depth <= min_depth, depth >= max_depth,
latitude >= min_latitude, latitude <= max_latitude,
year >= min_year, year <= max_year
) |>
dplyr::mutate(split_mendocino = ifelse(latitude > 40.1666667, "N", "S")))
) |>
dplyr::ungroup()
Yellowtail rockfish occur more north of Cape Mendocino, and as a check we can see if there’s any years with no positive catches. This filtering shows that no positive catches exist for 2007.
dplyr::filter(data$data_filtered[[1]]) |>
dplyr::group_by(split_mendocino, year) |>
dplyr::summarise(sum_catch_weight = sum(catch_weight)) |>
dplyr::filter(sum_catch_weight == 0)
#> # A tibble: 2 × 3
#> # Groups: split_mendocino [1]
#> split_mendocino year sum_catch_weight
#> <chr> <int> <dbl>
#> 1 S 2007 0
#> 2 S 2024 0
This is a problem because if we try to estimate the
fyear:split_mendocino
interaction, the coefficient will not
be identifiable for 2007. This is not a unique problem for
sdmTMB
but a general issue for any regression / GLM / GAM /
GLMM model. The workaround that we’ll implement is to “map off” or not
estimate the fyear:split_mendocino
in 2007 – but estimate
it in all other years.
As a first step, we’ll summarize the names of all main effect
coefficients in our model. We use lm()
to do this for
simplicity, but other approaches could also be used.
lm <- lm(
formula = as.formula(configuration_to_run$formula),
data = data$data_filtered[[1]]
)
coef_names <- names(coef(lm))
print(coef_names)
#> [1] "fyear2003" "fyear2004"
#> [3] "fyear2005" "fyear2006"
#> [5] "fyear2007" "fyear2008"
#> [7] "fyear2009" "fyear2010"
#> [9] "fyear2011" "fyear2012"
#> [11] "fyear2013" "fyear2014"
#> [13] "fyear2015" "fyear2016"
#> [15] "fyear2017" "fyear2018"
#> [17] "fyear2019" "fyear2021"
#> [19] "fyear2022" "fyear2023"
#> [21] "fyear2024" "split_mendocinoS"
#> [23] "pass_scaled" "fyear2004:split_mendocinoS"
#> [25] "fyear2005:split_mendocinoS" "fyear2006:split_mendocinoS"
#> [27] "fyear2007:split_mendocinoS" "fyear2008:split_mendocinoS"
#> [29] "fyear2009:split_mendocinoS" "fyear2010:split_mendocinoS"
#> [31] "fyear2011:split_mendocinoS" "fyear2012:split_mendocinoS"
#> [33] "fyear2013:split_mendocinoS" "fyear2014:split_mendocinoS"
#> [35] "fyear2015:split_mendocinoS" "fyear2016:split_mendocinoS"
#> [37] "fyear2017:split_mendocinoS" "fyear2018:split_mendocinoS"
#> [39] "fyear2019:split_mendocinoS" "fyear2021:split_mendocinoS"
#> [41] "fyear2022:split_mendocinoS" "fyear2023:split_mendocinoS"
#> [43] "fyear2024:split_mendocinoS"
All of the coefficients from the presence - absence model look to be identifiable
To identify the labels of the coefficients from the positive model that are not estimable, we can re-fit the regression, using only positive observations.
lm_pos <- lm(
formula = as.formula(configuration_to_run$formula),
data = dplyr::filter(data$data_filtered[[1]], catch_weight > 0)
)
pos_not_identifiable <- names(which(is.na(coef(lm_pos))))
print(pos_not_identifiable)
#> [1] "fyear2007:split_mendocinoS" "fyear2024:split_mendocinoS"
Finally, we can create some mapping variables to be passed to
sdmTMB
. We’ll do this separately for the presence-absence
and positive models.
map_pres <- coef_names
map_pres[coef_names %in% pres_not_identifiable] <- NA
map_pres <- factor(map_pres)
map_pos <- coef_names
map_pos[coef_names %in% pos_not_identifiable] <- NA
map_pos <- factor(map_pos)
Along with these mapping vectors, we’ll create vectors of starting values (arbitrarily assigning -20 to values that are not estimable)
Now we can use the indexwc
package to fit the model. The
indexwc
package acts as a wrapper for sdmTMB
here, combining the estimation process (sdmTMB()
) with the
index generation (get_index()
).
This code is optimized to fit multiple models (across stocks /
species, or model configurations). In our case, the configurations
dataframe has 1 row – but this could include multiple rows. A key point
to highlight is that in this code, we use sdmTMBcontrol()
to pass in a list that contains the variables above we use to map off
and initialize the parameters. For datasets without missing values, you
may not want to include this sdmTMBcontrol()
line, but it
may be useful to adjust the newton_loops
, etc.
Note: this code will fit a single model for each row.
best <- data |>
dplyr::mutate(
# Evaluate the call in family
family = purrr::map(family, .f = ~ eval(parse(text = .x))),
# Run the model on each row in data
results = purrr::pmap(
.l = list(
data = data_filtered,
formula = formula,
family = family,
anisotropy = anisotropy,
n_knots = knots,
share_range = share_range,
spatiotemporal = purrr::map2(spatiotemporal1, spatiotemporal2, list),
sdmtmb_control = list(
sdmTMB::sdmTMBcontrol(
map = list(b_j = map_pos, b_j2 = map_pos),
start = list(b_j = start_pos, b_j2 = start_pos),
newton_loops = 1 # increase for real models
)
)
),
.f = indexwc::run_sdmtmb
)
)
Output and diagnostics for all models and predictions are stored in a
folder that indexwc
creates; it’s location is
“yellowtail_rockfish/wcgbts/delta_gamma..”. The “index” folder contains
diagnostics and predictions for 10 sets of indices. These 10 indices
represent (1) N/S of Cape Mendocino, (2) N/S of Monterrey, (3) N/S of Pt
Conception, a coastwide index, and state-specific indices. For
multi-region models, like the yellowtail rockfish used here, it wouldn’t
make sense to interpret indices other than the split used in the model
(here, N/S of Cape Mendocino) and the coastwide index.