Benchmarking gigs against other software packages • gigs

Rationale

This short article describes the performance of gigs relative to a non-exhaustive group of R and non-R packages designed to implement child growth standards.

Package name	Language	On CRAN?
anthro	R	Yes
childsds	R	Yes
gigs	R	No
ki-tools/growthstandards	R	No
gigs	Stata	No
zanthro	Stata	No

Thus far, there is no comprehensive benchmark comparing these different packages. This short article will compare the speed of each package from 1 to 100000 inputs, checking how fast each package can convert weight values to z-scores in the WHO Child Growth standards.

We have performed these benchmarks on a Windows 10 system running a Ryzen 7 3700X processor and 16GB of DDR4 RAM, using R version 4.3.2. The Stata benchmarks have been run on the same system in Stata 18.0 (revision 15 May 2023), using the benchmark package for Stata.

Set up benchmark dataset

The benchmarking and package comparisons will all use the same 100,000-row dataset comprised of data which can be used to convert between z-scores and centiles in the WHO weight-for-age standard. The z-scores, x variable and sexes for each row are generated randomly with a pre-specified seed. As gigs is validated against the published growth curve data from the WHO, we use it to generate weight values in kg for each observation. This dataset can then be used to compare speed and accuracy in conversion between values and z-scores/centiles for the packages mentioned above.

who_gs_dataset <- function(n, acronym = "wfa", seed = 154237890) {
  withr::with_seed(
    seed,
    code = {
      # Random z-scores around 0
      z <- rnorm(n = n)
      # X variables are non-discrete but within bounds of the desired acronym
      xvars <- gigs::who_gs_coeffs[[acronym]][[1]][, 1]
      x <-  sample(xvars, size = n, replace = TRUE)
      x_jitter <- c(runif(n = 5, min = -1, max = 1), 0)
      x <- x + sample(x_jitter, size = n, replace = TRUE)
      x <- pmax(pmin(x, max(xvars)), min(xvars))
      # Sexes randomly sampled from male and female
      sex <- sample(c("M", "F"), size = n, replace = TRUE)
    }
  )
  acronym <- rep_len(acronym, n)
  out <- data.frame(z = z, x = x, sex = sex, acronym = acronym) |>
    dplyr::mutate(
      y = gigs::who_gs_zscore2value(z = z, x = x, sex = sex, acronym = acronym)
    )
  purrr::set_names(out, c("z", "x", "sex", "acronym", "y"))
}

# Generate 100,000-row dataset
bench_dataset <- who_gs_dataset(n = 100000)

The first 10 rows of this dataset look like this:

bench_dataset[1:10, ]

z	x	sex	acronym	y
0.1783916	1480.29104	F	wfa	16.593835
1.0806829	758.04165	F	wfa	13.377723
-0.7026929	1076.37115	F	wfa	12.558484
0.0950173	1425.04165	M	wfa	16.348601
-0.4482927	368.91555	F	wfa	8.497343
-0.3574745	792.04165	M	wfa	12.043674
1.1661293	1358.10063	F	wfa	18.188497
-0.0874651	54.29104	M	wfa	5.300828
-0.4768519	1176.10063	F	wfa	13.485699
0.8001542	784.29104	M	wfa	13.710762

Benchmark code

The mbench_pkg() function is used to benchmark each package over a range of input sizes. Each call to it produces a tabular output containing the median time required for pkg_expr to operate on the data, ranging from 1 to 100,000 inputs.

mbench_pkg <- function(pkg_expr, pkg_name) {
  n_inputs <- c(1, 10, 100, 500, 1000, 5000, 10000, 25000, 50000, 75000, 100000)
  purrr::map_dfr(.x = n_inputs,
                 .f = \(no_of_inputs) {
                   dataset <- bench_dataset[seq_len(no_of_inputs), ]
                   mbench <- microbenchmark::microbenchmark(
                     dplyr::mutate(dataset, test = eval(pkg_expr)),
                     times = 25
                   )
                   c(n_inputs = no_of_inputs,
                     median_time = summary(mbench)$median,
                     time_units = attr(summary(mbench), which = "unit"))
                 }) |>
    dplyr::mutate(n_inputs = as.integer(n_inputs),
                  median_time = as.numeric(median_time),
                  package = pkg_name)
}

`anthro`

anthro_timings <- mbench_pkg(
  pkg_expr = quote(anthro::anthro_zscores(sex = sex,
                                          age = x,
                                          weight = y)$zwei),
  pkg_name = paste("R: anthro", packageVersion(pkg = "anthro"))
)

`childsds`

childsds_timings <- mbench_pkg(
  pkg_expr = quote(childsds::sds(value = y,
                                 age = x / 365.25,
                                 sex = sex,
                                 male = "M",
                                 female = "F",
                                 item = "weight",
                                 ref = childsds::who.ref)),
  pkg_name = paste("R: childsds", packageVersion(pkg = "childsds"))
)

`growthstandards`

growthstandards_timings <- mbench_pkg(
  pkg_expr = quote(
    growthstandards::who_value2zscore(y = y, x = x,
                                      sex = ifelse(sex == "M", "Male",
                                                   "Female"),
                                      y_var = "wtkg")
  ),
  pkg_name = paste("R: growthstandards",
                   packageVersion(pkg = "growthstandards"))
)

`gigs`

gigs_timings <- mbench_pkg(
  pkg_expr = quote(gigs::who_gs_value2zscore(y = y, x = x, sex = sex,
                                             acronym = acronym)),
  pkg_name = paste("R: gigs", packageVersion(pkg = "gigs"))
)

Stata

# Save .dta file equivalent of benchmarking table. This can be used to benchmark
# Stata packages.
haven::write_dta(data = bench_dataset,
                 path = file.path("exclude", "statabench", "bench_dataset.dta"))

In Stata, the commands are run inside a do-file which utilises the benchmark package for Stata. This script essentially does the same as mbench_pkg(), but for the Stata gigs package and the zanthro package.

// This is Stata code
foreach i in 1 10 100 500 1000 5000 10000 25000 50000 75000 100000 {
    use "benchmarking/bench_dataset.dta", clear
    qui drop if _n > `i'
    di "Number of inputs: `i'"
    bench, reps(25) restore last: ///
        qui egen double z_gigs = who_gs(y, "wfa", "v2z"), ///
        xvar(x) sex(sex) sexcode(m=M, f=F)
}

foreach i in 1 10 100 500 1000 5000 10000 25000 50000 75000 100000 {
    use "benchmarking/bench_dataset.dta", clear
    qui drop if _n > `i'
    di "Number of inputs: `i'"
    bench, reps(25) restore last: ///
        qui egen z_anthro = zanthro(y, wa, WHO), xvar(x) gender(sex) ///
                gencode(male=M, female=F) ageunit(day)
}

The outputs from this script give a table of timings that look like this:

stata_timings

##    n_inputs median_time time_units              package
## 1         1       0.009    seconds    Stata: gigs 0.4.0
## 2        10       0.008    seconds    Stata: gigs 0.4.0
## 3       100       0.009    seconds    Stata: gigs 0.4.0
## 4       500       0.010    seconds    Stata: gigs 0.4.0
## 5      1000       0.012    seconds    Stata: gigs 0.4.0
## 6      5000       0.028    seconds    Stata: gigs 0.4.0
## 7     10000       0.047    seconds    Stata: gigs 0.4.0
## 8     25000       0.104    seconds    Stata: gigs 0.4.0
## 9     50000       0.198    seconds    Stata: gigs 0.4.0
## 10    75000       0.296    seconds    Stata: gigs 0.4.0
## 11   100000       0.405    seconds    Stata: gigs 0.4.0
## 12        1       0.007    seconds Stata: zanthro 1.0.2
## 13       10       0.008    seconds Stata: zanthro 1.0.2
## 14      100       0.009    seconds Stata: zanthro 1.0.2
## 15      500       0.017    seconds Stata: zanthro 1.0.2
## 16     1000       0.027    seconds Stata: zanthro 1.0.2
## 17     5000       0.105    seconds Stata: zanthro 1.0.2
## 18    10000       0.203    seconds Stata: zanthro 1.0.2
## 19    25000       0.479    seconds Stata: zanthro 1.0.2
## 20    50000       0.951    seconds Stata: zanthro 1.0.2
## 21    75000       1.445    seconds Stata: zanthro 1.0.2
## 22   100000       2.046    seconds Stata: zanthro 1.0.2

Plotting

Only by plotting the various timings tables can we see the trends for each package:

On the whole, anthro is by far the slowest R package, taking around 2.24 seconds to process 100,000 inputs. This is in part because anthro computes results in every WHO Child Growth standard each time anthro::anthro_zscores() is called, but also due to a slower implementation of the WHO LMS process conversion than the other packages.

Next slowest is the Stata package zanthro, which takes around 2.05 seconds to compute results in a single WHO standard. About 4 times faster than zanthro is gigs for Stata 0.4.0, which scales more efficiently than zanthro and takes around 0.41 seconds to convert 100,000 measurements to z-scores.

Leading the pack are three R implementations: growthstandards, gigs, and childsds. The growthstandards package was the fastest at ~125 ms for 100,000 inputs, followed by gigs (~128 ms) and childsds (~126 ms).

Package output similarity

The packages also differ slightly in how they convert values to centiles/z-scores, which can affect the z-scores computed by each package even when inputs are the same. In our testing of the WHO standards, we found that the tested packages mostly agreed with each other, but that childsds did not correctly perform z-scoring in the WHO standards.

This is because the WHO Child Growth standards constrain z-scores in the outer tails (i.e. past 3 z-scores), as there were fewer data available at these extreme values. More information on this constraining procedure can be found in the reports referenced in the gigs::who_gs_value2zscore() documentation. As childsds does not perform this constraining procedure, extreme values are computed incorrectly:

discrepancies <- data.frame(z = c(-3.03, -2.97, 2.97, 3.03),
                            age_days = 0,
                            sex = "M") |>
  dplyr::mutate(
    weight_kg = gigs::who_gs_wfa_zscore2value(z, age_days, sex),
    # GIGS z-score
    z_gigs = gigs::who_gs_wfa_value2zscore(weight_kg, age_days, sex),
    # growthstandards z-score
    z_growthstandards = growthstandards::who_wtkg2zscore(
      age_days, weight_kg, "Male"
    ),
    # childsds z-score
    z_childsds = childsds::sds(
      value = weight_kg, age = age_days / 365.25,
      sex = sex, male = "M", female = "F",
      item = "weight", ref = childsds::who.ref
    )
  )

When we look at these z-scores, you can see that both growthstandards and gigs correctly apply the constraining procedure; childsds does not. From looking at the anthro source code, they also apply the constraining procedure.

z	sex	weight_kg	z_gigs	z_growthstandards	z_childsds
-3.03	M	2.068938	-3.03	-3.03	-3.031770
-2.97	M	2.091082	-2.97	-2.97	-2.970000
2.97	M	5.011546	2.97	2.97	2.970000
3.03	M	5.048978	3.03	3.03	3.028744

Session information

sessionInfo()

## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] withr_2.5.2       units_0.8-5       testthat_3.2.1    rmarkdown_2.25   
## [5] knitr_1.45        vctrs_0.6.5       checkmate_2.3.1   gamlss.dist_6.1-1
## 
## loaded via a namespace (and not attached):
##  [1] sandwich_3.1-0        utf8_1.2.4            generics_0.1.3       
##  [4] tidyr_1.3.0           lattice_0.21-9        digest_0.6.34        
##  [7] magrittr_2.0.3        evaluate_0.23         grid_4.3.2           
## [10] mvtnorm_1.2-4         fastmap_1.1.1         Matrix_1.6-5         
## [13] backports_1.4.1       brio_1.1.4            DBI_1.2.1            
## [16] survival_3.5-7        anthro_1.0.1          multcomp_1.4-25      
## [19] gigs_0.4.1            purrr_1.0.2           fansi_1.0.6          
## [22] scales_1.3.0          TH.data_1.1-2         codetools_0.2-19     
## [25] microbenchmark_1.4.10 cli_3.6.2             mitools_2.4          
## [28] rlang_1.1.3           childsds_0.8.0        munsell_0.5.0        
## [31] splines_4.3.2         tools_4.3.2           growthstandards_0.1.6
## [34] dplyr_1.1.4           colorspace_2.1-0      ggplot2_3.4.4        
## [37] R6_2.5.1              zoo_1.8-12            lifecycle_1.0.4      
## [40] MASS_7.3-60           pkgconfig_2.0.3       pillar_1.9.0         
## [43] gtable_0.3.4          glue_1.7.0            Rcpp_1.0.12          
## [46] xfun_0.41             tibble_3.2.1          tidyselect_1.2.0     
## [49] survey_4.2-1          htmltools_0.5.7       compiler_4.3.2       
## [52] prettyunits_1.2.0