Data wrangling and working with ggoutbreak

ggoutbreak assumes a consistent naming scheme for significant columns, notably time and count, and additionally class, denom, population columns. Data needs to be supplied using these column names and to make sure the data is formatted correctly it undergoes quite rigorous checks. One area where this can pose problems is in correct grouping, which must make sure that each group is a single time series of unique time and minimally count columns.

Line lists vs. time series

Infectious disease data usually either comes as a set of observations of an individual infection with a time stamp (i.e. a line list) or as a count of events (e.g. positive tests, hospitalisations, deaths) happening within a specific period (day, week, month etc.) as a time series.

For count data there may also be a denominator known. For testing this could be the number of tests performed, or the number of patients at risk of hospitalisation.

For both these data types there may also be a class associated with each observation, defining a subgroup of infections of interest. This could be the variant of a virus, or the age group, for example. It may make sense to compare these different subgroups against each other. In this case the denominator may be the total of counts among all groups per unit time. Additionally there may be information about the size of the population for each subgroup.

ggoutbreak assumes for the most part that the input data is in the form of a set of time series of counts, each of which has a unique set of times, which are usually complete. To create datasets like this from line lists ggoutbreak provides some infrastructure for dealing with time series:

Time periods

A weekly case rate represents a time slice of seven days with a start and finish date. Dates are a continuous quantity, and cut_dates() can be used to classify continuous dates into periods of equal duration, with a start date:

random_dates = Sys.Date()+sample.int(21,50,replace = TRUE)
cut_date( random_dates, unit = "1 week", anchor = "start", dfmt = "%d %b")
#> 22 Nov — 28 Nov 29 Nov — 05 Dec 15 Nov — 21 Nov 29 Nov — 05 Dec 15 Nov — 21 Nov 
#>    "2024-11-22"    "2024-11-29"    "2024-11-15"    "2024-11-29"    "2024-11-15" 
#> 29 Nov — 05 Dec 29 Nov — 05 Dec 22 Nov — 28 Nov 15 Nov — 21 Nov 29 Nov — 05 Dec 
#>    "2024-11-29"    "2024-11-29"    "2024-11-22"    "2024-11-15"    "2024-11-29" 
#> 29 Nov — 05 Dec 15 Nov — 21 Nov 15 Nov — 21 Nov 22 Nov — 28 Nov 22 Nov — 28 Nov 
#>    "2024-11-29"    "2024-11-15"    "2024-11-15"    "2024-11-22"    "2024-11-22" 
#> 22 Nov — 28 Nov 29 Nov — 05 Dec 22 Nov — 28 Nov 15 Nov — 21 Nov 29 Nov — 05 Dec 
#>    "2024-11-22"    "2024-11-29"    "2024-11-22"    "2024-11-15"    "2024-11-29" 
#> 15 Nov — 21 Nov 15 Nov — 21 Nov 22 Nov — 28 Nov 22 Nov — 28 Nov 15 Nov — 21 Nov 
#>    "2024-11-15"    "2024-11-15"    "2024-11-22"    "2024-11-22"    "2024-11-15" 
#> 15 Nov — 21 Nov 29 Nov — 05 Dec 15 Nov — 21 Nov 22 Nov — 28 Nov 29 Nov — 05 Dec 
#>    "2024-11-15"    "2024-11-29"    "2024-11-15"    "2024-11-22"    "2024-11-29" 
#> 22 Nov — 28 Nov 22 Nov — 28 Nov 22 Nov — 28 Nov 22 Nov — 28 Nov 15 Nov — 21 Nov 
#>    "2024-11-22"    "2024-11-22"    "2024-11-22"    "2024-11-22"    "2024-11-15" 
#> 15 Nov — 21 Nov 22 Nov — 28 Nov 29 Nov — 05 Dec 29 Nov — 05 Dec 29 Nov — 05 Dec 
#>    "2024-11-15"    "2024-11-22"    "2024-11-29"    "2024-11-29"    "2024-11-29" 
#> 15 Nov — 21 Nov 15 Nov — 21 Nov 15 Nov — 21 Nov 15 Nov — 21 Nov 15 Nov — 21 Nov 
#>    "2024-11-15"    "2024-11-15"    "2024-11-15"    "2024-11-15"    "2024-11-15" 
#> 15 Nov — 21 Nov 29 Nov — 05 Dec 29 Nov — 05 Dec 22 Nov — 28 Nov 29 Nov — 05 Dec 
#>    "2024-11-15"    "2024-11-29"    "2024-11-29"    "2024-11-22"    "2024-11-29"

Performing calculations using interval censored dates is awkward. A numeric version of dates is useful that can keep track of both the start date of a time series and its intrinsic duration, as a numeric. This is the purpose of the time_period class:

dates = seq(as.Date("2020-01-01"),by=7,length.out = 5)
tmp = as.time_period(dates)
#> No `start_date` (or `anchor`) specified. Using default: 2019-12-29
#> No unit given. Guessing a sensible value from the dates gives: 7d 0H 0M 0S
tmp
#> time unit: week, origin: 2019-12-29 (a Sunday)
#> [1] 0.4285714 1.4285714 2.4285714 3.4285714 4.4285714

The time_period defaults to using a date at the beginning of the COVID-19 pandemic as its origin and calculating a duration unit based on the data (in this case weekly).

A usual set of S3 methods are available such as formatting, printing, labelling, and casting time_periods to and from dates and POSIXct classes:

suppressWarnings(labels(tmp))
#> 01/Jan — 07/Jan
#> 08/Jan — 14/Jan
#> 15/Jan — 21/Jan
#> 22/Jan — 28/Jan
#> 29/Jan — 04/Feb

A weekly time series can be recast to a different frequency, or start date:


tmp2 = as.time_period(tmp, unit = "2 days", start_date = "2020-01-01")
tmp2
#> time unit: 2 days, origin: 2020-01-01 (a Wednesday)
#> [1]  0.0  3.5  7.0 10.5 14.0

and the original dates should be recoverable:

as.Date(tmp2)
#> [1] "2020-01-01" "2020-01-08" "2020-01-15" "2020-01-22" "2020-01-29"

date_seq() can be used to make sure a set of periodic times is complete:


tmp3 = as.time_period(Sys.Date()+c(0:2,4:5)*7,anchor = "start")
#> No unit given. Guessing a sensible value from the dates gives: 7d 0H 0M 0S
as.Date(date_seq(tmp3))
#> [1] "2024-11-14" "2024-11-21" "2024-11-28" "2024-12-05" "2024-12-12"
#> [6] "2024-12-19"

time_periods can be used with any monthly or yearly data but such data are not regular. This is handled and irregular date periods are generally OK to use with ggoutbreak but some functions like date_seq may not work as anticipated with irregular dates.

Two time series can be aligned to make them comparable:


orig_dates = Sys.Date()+1:10*7

# a 2 daily time series based on weekly dates
t1 = as.time_period(orig_dates, unit = "2 days", start_date = "2021-01-01")
t1
#> time unit: 2 days, origin: 2021-01-01 (a Friday)
#>  [1] 710.0 713.5 717.0 720.5 724.0 727.5 731.0 734.5 738.0 741.5

# a weekly with different start date
t2 = as.time_period(orig_dates, unit = "1 week", start_date = "2022-01-01")
t2
#> time unit: week, origin: 2022-01-01 (a Saturday)
#>  [1] 150.7143 151.7143 152.7143 153.7143 154.7143 155.7143 156.7143 157.7143
#>  [9] 158.7143 159.7143

# rebase t1 into the same format as t2
# as t1 and t2 based on the same original dates converting t2 onto the same
# peridicty as t1 results in an identical set of times
t3 = as.time_period(t1,t2)
t3
#> time unit: week, origin: 2022-01-01 (a Saturday)
#>  [1] 150.7143 151.7143 152.7143 153.7143 154.7143 155.7143 156.7143 157.7143
#>  [9] 158.7143 159.7143

Times in ggoutbreak and conversion of line-lists

ggoutbreak uses the time_period class internally extensively. Casting dates to and from time_periods is all that generally needs to be done before using ggoutbreak. Most of the functions in ggoutbreak operate on time series data which expect a unique (and usually complete) set of data on a periodic time.

To help prepare line-list data into time series there is the time_summarise() function. A minimal line-list will have a date column and nothing else.


random_dates = Sys.Date()+sample.int(21,50,replace = TRUE)
linelist = tibble::tibble(date = random_dates)
linelist %>% time_summarise(unit="1 week") %>% dplyr::glimpse()
#> Rows: 3
#> Columns: 2
#> $ time  <time_prd> 0, 1, 2
#> $ count <int> 15, 17, 18

If the line-list contains a class column it is interpreted as a complete record of all possible options from which we can calculate a denominator. In this case the positive and negative results of a test:


random_dates = Sys.Date()+sample.int(21,200,replace = TRUE)
linelist2 = tibble::tibble(
  date = random_dates,
  class = stats::rbinom(200, 1, 0.04) %>% ifelse("positive","negative")
)
linelist2 %>% time_summarise(unit="1 week") %>% dplyr::glimpse()
#> Rows: 6
#> Columns: 4
#> Groups: class [2]
#> $ class <chr> "negative", "negative", "negative", "positive", "positive", "pos…
#> $ time  <time_prd> 0, 1, 2, 0, 1, 2
#> $ count <int> 56, 65, 68, 4, 4, 3
#> $ denom <int> 60, 69, 71, 60, 69, 71

In this specific example subsequent analysis with ggoutbreak may focus on the positive subgroup only, as the comparison between positive and negative test results is trivial. In another example class may not be test results, it could be any other major subdivision e.g. the variant of a disease. In this case the comparison between different groups may be much more relevant. The use of class as the major sub-group is for convenience. Additional grouping other than class columns is also possible for multi-facetted comparisons, and grouping is preserved but not included automatically in the denominator, which may need to be manually calculated:


random_dates = Sys.Date()+sample.int(21,200,replace = TRUE)
variant = apply(stats::rmultinom(200, 1, c(0.1,0.3,0.6)), MARGIN = 2, function(x) which(x==1))

linelist3 = tibble::tibble(
  date = random_dates,
  class = c("variant1","variant2","variant3")[variant],
  gender = ifelse(stats::rbinom(200,1,0.5),"male","female")
)
  
count_by_gender = linelist3 %>% 
  dplyr::group_by(gender) %>% 
  time_summarise(unit="1 week") %>% 
  dplyr::arrange(time, gender, class) %>%
  dplyr::glimpse()
#> Rows: 18
#> Columns: 5
#> Groups: gender, class [6]
#> $ gender <chr> "female", "female", "female", "male", "male", "male", "female",…
#> $ class  <chr> "variant1", "variant2", "variant3", "variant1", "variant2", "va…
#> $ time   <time_prd> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2
#> $ count  <int> 8, 8, 27, 3, 9, 21, 1, 9, 21, 3, 8, 25, 2, 3, 21, 4, 7, 20
#> $ denom  <int> 43, 43, 43, 33, 33, 33, 31, 31, 31, 36, 36, 36, 26, 26, 26…

Aggregating time series datasets.

In the case of a time series with additional grouping present, removing a level of grouping whilst retaining time is made easier with time_aggregate(). In this case we wish to sum count and denom by gender, retaining the class grouping.


count_by_gender %>% 
  dplyr::group_by(class,gender) %>% 
  time_aggregate() %>%
  dplyr::glimpse()
#> Rows: 9
#> Columns: 4
#> Groups: class [3]
#> $ class <chr> "variant1", "variant1", "variant1", "variant2", "variant2", "var…
#> $ time  <time_prd> 0, 1, 2, 0, 1, 2, 0, 1, 2
#> $ count <int> 11, 4, 6, 17, 17, 10, 48, 46, 41
#> $ denom <int> 76, 67, 57, 76, 67, 57, 76, 67, 57

by default time_aggregate will sum any of count, denom and population columns but any other behaviour can be specified by passing dplyr::summarise style directives to the function.