| Title: | Datasets from the UK COVID-19 outbreak |
|---|---|
| Description: | Provides simple access to a small selection of pre-wrangled data sets relevant to the COVId-19 outbreak in the UK for teaching and demo purposes. |
| Authors: | Robert Challen [aut, cre] (ORCID: <https://orcid.org/0000-0002-5504-7768>) |
| Maintainer: | Robert Challen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.3 |
| Built: | 2026-05-08 07:35:32 UTC |
| Source: | https://github.com/ai4ci/ukc19 |
Viral load from nasal swabs of subset of positive participants from COVID-19 human challenge study, as deteced by Quantitative PCR. Values were mined from the vector files of the figures. The Y-axis values are approximate as had to be manually read from the scale.
data("covid_challenge")data("covid_challenge")
An object of class tbl_df (inherits from tbl, data.frame) with 629 rows and 3 columns.
Data extracted from figure 2 Viral shedding after a short incubation period peaks rapidly after human SARS-CoV-2 challenge. Panel A (middle left subpanel).
Killingley, B., Mann, A.J., Kalinova, M. et al. Safety, tolerability and viral kinetics during SARS-CoV-2 human challenge in young adults. Nat Med 28, 1031–1041 (2022). https://doi.org/10.1038/s41591-022-01780-9
For datasets compiled from existing literature, Scientific Data’s policy is that compilers (creators of the secondary compilation dataset and authors of the associated Data Descriptor) are not required by the journal to ask permission from the original authors to extract small amounts of numerical information or other fields. Expected practice is to attribute the original work via citation.
id (chr) id a unique ID for participant
log10_viral_load (dbl) log 10 viral load in copies per millilitre detected
time (dbl) time of the sample in days from exposure.
https://www.nature.com/articles/s41591-022-01780-9/figures/2
dplyr::glimpse(covid_challenge)dplyr::glimpse(covid_challenge)
Weekly counts of identified variants for the whole of England.
data("covid_variants")data("covid_variants")
An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 479 rows and 5 columns.
From late March 2023 onwards, due to the low number of sequenced samples, the UK SARS-CoV-2 sequencing surveillance data is not updated on the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard. Due to changes since the end of mass COVID-19 testing in the UK since April 2022 - the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard only includes a subset of UK SARS-CoV-2 sequencing surveillance data and should not be used to estimate frequency of SARS-CoV-2 variants circulating. Not all samples sequenced and deposited in public databases are presented here. This data is not de-duplicated on a patient level - and may include targeted sequencing that may introduce biases.
covid_variants dataframe with 479 rows and 5 columnsdate (date) The date
class (fct) The variant description as a name and pango lineage
who_class (fct) The WHO short name
count (dbl) The number of sequences of this variant identified on this date
denom (dbl) The total number of sequences of all variants identified on this date
https://covid19.sanger.ac.uk/lineages/raw Contains Ordnance Survey data © Crown copyright and database right 2019 Contains UK Health Security Agency data © Crown copyright and database right 2020 Office for National Statistics licensed under the Open Government Licence v.3.0
dplyr::glimpse(covid_variants)dplyr::glimpse(covid_variants)
Weekly counts of identified variants by Lower tier local authority (2019 names)
This dataset has implicit zeros. The full range of areas can be got from the
geography data set with: geography %>% dplyr::filter(codeType == "LAD19")
data("covid_variants_ltla")data("covid_variants_ltla")
An object of class tbl_df (inherits from tbl, data.frame) with 55785 rows and 8 columns.
From late March 2023 onwards, due to the low number of sequenced samples, the UK SARS-CoV-2 sequencing surveillance data is not updated on the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard. Due to changes since the end of mass COVID-19 testing in the UK since April 2022 - the Wellcome Sanger Institute COVID-19 Genomic surveillance dashboard only includes a subset of UK SARS-CoV-2 sequencing surveillance data and should not be used to estimate frequency of SARS-CoV-2 variants circulating. Not all samples sequenced and deposited in public databases are presented here. This data is not de-duplicated on a patient level - and may include targeted sequencing that may introduce biases.
covid_variants_ltla dataframe with 55785 rows and 8 columnsdate (date) The date
code (chr) The ONS geographical region code
codeType (chr) The type of ONS geographical code
name (chr) The ONS geographical region name
who_class (fct) The WHO short name
count (dbl) The number of sequences of this variant identified on this date
denom (dbl) The total number of sequences of all variants identified on this date
https://covid19.sanger.ac.uk/lineages/raw Contains Ordnance Survey data © Crown copyright and database right 2019 Contains UK Health Security Agency data © Crown copyright and database right 2020 Office for National Statistics licensed under the Open Government Licence v.3.0
dplyr::glimpse(covid_variants_ltla)dplyr::glimpse(covid_variants_ltla)
Data from Z. Du, X. Xu, Y. Wu, L. Wang, B. J. Cowling, and L. A. Meyers, ‘Serial Interval of COVID-19 among Publicly Reported Confirmed Cases’, Emerg Infect Dis, vol. 26, no. 6, pp. 1341–1343, Jun. 2020, doi: 10.3201/eid2606.200357.
data("du_serial_interval")data("du_serial_interval")
An object of class tbl_df (inherits from tbl, data.frame) with 752 rows and 3 columns.
"This is a publication of the U.S. Government. This publication is in the public domain and is therefore without copyright. All text from this work may be reprinted freely. Use of these materials should be properly cited."
du_serial_interval dataframe with 752 rows and 3 columnsid (dbl) Unique case id
symptom_onset (dbl) Time of symptom onset as an integer
infector_id (dbl) Case id of infector where known
https://github.com/MeyersLabUTexas/COVID-19
dplyr::glimpse(du_serial_interval)dplyr::glimpse(du_serial_interval)
Mined out the commit history of COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University this dataset has early outbreak trajectories (21st Jan 2020 up to March 8th 2020) for a wide range of geographies, for confirmed cases, deaths and recovered cases. These trajectories are based on reported date, but are occasionally revised which will vary from region to region and maybe between different statistics, which show up as infrequent changes in published estimates over time.
data("early_global_combined")data("early_global_combined")
An object of class tbl_df (inherits from tbl, data.frame) with 104036 rows and 9 columns.
This data set is originally licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
country (chr) The country
province (chr) subnational division
lat (dbl) Latitude
long (dbl) Longitude
reported_date (date) Date of the observation based on reports of cases on this date.
total_cases (dbl) Cumulative cases as
published_date (date) Date the observation was published on the JHU github.
total_deaths (dbl) Cumulative deaths
total_recovered (dbl) Cumulative recovered
https://github.com/CSSEGISandData/COVID-19
dplyr::glimpse(early_global_combined)dplyr::glimpse(early_global_combined)
A dataset of the daily count of COVID-19 cases by age group in England
downloaded from the UKHSA coronavirus API, and formatted for
use in ggoutbreak. A denominator is calculated which is the overall
positive count for all age groups. This data set can be used to calculate
group-wise incidence and absolute growth rates and group wise proportions and
relative growth rates by age group.
data("england_cases_by_5yr_age")data("england_cases_by_5yr_age")
An object of class tbl_df (inherits from tbl, data.frame) with 26790 rows and 8 columns.
You may want england_covid_positivity instead which includes the
test denominator. The denominator here is the total number of positive
tests across all age groups and not the number of tests taken or population
size.
england_cases_by_5yr_age dataframe with 26790 rows and 8 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
date (date) The date
class (chr) the age group in 5 year age bands
count (dbl) the test positives for each age group
denom (dbl) the test positives across all age groups
population (dbl) the population size for this age group
https://ukhsa-dashboard.data.gov.uk/covid-19-archive-data-download
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(england_cases_by_5yr_age)dplyr::glimpse(england_cases_by_5yr_age)
The daily count of COVID-19 new PCR positive cases in England. The denominator the overall number of PCR tests conducted. This gives us a proportion of positive tests which can be used to correct for testing effort.
data("england_covid_positivity")data("england_covid_positivity")
An object of class tbl_df (inherits from tbl, data.frame) with 1413 rows and 6 columns.
england_covid_positivity dataframe with 2048 rows and 6 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
date (date) The date
count (dbl) the count of PCR test positives
denom (dbl) the total count of PCR tests conducted on that day
https://ukhsa-dashboard.data.gov.uk/covid-19-archive-data-download
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(england_covid_positivity)dplyr::glimpse(england_covid_positivity)
Ganyani T, Kremer C, Chen D, Torneri A, Faes C, Wallinga J, Hens N. Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020. Euro Surveill. 2020 Apr;25(17):2000257. doi: 10.2807/1560-7917.ES.2020.25.17.2000257. PMID: 32372755; PMCID: PMC7201952.
data("ganyani_clusters")data("ganyani_clusters")
An object of class tbl_df (inherits from tbl, data.frame) with 196 rows and 6 columns.
Original article licensed under Creative Commons 4.0. Data was cleansed and formatted for R.
ganyani_clusters dataframe with 196 rows and 6 columnsid (dbl) a unique id for a person (unique within the source)
contacts (list dbl) list of known contacts in the cluster
cluster_id (dbl) id of a cluster (unique within the source)
symptom_onset (date) symptom onset date
known_primary_case (lgl) flag if this person is know to be the primary case in the cluster
source (chr) geographical source of the data
https://github.com/cecilekremer/COVID19
dplyr::glimpse(ganyani_clusters)dplyr::glimpse(ganyani_clusters)
Geographic codes and names from the ONS for administrative regions of the UK relevant to the UKs covid response. There are multiple entries for lower tier local authority codes as these changed during the course of the pandemic.
data("geography")data("geography")
An object of class tbl_df (inherits from tbl, data.frame) with 1512 rows and 3 columns.
geography dataframe with 1512 rows and 3 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
https://geoportal.statistics.gov.uk/
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(geography)dplyr::glimpse(geography)
A dataset of the daily count of COVID-19 cases by Lower tier local authority
in the UK downloaded from the UKHSA coronavirus API, and formatted for
use in ggoutbreak.
data("ltla_cases")data("ltla_cases")
An object of class tbl_df (inherits from tbl, data.frame) with 512050 rows and 6 columns.
ltla_cases dataframe with 512050 rows and 6 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
date (date) The date
count (dbl) the test positives for each LTLA
population (dbl) the population size for this geography
https://ukhsa-dashboard.data.gov.uk/covid-19-archive-data-download
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(ltla_cases)dplyr::glimpse(ltla_cases)
Summary data collected as part of the NHS digital contact tracing app monitoring. This describes the number of alerts issued, and venue "check-ins".
data("nhs_app")data("nhs_app")
An object of class tbl_df (inherits from tbl, data.frame) with 137 rows and 3 columns.
date (date) The date
alerts (int) Number of alerts
visits (int) Number of check-ins
https://www.gov.uk/government/publications/nhs-covid-19-app-statistics
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(nhs_app)dplyr::glimpse(nhs_app)
The COVID-19 ONS infection survey took a random sample of the population and provides an estimate of the prevalence of COVID-19 that is theoretically free from ascertainment bias.
data("ons_infection_survey")data("ons_infection_survey")
An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 9820 rows and 8 columns.
code (chr) The ONS geographical region code
codeType (chr) The type of ONS geographical code
name (chr) The ONS geographical region name
date (date) A date
prevalence.0.5 (dbl) the median proportion of people in the region testing positive for COVID-19
prevalence.0.025 (dbl) the lower CI of the proportion of people in the region testing positive for COVID-19
prevalence.0.975 (dbl) the upper CI of the proportion of people in the region testing positive for COVID-19
denom (int) the sample size on which this estimate was made (daily rate inferred from weekly sample sizes.)
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(ons_infection_survey)dplyr::glimpse(ons_infection_survey)
Rachelle N Binny, Patricia Priest, Nigel P French, Matthew Parry, Audrey Lustig, Shaun C Hendy, Oliver J Maclaren, Kannan M Ridings, Nicholas Steyn, Giorgia Vattiato, Michael J Plank, Sensitivity of Reverse Transcription Polymerase Chain Reaction Tests for Severe Acute Respiratory Syndrome Coronavirus 2 Through Time, The Journal of Infectious Diseases, Volume 227, Issue 1, 1 January 2023, Pages 9–17, https://doi.org/10.1093/infdis/jiac317
data("pcr_test_sensitivity")data("pcr_test_sensitivity")
An object of class list of length 2.
pcr_test_sensitivity named list with 2 itemsmodelled (df modelled*) Original data from supplementary
resampled (df resampled*) resampled description
df modelled dataframe with 501 rows and 4 columnsModel output
days_since_infection (dbl) days since infection
median (dbl) median sensitivity
lower_95 (dbl) lower 95% CI of sensitivity
upper_95 (dbl) upper 95% CI of sensitivity
df resampled dataframe with 5100 rows and 3 columnstau (dbl) days since infection
probability (dbl) the sensitivity as a probability of detection
boot (int) a bootstrap identifier
https://pmc.ncbi.nlm.nih.gov/articles/instance/9796165/bin/jiac317_supplementary_data.zip
A set of consensus estimates for the reproduction number and growth rate of the COVID-19 epidemic in England
data("spim_consensus")data("spim_consensus")
An object of class tbl_df (inherits from tbl, data.frame) with 113 rows and 5 columns.
spim_consensus_rt dataframe with 113 rows and 5 columnsdate (date) the date
rt.low (dbl) the lower estimate of the reproduction number
rt.high (dbl) the upper estimate of the reproduction number
growth.low (dbl) the lower estimate of the exponential growth rate
growth.high (dbl) the higher estimate of the exponential growth rate
https://www.gov.uk/guidance/the-r-value-and-growth-rate
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(spim_consensus)dplyr::glimpse(spim_consensus)
Major events in the UK COVID-19 pandemic, limited to lockdowns, vaccination rollout and first identification of major variants.
data("timeline")data("timeline")
An object of class tbl_df (inherits from tbl, data.frame) with 19 rows and 3 columns.
label (chr) The event
start (date) The start date
end (date) The end date if a period
https://en.wikipedia.org/wiki/Timeline_of_the_COVID-19_pandemic_in_the_United_Kingdom
dplyr::glimpse(timeline)dplyr::glimpse(timeline)
ONS National and subnational mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).
data("uk_population_2019")data("uk_population_2019")
An object of class tbl_df (inherits from tbl, data.frame) with 398 rows and 4 columns.
Mid-2019: April 2019 local authority district codes edition of this dataset. This is UK wide and covers country, regions and LTLA (2019 boundaries)
uk_population_2019 dataframe with 398 rows and 4 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
population (dbl) the count of the population in that age group
https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(uk_population_2019)dplyr::glimpse(uk_population_2019)
ONS National and subnational mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).
data("uk_population_2019_by_10yr_age")data("uk_population_2019_by_10yr_age")
An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 3980 rows and 6 columns.
Mid-2019: April 2019 local authority district codes edition of this dataset, this is UK wide and covers country, regions and LTLA (2019 boundaries)
Stratified by 10 year age groups
uk_population_2019_by_10yr_age dataframe with 3980 rows and 6 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
class (chr) The age group in 10 year age bands
population (dbl) the count of the population in that age group
baseline_proportion (dbl) the proportion of the total regional population that is in an age group
https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(uk_population_2019_by_10yr_age)dplyr::glimpse(uk_population_2019_by_10yr_age)
ONS National and subnational mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).
data("uk_population_2019_by_5yr_age")data("uk_population_2019_by_5yr_age")
An object of class grouped_df (inherits from tbl_df, tbl, data.frame) with 7562 rows and 6 columns.
Mid-2019: April 2019 local authority district codes edition of this dataset, this is UK wide and covers country, regions and LTLA (2019 boundaries)
Stratified by 5 year age groups
uk_population_2019_by_5yr_age dataframe with 7562 rows and 6 columnsname (chr) The region name
code (chr) The region code
codeType (chr) The ONS geographical region code type (including year)
class (chr) The age group in 5 year age bands
population (dbl) the count of the population in that age group
baseline_proportion (dbl) the proportion of the total regional population that is in an age group
https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates
Originally licensed under the Open Government Licence v3.0
dplyr::glimpse(uk_population_2019_by_5yr_age)dplyr::glimpse(uk_population_2019_by_5yr_age)
van Kampen, J.J.A., van de Vijver, D.A.M.C., Fraaij, P.L.A. et al. Duration and key determinants of infectious virus shedding in hospitalized patients with coronavirus disease-2019 (COVID-19). Nat Commun 12, 267 (2021). https://doi.org/10.1038/s41467-020-20568-4
data("viral_shedding")data("viral_shedding")
An object of class list of length 2.
viral_shedding named list with 2 itemsoriginal (df original*) original description
resampled (df resampled*) resampled description
df original dataframe with 690 rows and 4 columnsduration of symptoms in days (dbl) duration of symptoms in days
RNA copies per mL (chr) RNA copies per mL
PRNT titer (chr) PRNT titer
virus culture result (chr) virus culture result
df resampled dataframe with 2600 rows and 3 columnstau (int) time from syptom onset to measurement
probability (dbl) probability of detected viral excretion
boot (int) a bootstrap identifier