library(thekidsbiostats) # install with install.packages("thekidsbiostats", repos = "the-kids-biostats.r-universe.dev")Overview
Datasets rarely come with clean and consistent formatting in their column names. Therefore, it is common practice when performing analysis to rename the original column names to coding-friendly names (i.e., unique, lower-case, no spaces or special characters, consistent prefixes/suffixes).
Alongside these coding-friendly columns, it is then also useful to store (or link) label attributes, which provide clean, more descriptive or verbose labels for each coding-friendly variable name. These can be called upon when we are producing output for presentation, such as complete analytical reports, or exported tables and figures.
While a variety of functions for renaming columns (e.g., dplyr::rename()) and applying labels (e.g., using the labelled or Hmisc packages) exist, a reasonable amount of leg work is needed to use such functions across an entire dataset.
This post will demonstrate how two new functions within thekidsbiostats package can be used to streamline and expedite the process of standardising and labelling columns. Specifically, let’s introduce the following functions:
make_column_dict()— create a copy-pasteable column name dictionary template.update_columns()— renames and labels columns using a column name dictionary.
Motivating Example
As an example, let’s explore the data_patient dataset, an example dataset shipped with the package:
data(data_patient, package = "thekidsbiostats")Previewing this dataset:
Patient ID# | DOB [YYYY-MM-DD] | Sex (0=Male, 1=Female) | WHAT IS YOUR HEIGHT? (cm) | WHAT IS YOUR CURRENT WEIGHT? (kg) | Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis? | bp (mmHg) | Cholesterol / mmolL |
|---|---|---|---|---|---|---|---|
101 | 1980‑05‑12 | 0 | 175 | 70 | true | 120/80 | 5.2 |
102 | 1992‑08‑03 | 1 | 160 | 55 | false | 110/70 | 4.8 |
103 | 1975‑12‑25 | 1 | 180 | 80 | false | 130/85 | 6.1 |
104 | 2000‑01‑01 | 0 | 165 | 60 | true | 115/75 | 5.0 |
105 | 1985‑07‑07 | 1 | 170 | 75 | false | 125/82 | 5.5 |
Immediately, we see that the column headers are not formatted in a very nice way, and especially not for coding purposes! The column headers contain (any combination of):
- Categorical variable coding,
- Full questions,
- Spacing,
- Inconsistent letter capitalisation,
- Punctuation (parentheses, equals signs, questions marks, dashes, slashes).
The spaces and punctuation in these column names require us to use backticks (``) to avoid R interpreting these symbols literally.
Beyond visually looking messy within final output, this set up can also be messy within your code, unnecessarily making simple operations more difficult to follow. For example, when forced to use the original column names, let’s see what implementing two trivial manipulations/changes to the data looks like in the code:
data_patient$bmi <- data_patient$`WHAT IS YOUR CURRENT WEIGHT? (kg)` / (data_patient$`WHAT IS YOUR HEIGHT? (cm)` / 100.0)^2Recode the Sex column by replacing 0/1 with Male/Female, as defined in the existing column header.
data_patient$`Sex (0=Male, 1=Female)` = case_when(
data_patient$`Sex (0=Male, 1=Female)` == 0 ~ "Male",
data_patient$`Sex (0=Male, 1=Female)` == 1 ~ "Female",
TRUE ~ NA_character_
)Clearly, coding each of these changes is more cumbersome and time intensive than it needs to be, and the readability of our code is impacted.
Functions such as janitor::clean_names() can be very useful for standardising such column names into coding-friendly names. However, this approach is still imperfect, and the resulting names may remain just as problematic (as seen below).
colnames(janitor::clean_names(data_patient[6]))[1] "do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis"
The “cleaned” version of this column name simply omits spaces and punctuation, however, it is still far too long.
Our Solution
Ideally, we want to simultaneously
- rename the columns, and
- retain human-readable labels for presentation/reporting.
Therefore, our solution is to set up a dictionary that maps the old column names to newly-formatted column names as well as apply label attributes.
Step 1: Create Data Dictionary
The make_column_dict() function facilitates this process by giving you a ready-to-fill template based on the original column names in the data.
For example:
dict <- make_column_dict(data = data_patient, auto_clean = TRUE, quiet = FALSE)dict_patient <- tribble(
~old, ~new, ~label,
'Patient ID#', 'patient_id_number', '',
'DOB [YYYY-MM-DD]', 'dob_yyyy_mm_dd', '',
'Sex (0=Male, 1=Female)', 'sex_0_male_1_female', '',
'WHAT IS YOUR HEIGHT? (cm)', 'what_is_your_height_cm', '',
'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'what_is_your_current_weight_kg', '',
'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis', '',
'bp (mmHg)', 'bp_mm_hg', '',
'Cholesterol / mmolL', 'cholesterol_mmol_l', '',
'bmi', 'bmi', '',
)
The output of make_column_dict is a tribble — a row-wise data-frame object (or tibble) which allows one to more easily visualise the format of the table.
We can either directly add to, or modify, the dict object assigned above or copy-paste the output into our code and add/edit/update column names and labels as needed (see below):
dict_patient <- tribble(
~old, ~new, ~label,
'Patient ID#', 'id', 'Patient ID',
'DOB [YYYY-MM-DD]', 'dob', 'Date of Birth',
'Sex (0=Male, 1=Female)', 'sex', 'Sex',
'WHAT IS YOUR HEIGHT? (cm)', 'height_cm', 'Height (cm)',
'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'weight_kg', 'Weight (kg)',
'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'current_smoking', 'Currently consumes tobacco',
'bp (mmHg)', 'bp_mmhg', 'BP (mmHg)',
'Cholesterol / mmolL', 'cholesterol_mmol_l', 'Cholesterol (mmolL)',
'bmi', 'bmi', "BMI (kg/m^2)"
)This tribble has been manually modified with “clean” column names and variable labels. Additionally, a row for the derived bmi variable has been included.
As the number of columns in your dataset increases, editing/maintaining this within your code may become increasingly unwieldy. The option is also available to manage this via an external file (e.g., .csv, .rds, .xlsx), by specifying file:
make_column_dict(data = data_patient,
auto_clean = TRUE,
file = "colnames.csv")If need be, we could read the dictionary back into R after having modified (in an external program) the column names and labels as necessary:
dict_patient <- readr::read_csv(file = "colnames.csv")old | new | label |
|---|---|---|
Patient ID# | id | Patient ID |
DOB [YYYY‑MM‑DD] | dob | Date of Birth |
Sex (0=Male, 1=Female) | sex | Sex |
WHAT IS YOUR HEIGHT? (cm) | height_cm | Height (cm) |
WHAT IS YOUR CURRENT WEIGHT? (kg) | weight_kg | Weight (kg) |
Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis? | current_smoking | Currently consumes tobacco |
bp (mmHg) | bp_mmhg | BP (mmHg) |
Cholesterol / mmolL | cholesterol_mmol_l | Cholesterol (mmolL) |
bmi | bmi | BMI (kg/m^2) |
Step 2: Applying the column dictionary
Now that we have defined the column dictionary above, we can simply use update_columns() to apply these changes in a single line.
data_clean <- update_columns(
data_patient,
dict = dict_patient
)The update_columns() function includes arguments old, new, label and these must correspond to the column names of the data dictionary object. By default, the names assigned by make_column_dict() are assumed. It is unlikely you will have to modify these arguments.
Previewing Output
Previewing this “cleaned” dataset:
id | dob | sex | height_cm | weight_kg | current_smoking | bp_mmhg | cholesterol_mmol_l | bmi |
|---|---|---|---|---|---|---|---|---|
101 | 1980‑05‑12 | Male | 175 | 70 | true | 120/80 | 5.2 | 22.85714 |
102 | 1992‑08‑03 | Female | 160 | 55 | false | 110/70 | 4.8 | 21.48437 |
103 | 1975‑12‑25 | Female | 180 | 80 | false | 130/85 | 6.1 | 24.69136 |
104 | 2000‑01‑01 | Male | 165 | 60 | true | 115/75 | 5.0 | 22.03857 |
105 | 1985‑07‑07 | Female | 170 | 75 | false | 125/82 | 5.5 | 25.95156 |
Because the labels we specified are stored as attributes, they are automatically recognised by many summary packages that respect label metadata (e.g., gtsummary::tbl_summary() tables).
So after the steps above, we no longer have this…
data_patient %>%
tbl_summary(include = -`Patient ID#`,
type = list(`WHAT IS YOUR HEIGHT? (cm)` ~ "continuous",
`WHAT IS YOUR CURRENT WEIGHT? (kg)` ~ "continuous",
`Cholesterol / mmolL` ~ "continuous",
`bmi` ~ "continuous")) %>%
thekids_table(padding=2) |>
modify_labels(hline=TRUE, bold=TRUE)Characteristic | N = 51 |
|---|---|
DOB [YYYY-MM-DD] | |
1975-12-25 | 1 (20%) |
1980-05-12 | 1 (20%) |
1985-07-07 | 1 (20%) |
1992-08-03 | 1 (20%) |
2000-01-01 | 1 (20%) |
Sex (0=Male, 1=Female) | |
Female | 3 (60%) |
Male | 2 (40%) |
WHAT IS YOUR HEIGHT? (cm) | 170.0 (165.0, 175.0) |
WHAT IS YOUR CURRENT WEIGHT? (kg) | 70 (60, 75) |
Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis? | 2 (40%) |
bp (mmHg) | |
110/70 | 1 (20%) |
115/75 | 1 (20%) |
120/80 | 1 (20%) |
125/82 | 1 (20%) |
130/85 | 1 (20%) |
Cholesterol / mmolL | 5.20 (5.00, 5.50) |
bmi | 22.86 (22.04, 24.69) |
1n (%); Median (Q1, Q3) | |
We have this!
data_clean %>%
tbl_summary(include = -id,
type = list(height_cm ~ "continuous",
weight_kg ~ "continuous",
cholesterol_mmol_l ~ "continuous",
bmi ~ "continuous")) %>%
thekids_table(padding=2) |>
modify_labels(hline=TRUE, bold=TRUE)Characteristic | N = 51 |
|---|---|
Date of Birth | |
1975-12-25 | 1 (20%) |
1980-05-12 | 1 (20%) |
1985-07-07 | 1 (20%) |
1992-08-03 | 1 (20%) |
2000-01-01 | 1 (20%) |
Sex | |
Female | 3 (60%) |
Male | 2 (40%) |
Height (cm) | 170.0 (165.0, 175.0) |
Weight (kg) | 70 (60, 75) |
Currently consumes tobacco | 2 (40%) |
BP (mmHg) | |
110/70 | 1 (20%) |
115/75 | 1 (20%) |
120/80 | 1 (20%) |
125/82 | 1 (20%) |
130/85 | 1 (20%) |
Cholesterol (mmolL) | 5.20 (5.00, 5.50) |
BMI (kg/m^2) | 22.86 (22.04, 24.69) |
1n (%); Median (Q1, Q3) | |
Closing comments
Overall, the make_column_dict() and update_columns() functions offer an efficient solution for transforming and labelling datasets with problematic column headers. More concise column names are easier to work with and enhance the readability/tractability of code, while user-defined labels allow users to easily produce more polished and publication-ready outputs.
Acknowledgements
Thanks to Dr Matt Cooper, Wesley Billingham, Dr Haileab Wolde, and Dr Elizabeth McKinnon for providing feedback on and reviewing this post.
Reproducibility Information
To access the .qmd (Quarto markdown) files as well as any R scripts or data that was used in this post, please visit our GitHub:
https://github.com/The-Kids-Biostats/The-Kids-Biostats.github.io/tree/main/posts/
The session information can also be seen below.
Code
sessionInfo()R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Australia/Perth
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] thekidsbiostats_1.8.0 extrafont_0.20 flextable_0.9.10
[4] gtsummary_2.5.0 lubridate_1.9.4 forcats_1.0.1
[7] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1
[10] readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
[13] ggplot2_4.0.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.56 htmlwidgets_1.6.4
[4] tzdb_0.5.0 vctrs_0.7.1 tools_4.5.2
[7] generics_0.1.4 parallel_4.5.2 pkgconfig_2.0.3
[10] data.table_1.18.2.1 RColorBrewer_1.1-3 S7_0.2.1
[13] uuid_1.2-2 lifecycle_1.0.5 compiler_4.5.2
[16] farver_2.1.2 textshaping_1.0.4 janitor_2.2.1
[19] snakecase_0.11.1 httpuv_1.6.16 fontquiver_0.2.1
[22] fontLiberation_0.1.0 htmltools_0.5.9 yaml_2.3.12
[25] Rttf2pt1_1.3.14 crayon_1.5.3 extrafontdb_1.1
[28] pillar_1.11.1 later_1.4.5 openssl_2.3.4
[31] mime_0.13 fontBitstreamVera_0.1.1 tidyselect_1.2.1
[34] zip_2.3.3 digest_0.6.39 stringi_1.8.7
[37] labelled_2.16.0 fastmap_1.2.0 grid_4.5.2
[40] cli_3.6.5 magrittr_2.0.4 cards_0.7.1
[43] patchwork_1.3.2 withr_3.0.2 gdtools_0.4.4
[46] scales_1.4.0 promises_1.5.0 bit64_4.6.0-1
[49] timechange_0.4.0 rmarkdown_2.30 officer_0.7.3
[52] bit_4.6.0 otel_0.2.0 hms_1.1.4
[55] askpass_1.2.1 ragg_1.5.0 shiny_1.12.1
[58] evaluate_1.0.5 haven_2.5.5 knitr_1.51
[61] rlang_1.1.7 Rcpp_1.1.1 xtable_1.8-4
[64] glue_1.8.0 xml2_1.5.2 vroom_1.7.0
[67] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1
[70] systemfonts_1.3.1 fs_1.6.6 shinyFiles_0.9.3