library(thekidsbiostats) # install with install.packages("thekidsbiostats", repos = "the-kids-biostats.r-universe.dev")

Overview

Datasets rarely come with clean and consistent formatting in their column names. Therefore, it is common practice when performing analysis to rename the original column names to coding-friendly names (i.e., unique, lower-case, no spaces or special characters, consistent prefixes/suffixes).

Alongside these coding-friendly columns, it is then also useful to store (or link) label attributes, which provide clean, more descriptive or verbose labels for each coding-friendly variable name. These can be called upon when we are producing output for presentation, such as complete analytical reports, or exported tables and figures.

While a variety of functions for renaming columns (e.g., dplyr::rename()) and applying labels (e.g., using the labelled or Hmisc packages) exist, a reasonable amount of leg work is needed to use such functions across an entire dataset.

This post will demonstrate how two new functions within thekidsbiostats package can be used to streamline and expedite the process of standardising and labelling columns. Specifically, let’s introduce the following functions:

make_column_dict() — create a copy-pasteable column name dictionary template.
update_columns() — renames and labels columns using a column name dictionary.

Motivating Example

As an example, let’s explore the data_patient dataset, an example dataset shipped with the package:

data(data_patient, package = "thekidsbiostats")

Previewing this dataset:

Patient ID#	DOB [YYYY-MM-DD]	Sex (0=Male, 1=Female)	WHAT IS YOUR HEIGHT? (cm)	WHAT IS YOUR CURRENT WEIGHT? (kg)	Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?	bp (mmHg)	Cholesterol / mmolL
101	1980‑05‑12	0	175	70	true	120/80	5.2
102	1992‑08‑03	1	160	55	false	110/70	4.8
103	1975‑12‑25	1	180	80	false	130/85	6.1
104	2000‑01‑01	0	165	60	true	115/75	5.0
105	1985‑07‑07	1	170	75	false	125/82	5.5

Immediately, we see that the column headers are not formatted in a very nice way, and especially not for coding purposes! The column headers contain (any combination of):

Categorical variable coding,
Full questions,
Spacing,
Inconsistent letter capitalisation,
Punctuation (parentheses, equals signs, questions marks, dashes, slashes).

The spaces and punctuation in these column names require us to use backticks (``) to avoid R interpreting these symbols literally.

Beyond visually looking messy within final output, this set up can also be messy within your code, unnecessarily making simple operations more difficult to follow. For example, when forced to use the original column names, let’s see what implementing two trivial manipulations/changes to the data looks like in the code:

i) Calculate BMI
ii) Re-code Sex variable

data_patient$bmi <- data_patient$`WHAT IS YOUR CURRENT WEIGHT? (kg)` / (data_patient$`WHAT IS YOUR HEIGHT? (cm)` / 100.0)^2

Recode the Sex column by replacing 0/1 with Male/Female, as defined in the existing column header.

data_patient$`Sex (0=Male, 1=Female)` = case_when(
  data_patient$`Sex (0=Male, 1=Female)` == 0 ~ "Male",
  data_patient$`Sex (0=Male, 1=Female)` == 1 ~ "Female",
  TRUE ~ NA_character_
)

Clearly, coding each of these changes is more cumbersome and time intensive than it needs to be, and the readability of our code is impacted.

Functions such as janitor::clean_names() can be very useful for standardising such column names into coding-friendly names. However, this approach is still imperfect, and the resulting names may remain just as problematic (as seen below).

colnames(janitor::clean_names(data_patient[6]))

[1] "do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis"

The “cleaned” version of this column name simply omits spaces and punctuation, however, it is still far too long.

Our Solution

Ideally, we want to simultaneously

rename the columns, and
retain human-readable labels for presentation/reporting.

Therefore, our solution is to set up a dictionary that maps the old column names to newly-formatted column names as well as apply label attributes.

Step 1: Create Data Dictionary

The make_column_dict() function facilitates this process by giving you a ready-to-fill template based on the original column names in the data.

For example:

dict <- make_column_dict(data = data_patient, auto_clean = TRUE, quiet = FALSE)

dict_patient <- tribble(
  ~old, ~new, ~label,
  'Patient ID#', 'patient_id_number', '',
  'DOB [YYYY-MM-DD]', 'dob_yyyy_mm_dd', '',
  'Sex (0=Male, 1=Female)', 'sex_0_male_1_female', '',
  'WHAT IS YOUR HEIGHT? (cm)', 'what_is_your_height_cm', '',
  'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'what_is_your_current_weight_kg', '',
  'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis', '',
  'bp (mmHg)', 'bp_mm_hg', '',
  'Cholesterol / mmolL', 'cholesterol_mmol_l', '',
  'bmi', 'bmi', '',
)

The output of make_column_dict is a tribble — a row-wise data-frame object (or tibble) which allows one to more easily visualise the format of the table.

We can either directly add to, or modify, the dict object assigned above or copy-paste the output into our code and add/edit/update column names and labels as needed (see below):

dict_patient <- tribble(
  ~old, ~new, ~label,
  'Patient ID#', 'id', 'Patient ID',
  'DOB [YYYY-MM-DD]', 'dob', 'Date of Birth',
  'Sex (0=Male, 1=Female)', 'sex', 'Sex',
  'WHAT IS YOUR HEIGHT? (cm)', 'height_cm', 'Height (cm)',
  'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'weight_kg', 'Weight (kg)',
  'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'current_smoking', 'Currently consumes tobacco',
  'bp (mmHg)', 'bp_mmhg', 'BP (mmHg)',
  'Cholesterol / mmolL', 'cholesterol_mmol_l', 'Cholesterol (mmolL)',
  'bmi', 'bmi', "BMI (kg/m^2)"
)

This tribble has been manually modified with “clean” column names and variable labels. Additionally, a row for the derived bmi variable has been included.

Important

Naively renaming columns for easy manipulation in code risks dropping potentially useful information contained in the original column name (e.g., units and symbols). It is important to have a data dictionary to retain this information about variables, which is distinct from the column name dictionary we’re creating here. For example, neither our new or label value for the sex variable above captures the coding.

As the number of columns in your dataset increases, editing/maintaining this within your code may become increasingly unwieldy. The option is also available to manage this via an external file (e.g., .csv, .rds, .xlsx), by specifying file:

make_column_dict(data = data_patient,
                 auto_clean = TRUE,
                 file = "colnames.csv")

If need be, we could read the dictionary back into R after having modified (in an external program) the column names and labels as necessary:

dict_patient <- readr::read_csv(file = "colnames.csv")

old	new	label
Patient ID#	id	Patient ID
DOB [YYYY‑MM‑DD]	dob	Date of Birth
Sex (0=Male, 1=Female)	sex	Sex
WHAT IS YOUR HEIGHT? (cm)	height_cm	Height (cm)
WHAT IS YOUR CURRENT WEIGHT? (kg)	weight_kg	Weight (kg)
Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?	current_smoking	Currently consumes tobacco
bp (mmHg)	bp_mmhg	BP (mmHg)
Cholesterol / mmolL	cholesterol_mmol_l	Cholesterol (mmolL)
bmi	bmi	BMI (kg/m^2)

Step 2: Applying the column dictionary

Now that we have defined the column dictionary above, we can simply use update_columns() to apply these changes in a single line.

data_clean <- update_columns(
  data_patient, 
  dict = dict_patient
)

The update_columns() function includes arguments old, new, label and these must correspond to the column names of the data dictionary object. By default, the names assigned by make_column_dict() are assumed. It is unlikely you will have to modify these arguments.

Previewing Output

Previewing this “cleaned” dataset:

id	dob	sex	height_cm	weight_kg	current_smoking	bp_mmhg	cholesterol_mmol_l	bmi
101	1980‑05‑12	Male	175	70	true	120/80	5.2	22.85714
102	1992‑08‑03	Female	160	55	false	110/70	4.8	21.48437
103	1975‑12‑25	Female	180	80	false	130/85	6.1	24.69136
104	2000‑01‑01	Male	165	60	true	115/75	5.0	22.03857
105	1985‑07‑07	Female	170	75	false	125/82	5.5	25.95156

Because the labels we specified are stored as attributes, they are automatically recognised by many summary packages that respect label metadata (e.g., gtsummary::tbl_summary() tables).

So after the steps above, we no longer have this…

data_patient %>%
  tbl_summary(include = -`Patient ID#`,
              type = list(`WHAT IS YOUR HEIGHT? (cm)` ~ "continuous",
                          `WHAT IS YOUR CURRENT WEIGHT? (kg)` ~ "continuous",
                          `Cholesterol / mmolL` ~ "continuous",
                          `bmi` ~ "continuous")) %>%
  thekids_table(padding=2) |>
  modify_labels(hline=TRUE, bold=TRUE)

Characteristic	N = 51
DOB [YYYY-MM-DD]
1975-12-25	1 (20%)
1980-05-12	1 (20%)
1985-07-07	1 (20%)
1992-08-03	1 (20%)
2000-01-01	1 (20%)
Sex (0=Male, 1=Female)
Female	3 (60%)
Male	2 (40%)
WHAT IS YOUR HEIGHT? (cm)	170.0 (165.0, 175.0)
WHAT IS YOUR CURRENT WEIGHT? (kg)	70 (60, 75)
Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?	2 (40%)
bp (mmHg)
110/70	1 (20%)
115/75	1 (20%)
120/80	1 (20%)
125/82	1 (20%)
130/85	1 (20%)
Cholesterol / mmolL	5.20 (5.00, 5.50)
bmi	22.86 (22.04, 24.69)
1n (%); Median (Q1, Q3)

We have this!

data_clean %>%
  tbl_summary(include = -id,
              type = list(height_cm ~ "continuous",
                          weight_kg ~ "continuous",
                          cholesterol_mmol_l ~ "continuous",
                          bmi ~ "continuous")) %>%
  thekids_table(padding=2) |> 
  modify_labels(hline=TRUE, bold=TRUE)

Characteristic	N = 51
Date of Birth
1975-12-25	1 (20%)
1980-05-12	1 (20%)
1985-07-07	1 (20%)
1992-08-03	1 (20%)
2000-01-01	1 (20%)
Sex
Female	3 (60%)
Male	2 (40%)
Height (cm)	170.0 (165.0, 175.0)
Weight (kg)	70 (60, 75)
Currently consumes tobacco	2 (40%)
BP (mmHg)
110/70	1 (20%)
115/75	1 (20%)
120/80	1 (20%)
125/82	1 (20%)
130/85	1 (20%)
Cholesterol (mmolL)	5.20 (5.00, 5.50)
BMI (kg/m^2)	22.86 (22.04, 24.69)
1n (%); Median (Q1, Q3)

Closing comments

Overall, the make_column_dict() and update_columns() functions offer an efficient solution for transforming and labelling datasets with problematic column headers. More concise column names are easier to work with and enhance the readability/tractability of code, while user-defined labels allow users to easily produce more polished and publication-ready outputs.

Acknowledgements

Thanks to Dr Matt Cooper, Wesley Billingham, Dr Haileab Wolde, and Dr Elizabeth McKinnon for providing feedback on and reviewing this post.

Reproducibility Information

To access the .qmd (Quarto markdown) files as well as any R scripts or data that was used in this post, please visit our GitHub:

https://github.com/The-Kids-Biostats/The-Kids-Biostats.github.io/tree/main/posts/

The session information can also be seen below.

Code

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Perth
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] thekidsbiostats_1.8.0 extrafont_0.20        flextable_0.9.10     
 [4] gtsummary_2.5.0       lubridate_1.9.4       forcats_1.0.1        
 [7] stringr_1.6.0         dplyr_1.2.0           purrr_1.2.1          
[10] readr_2.1.6           tidyr_1.3.2           tibble_3.3.1         
[13] ggplot2_4.0.2         tidyverse_2.0.0      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               htmlwidgets_1.6.4      
 [4] tzdb_0.5.0              vctrs_0.7.1             tools_4.5.2            
 [7] generics_0.1.4          parallel_4.5.2          pkgconfig_2.0.3        
[10] data.table_1.18.2.1     RColorBrewer_1.1-3      S7_0.2.1               
[13] uuid_1.2-2              lifecycle_1.0.5         compiler_4.5.2         
[16] farver_2.1.2            textshaping_1.0.4       janitor_2.2.1          
[19] snakecase_0.11.1        httpuv_1.6.16           fontquiver_0.2.1       
[22] fontLiberation_0.1.0    htmltools_0.5.9         yaml_2.3.12            
[25] Rttf2pt1_1.3.14         crayon_1.5.3            extrafontdb_1.1        
[28] pillar_1.11.1           later_1.4.5             openssl_2.3.4          
[31] mime_0.13               fontBitstreamVera_0.1.1 tidyselect_1.2.1       
[34] zip_2.3.3               digest_0.6.39           stringi_1.8.7          
[37] labelled_2.16.0         fastmap_1.2.0           grid_4.5.2             
[40] cli_3.6.5               magrittr_2.0.4          cards_0.7.1            
[43] patchwork_1.3.2         withr_3.0.2             gdtools_0.4.4          
[46] scales_1.4.0            promises_1.5.0          bit64_4.6.0-1          
[49] timechange_0.4.0        rmarkdown_2.30          officer_0.7.3          
[52] bit_4.6.0               otel_0.2.0              hms_1.1.4              
[55] askpass_1.2.1           ragg_1.5.0              shiny_1.12.1           
[58] evaluate_1.0.5          haven_2.5.5             knitr_1.51             
[61] rlang_1.1.7             Rcpp_1.1.1              xtable_1.8-4           
[64] glue_1.8.0              xml2_1.5.2              vroom_1.7.0            
[67] rstudioapi_0.18.0       jsonlite_2.0.0          R6_2.6.1               
[70] systemfonts_1.3.1       fs_1.6.6                shinyFiles_0.9.3