Efficiently Cleaning and Labelling Data Columns with thekidsbiostats

R
Data cleaning
Package
Author

Zac Dempsey, Dr Robin van de Meeberg

Published

February 19, 2026

library(thekidsbiostats) # install with install.packages("thekidsbiostats", repos = "the-kids-biostats.r-universe.dev")

Overview

Datasets rarely come with clean and consistent formatting in their column names. Therefore, it is common practice when performing analysis to rename the original column names to coding-friendly names (i.e., unique, lower-case, no spaces or special characters, consistent prefixes/suffixes).

Alongside these coding-friendly columns, it is then also useful to store (or link) label attributes, which provide clean, more descriptive or verbose labels for each coding-friendly variable name. These can be called upon when we are producing output for presentation, such as complete analytical reports, or exported tables and figures.

While a variety of functions for renaming columns (e.g., dplyr::rename()) and applying labels (e.g., using the labelled or Hmisc packages) exist, a reasonable amount of leg work is needed to use such functions across an entire dataset.

This post will demonstrate how two new functions within thekidsbiostats package can be used to streamline and expedite the process of standardising and labelling columns. Specifically, let’s introduce the following functions:

Motivating Example

As an example, let’s explore the data_patient dataset, an example dataset shipped with the package:

data(data_patient, package = "thekidsbiostats")

Previewing this dataset:

Patient ID#

DOB [YYYY-MM-DD]

Sex (0=Male, 1=Female)

WHAT IS YOUR HEIGHT? (cm)

WHAT IS YOUR CURRENT WEIGHT? (kg)

Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?

bp (mmHg)

Cholesterol / mmolL

101

1980‑05‑12

0

175

70

true

120/80

5.2

102

1992‑08‑03

1

160

55

false

110/70

4.8

103

1975‑12‑25

1

180

80

false

130/85

6.1

104

2000‑01‑01

0

165

60

true

115/75

5.0

105

1985‑07‑07

1

170

75

false

125/82

5.5

Immediately, we see that the column headers are not formatted in a very nice way, and especially not for coding purposes! The column headers contain (any combination of):

  • Categorical variable coding,
  • Full questions,
  • Spacing,
  • Inconsistent letter capitalisation,
  • Punctuation (parentheses, equals signs, questions marks, dashes, slashes).

The spaces and punctuation in these column names require us to use backticks (``) to avoid R interpreting these symbols literally.

Beyond visually looking messy within final output, this set up can also be messy within your code, unnecessarily making simple operations more difficult to follow. For example, when forced to use the original column names, let’s see what implementing two trivial manipulations/changes to the data looks like in the code:

data_patient$bmi <- data_patient$`WHAT IS YOUR CURRENT WEIGHT? (kg)` / (data_patient$`WHAT IS YOUR HEIGHT? (cm)` / 100.0)^2

Recode the Sex column by replacing 0/1 with Male/Female, as defined in the existing column header.

data_patient$`Sex (0=Male, 1=Female)` = case_when(
  data_patient$`Sex (0=Male, 1=Female)` == 0 ~ "Male",
  data_patient$`Sex (0=Male, 1=Female)` == 1 ~ "Female",
  TRUE ~ NA_character_
)

Clearly, coding each of these changes is more cumbersome and time intensive than it needs to be, and the readability of our code is impacted.


Functions such as janitor::clean_names() can be very useful for standardising such column names into coding-friendly names. However, this approach is still imperfect, and the resulting names may remain just as problematic (as seen below).

colnames(janitor::clean_names(data_patient[6]))
[1] "do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis"

The “cleaned” version of this column name simply omits spaces and punctuation, however, it is still far too long.

Our Solution

Ideally, we want to simultaneously

  1. rename the columns, and
  2. retain human-readable labels for presentation/reporting.

Therefore, our solution is to set up a dictionary that maps the old column names to newly-formatted column names as well as apply label attributes.

Step 1: Create Data Dictionary

The make_column_dict() function facilitates this process by giving you a ready-to-fill template based on the original column names in the data.

For example:

dict <- make_column_dict(data = data_patient, auto_clean = TRUE, quiet = FALSE)
dict_patient <- tribble(
  ~old, ~new, ~label,
  'Patient ID#', 'patient_id_number', '',
  'DOB [YYYY-MM-DD]', 'dob_yyyy_mm_dd', '',
  'Sex (0=Male, 1=Female)', 'sex_0_male_1_female', '',
  'WHAT IS YOUR HEIGHT? (cm)', 'what_is_your_height_cm', '',
  'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'what_is_your_current_weight_kg', '',
  'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis', '',
  'bp (mmHg)', 'bp_mm_hg', '',
  'Cholesterol / mmolL', 'cholesterol_mmol_l', '',
  'bmi', 'bmi', '',
) 

The output of make_column_dict is a tribble — a row-wise data-frame object (or tibble) which allows one to more easily visualise the format of the table.

We can either directly add to, or modify, the dict object assigned above or copy-paste the output into our code and add/edit/update column names and labels as needed (see below):

dict_patient <- tribble(
  ~old, ~new, ~label,
  'Patient ID#', 'id', 'Patient ID',
  'DOB [YYYY-MM-DD]', 'dob', 'Date of Birth',
  'Sex (0=Male, 1=Female)', 'sex', 'Sex',
  'WHAT IS YOUR HEIGHT? (cm)', 'height_cm', 'Height (cm)',
  'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'weight_kg', 'Weight (kg)',
  'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'current_smoking', 'Currently consumes tobacco',
  'bp (mmHg)', 'bp_mmhg', 'BP (mmHg)',
  'Cholesterol / mmolL', 'cholesterol_mmol_l', 'Cholesterol (mmolL)',
  'bmi', 'bmi', "BMI (kg/m^2)"
)

This tribble has been manually modified with “clean” column names and variable labels. Additionally, a row for the derived bmi variable has been included.

Important

Naively renaming columns for easy manipulation in code risks dropping potentially useful information contained in the original column name (e.g., units and symbols). It is important to have a data dictionary to retain this information about variables, which is distinct from the column name dictionary we’re creating here. For example, neither our new or label value for the sex variable above captures the coding.

As the number of columns in your dataset increases, editing/maintaining this within your code may become increasingly unwieldy. The option is also available to manage this via an external file (e.g., .csv, .rds, .xlsx), by specifying file:

make_column_dict(data = data_patient,
                 auto_clean = TRUE,
                 file = "colnames.csv")

If need be, we could read the dictionary back into R after having modified (in an external program) the column names and labels as necessary:

dict_patient <- readr::read_csv(file = "colnames.csv")

old

new

label

Patient ID#

id

Patient ID

DOB [YYYY‑MM‑DD]

dob

Date of Birth

Sex (0=Male, 1=Female)

sex

Sex

WHAT IS YOUR HEIGHT? (cm)

height_cm

Height (cm)

WHAT IS YOUR CURRENT WEIGHT? (kg)

weight_kg

Weight (kg)

Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?

current_smoking

Currently consumes tobacco

bp (mmHg)

bp_mmhg

BP (mmHg)

Cholesterol / mmolL

cholesterol_mmol_l

Cholesterol (mmolL)

bmi

bmi

BMI (kg/m^2)

Step 2: Applying the column dictionary

Now that we have defined the column dictionary above, we can simply use update_columns() to apply these changes in a single line.

data_clean <- update_columns(
  data_patient, 
  dict = dict_patient
)

The update_columns() function includes arguments old, new, label and these must correspond to the column names of the data dictionary object. By default, the names assigned by make_column_dict() are assumed. It is unlikely you will have to modify these arguments.

Previewing Output

Previewing this “cleaned” dataset:

id

dob

sex

height_cm

weight_kg

current_smoking

bp_mmhg

cholesterol_mmol_l

bmi

101

1980‑05‑12

Male

175

70

true

120/80

5.2

22.85714

102

1992‑08‑03

Female

160

55

false

110/70

4.8

21.48437

103

1975‑12‑25

Female

180

80

false

130/85

6.1

24.69136

104

2000‑01‑01

Male

165

60

true

115/75

5.0

22.03857

105

1985‑07‑07

Female

170

75

false

125/82

5.5

25.95156

Because the labels we specified are stored as attributes, they are automatically recognised by many summary packages that respect label metadata (e.g., gtsummary::tbl_summary() tables).


So after the steps above, we no longer have this…

data_patient %>%
  tbl_summary(include = -`Patient ID#`,
              type = list(`WHAT IS YOUR HEIGHT? (cm)` ~ "continuous",
                          `WHAT IS YOUR CURRENT WEIGHT? (kg)` ~ "continuous",
                          `Cholesterol / mmolL` ~ "continuous",
                          `bmi` ~ "continuous")) %>%
  thekids_table(padding=2) |>
  modify_labels(hline=TRUE, bold=TRUE)

Characteristic

N = 51

DOB [YYYY-MM-DD]

1975-12-25

1 (20%)

1980-05-12

1 (20%)

1985-07-07

1 (20%)

1992-08-03

1 (20%)

2000-01-01

1 (20%)

Sex (0=Male, 1=Female)

Female

3 (60%)

Male

2 (40%)

WHAT IS YOUR HEIGHT? (cm)

170.0 (165.0, 175.0)

WHAT IS YOUR CURRENT WEIGHT? (kg)

70 (60, 75)

Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?

2 (40%)

bp (mmHg)

110/70

1 (20%)

115/75

1 (20%)

120/80

1 (20%)

125/82

1 (20%)

130/85

1 (20%)

Cholesterol / mmolL

5.20 (5.00, 5.50)

bmi

22.86 (22.04, 24.69)

1n (%); Median (Q1, Q3)

We have this!

data_clean %>%
  tbl_summary(include = -id,
              type = list(height_cm ~ "continuous",
                          weight_kg ~ "continuous",
                          cholesterol_mmol_l ~ "continuous",
                          bmi ~ "continuous")) %>%
  thekids_table(padding=2) |> 
  modify_labels(hline=TRUE, bold=TRUE)

Characteristic

N = 51

Date of Birth

1975-12-25

1 (20%)

1980-05-12

1 (20%)

1985-07-07

1 (20%)

1992-08-03

1 (20%)

2000-01-01

1 (20%)

Sex

Female

3 (60%)

Male

2 (40%)

Height (cm)

170.0 (165.0, 175.0)

Weight (kg)

70 (60, 75)

Currently consumes tobacco

2 (40%)

BP (mmHg)

110/70

1 (20%)

115/75

1 (20%)

120/80

1 (20%)

125/82

1 (20%)

130/85

1 (20%)

Cholesterol (mmolL)

5.20 (5.00, 5.50)

BMI (kg/m^2)

22.86 (22.04, 24.69)

1n (%); Median (Q1, Q3)

Closing comments

Overall, the make_column_dict() and update_columns() functions offer an efficient solution for transforming and labelling datasets with problematic column headers. More concise column names are easier to work with and enhance the readability/tractability of code, while user-defined labels allow users to easily produce more polished and publication-ready outputs.

Acknowledgements

Thanks to Dr Matt Cooper, Wesley Billingham, Dr Haileab Wolde, and Dr Elizabeth McKinnon for providing feedback on and reviewing this post.

Reproducibility Information

To access the .qmd (Quarto markdown) files as well as any R scripts or data that was used in this post, please visit our GitHub:

https://github.com/The-Kids-Biostats/The-Kids-Biostats.github.io/tree/main/posts/

The session information can also be seen below.

Code
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Perth
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] thekidsbiostats_1.8.0 extrafont_0.20        flextable_0.9.10     
 [4] gtsummary_2.5.0       lubridate_1.9.4       forcats_1.0.1        
 [7] stringr_1.6.0         dplyr_1.2.0           purrr_1.2.1          
[10] readr_2.1.6           tidyr_1.3.2           tibble_3.3.1         
[13] ggplot2_4.0.2         tidyverse_2.0.0      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               htmlwidgets_1.6.4      
 [4] tzdb_0.5.0              vctrs_0.7.1             tools_4.5.2            
 [7] generics_0.1.4          parallel_4.5.2          pkgconfig_2.0.3        
[10] data.table_1.18.2.1     RColorBrewer_1.1-3      S7_0.2.1               
[13] uuid_1.2-2              lifecycle_1.0.5         compiler_4.5.2         
[16] farver_2.1.2            textshaping_1.0.4       janitor_2.2.1          
[19] snakecase_0.11.1        httpuv_1.6.16           fontquiver_0.2.1       
[22] fontLiberation_0.1.0    htmltools_0.5.9         yaml_2.3.12            
[25] Rttf2pt1_1.3.14         crayon_1.5.3            extrafontdb_1.1        
[28] pillar_1.11.1           later_1.4.5             openssl_2.3.4          
[31] mime_0.13               fontBitstreamVera_0.1.1 tidyselect_1.2.1       
[34] zip_2.3.3               digest_0.6.39           stringi_1.8.7          
[37] labelled_2.16.0         fastmap_1.2.0           grid_4.5.2             
[40] cli_3.6.5               magrittr_2.0.4          cards_0.7.1            
[43] patchwork_1.3.2         withr_3.0.2             gdtools_0.4.4          
[46] scales_1.4.0            promises_1.5.0          bit64_4.6.0-1          
[49] timechange_0.4.0        rmarkdown_2.30          officer_0.7.3          
[52] bit_4.6.0               otel_0.2.0              hms_1.1.4              
[55] askpass_1.2.1           ragg_1.5.0              shiny_1.12.1           
[58] evaluate_1.0.5          haven_2.5.5             knitr_1.51             
[61] rlang_1.1.7             Rcpp_1.1.1              xtable_1.8-4           
[64] glue_1.8.0              xml2_1.5.2              vroom_1.7.0            
[67] rstudioapi_0.18.0       jsonlite_2.0.0          R6_2.6.1               
[70] systemfonts_1.3.1       fs_1.6.6                shinyFiles_0.9.3