Overview
Datasets rarely come with cleaned, tidy formatting in their column
names and hence it is common practice to rename the original column
names to those that are more code-friendly (i.e. lower-case, no spaces
or special characters and unique). Alongside the code-friendly renamed
columns, it’s also very practical to store label
attributes, which act as a human-readable version of each column that
can be used for displaying in tables and figures. Whilst there already
exists functions for renaming columns
(e.g. dplyr::rename()) and applying labels
(e.g. labelled or Hmisc), there is much leg
work that needs to be applied in setting up the inputs for these
functions.
This vignette demonstrates how to use column
dictionaries to expedite the process of standardising and
labelling columns in a dataset. We will use two main functions from
thekidsbiostats:
-
make_column_dict()– create a copy-pasteable column dictionary template. -
update_columns()– rename and label columns using a column dictionary.
1. Create a data dictionary template:
make_column_dict()
Suppose we use thekidsbiostats::data_patient as an
example:
data("data_patient")
data_patient |> as_flextable()Patient ID# |
DOB [YYYY-MM-DD] |
Sex (0=Male, 1=Female) |
WHAT IS YOUR HEIGHT? (cm) |
WHAT IS YOUR CURRENT WEIGHT? (kg) |
Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis? |
bp (mmHg) |
Cholesterol / mmolL |
|---|---|---|---|---|---|---|---|
integer |
Date |
character |
numeric |
numeric |
logical |
character |
numeric |
101 |
1980-05-12 |
0 |
175 |
70 |
true |
120/80 |
5.2 |
102 |
1992-08-03 |
1 |
160 |
55 |
false |
110/70 |
4.8 |
103 |
1975-12-25 |
1 |
180 |
80 |
false |
130/85 |
6.1 |
104 |
2000-01-01 |
0 |
165 |
60 |
true |
115/75 |
5.0 |
105 |
1985-07-07 |
1 |
170 |
75 |
false |
125/82 |
5.5 |
n: 5 | |||||||
Beyond this dataset visually looking messy, it can also be tricky to
reference these column names when attempting to use them in operations.
For example, let’s try to calculate the BMI of individuals and recode
the Sex column by replacing 0/1 with
Male/Female.
data_patient$`Sex (0=Male, 1=Female)` = case_when(
data_patient$`Sex (0=Male, 1=Female)` == 0 ~ "Male",
data_patient$`Sex (0=Male, 1=Female)` == 1 ~ "Female",
T ~ NA_character_
)
bmi = data_patient$`WEIGHT (kg)` / (data_patient$`HEIGHT (cm)` / 100.0)^2
print(bmi)
#> numeric(0)The spaces, equal signs and parentheses in
the column names requires us to use backticks (` `) to avoid R
interpreting these symbols literally. Functions such as
janitor::clean_names() can be very useful for standardising
such column names into syntactic names. However, for situations like our
column
"Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?",
the “cleaned” version is arguably just as unwieldy.
Nonetheless, the consequence of renaming columns for easy manipulation in code is likely dropping any potentially useful formatting of the columns (e.g. units and symbols). Ideally, we want to both rename the columns and simultaneously retain their human-readable labels for presentation.
The solution is therefore to manually set up a dictionary that maps
the old column names to newly formatted column names as well as apply
$label attributes to each column to be used for all visual
presentation.
The make_column_dict() function makes
this process slightly easier by creating a ready-to-fill template based
on the column names that exist with the data. For example:
dict <- make_column_dict(data_patient, quiet=FALSE) # note: setting auto_clean=FALSE will stop the janitor::clean_names() from being performed.
#> dict_patient <- tribble(
#> ~old, ~new, ~label,
#> 'Patient ID#', 'patient_id_number', '',
#> 'DOB [YYYY-MM-DD]', 'dob_yyyy_mm_dd', '',
#> 'Sex (0=Male, 1=Female)', 'sex_0_male_1_female', '',
#> 'WHAT IS YOUR HEIGHT? (cm)', 'what_is_your_height_cm', '',
#> 'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'what_is_your_current_weight_kg', '',
#> 'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis', '',
#> 'bp (mmHg)', 'bp_mm_hg', '',
#> 'Cholesterol / mmolL', 'cholesterol_mmol_l', '',
#> )We can either directly add to the dict object assigned
above or copy-paste the printed output into our code and modify as
needed:
dict_patient <- tribble(
~old, ~new, ~label,
'Patient ID#', 'patient_id_number', '',
'DOB [YYYY-MM-DD]', 'dob_yyyy_mm_dd', '',
'Sex (0=Male, 1=Female)', 'sex_0_male_1_female', '',
'WHAT IS YOUR HEIGHT? (cm)', 'what_is_your_height_cm', '',
'WHAT IS YOUR CURRENT WEIGHT? (kg)', 'what_is_your_current_weight_kg', '',
'Do you currently smoke any form of tobacco products, including cigarettes, cigars, or pipes, on a regular basis?', 'do_you_currently_smoke_any_form_of_tobacco_products_including_cigarettes_cigars_or_pipes_on_a_regular_basis', '',
'bp (mmHg)', 'bp_mm_hg', '',
'Cholesterol / mmolL', 'cholesterol_mmol_l', '',
)2. Apply column dictionary: update_columns()
Now after having defined the column dictionary above, we can simply
use the update_columns() function to apply
these changes in a single line.
data_clean <- update_columns(data_patient, dict = dict_patient)
data_clean |> flextable()101 |
1980-05-12 |
Male |
175 |
70 |
true |
120/80 |
5.2 |
102 |
1992-08-03 |
Female |
160 |
55 |
false |
110/70 |
4.8 |
103 |
1975-12-25 |
Female |
180 |
80 |
false |
130/85 |
6.1 |
104 |
2000-01-01 |
Male |
165 |
60 |
true |
115/75 |
5.0 |
105 |
1985-07-07 |
Female |
170 |
75 |
false |
125/82 |
5.5 |
Writing/Loading column dictionaries from file
Both of these helper functions can accept file paths as input or output for the column dictionary. This is handled slightly differently for each:
-
make_column_dict(): thefileargument accepts a character string defining the file path to write the dictionary to. The extensions ‘.csv’, ‘.txt’, ‘.rds’ and ‘.xlsx’ are accepted. -
update_columns(): providing a file path in the form of a character string to thedictargument will attempt to load the dictionary directly from the file. In addition, theold,newandlabelparameters can be used to explicitly specify the names of the relevant columns if they are named differently.
