Example Usage
The data used in the following examples are simulated using OpenAI’s ChatGPT. Identifiable fields (full names, addresses, etc.) are also simulated.
Using Extracts (> clean_REDCap
)
First, we can use a (.csv) extract of the data and data dictionary from REDCap:
dat_raw <- read.csv(file = "materials/miscellaneous/DemoRSyntheticData_DATA_2025-03-18_0728.csv") # Read in the data
dict <- read.csv(file = "materials/miscellaneous/DemoRSyntheticData_DataDictionary_2025-03-18.csv", # Read in the data dictionary
check.names = FALSE ## So spaces, punctuation etc. are preserved in tibble
)
Visualising the first few rows of our data:
head(dat_raw, n = 10) %>%
thekids_table(line.spacing = 0.8,
padding = 1.5)
record_id |
form_1_complete |
demo_title |
demo_fname |
demo_dob |
demo_address |
demo_postcode |
anthro_sex |
anthro_weight |
anthro_height |
assess_iq |
demo_edu |
demo_income |
demo_num_dep |
med_hist_cvd |
med_hist_diabetes |
life_pol |
life_house_date |
life_super |
synthetic_1_complete |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
0 |
Mr. |
William Moore |
1990‑01‑24 |
99 St Georges Terrace, Scarborough, WA |
6006 |
Male |
97.7 |
1.56 |
58 |
3 |
0 |
2 |
0 |
1 |
Liberal |
1999‑04‑05 |
4458703 |
0 |
2 |
0 |
Mr. |
William Brown |
1998‑10‑24 |
6011 |
Other |
62.3 |
1.53 |
54 |
1 |
0 |
1 |
0 |
1 |
Central |
1985‑05‑31 |
14636346 |
0 |
|
3 |
0 |
Mr. |
Elizabeth Williams |
1983‑03‑01 |
87 Hay Street, Joondalup, WA |
6011 |
Male |
146.0 |
2.02 |
119 |
3 |
4 |
6 |
1 |
0 |
Liberal |
5334516 |
0 |
|
4 |
0 |
Ms. |
John Miller |
1985‑02‑13 |
85 Hay Street, Perth CBD, WA |
6006 |
Female |
66.4 |
1.53 |
54 |
2 |
0 |
5 |
0 |
1 |
Labour |
1989‑11‑13 |
6399928 |
0 |
5 |
0 |
Mr. |
John Brown |
1973‑12‑10 |
33 Beaufort Street, Mount Lawley, WA |
6005 |
Other |
114.9 |
1.55 |
57 |
3 |
0 |
6 |
1 |
1 |
Labour |
1985‑12‑02 |
294741 |
0 |
6 |
0 |
Ms. |
Michael Williams |
1997‑01‑29 |
24 St Georges Terrace, Northbridge, WA |
6011 |
Female |
69.2 |
1.64 |
69 |
3 |
1 |
2 |
0 |
0 |
Labour |
1992‑04‑19 |
10847340 |
0 |
7 |
0 |
Ms. |
Michael Davis |
1990‑10‑10 |
84 Hay Street, Northbridge, WA |
6007 |
Male |
85.6 |
2.03 |
121 |
3 |
4 |
5 |
1 |
0 |
Labour |
1991‑04‑28 |
6320541 |
0 |
8 |
0 |
Mx. |
Mary Brown |
1994‑08‑20 |
85 Murray Street, Mount Lawley, WA |
6008 |
Male |
76.7 |
1.70 |
77 |
3 |
1 |
1 |
0 |
0 |
Central |
2009‑03‑12 |
1726473 |
0 |
9 |
0 |
Ms. |
Michael Taylor |
1978‑05‑13 |
29 King Street, Fremantle, WA |
6008 |
Other |
109.8 |
66 |
1 |
0 |
0 |
0 |
1 |
Liberal |
2004‑03‑05 |
14534640 |
0 |
|
10 |
0 |
Mx. |
Patricia Moore |
1990‑06‑06 |
76 Adelaide Terrace, Cottesloe, WA |
6012 |
Other |
139.8 |
1.93 |
107 |
1 |
3 |
7 |
0 |
0 |
Liberal |
1992‑12‑20 |
11706012 |
0 |
Similarly, we can visualise our standardised data dictionary extract from REDCap:
Clearly:
- Some categorical fields (
demo_income
,demo_num_dep
,med_hist_cvd
,med_hist_diabetes
) are numerically coded and will need resolution before proceeding. - Column headers are variable names, not labels.
If we were to tabulate this data (stratifying by, say,
med_hist_cvd
), we must manually apply labels to categorical
variable levels (which is indeed made easier using
thekidsbiostats::fct_case_when
) and variable names (passed
directly to tbl_summary
):
dat_raw %>%
select(demo_postcode, anthro_sex:life_pol, life_super) %>%
mutate(demo_edu = fct_case_when(demo_edu == 0 ~ "Completed High School",
demo_edu == 1 ~ "TAFE or Trade",
demo_edu == 2 ~ "Higher Degree",
demo_edu == 3 ~ "Bachelor Degree"),
demo_income = fct_case_when(demo_income == 0 ~ "<80000",
demo_income == 1 ~ "80000-120000",
demo_income == 2 ~ "120000-160000",
demo_income == 3 ~ "160000-200000",
demo_income == 4 ~ ">200000"),
across(c(med_hist_cvd, med_hist_diabetes), ~fct_case_when(. == 0 ~ "No",
. == 1 ~ "Yes")),
across(c(anthro_sex, life_pol), ~case_when(. == "" ~ NA,
TRUE ~ .))) %>%
tbl_summary(by = med_hist_cvd,
type = list(demo_postcode ~ "categorical"),
digits = list(all_categorical() ~ c(0, 1)),
label = list(demo_postcode ~ "Postcode",
anthro_sex ~ "Sex",
anthro_weight ~ "Weight",
anthro_height ~ "Height",
assess_iq ~ "IQ",
demo_edu ~ "Highest level of education",
demo_income ~ "Household income",
demo_num_dep ~ "Number of dependents",
med_hist_diabetes ~ "History of diabetes",
life_pol ~ "Political affiliation",
life_super ~ "Total superannuation balance")
) %>%
modify_spanning_header(all_stat_cols() ~ "**History of CVD**") %>%
thekids_table()
#> 100 missing rows in the "med_hist_cvd" column have been
#> removed.
|
History of CVD |
|
---|---|---|
Characteristic |
No |
Yes |
Postcode |
||
6000 |
72 (7.9%) |
69 (7.7%) |
6003 |
59 (6.5%) |
79 (8.8%) |
6004 |
85 (9.3%) |
81 (9.0%) |
6005 |
76 (8.3%) |
82 (9.2%) |
6006 |
89 (9.7%) |
90 (10.0%) |
6007 |
85 (9.3%) |
88 (9.8%) |
6008 |
85 (9.3%) |
88 (9.8%) |
6009 |
79 (8.6%) |
81 (9.0%) |
6010 |
82 (9.0%) |
78 (8.7%) |
6011 |
101 (11.1%) |
77 (8.6%) |
6012 |
101 (11.1%) |
83 (9.3%) |
Unknown |
46 |
44 |
Sex |
||
Female |
308 (33.7%) |
272 (30.6%) |
Male |
285 (31.2%) |
314 (35.3%) |
Other |
320 (35.0%) |
304 (34.2%) |
Unknown |
47 |
50 |
Weight |
107 (82, 129) |
106 (85, 129) |
Unknown |
46 |
47 |
Height |
1.79 (1.65, 1.96) |
1.81 (1.66, 1.96) |
Unknown |
44 |
52 |
IQ |
89 (69, 111) |
90 (70, 110) |
Unknown |
50 |
47 |
Highest level of education |
||
Completed High School |
216 (23.6%) |
215 (24.2%) |
TAFE or Trade |
234 (25.5%) |
219 (24.6%) |
Higher Degree |
218 (23.8%) |
230 (25.8%) |
Bachelor Degree |
249 (27.2%) |
226 (25.4%) |
Unknown |
43 |
50 |
Household income |
||
<80000 |
199 (21.7%) |
183 (20.6%) |
80000-120000 |
176 (19.2%) |
169 (19.0%) |
120000-160000 |
186 (20.3%) |
189 (21.3%) |
160000-200000 |
182 (19.9%) |
162 (18.2%) |
>200000 |
173 (18.9%) |
186 (20.9%) |
Unknown |
44 |
51 |
Number of dependents |
3 (2, 5) |
3 (1, 5) |
Unknown |
51 |
45 |
History of diabetes |
471 (51.3%) |
467 (52.6%) |
Unknown |
41 |
53 |
Political affiliation |
||
Central |
301 (33.3%) |
309 (34.3%) |
Labour |
270 (29.9%) |
325 (36.0%) |
Liberal |
332 (36.8%) |
268 (29.7%) |
Unknown |
57 |
38 |
Total superannuation balance |
7,456,367 (3,930,725, 11,271,067) |
7,679,442 (3,868,717, 11,056,757) |
Unknown |
42 |
53 |
1n (%); Median (Q1, Q3) |
The above code chunk is clearly quite sizeable and required a significant degree of user input.
Instead, using clean_REDCap
:
dat_mod <- clean_REDCap(d = dat_raw,
dict = dict,
yesno_to_bool = T)
dat_mod %>%
select(demo_postcode, anthro_sex:life_pol, life_super) %>%
tbl_summary(by = med_hist_cvd,
type = list(demo_postcode ~ "categorical"),
digits = list(all_categorical() ~ c(0, 1))
) %>%
modify_spanning_header(all_stat_cols() ~ "**History of CVD**") %>%
thekids_table()
#> 100 missing rows in the "med_hist_cvd" column have been
#> removed.
|
History of CVD |
|
---|---|---|
Characteristic |
No |
Yes |
Postcode |
||
6000 |
72 (7.9%) |
69 (7.7%) |
6003 |
59 (6.5%) |
79 (8.8%) |
6004 |
85 (9.3%) |
81 (9.0%) |
6005 |
76 (8.3%) |
82 (9.2%) |
6006 |
89 (9.7%) |
90 (10.0%) |
6007 |
85 (9.3%) |
88 (9.8%) |
6008 |
85 (9.3%) |
88 (9.8%) |
6009 |
79 (8.6%) |
81 (9.0%) |
6010 |
82 (9.0%) |
78 (8.7%) |
6011 |
101 (11.1%) |
77 (8.6%) |
6012 |
101 (11.1%) |
83 (9.3%) |
Unknown |
46 |
44 |
Sex |
||
47 (4.9%) |
50 (5.3%) |
|
Female |
308 (32.1%) |
272 (28.9%) |
Male |
285 (29.7%) |
314 (33.4%) |
Other |
320 (33.3%) |
304 (32.3%) |
Weight |
107 (82, 129) |
106 (85, 129) |
Unknown |
46 |
47 |
Height |
1.79 (1.65, 1.96) |
1.81 (1.66, 1.96) |
Unknown |
44 |
52 |
IQ |
89 (69, 111) |
90 (70, 110) |
Unknown |
50 |
47 |
Highest Level of Education |
||
Completed High School |
216 (23.6%) |
215 (24.2%) |
TAFE or Trade |
234 (25.5%) |
219 (24.6%) |
Higher Degree |
218 (23.8%) |
230 (25.8%) |
Bachelor Degree |
249 (27.2%) |
226 (25.4%) |
Unknown |
43 |
50 |
Household Income |
||
<80000 |
199 (21.7%) |
183 (20.6%) |
80000-120000 |
176 (19.2%) |
169 (19.0%) |
120000-160000 |
186 (20.3%) |
189 (21.3%) |
160000-200000 |
182 (19.9%) |
162 (18.2%) |
>200000 |
173 (18.9%) |
186 (20.9%) |
Unknown |
44 |
51 |
Number of Dependents |
3 (2, 5) |
3 (1, 5) |
Unknown |
51 |
45 |
History of Diabetes |
471 (51.3%) |
467 (52.6%) |
Unknown |
41 |
53 |
Political Affiliation |
||
57 (5.9%) |
38 (4.0%) |
|
Central |
301 (31.4%) |
309 (32.9%) |
Labour |
270 (28.1%) |
325 (34.6%) |
Liberal |
332 (34.6%) |
268 (28.5%) |
Total Super Balance |
7,456,367 (3,930,725, 11,271,067) |
7,679,442 (3,868,717, 11,056,757) |
Unknown |
42 |
53 |
1n (%); Median (Q1, Q3) |
The variable levels and labels are automatically applied.