Skip to contents

Example Usage

The data used in the following examples are simulated using OpenAI’s ChatGPT. Identifiable fields (full names, addresses, etc.) are also simulated.

Using Extracts (> clean_REDCap)

First, we can use a (.csv) extract of the data and data dictionary from REDCap:

dat_raw <- read.csv(file = "materials/miscellaneous/DemoRSyntheticData_DATA_2025-03-18_0728.csv")      # Read in the data
dict    <- read.csv(file = "materials/miscellaneous/DemoRSyntheticData_DataDictionary_2025-03-18.csv", # Read in the data dictionary
                    check.names = FALSE ## So spaces, punctuation etc. are preserved in tibble
                    ) 

Visualising the first few rows of our data:

head(dat_raw, n = 10) %>%
  thekids_table(line.spacing = 0.8,
                padding = 1.5)

record_id

form_1_complete

demo_title

demo_fname

demo_dob

demo_address

demo_postcode

anthro_sex

anthro_weight

anthro_height

assess_iq

demo_edu

demo_income

demo_num_dep

med_hist_cvd

med_hist_diabetes

life_pol

life_house_date

life_super

synthetic_1_complete

1

0

Mr.

William Moore

1990‑01‑24

99 St Georges Terrace, Scarborough, WA

6006

Male

97.7

1.56

58

3

0

2

0

1

Liberal

1999‑04‑05

4458703

0

2

0

Mr.

William Brown

1998‑10‑24

6011

Other

62.3

1.53

54

1

0

1

0

1

Central

1985‑05‑31

14636346

0

3

0

Mr.

Elizabeth Williams

1983‑03‑01

87 Hay Street, Joondalup, WA

6011

Male

146.0

2.02

119

3

4

6

1

0

Liberal

5334516

0

4

0

Ms.

John Miller

1985‑02‑13

85 Hay Street, Perth CBD, WA

6006

Female

66.4

1.53

54

2

0

5

0

1

Labour

1989‑11‑13

6399928

0

5

0

Mr.

John Brown

1973‑12‑10

33 Beaufort Street, Mount Lawley, WA

6005

Other

114.9

1.55

57

3

0

6

1

1

Labour

1985‑12‑02

294741

0

6

0

Ms.

Michael Williams

1997‑01‑29

24 St Georges Terrace, Northbridge, WA

6011

Female

69.2

1.64

69

3

1

2

0

0

Labour

1992‑04‑19

10847340

0

7

0

Ms.

Michael Davis

1990‑10‑10

84 Hay Street, Northbridge, WA

6007

Male

85.6

2.03

121

3

4

5

1

0

Labour

1991‑04‑28

6320541

0

8

0

Mx.

Mary Brown

1994‑08‑20

85 Murray Street, Mount Lawley, WA

6008

Male

76.7

1.70

77

3

1

1

0

0

Central

2009‑03‑12

1726473

0

9

0

Ms.

Michael Taylor

1978‑05‑13

29 King Street, Fremantle, WA

6008

Other

109.8

66

1

0

0

0

1

Liberal

2004‑03‑05

14534640

0

10

0

Mx.

Patricia Moore

1990‑06‑06

76 Adelaide Terrace, Cottesloe, WA

6012

Other

139.8

1.93

107

1

3

7

0

0

Liberal

1992‑12‑20

11706012

0

Similarly, we can visualise our standardised data dictionary extract from REDCap:

Figure 1: Data dictionary extract.

Clearly:

  • Some categorical fields (demo_income, demo_num_dep, med_hist_cvd, med_hist_diabetes) are numerically coded and will need resolution before proceeding.
  • Column headers are variable names, not labels.

If we were to tabulate this data (stratifying by, say, med_hist_cvd), we must manually apply labels to categorical variable levels (which is indeed made easier using thekidsbiostats::fct_case_when) and variable names (passed directly to tbl_summary):

dat_raw %>%
  select(demo_postcode, anthro_sex:life_pol, life_super) %>%
  mutate(demo_edu = fct_case_when(demo_edu == 0 ~ "Completed High School",
                                  demo_edu == 1 ~ "TAFE or Trade",
                                  demo_edu == 2 ~ "Higher Degree",
                                  demo_edu == 3 ~ "Bachelor Degree"),
         demo_income = fct_case_when(demo_income == 0 ~ "<80000",
                                     demo_income == 1 ~ "80000-120000",
                                     demo_income == 2 ~ "120000-160000",
                                     demo_income == 3 ~ "160000-200000",
                                     demo_income == 4 ~ ">200000"),
         across(c(med_hist_cvd, med_hist_diabetes), ~fct_case_when(. == 0 ~ "No",
                                                                   . == 1 ~ "Yes")),
         across(c(anthro_sex, life_pol), ~case_when(. == "" ~ NA,
                                                    TRUE ~ .))) %>%
  
  tbl_summary(by = med_hist_cvd,
              type = list(demo_postcode ~ "categorical"),
              digits = list(all_categorical() ~ c(0, 1)),
              label = list(demo_postcode     ~ "Postcode",
                           anthro_sex        ~ "Sex",
                           anthro_weight     ~ "Weight",
                           anthro_height     ~ "Height",
                           assess_iq         ~ "IQ",
                           demo_edu          ~ "Highest level of education",
                           demo_income       ~ "Household income",
                           demo_num_dep      ~ "Number of dependents",
                           med_hist_diabetes ~ "History of diabetes",
                           life_pol          ~ "Political affiliation",
                           life_super        ~ "Total superannuation balance")
              ) %>%
  modify_spanning_header(all_stat_cols() ~ "**History of CVD**") %>%
  thekids_table()
#> 100 missing rows in the "med_hist_cvd" column have been
#> removed.

History of CVD

Characteristic

No
N = 9601

Yes
N = 9401

Postcode

6000

72 (7.9%)

69 (7.7%)

6003

59 (6.5%)

79 (8.8%)

6004

85 (9.3%)

81 (9.0%)

6005

76 (8.3%)

82 (9.2%)

6006

89 (9.7%)

90 (10.0%)

6007

85 (9.3%)

88 (9.8%)

6008

85 (9.3%)

88 (9.8%)

6009

79 (8.6%)

81 (9.0%)

6010

82 (9.0%)

78 (8.7%)

6011

101 (11.1%)

77 (8.6%)

6012

101 (11.1%)

83 (9.3%)

Unknown

46

44

Sex

Female

308 (33.7%)

272 (30.6%)

Male

285 (31.2%)

314 (35.3%)

Other

320 (35.0%)

304 (34.2%)

Unknown

47

50

Weight

107 (82, 129)

106 (85, 129)

Unknown

46

47

Height

1.79 (1.65, 1.96)

1.81 (1.66, 1.96)

Unknown

44

52

IQ

89 (69, 111)

90 (70, 110)

Unknown

50

47

Highest level of education

Completed High School

216 (23.6%)

215 (24.2%)

TAFE or Trade

234 (25.5%)

219 (24.6%)

Higher Degree

218 (23.8%)

230 (25.8%)

Bachelor Degree

249 (27.2%)

226 (25.4%)

Unknown

43

50

Household income

<80000

199 (21.7%)

183 (20.6%)

80000-120000

176 (19.2%)

169 (19.0%)

120000-160000

186 (20.3%)

189 (21.3%)

160000-200000

182 (19.9%)

162 (18.2%)

>200000

173 (18.9%)

186 (20.9%)

Unknown

44

51

Number of dependents

3 (2, 5)

3 (1, 5)

Unknown

51

45

History of diabetes

471 (51.3%)

467 (52.6%)

Unknown

41

53

Political affiliation

Central

301 (33.3%)

309 (34.3%)

Labour

270 (29.9%)

325 (36.0%)

Liberal

332 (36.8%)

268 (29.7%)

Unknown

57

38

Total superannuation balance

7,456,367 (3,930,725, 11,271,067)

7,679,442 (3,868,717, 11,056,757)

Unknown

42

53

1n (%); Median (Q1, Q3)

The above code chunk is clearly quite sizeable and required a significant degree of user input.

Instead, using clean_REDCap:

dat_mod <- clean_REDCap(d = dat_raw,
             dict = dict,
             yesno_to_bool = T)

dat_mod %>%
  select(demo_postcode, anthro_sex:life_pol, life_super) %>%
  
  tbl_summary(by = med_hist_cvd,
              type = list(demo_postcode ~ "categorical"),
              digits = list(all_categorical() ~ c(0, 1))
              ) %>%
  modify_spanning_header(all_stat_cols() ~ "**History of CVD**") %>%
  thekids_table()
#> 100 missing rows in the "med_hist_cvd" column have been
#> removed.

History of CVD

Characteristic

No
N = 9601

Yes
N = 9401

Postcode

6000

72 (7.9%)

69 (7.7%)

6003

59 (6.5%)

79 (8.8%)

6004

85 (9.3%)

81 (9.0%)

6005

76 (8.3%)

82 (9.2%)

6006

89 (9.7%)

90 (10.0%)

6007

85 (9.3%)

88 (9.8%)

6008

85 (9.3%)

88 (9.8%)

6009

79 (8.6%)

81 (9.0%)

6010

82 (9.0%)

78 (8.7%)

6011

101 (11.1%)

77 (8.6%)

6012

101 (11.1%)

83 (9.3%)

Unknown

46

44

Sex

47 (4.9%)

50 (5.3%)

Female

308 (32.1%)

272 (28.9%)

Male

285 (29.7%)

314 (33.4%)

Other

320 (33.3%)

304 (32.3%)

Weight

107 (82, 129)

106 (85, 129)

Unknown

46

47

Height

1.79 (1.65, 1.96)

1.81 (1.66, 1.96)

Unknown

44

52

IQ

89 (69, 111)

90 (70, 110)

Unknown

50

47

Highest Level of Education

Completed High School

216 (23.6%)

215 (24.2%)

TAFE or Trade

234 (25.5%)

219 (24.6%)

Higher Degree

218 (23.8%)

230 (25.8%)

Bachelor Degree

249 (27.2%)

226 (25.4%)

Unknown

43

50

Household Income

<80000

199 (21.7%)

183 (20.6%)

80000-120000

176 (19.2%)

169 (19.0%)

120000-160000

186 (20.3%)

189 (21.3%)

160000-200000

182 (19.9%)

162 (18.2%)

>200000

173 (18.9%)

186 (20.9%)

Unknown

44

51

Number of Dependents

3 (2, 5)

3 (1, 5)

Unknown

51

45

History of Diabetes

471 (51.3%)

467 (52.6%)

Unknown

41

53

Political Affiliation

57 (5.9%)

38 (4.0%)

Central

301 (31.4%)

309 (32.9%)

Labour

270 (28.1%)

325 (34.6%)

Liberal

332 (34.6%)

268 (28.5%)

Total Super Balance

7,456,367 (3,930,725, 11,271,067)

7,679,442 (3,868,717, 11,056,757)

Unknown

42

53

1n (%); Median (Q1, Q3)

The variable levels and labels are automatically applied.

Using APIs