Using categorical variables with anticlustering

library(anticlust)

In this vignette I explore two ways to incorporate categorical variables with anticlustering. The main function of anticlust is anticlustering(), and it has an argument categories. It can be used easily enough: We just pass the numeric variables as first argument (x) and our categorical variable(s) to categories. I will use the penguin data set from the palmerpenguins package to illustrate the usage:

library(palmerpenguins)
# First exclude cases with missing values
df <- na.omit(penguins)
head(df)
#>   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1  Adelie Torgersen           39.1          18.7               181        3750
#> 2  Adelie Torgersen           39.5          17.4               186        3800
#> 3  Adelie Torgersen           40.3          18.0               195        3250
#> 5  Adelie Torgersen           36.7          19.3               193        3450
#> 6  Adelie Torgersen           39.3          20.6               190        3650
#> 7  Adelie Torgersen           38.9          17.8               181        3625
#>      sex year
#> 1   male 2007
#> 2 female 2007
#> 3 female 2007
#> 5 female 2007
#> 6   male 2007
#> 7 female 2007
nrow(df)
#> [1] 333

In the data set, each row represents a penguin, and the data set has four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and several categorical variables (species, island, sex) as descriptions of the penguins.

Let’s call anticlustering() to divide the 333 penguins into 3 groups. We use the four the numeric variables as first argument (i.e., the anticlustering objective is computed on the basis of the numeric variables), and the penguins’ sex as categorical variable:

numeric_vars <- df[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]
groups <- anticlustering(
  numeric_vars, 
  K = 3,
  categories = df$sex
)

Let’s check out how well our categorical variables are balanced:

table(groups, df$sex)
#>       
#> groups female male
#>      1     55   56
#>      2     55   56
#>      3     55   56

A perfect split! Similarly, we could use the species as categorical variable:

groups <- anticlustering(
  numeric_vars, 
  K = 3,
  categories = df$species
)

table(groups, df$species)
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        22     40
#>      2     49        23     39
#>      3     48        23     40

As good as it could be! Now, let’s use both categorical variables at the same time:

groups <- anticlustering(
  numeric_vars, 
  K = 3,
  categories = df[, c("species", "sex")]
)

table(groups, df$sex)
#>       
#> groups female male
#>      1     54   57
#>      2     56   55
#>      3     55   56
table(groups, df$species) 
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        22     40
#>      2     49        23     39
#>      3     48        23     40

The results for the sex variable are worse than previously when we only considered one variable at a time. This is because when using multiple variables with the categories argument, all columns are “merged” into a single column, and each combination of sex / species is treated as a separate category. Some information on the original variables is lost, and the results may become less optimal—while being still pretty okay here. Alas, using only the categories argument, we cannot improve this balancing even if a better split with regard to both categorical variables would be possible.

Categorical variables as numeric variables

A second possibility to incorporate categorical variables is to treat them as numeric variables and use them as part of the first argument x, which is used to compute the anticlustering objective (e.g., the diversity or variance). This approach can lead to better results when multiple categorical variables are available, and / or if the group sizes are unequal. I discuss the approach by the example of k-means anticlustering, but using the diversity objective is also possible (in principle, any reasonable way to transform categorical variables to pairwise dissimilarities would work).

To use categorical variables as part of the anticlustering objective, we first generate a matrix of the categorical variables in binary representation using the anticlust convenience function categories_to_binary().1 Because k-means anticlustering optimizes similarity with regard to means, k-means anticlustering applied to this binary matrix will even out the proportion of each category in each group (this is because the mean of a binary variable is the proportion of 1s in that variable).

binary_categories <- categories_to_binary(df[, c("species", "sex")])
# see ?categories_to_binary
head(binary_categories)
#>   speciesChinstrap speciesGentoo sexmale
#> 1                0             0       1
#> 2                0             0       0
#> 3                0             0       0
#> 4                0             0       0
#> 5                0             0       1
#> 6                0             0       0
groups <- anticlustering(
  binary_categories,
  K = 3,
  method = "local-maximum", 
  objective = "variance",
  repetitions = 10,
  standardize = TRUE
)
table(groups, df$sex)
#>       
#> groups female male
#>      1     55   56
#>      2     55   56
#>      3     55   56
table(groups, df$species)
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        23     39
#>      2     49        22     40
#>      3     48        23     40

The results are quite convincing. In particular, the penguins’ sex is better balanced than previously when we used the argument categories. If we have multiple categorical variables and / or unequal-sized groups, it may be useful to try out the k-means optimization version of including categorical variables, instead of (only) using the categories argument. If we also wish to ensure that the categorical variables in their combination are balanced between groups (i.e., the proportion of the penguins’ sex is roughly the same for each species in each group), we could set the optional argument use_combinations of categories_to_binary() to TRUE:

binary_categories <- categories_to_binary(df[, c("species", "sex")], use_combinations = TRUE)
groups <- anticlustering(
  binary_categories,
  K = 3,
  method = "local-maximum", 
  objective = "variance",
  repetitions = 10,
  standardize = TRUE
)
table(groups, df$sex)
#>       
#> groups female male
#>      1     55   56
#>      2     55   56
#>      3     55   56
table(groups, df$species) 
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        23     39
#>      2     49        22     40
#>      3     48        23     40
table(groups, df$sex, df$species)
#> , ,  = Adelie
#> 
#>       
#> groups female male
#>      1     24   25
#>      2     25   24
#>      3     24   24
#> 
#> , ,  = Chinstrap
#> 
#>       
#> groups female male
#>      1     12   11
#>      2     11   11
#>      3     11   12
#> 
#> , ,  = Gentoo
#> 
#>       
#> groups female male
#>      1     19   20
#>      2     19   21
#>      3     20   20

Note that we only evenly distributed the categorical variable between groups and did not consider any numeric variables. Fortunately, also considering the numeric variables is possible, and can we accomplish that in two different ways:

  1. we first optimize similarity with regard to the categorical variable(s) via k-means anticlustering, and then insert the resulting group assignment as a “hard constraint” into anticlustering()
  2. we simultaneous optimize similarity with regard to numeric and categorical variables

We discuss both approaches in the following.

a. Sequential optimization

We use the output vector groups of the previous call to anticlustering()—which convincingly balanced our categorical variables—as input to the K argument in an additional call to anticlustering(). The groups vector is used as the initial group assignment before the anticlustering optimization starts. In this group assignment, the categories are already well balanced. We additionally pass the two categorical variables to categories, thus ensuring that the balancing of the categorical variable is never changed throughout the optimization process:2

final_groups <- anticlustering(
  numeric_vars,
  K = groups,
  standardize = TRUE,
  method = "local-maximum",
  categories = df[, c("species", "sex")]
)

table(groups, df$sex)
#>       
#> groups female male
#>      1     55   56
#>      2     55   56
#>      3     55   56
table(groups, df$species)
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        23     39
#>      2     49        22     40
#>      3     48        23     40
mean_sd_tab(numeric_vars, final_groups)
#>   bill_length_mm bill_depth_mm  flipper_length_mm body_mass_g       
#> 1 "44.00 (5.46)" "17.17 (1.97)" "200.92 (14.05)"  "4203.38 (810.54)"
#> 2 "44.00 (5.47)" "17.16 (1.96)" "201.05 (14.07)"  "4209.23 (812.49)"
#> 3 "43.98 (5.52)" "17.17 (1.99)" "200.94 (14.05)"  "4208.56 (799.86)"

The results are convincing, both with regard to the numeric variables and the categorical variables.

b. Simultaneous optimization

We can simultaneously consider the numeric and categorical variables in the optimization process. Note that this approach only works with the k-means and k-plus objectives, because only k-means adequately deals with the categorical variables (at least when using the approach described here). Using the simultaneous approach, we just pass all variables (representing binary categories and numeric variables) as a single matrix to the first argument of anticlustering(). Do not use the categories argument here!

final_groups <- anticlustering(
  cbind(numeric_vars, binary_categories),
  K = 3,
  standardize = TRUE,
  method = "local-maximum", 
  objective = "variance",
  repetitions = 10
)

table(groups, df$sex)
#>       
#> groups female male
#>      1     55   56
#>      2     55   56
#>      3     55   56
table(groups, df$species)
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        23     39
#>      2     49        22     40
#>      3     48        23     40
mean_sd_tab(numeric_vars, final_groups)
#>   bill_length_mm bill_depth_mm  flipper_length_mm body_mass_g       
#> 1 "43.99 (5.54)" "17.17 (2.05)" "200.97 (14.04)"  "4207.21 (815.03)"
#> 2 "44.00 (5.64)" "17.16 (2.06)" "200.96 (13.36)"  "4207.21 (769.92)"
#> 3 "43.99 (5.28)" "17.16 (1.81)" "200.96 (14.73)"  "4206.76 (836.58)"

The following code extends the simultaneous optimization approach towards k-plus anticlustering, which ensures that standard deviations as well as means are similar between groups (and not only the means, which is achieved via standard k-means anticlustering):

final_groups <- anticlustering(
  cbind(kplus_moment_variables(numeric_vars, T = 2), binary_categories),
  K = 3,
  method = "local-maximum", 
  objective = "variance", 
  repetitions = 10
)

table(groups, df$sex)
#>       
#> groups female male
#>      1     55   56
#>      2     55   56
#>      3     55   56
table(groups, df$species)
#>       
#> groups Adelie Chinstrap Gentoo
#>      1     49        23     39
#>      2     49        22     40
#>      3     48        23     40
mean_sd_tab(numeric_vars, final_groups)
#>   bill_length_mm bill_depth_mm  flipper_length_mm body_mass_g       
#> 1 "43.99 (5.49)" "17.17 (1.98)" "200.95 (14.06)"  "4207.21 (807.83)"
#> 2 "43.99 (5.48)" "17.17 (1.97)" "200.98 (14.04)"  "4206.76 (808.06)"
#> 3 "44.00 (5.48)" "17.16 (1.97)" "200.96 (14.08)"  "4207.21 (807.06)"

While we use objective = "variance"—indicating that the k-means objective is used—this code actually performs k-plus anticlustering because the first argument takes as input the augmented k-plus variable matrix3. We see that the standard deviations are now also quite evenly matched between groups (which is unlike when using standard k-means anticlustering).

In the end: You should try out the different approaches for dealing with categorical variables and see which one works best for you!


  1. Internally, categories_to_binary() is just a thin wrapper around the base R function model.matrix().↩︎

  2. Only elements that have the same value in categories are exchanged between clusters throughout the optimization algorithm, so the initial balancing of the categories is never changed when the algorithm runs.↩︎

  3. This is how k-plus anticlustering actually works: It reuses the k-means criterion but uses additional “k-plus” variables as input. More information on the k-plus approach is given in the documentation: ?kplus_moment_variables and ?kplus_anticlustering.↩︎