`clustord` Structure Summary
Louise McMillan
2025-05-28
clustordStructureSummary.Rmd
For this package, we assume that you have a dataset of ordinal data. The most common form of this is survey data, such as you might get by asking participants to ask a series of questions with Likert-scale answers (for example, ranking from 1 = “Strongly Disagree” to 5 = “Strongly Agree”).
There are three broad types of clustering: row clustering, column clustering and biclustering. Within each of these, there are multiple possible clustering structures.
Mathematically, row and column clustering are equivalent, so you can orient your data matrix either way round, and just choose the appropriate clustering direction.
Row clustering
The clustord
package can cluster the
rows of this data matrix, which often correspond to the
subjects or the observations.
Row clustering only
If there are no additional patterns or data we want to incorporate, then this is the simplest form of row clustering.
The formula will be:
Y ~ ROWCLUST
Parameters: rowc
Row clustering with individual column effects
If you think each individual column (e.g. survey question) has slightly different patterns, then we can incorporate those into the clustering structure.
The formula without interaction:
Y ~ ROWCLUST + COL
or the formula with interaction:
Y ~ ROWCLUST*COL
If you have, for example, a gradient in the effects of the individual columns, then interaction between that and the row cluster would allow one cluster to have the gradient going in one direction and the other cluster to have the gradient going in the other direction.
Parameters: rowc
, col
and possible
interaction rowc_col
.
Row clustering with row covariates
If you have additional information about the rows, such as demographic information about each survey respondent, that you think will influence how they answered the survey, then you can incorporate that into the clustering structure:
There are many possible formulae for this, but they all follow a
similar style to formulae in lm
or glm
.
Example formulae without interaction between row clustering and row covariates:
Y ~ ROWCLUST + age
Y ~ ROWCLUST + age + nationality*occupation
Example formulae with interaction between row clustering and some of the row covariates:
Y ~ ROWCLUST*age + nationality*occupation
You need to also supply these row covariates as xr.df
to
the function mat2df()
when creating the long data frame for
clustering (see the clustord
Tutorial vignette for more
details).
You can also include individual column effects as well as covariates:
Y ~ ROWCLUST*age + COL
Parameters: rowc
, cov
and possible
interaction rowc_cov
, and possible column effects
col
.
Row clustering with column covariates
If you have additional information about the columns, such as the probability that a particular question will not be answered honestly, that you think will influence the response data, then you can incorporate that into the clustering structure:
The formula works the same as for other covariates:
Y + ROWCLUST + honesty
but you need to supply these column covariates as xc.df
to the function mat2df()
when creating the long form data
frame.
Column clustering
The clustord
package can cluster the
columns of this data matrix, which often correspond to
the survey questions.
Column clustering only
If there are no additional patterns or data we want to incorporate, then this is the simplest form of column clustering.
The formula will be
Y ~ COLCLUST
Parameters: colc
Column clustering with individual row effects
If you think each individual row (e.g. survey respondent) has slightly different patterns, then we can incorporate those into the clustering structure.
The formula without interaction:
Y ~ COLCLUST + ROW
or the formula with interaction:
Y ~ COLCLUST*ROW
If you have, for example, a gradient in the effects of the individual respondents, then interaction between that and the column cluster would allow one cluster of columns to have the gradient of respondents going in one direction and the other column cluster to have the gradient of respondents going in the other direction.
IMPORTANT: Many datasets have far more rows than columns. If this is the case, e.g. if the number of rows is > 100, then this model may not be suitable to be fitted because it requires one parameter for every row and it is difficult to fit that many parameters accurately.
Parameters: colc
, row
and possible
interaction colc_row
.
Column clustering with column covariates
If you have additional information about the columns, such as the probability that a particular question will not be answered honestly, that you think will influence the response data, then you can incorporate that into the clustering structure:
The formula works the same as for other covariates. Column clustering with covariates without interaction:
Y + COLCLUST + honesty
Column clustering with covariates and interaction:
Y + COLCLUST*honesty
Supply column covariates as xc.df
to
mat2df()
.
Parameters: colc
, cov
and possible
interaction colc_cov
.
Column clustering with row covariates
As before, if there is additional information about the rows (e.g. survey respondents) that you think will affect the response values, you can incorporate that when clustering the columns:
The formula works the same as for other covariates. Column clustering with row covariates without interaction:
Y + COLCLUST + age
Column clustering with covariates and interaction:
Y + COLCLUST*age
Supply row covariates as xr.df
to
mat2df()
.
Parameters: colc
, cov
and possible
interaction colc_cov
.
Biclustering
The package can also cluster both rows and columns simultaneously, which we call biclustering. This finds the combinations of subjects and questions that exhibit similar response patterns:
The model can include just the main row and column clustering effects:
Y ~ ROWCLUST + COLCLUST
or it can also include the interaction between them:
Y ~ ROWCLUST*COLCLUST
Parameters: rowc
, colc
and possible
interaction rowc_colc
.
Biclustering with covariates
You can include covariates when performing biclustering:
Biclustering with row and column covariates without any interaction:
Y + ROWCLUST + COLCLUST + age + honesty
Biclustering with row and column cluster interactions but no covariate interaction:
Y + ROWCLUST*COLCLUST + age + honesty
Biclustering without interaction between row and column clusters, but with interaction between row clusters and row covariates:
Y ~ ROWCLUST*age + COLCLUST + honesty
Biclustering with row and column cluster interactions and interaction between row clusters and row covariates:
Y ~ ROWCLUST*age + COLCLUST + ROWCLUST:COLCLUST + honesty
Biclustering without interaction between row and column clusters, but with interaction between column clusters and column covariates:
Y ~ ROWCLUST + age + COLCLUST*honesty
Biclustering with row and column cluster interactions and interaction between column clusters and column covariates:
Y ~ ROWCLUST + COLCLUST*honesty + ROWCLUST:COLCLUST + age
You cannot include three-way interactions with the
row and column cluster interaction: formula terms such as
ROWCLUST:COLCLUST:age
are disallowed.
Parameters: rowc
and colc
, covariate
effects cov
, possible cluster interaction
rowc_colc
, and possible cluster-covariate interactions
rowc_cov
or colc_cov
.