Tutorial for densify

Anna Graff, Marc Lischka, Taras Zakharko, Reinhard Furrer, Balthasar Bickel

1 Introduction

The densify package provides a procedure to generate denser subsets of input data frames according to varying user-defined parameters. The densification process is split up into two main functions densify() and prune(). This tutorial guides users through all functions and parameters to demonstrate their effect on densification.

The data used in this tutorial are the default data provided in the package: the language-feature matrix WALS obtained from WALS, the World Atlas of Language Structures (Dryer and Haspelmath 2013), and the language taxonomy glottolog_languoids provided by Glottolog, Version 5.0 (Hammarstroem et al. 2024).

2 Matrix densification

2.1 Preparing input

The input for densification is a data frame with rows representing observations or taxa (e.g., languages) and columns representing features (in linguistic typological databases, these are also often referred to as “features”, “characters” or “parameters”). The data frame requires taxa to be expressed in rows, with taxon names specified in a column (e.g., a column called “Glottocodes”). Columns can denote meta-data (such as taxon names) or features. feature names must thereby be set as the column names. Any cells with empty or unwanted (e.g. ?) entries must be coded as NA.

If the densification process should be sensitive to the taxonomic structure of the observations, a taxonomy must be provided, in which all observations in the data frame are attested either as tips or nodes. The input taxonomy can have two forms: it can be a phylo object (e.g. via ape::read.nexus) or an adjacency table (i.e., a data frame containing columns id and parent_id, with each row encoding one parent-child relationship). To generate a language taxonomy, the data from Glottolog, accessible as data(glottolog_languoids), can be used directly.

The taxonomic input is parsed internally by the function densify() to generate a flattened taxonomic representation and then used for densification if specified. As such, users do not need to convert the input taxonomy into a flattened format themselves, but we expose the function as_flat_taxonomy_matrix() here, since it can be useful for other applications.

# load packages
library(densify)
library(ggplot2)
library(dplyr)
# exclude languages from WALS that are not attributable to any entry in the glottolog taxonomy
WALS <- WALS[which(WALS$Glottocode %in% glottolog_languoids$id), ]

# ensure all missing and non-applicable information is coded as NA
WALS[WALS==""] <- NA
WALS[WALS=="?"] <- NA
WALS[WALS=="NA"] <- NA

# this is the flat glottolog taxonomy generated within the densify function
# with the input glottolog_languoids
flat_taxonomy_matrix <- as_flat_taxonomy_matrix(glottolog_languoids)

The initial (full) dataset provided by WALS encompasses data on 192 linguistic features, coded for 2496 languages (representing 318 families) that are attributable to Glottolog.

languages <- WALS$Glottocode
data <- WALS %>% select(-Glottocode)
# number of languages
length(languages) 
# [1] 2496
# number of families
flat_taxonomy_matrix %>% filter(id %in% languages) %>% select(level1) %>% unique() %>% nrow()
# [1] 318
# number of features
ncol(data)
# [1] 192

In this matrix each feature is coded for a distinct set of languages, such that the coding density, which we define as the proportion of non-empty cells, is patchy for both languages and features (see Figure 2.1).

Coding density for the `WALS` data frame (left languages, right features)Coding density for the `WALS` data frame (left languages, right features)

Figure 2.1: Coding density for the WALS data frame (left languages, right features)

The overall coding density of the input data frame is 0.16%.

mean(!is.na(data))
# [1] 0.1568217

2.2 Iterative matrix densification using densify()

A straightforward method to find a denser sub-matrix within a sparse super-matrix is to remove all rows and columns with coding densities below a given threshold. However, removing a row subtly changes all column densities and vice versa, so a more cautious method is to iteratively remove rows and columns and to update, after each removal, the coding densities of all remaining rows and columns. The function densify() accomplishes this: It takes the original matrix, iteratively removes rows and columns according to specifiable parameters (described below), and logs the characteristics of the matrix after each iteration. The output of the function is a densify_result object, which is parsed by prune(), rank_results() or visualize() (see below), to determine the optimal number of iterations according to user-specified criteria post-hoc.

2.2.1 High-level description of densify

Based on the sparsely encoded data frame provided by the user, densify() defines an encoding matrix of the same dimensions, containing zeroes for NAs and ones elsewhere. Using this encoding matrix, it computes an importance score for each row and each column according to user-specifiable parameters, identifies the row and/or column with the lowest importance score (randomly sampling one row or column if there are ties) and removes it. After the first row or column is removed, the function re-computes the importance scores of the resulting sub-matrix, identifies which row and/or column scores worst in the new matrix, and in turn removes it. This process is repeated until matrix density reaches 1, or up to specifiable limits regarding density. Key characteristics (e.g., number of data points, number of rows, number of columns, overall coding density, identity of removed rows and/or columns, etc.) of all sub-matrices produced in the pruning process are documented in a densify_result object, the output of the function.

There are several ways with which importance scores for rows and columns can be computed. Our approach distinguishes between what will be referred to as absolute and weighed coding densities. Absolute coding densities denote the (raw) proportion of coded cells in a row or column. If a row is coded for 10% of the columns in the matrix, its absolute coding density is thus 0.1. Several rows may be coded for the same absolute proportion of columns, but the identity of the columns for which such rows are coded may well be distinct – some rows may be coded for columns which are well-coded, while others may be coded for columns coded only for few rows. Such differences among rows coded for the same absolute proportion of columns matters when establishing the importance scores for rows and columns in the iterative pruning approach adopted here, such that among those sparsely coded rows (or columns, respectively) coded for the same proportion of columns (or rows, respectively), the ones coded for less well-coded columns are removed first (weighted coding density).

Thus, to determine the importance score of a column (feature), densify() computes the weighted column coding density. To determine the importance score of a row (language), one of three possible types of mean (described in detail below) are used, taking into to account (1) the absolute row coding densities, (2) the weighted row coding densities and – if specified – (3) the contribution of a specific row to the taxonomic diversity of the sample.

Row and column importance scores are treated differently for two reasons. First, only rows can have taxonomic structure. Second, in most linguistic typological research applications, the focus of densification resides chiefly on pruning away rows (languages/taxa/observations) rather than columns because the former are usually substantially more numerous, and including absolute coding densities in the row importance score primarily has the effect of penalizing scarcely-coded rows in the matrix more strongly.

Removing rows can have effects not only on coding densities of columns, but also on their informativeness: It is possible for the iterative removal of rows to slowly deplete a feature state, such that a feature (column) no longer displays sufficient or any variability (i.e., a column may still be coded for a substantial number of rows but all remaining coded rows might have the value FALSE, because all rows with the value TRUE have been pruned away). Thus, after each row removal, densify() assesses whether any features have become uninformative according to a user-defined threshold denoting how many observations must be present in the second-largest feature state, and removes any such column(s) alongside the row in the same iteration.

Conversely, it is possible that the removal of a specific column results in a row no longer exhibiting any data points (because it was only coded for the column that was removed). Thus, after each column removal the algorithm assesses whether any rows have become empty and removes any such row(s) alongside the column in the same iteration.

2.2.2 The parameters for densify()

The function densify() requires the user to set a number of parameters, most of which modulate the way the row importance scores are computed at each iteration. This section describes and discusses all parameters available to fine-tune the densification process.

densify() takes the following arguments:

  • data: The data frame to be pruned, in the format described above.

  • cols: A specification of which feature columns to densify. The default is to include all columns for densification.

  • taxonomy: A taxonomic tree encompassing all taxa represented in data as tips or nodes. It can be a phylo object or an adjacency table (a data frame containing columns id and parent_id, with each row encoding one parent-child relationship).

  • taxon_id: The name of the column identifying taxa. If this column is not specified, densify() will attempt to guess which column denotes taxa based on column contents.

  • density_mean: This parameter specifies how the measure denoting row importance is computed. It can be one of the following three mean-types: "arithmetic", "geometric", "log_odds". See below for more details. The default value is "log_odds", which is is based on the mean of logit-transformed proportions.

  • min_variability: An integer specifying how many observations/taxa the second-most-frequent feature state of any feature must contain for the feature to be maintained in the matrix. The default is set to 1, guaranteeing that a feature must display at least minimal variability: at least one observation must be different from all others. Value NA disables pruning according to this principle.

  • limits: A named list specifying the limit conditions for the densification process to end. Available limit conditions are min_coding_density (a number between 0 and 1 specifying the target coding density), min_prop_rows (a number between 0 and 1 specifying the minimal proportion of rows that have to be retained in the data), and min_prop_cols (a number between 0 and 1 specifying the minimal proportion of feature columns that have to be retained in the data). More then one condition can be specified simultaneously. densify() will stop if any condition is reached. The default limit is min_coding_density = 1, i.e. densification ends once the matrix is fully dense, i.e. there are no NAs left.

  • density_mean_weights: A named list specifying weighting factors applied during importance score computation. Available parameters are coding and taxonomy, applied to coding density and taxonomic diversity values respectively and thus tweaking the relative importance between coding density and taxonomic diversity in the pruning process. Both values must be integers \(\in [0,1]\). Setting a parameter to 0 (or NA, or NULL) disables the contribution of the respective values to the computed row importance score and hence the pruning process. The default value is density_mean_weights = list(coding = 1, taxonomy = 1).

While it is possible to run densify() with several parameter settings on the same input data frame, we strongly recommend motivating parameter settings with the desired characteristics of the output data frame (e.g., how strongly should taxonomic diversity be weighted in relation to coding density of rows?) in mind.

2.2.2.1 Parameters cols and taxon_id

Users can specify the names of the columns which represent features that should undergo densification using the cols parameter and the name of the column denoting taxon names using the taxon_id parameter.

The cols argument is useful if certain columns should not be densified, e.g. because they encode language meta-data. If no cols are specified, densify() will assume that all but the taxon_id column should be included.

Note that if no taxon_id is specified, densify() will try to make an educated guess as to which column represents taxon names based on the data.

2.2.2.2 Parameter taxonomy

A taxonomy must be provided under the parameter taxonomy if taxonomic diversity should be considered for densification or optimum sub-matrix identification. It can be provided as a phylo object or as an adjacency table (i.e., a data frame containing columns id and parent_id, with each row encoding one parent-child relationship, like glottolog_languoids).

2.2.2.3 Parameters density_mean and density_mean_weights

densify() offers three methods of computing the mean of row-wise absolute and weighted coding densities as well as taxonomic diversity of the sample where applicable: Possible values are "arithmetic", "geometric", or "log_odds". The arithmetic and geometric means operate as conventionally understood: the arithmetic mean utilizes the base::mean() function, while the geometric mean involves the n-th root of the product. The “log odds mean” (henceforth: LOM) is computed by taking the arithmetic mean of the logit transform of all inputs, and then transforming back. In the computation of the LOM, a clamping step prior to logit-value computation ensures that all values remain finite. For all scoring settings, the respective means of (1) the absolute, (2) the weighted coding densities and - if applicable - (3) the contribution to taxonomic diversity of the sample are computed for each row.

To illustrate the different effects of the three scoring types available, the left panel below in Figure 2.2 traces the mean of a constant 0.5 and a coding density x between 0 and 1, with mean types "arithmetic", "geometric" and "log_odds". Note how the log odds mean strongly skews towards 0 and 1 once x approaches the extremes: once one of the densities is particularly low or high, this will dominate the mean. The effect of this is that if coding weights or contribution to taxonomic diversity are extremely high (close or equal to 1), a row is unlikely to be removed even if the densities of the other rows are below average. Conversely, if coding weights or contribution to taxonomic diversity are extremely low (approaching 0), a row is very likely to be removed (even if the others are above average). By contrast, the geometric mean shows this skewing only when the density approaches 0, emphasizing the negative effect of extremely low coding densities or contributions to taxonomic diversity only. If users intend to strongly weight extremely low or extremely high densities regarding either contribution to taxonomic diversity or codedness, they should thus favor the LOM for densify(). The geometric mean should be chosen if extremely high density or contribution to taxonomic diversity should not outweigh extremely low densities. In other words, the geometric mean is useful if the focus lies more on removing rows with extremely low densities than in maintaining rows with extremely high densities.

densify() also provides users with the option of modulating the weights of the components of which the "arithmetic", "geometric", or "log_odds" mean score is computed. This is achieved by a separate parameter density_mean_weights, where weights can be specified for coding and taxonomy. These are multipliers \(\in [0,1]\), defaulting to 1. The coding weight specifies the value by which the absolute and the weighted coding densities are multiplied prior to importance score computation, and the taxonomy weight specifies the value by which the taxonomic diversity score is multiplied. If set to different values, these weight factors thus allow users to bias the weight of taxonomic diversity relative to the weight of coding density in the computation of the row scores and ultimately in the iterative pruning process. To show the effect of this, the middle and right panels in Figure 2.2 illustrate how different weight factors (here: 0.5, 0.75, 0.9, 0.95, 0.99 and 0.999) modulate the behavior of the geometric mean and the LOM, respectively, of a constant 0.5 and all possible values of x.

By setting the taxonomy weight to 0, NA or NULL, the pruning process ignores taxonomic structure. If a taxonomy is provided in the data, a taxonomic diversity score (the Shannon entropy of the highest taxonomic level in the taxonomy) will nonetheless be computed and logged for each sub-matrix in the densify_result output produced by densify(). This way, ignoring taxonomic diversity in the pruning process does not preclude taxonomic diversity from being considered for identifying the optimal number of iterations using rank_results(), prune() or visualize() later (see below). (In principle, the coding weight can also be disabled. However, this undermines the purpose of densification.)

Illustrating different mean types (top), different coding weights for the geometric mean (middle), and different coding weights for the LOM (bottom).Illustrating different mean types (top), different coding weights for the geometric mean (middle), and different coding weights for the LOM (bottom).Illustrating different mean types (top), different coding weights for the geometric mean (middle), and different coding weights for the LOM (bottom).

Figure 2.2: Illustrating different mean types (top), different coding weights for the geometric mean (middle), and different coding weights for the LOM (bottom).

In summary, the parameters density_mean and density_mean_weights modulate the pruning process and can (and should) be adjusted to meet the needs of the user. It may be necessary to experiment with different values to find a good balance of parameters.

2.2.2.4 Parameter min_variability

While it is conceivable that one might want to include features in a dataset displaying no variability whatsoever among the coded taxa/observations (min_variability = NA), such features are uninteresting for most statistical purposes.

The parameter min_variability specifies how many taxa the second-most-frequent feature state of any feature must contain for the feature to be maintained in the matrix. The default is 1, i.e. there must be at least one language that has a state different from all others. This will remove features with only one state – whether they are present in the input matrix or whether they result from removing rows. For certain comparative analyses, it may be reasonable to set the threshold to an integer larger \(n > 1\). This ensures that each feature contains at least two values and that the second-most frequent values is present in \(n\) taxa/observations.

Consider the following example:

example <- data.frame(feature.A=c("present","absent","absent","absent","absent","absent","absent"), 
                      feature.B=c("present","present","absent","absent","absent","absent","absent"), 
                      feature.C=c("type1","type1","type1","type1","type1","type2","type3"),
                      feature.D=c("type1","type1","type1","type1","type2","type2","type3"),
                      feature.E=c("type1","type1","type1","type1","type2","type3","type4"),
                      feature.F=c("type1","type1","type1","type1","type1","type1","type1"))

table(example$feature.A)
# 
#  absent present 
#       6       1
table(example$feature.B)
# 
#  absent present 
#       5       2
table(example$feature.C)
# 
# type1 type2 type3 
#     5     1     1
table(example$feature.D)
# 
# type1 type2 type3 
#     4     2     1
table(example$feature.E)
# 
# type1 type2 type3 type4 
#     4     1     1     1
table(example$feature.F)
# 
# type1 
#     7

In this data frame, features A and B each have two states, absent and present, but feature B exhibits more variation than feature A: The rarer state present counts two observations, whereas it counts only one observation in feature A. Features C and D each have three states, type1, type2 and type3, but feature D exhibits more variation than feature C: The second-most-frequent state type2 counts two observations instead of just one (feature C).

Different values for min_variability on this data have the following effect:

  • min_variability = NA: All features are retained, including feature F, which displays no variability among observations.

  • min_variability = 1: Features A, B, C, D and E are retained, because each of them contains at least one observation in the second-most-frequent state. Feature F is removed, because it only includes one state.

  • min_variability = 2: Features B and D are retained, because both of them contain at least two observations in the second-most-frequent state. Features A, C, E and F are removed because they either have only one observation in their second-most-frequent state or because they are invariate.

2.2.3 Preparing examples with varying parameters

The following 16 implementations of densify() take WALS as an input data frame with min_variability = 3 and vary the following parameters discussed above: density_mean and density_mean_weights. Their outputs as well as their effect on the densification process will be illustrated below.

# arithmetic mean, including and excluding taxonomic diversity
set.seed(2023)

dr_arith_taxtrue <- 
  densify(data = WALS, 
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "arithmetic", 
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 1))

dr_arith_taxfalse <- 
    densify(data = WALS, 
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "arithmetic", 
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 0))

# geometric mean, including and excluding taxonomic diversity
dr_geom_taxtrue <- 
  densify(data = WALS, 
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "geometric", 
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 1))

dr_geom_taxfalse <- 
    densify(data = WALS, 
            taxonomy = glottolog_languoids,
            taxon_id = "Glottocode",
            density_mean = "geometric", 
            min_variability = 3,
            density_mean_weights = list(coding = 1, taxonomy = 0))

# log_odds mean, including and excluding taxonomic diversity
# varying weights when taxonomic diversity included
dr_lom_taxtrue_11 <-
  densify(data = WALS, 
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds", 
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 1))

dr_lom_taxtrue_099099 <-
  densify(data = WALS, 
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds", 
          min_variability = 3,
          density_mean_weights = list(coding = 0.99, taxonomy = 0.99))

dr_lom_taxtrue_095095 <-
  densify(data = WALS, 
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds", 
          min_variability = 3,
          density_mean_weights = list(coding = 0.95, taxonomy = 0.95))

dr_lom_taxtrue_090090 <-
    densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 0.90, taxonomy = 0.90))

dr_lom_taxtrue_050050 <-
    densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 0.50, taxonomy = 0.50))

dr_lom_taxtrue_1095 <-
    densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 0.95))

dr_lom_taxtrue_0951 <-
    densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 0.95, taxonomy = 1))

dr_lom_taxtrue_1090 <-
    densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 0.90))

dr_lom_taxtrue_0901 <-
    densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 0.90, taxonomy = 1))

dr_lom_taxfalse_1 <-
  densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 1, taxonomy = 0))

dr_lom_taxfalse_095 <-
  densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 0.95, taxonomy = 0))

dr_lom_taxfalse_075 <-
  densify(data = WALS,
          taxonomy = glottolog_languoids,
          taxon_id = "Glottocode",
          density_mean = "log_odds",
          min_variability = 3,
          density_mean_weights = list(coding = 0.75, taxonomy = 0))

2.2.4 Objects of type densify_result, the outputs of densify()

Outputs of the densify() function are objects of type densify_result, which log characteristics of all sub-matrices generated throughout densification. The columns of the tibble log, for each sub-matrix (row):

  • .step: the iteration step in the densification
  • n_data_points: the number of total non-missing data points
  • n_rows: the number of taxa
  • n_cols: the number of columns (excluding extra columns, such as taxon_id or meta-data)
  • coding_density: the overall proportion of non-missing data points
  • row_coding_density_min: the coding density of the row with the lowest coding density
  • row_coding_density_median: the median row coding density
  • row_coding_density_max: the coding density of the row with the highest coding density
  • col_coding_density_min: the coding density of the column with the lowest coding density
  • col_coding_density_median: the median column coding density
  • col_coding_density_max: the coding density of the column with the highest coding density
  • taxonomic_index: the Shannon-Wiener diversity index for the highest taxonomic level (i.e., family (or isolate) membership) of languages
  • data: a promise object evaluating the dimensions of the pruned dataset. To retrieve the corresponding object, run as.data.frame(x), e.g. as.data.frame(dr_arith_taxtrue$data[4]) for the fourth sub-matrix.
  • changes: a tibble listing all pruned taxa and/or features in the iteration, including the reason for pruning.
# overview of densify_result object
head(dr_arith_taxtrue)
# # Densify result with 6 pruning steps
# # Highest achieved coding density is 17%
#   .step n_data_points n_rows n_cols coding_density row_coding_density_min
#   <int>         <dbl>  <int>  <int>          <dbl>                  <dbl>
# 1     1         75154   2496    192          0.157                0.00521
# 2     2         75154   2496    192          0.157                0.00521
# 3     3         75075   2494    186          0.162                0.00538
# 4     4         75037   2457    185          0.165                0.00541
# 5     5         75036   2456    185          0.165                0.00541
# 6     6         75035   2455    185          0.165                0.00541
# # ℹ 8 more variables: row_coding_density_median <dbl>,
# #   row_coding_density_max <dbl>, col_coding_density_min <dbl>,
# #   col_coding_density_median <dbl>, col_coding_density_max <dbl>,
# #   taxonomic_index <dbl>, data <list>, changes <list>
# #
# # Use `plot()` or `visualize()` to visualize density statistics and pick the
# best result
# # Use `rank_results()` or `prune()` to pick the best densify result based on
# your criteria
# # Use `eval()` to manually retrieve any data frame from column `data`

# retrieving a specific sub-matrix
submatrix <- as.data.frame(dr_arith_taxtrue$data[4])

2.3 Determining the optimal number of iterations using rank_results(), prune() or visualize()

The output from densify() allows for users to retrieve all sub-matrices produced in the iterative pruning process and to compare them based on the summary statistics available. Depending on what criteria are relevant to a specific densification, different summary statistics are relevant for comparison and hence optimum identification.

Naturally, the sub-matrices generated after more iterations are denser than those after only few pruning steps. However, they also contain less data. There is thus a trade-off between the proportion of coded data and the number of available data points that the users can exploit to determine the optimal number of iterations. If taxonomic diversity is relevant to users, it provides an additional criterion that can modulate which number of iterations is optimal: If the input is highly unevenly distributed taxonomically and taxonomic structure is considered for pruning, the taxonomic diversity of the taxa provided in rows will tend to increase throughout the initial pruning phase until it levels off and subsequently decreases. Other users may want to ensure their sub-matrix does not contain (any) rows or columns with very low coding densities.

The functions rank_results(), prune() and visualize() thus take the output from densify() as an input and compare the sub-matrices resulting from each iteration by computing a matrix quality score via an intuitively worded formula consisting of one or more summary statistics from the densify_result object. The formula is specified under the parameter scoring_function (as a standard R expression) and biases the sub-matrix selection in the way the trade-offs are meant to play out. The default formula used is scoring_function = n_data_points * coding_density, which maximizes the product of available non-missing data points and the overall matrix coding density.

Using the function rank_results(), users can rank all sub-matrices with respect to the scoring function they specify (returning a vector indicating the rank of each sub-matrix), while prune() identifies and retrieves the sub-matrix for which the scoring function is maximized (selecting the earlier iteration in the unlikely case of ties). The function visualize() plots the quality score for all sub-matrices and also indicates the optimum.

# rank results
head(rank_results(dr_arith_taxtrue, scoring_function = n_data_points * coding_density))
# [1] 2383 2383 2375 2369 2367 2366

# prune matrix
pruned <- prune(dr_arith_taxtrue, scoring_function = n_data_points * coding_density)
dim(pruned)
# [1] 1081  155

# visualize quality score
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[1407]])` to obtain the pruned data frame
Quality score plot for `scoring_function = n_data_points * coding_density` on the densify result object obtained from an implementation of `densify()` using the values `density_mean = "arithmetic"`, considering taxonomic diversity.

(#fig:densify-result illustration)Quality score plot for scoring_function = n_data_points * coding_density on the densify result object obtained from an implementation of densify() using the values density_mean = "arithmetic", considering taxonomic diversity.

Users can run rank_results(), prune() or visualize() multiple times on the same densification output using different scoring functions to fine-tune the settings according to their needs. We recommend the choice of the scoring function to be motivated by characteristics of the input data frame as well as the desired characteristics of the output data frame in mind.

For instance, if coding sparsity of the input matrix mainly manifests itself in rows but not columns, and an output matrix with low row coding density is undesirable, this should be expressed by including the summary statistic row_coding_density_min (e.g., scoring_function = n_data_points * coding_density * row_coding_density_min). If this concern is strong, the importance of row_coding_density_min can be increased by an exponent (e.g., scoring_function = n_data_points * coding_density * row_coding_density_min^2).

Other users may have an extremely sparse matrix and merely seek to strongly densify it, irrespective of taxonomic structure in rows. They would seek to include both row_coding_density_min and col_coding_density_min, possibly even increasing coding_density relative to n_data_points (e.g., scoring_function = n_data_points * coding_density^2 * row_coding_density_min * col_coding_density_min).

In another example, coding densities may be high and roughly even across rows and columns, and the motivation for pruning resides mainly in selecting a taxonomically diverse sub-sample for an analysis. In such a case, the summary statistic taxonomic_index would be included in the score, and the user may even consider to lower the parameter n_data_points relative to coding_density (e.g., scoring_function = n_data_points^0.5 * coding_density * taxonomic_index).

To appreciate the effect of the scoring_function argument , compare the following quality score plots (see Figure 2.3) using different scoring functions on the same iteration log (obtained from an implementation of densify() using the values density_mean = "arithmetic", consider_taxonomic_diversity = TRUE.) The quality score per iteration and the optimum sub-matrix vary dramatically depending on the score function. Accordingly, so do the optima.

visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[1407]])` to obtain the pruned data frame

visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density 
          * row_coding_density_min * col_coding_density_min * taxonomic_index)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2407]])` to obtain the pruned data frame

visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density 
          * taxonomic_index)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[1626]])` to obtain the pruned data frame

visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density 
          * row_coding_density_min * taxonomic_index)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2388]])` to obtain the pruned data frame
Quality score plots for four different scoring functions on the iteration log obtained from an implementation of `densify()` using the values `density_mean = "arithmetic", consider_taxonomic_diversity = TRUE`. Topmost: `scoring_function = n_data_points * coding_density`. Second from top: `scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index`. Second from bottom: `scoring_function = n_data_points * coding_density * taxonomic_index`. Bottommost: `scoring_function = n_data_points * coding_density * row_coding_density_min* taxonomic_index`.Quality score plots for four different scoring functions on the iteration log obtained from an implementation of `densify()` using the values `density_mean = "arithmetic", consider_taxonomic_diversity = TRUE`. Topmost: `scoring_function = n_data_points * coding_density`. Second from top: `scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index`. Second from bottom: `scoring_function = n_data_points * coding_density * taxonomic_index`. Bottommost: `scoring_function = n_data_points * coding_density * row_coding_density_min* taxonomic_index`.Quality score plots for four different scoring functions on the iteration log obtained from an implementation of `densify()` using the values `density_mean = "arithmetic", consider_taxonomic_diversity = TRUE`. Topmost: `scoring_function = n_data_points * coding_density`. Second from top: `scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index`. Second from bottom: `scoring_function = n_data_points * coding_density * taxonomic_index`. Bottommost: `scoring_function = n_data_points * coding_density * row_coding_density_min* taxonomic_index`.Quality score plots for four different scoring functions on the iteration log obtained from an implementation of `densify()` using the values `density_mean = "arithmetic", consider_taxonomic_diversity = TRUE`. Topmost: `scoring_function = n_data_points * coding_density`. Second from top: `scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index`. Second from bottom: `scoring_function = n_data_points * coding_density * taxonomic_index`. Bottommost: `scoring_function = n_data_points * coding_density * row_coding_density_min* taxonomic_index`.

Figure 2.3: Quality score plots for four different scoring functions on the iteration log obtained from an implementation of densify() using the values density_mean = "arithmetic", consider_taxonomic_diversity = TRUE. Topmost: scoring_function = n_data_points * coding_density. Second from top: scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index. Second from bottom: scoring_function = n_data_points * coding_density * taxonomic_index. Bottommost: scoring_function = n_data_points * coding_density * row_coding_density_min* taxonomic_index.

Optima also change by weighting particular concerns using exponents (see Figure 2.4):

visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density^2)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2311]])` to obtain the pruned data frame

visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density^3)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2388]])` to obtain the pruned data frame

visualize(dr_arith_taxtrue, scoring_function = n_data_points^2 * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[976]])` to obtain the pruned data frame

visualize(dr_arith_taxtrue, scoring_function = n_data_points^3 * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[812]])` to obtain the pruned data frame
Quality score plots varying the basic trade-off of available data points and coding density in an implementation of `densify()` using the values `density_mean = "arithmetic"`, considering taxonomic diversity. Topmost: `scoring_function = n_data_points * coding_density^2`. Second from top: `scoring_function = n_data_points * coding_density^3`. Second from bottom: `scoring_function = n_data_points^2 * coding_density`. Bottommost: `scoring_function = n_data_points^3 * coding_density`.Quality score plots varying the basic trade-off of available data points and coding density in an implementation of `densify()` using the values `density_mean = "arithmetic"`, considering taxonomic diversity. Topmost: `scoring_function = n_data_points * coding_density^2`. Second from top: `scoring_function = n_data_points * coding_density^3`. Second from bottom: `scoring_function = n_data_points^2 * coding_density`. Bottommost: `scoring_function = n_data_points^3 * coding_density`.Quality score plots varying the basic trade-off of available data points and coding density in an implementation of `densify()` using the values `density_mean = "arithmetic"`, considering taxonomic diversity. Topmost: `scoring_function = n_data_points * coding_density^2`. Second from top: `scoring_function = n_data_points * coding_density^3`. Second from bottom: `scoring_function = n_data_points^2 * coding_density`. Bottommost: `scoring_function = n_data_points^3 * coding_density`.Quality score plots varying the basic trade-off of available data points and coding density in an implementation of `densify()` using the values `density_mean = "arithmetic"`, considering taxonomic diversity. Topmost: `scoring_function = n_data_points * coding_density^2`. Second from top: `scoring_function = n_data_points * coding_density^3`. Second from bottom: `scoring_function = n_data_points^2 * coding_density`. Bottommost: `scoring_function = n_data_points^3 * coding_density`.

Figure 2.4: Quality score plots varying the basic trade-off of available data points and coding density in an implementation of densify() using the values density_mean = "arithmetic", considering taxonomic diversity. Topmost: scoring_function = n_data_points * coding_density^2. Second from top: scoring_function = n_data_points * coding_density^3. Second from bottom: scoring_function = n_data_points^2 * coding_density. Bottommost: scoring_function = n_data_points^3 * coding_density.

In what follows we compare some of the densified sub-matrices.

pruned_1 <- prune(dr_arith_taxtrue, scoring_function = n_data_points * coding_density 
                  * taxonomic_index)
pruned_2 <- prune(dr_arith_taxtrue, scoring_function = n_data_points * coding_density 
                  * row_coding_density_min * col_coding_density_min * taxonomic_index)

The densified matrix resulting from defining: scoring_function = n_data_points * coding_density * taxonomic_index encompasses 862 languages in 314 families coded for 154 features at an overall coding density of 0.38%.

languages_1 <- pruned_1$Glottocode
data_1 <- pruned_1 %>% select(-Glottocode)
# number of languages, families and features, overall coding density for pruned_1
length(languages_1) 
# [1] 862
flat_taxonomy_matrix %>% filter(id %in% languages_1) %>% 
  select(level1) %>% unique() %>% nrow()
# [1] 314
ncol(data_1)
# [1] 154
sum(!is.na(data_1))/(ncol(data_1)*nrow(data_1))
# [1] 0.3815575

Both language and feature coding densities have become substantially higher (see Figure 2.5) as compared to the input matrix (see Figure 2.1), and how the inclusion of taxonomic diversity in both densify() and the scoring_function argument of prune() result in the inclusion of a number of languages that are not densely coded, but strongly contribute to taxonomic diversity and are thus kept.

Coding density for the pruned `WALS` data frame, using the values `density_mean = "arithmetic"`, considering taxonomic diversity, using `scoring_function = n_data_points * coding_density * taxonomic_index` (left languages, right features)Coding density for the pruned `WALS` data frame, using the values `density_mean = "arithmetic"`, considering taxonomic diversity, using `scoring_function = n_data_points * coding_density * taxonomic_index` (left languages, right features)

Figure 2.5: Coding density for the pruned WALS data frame, using the values density_mean = "arithmetic", considering taxonomic diversity, using scoring_function = n_data_points * coding_density * taxonomic_index (left languages, right features)

The densified sub-matrix resulting from defining: scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index encompasses 105 languages in 83 families coded for 128 features at an overall coding density of 0.11%.

languages_2 <- pruned_2$Glottocode
data_2 <- pruned_2 %>% select(-Glottocode)
# number of languages, families and features, overall coding density for pruned_2
length(languages_2) 
# [1] 105
flat_taxonomy_matrix %>% filter(id %in% languages_2) %>% 
  select(level1) %>% unique() %>% nrow()
# [1] 83
ncol(data_2)
# [1] 128
sum(!is.na(data_2))/(ncol(data_2)*nrow(data_2))
# [1] 0.8825893

Additionally including row_coding_density_min and col_coding_density_min in the scoring_function argument thus results in a much denser matrix with extremely well-coded rows and columns (see Figure 2.6).

Coding density for the pruned `WALS` data frame, using the values `density_mean = "arithmetic"`, considering taxonomic diversity and with `scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index` (left languages, right features)Coding density for the pruned `WALS` data frame, using the values `density_mean = "arithmetic"`, considering taxonomic diversity and with `scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index` (left languages, right features)

Figure 2.6: Coding density for the pruned WALS data frame, using the values density_mean = "arithmetic", considering taxonomic diversity and with scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index (left languages, right features)

3 Comparing outputs using varying parameters

To illustrate in more detail the effects of varying parameters in densify() and the scoring_function parameter in rank_results(), prune() and visualize(), we set 10 different scoring_function definitions in prune() for each of the 16 parameter settings specified above for densify() and log for each of the resulting matrices the overall coding density of the matrix as well as the number of languages, the number of families and the number of features maintained. Each of these values are compared to the values of the original data frame. The full log-file is stored as a separate file (varying_parameters.RData) and can guide users to explore the effects of varying parameters on this dataset.

str(logfile)
# 'data.frame': 161 obs. of  15 variables:
#  $ density_mean                     : chr  "original" "arithmetic" "arithmetic" "arithmetic" ...
#  $ min_variability                  : int  NA 3 3 3 3 3 3 3 3 3 ...
#  $ seed                             : int  NA 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
#  $ taxonomy_weight                  : num  NA 0 0 0 0 0 0 0 0 0 ...
#  $ coding_weight                    : num  NA 1 1 1 1 1 1 1 1 1 ...
#  $ iteration_log_id                 : chr  NA "dr_arith_taxfalse" "dr_arith_taxfalse" "dr_arith_taxfalse" ...
#  $ scoring_function                 : chr  NA "n_data_points * coding_density" "n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index" "n_data_points * coding_density * taxonomic_index" ...
#  $ full_coding_proprotion           : num  0.157 0.411 0.937 0.425 0.887 0.887 0.819 0.836 0.836 0.425 ...
#  $ number_languages                 : int  2496 974 159 907 174 174 208 202 202 907 ...
#  $ number_families                  : int  318 207 93 202 98 98 111 110 110 202 ...
#  $ number_features                  : int  192 134 86 134 107 107 119 115 115 134 ...
#  $ increase_factor_coding_proportion: num  1 2.62 5.97 2.71 5.65 ...
#  $ proportion_languages_kept        : num  1 0.39 0.064 0.363 0.07 0.07 0.083 0.081 0.081 0.363 ...
#  $ proportion_families_kept         : num  1 0.651 0.292 0.635 0.308 0.308 0.349 0.346 0.346 0.635 ...
#  $ proportion_features_kept         : num  1 0.698 0.448 0.698 0.557 0.557 0.62 0.599 0.599 0.698 ...

References

Dryer, Matthew S., and Martin Haspelmath, eds. 2013. WALS Online (V2020.3). Data set. Zenodo. https://doi.org/10.5281/zenodo.7385533.
Hammarstroem, Harald, Robert Forkel, Martin Haspelmath, and Sebastian Bank. 2024. “Glottolog/Glottolog: Glottolog Database 5.0.” Zenodo. https://doi.org/10.5281/zenodo.10804357.