The densify package provides a procedure to generate denser subsets of
input data frames according to varying user-defined parameters. The
densification process is split up into two main functions densify()
and prune(). This tutorial guides users through all functions and
parameters to demonstrate their effect on densification.
The data used in this tutorial are the default data provided in the
package: the language-feature matrix WALS obtained from WALS, the
World Atlas of Language Structures (Dryer and Haspelmath 2013), and the language taxonomy
glottolog_languoids provided by Glottolog, Version 5.0 (Hammarstroem et al. 2024).
The input for densification is a data frame with rows representing
observations or taxa (e.g., languages) and columns representing
features (in linguistic typological databases, these are also often
referred to as “features”, “characters” or “parameters”). The data
frame requires taxa to be expressed in rows, with taxon names specified
in a column (e.g., a column called “Glottocodes”). Columns can denote
meta-data (such as taxon names) or features. feature names must
thereby be set as the column names. Any cells with empty or unwanted
(e.g. ?) entries must be coded as NA.
If the densification process should be sensitive to the taxonomic
structure of the observations, a taxonomy must be provided, in which all
observations in the data frame are attested either as tips or nodes. The
input taxonomy can have two forms: it can be a phylo object (e.g. via
ape::read.nexus) or an adjacency table (i.e., a data frame containing
columns id and parent_id, with each row encoding one parent-child
relationship). To generate a language taxonomy, the data from Glottolog,
accessible as data(glottolog_languoids), can be used directly.
The taxonomic input is parsed internally by the function densify() to
generate a flattened taxonomic representation and then used for
densification if specified. As such, users do not need to convert the
input taxonomy into a flattened format themselves, but we expose the
function as_flat_taxonomy_matrix() here, since it can be useful for
other applications.
# load packages
library(densify)
library(ggplot2)
library(dplyr)# exclude languages from WALS that are not attributable to any entry in the glottolog taxonomy
WALS <- WALS[which(WALS$Glottocode %in% glottolog_languoids$id), ]
# ensure all missing and non-applicable information is coded as NA
WALS[WALS==""] <- NA
WALS[WALS=="?"] <- NA
WALS[WALS=="NA"] <- NA
# this is the flat glottolog taxonomy generated within the densify function
# with the input glottolog_languoids
flat_taxonomy_matrix <- as_flat_taxonomy_matrix(glottolog_languoids)The initial (full) dataset provided by WALS encompasses data on 192 linguistic features, coded for 2496 languages (representing 318 families) that are attributable to Glottolog.
languages <- WALS$Glottocode
data <- WALS %>% select(-Glottocode)
# number of languages
length(languages)
# [1] 2496# number of families
flat_taxonomy_matrix %>% filter(id %in% languages) %>% select(level1) %>% unique() %>% nrow()
# [1] 318# number of features
ncol(data)
# [1] 192In this matrix each feature is coded for a distinct set of languages, such that the coding density, which we define as the proportion of non-empty cells, is patchy for both languages and features (see Figure 2.1).
Figure 2.1: Coding density for the WALS data frame (left languages, right features)
The overall coding density of the input data frame is 0.16%.
mean(!is.na(data))
# [1] 0.1568217densify()A straightforward method to find a denser sub-matrix within a sparse
super-matrix is to remove all rows and columns with coding densities
below a given threshold. However, removing a row subtly changes all
column densities and vice versa, so a more cautious method is to
iteratively remove rows and columns and to update, after each removal,
the coding densities of all remaining rows and columns. The function
densify() accomplishes this: It takes the original matrix,
iteratively removes rows and columns according to specifiable parameters
(described below), and logs the characteristics of the matrix after each
iteration. The output of the function is a densify_result object,
which is parsed by prune(), rank_results() or visualize() (see
below), to determine the optimal number of iterations according to
user-specified criteria post-hoc.
densifyBased on the sparsely encoded data frame provided by the user,
densify() defines an encoding matrix of the same dimensions,
containing zeroes for NAs and ones elsewhere. Using this encoding
matrix, it computes an importance score for each row and each column
according to user-specifiable parameters, identifies the row and/or column
with the lowest importance score (randomly sampling one row or column if
there are ties) and removes it. After the first row or column is
removed, the function re-computes the importance scores of the resulting
sub-matrix, identifies which row and/or column scores worst in the new
matrix, and in turn removes it. This process is repeated until matrix
density reaches 1, or up to specifiable limits regarding density. Key
characteristics (e.g., number of data points, number of rows, number of
columns, overall coding density, identity of removed rows and/or
columns, etc.) of all sub-matrices produced in the pruning process are
documented in a densify_result object, the output of the function.
There are several ways with which importance scores for rows and columns can be computed. Our approach distinguishes between what will be referred to as absolute and weighed coding densities. Absolute coding densities denote the (raw) proportion of coded cells in a row or column. If a row is coded for 10% of the columns in the matrix, its absolute coding density is thus 0.1. Several rows may be coded for the same absolute proportion of columns, but the identity of the columns for which such rows are coded may well be distinct – some rows may be coded for columns which are well-coded, while others may be coded for columns coded only for few rows. Such differences among rows coded for the same absolute proportion of columns matters when establishing the importance scores for rows and columns in the iterative pruning approach adopted here, such that among those sparsely coded rows (or columns, respectively) coded for the same proportion of columns (or rows, respectively), the ones coded for less well-coded columns are removed first (weighted coding density).
Thus, to determine the importance score of a column (feature),
densify() computes the weighted column coding density. To determine
the importance score of a row (language), one of three possible types
of mean (described in detail below) are used, taking into to account
(1) the absolute row coding densities, (2) the weighted row coding
densities and – if specified – (3) the contribution of a specific row
to the taxonomic diversity of the sample.
Row and column importance scores are treated differently for two reasons. First, only rows can have taxonomic structure. Second, in most linguistic typological research applications, the focus of densification resides chiefly on pruning away rows (languages/taxa/observations) rather than columns because the former are usually substantially more numerous, and including absolute coding densities in the row importance score primarily has the effect of penalizing scarcely-coded rows in the matrix more strongly.
Removing rows can have effects not only on coding densities of columns,
but also on their informativeness: It is possible for the iterative
removal of rows to slowly deplete a feature state, such that a feature
(column) no longer displays sufficient or any variability (i.e., a
column may still be coded for a substantial number of rows but all
remaining coded rows might have the value FALSE, because all rows with
the value TRUE have been pruned away). Thus, after each row removal,
densify() assesses whether any features have become uninformative
according to a user-defined threshold denoting how many observations
must be present in the second-largest feature state, and removes any
such column(s) alongside the row in the same iteration.
Conversely, it is possible that the removal of a specific column results in a row no longer exhibiting any data points (because it was only coded for the column that was removed). Thus, after each column removal the algorithm assesses whether any rows have become empty and removes any such row(s) alongside the column in the same iteration.
densify()The function densify() requires the user to set a number of
parameters, most of which modulate the way the row importance scores are
computed at each iteration. This section describes and discusses all
parameters available to fine-tune the densification process.
densify() takes the following arguments:
data: The data frame to be pruned, in the format described above.
cols: A specification of which feature columns to densify. The
default is to include all columns for densification.
taxonomy: A taxonomic tree encompassing all taxa represented in
data as tips or nodes. It can be a phylo object or an adjacency
table (a data frame containing columns id and parent_id, with
each row encoding one parent-child relationship).
taxon_id: The name of the column identifying taxa. If this column
is not specified, densify() will attempt to guess which column
denotes taxa based on column contents.
density_mean: This parameter specifies how the measure denoting row
importance is computed. It can be one of the following three
mean-types: "arithmetic", "geometric", "log_odds". See below
for more details. The default value is "log_odds", which is is based
on the mean of logit-transformed proportions.
min_variability: An integer specifying how many observations/taxa
the second-most-frequent feature state of any feature must contain
for the feature to be maintained in the matrix. The default is set to
1, guaranteeing that a feature must display at least minimal
variability: at least one observation must be different from all
others. Value NA disables pruning according to this principle.
limits: A named list specifying the limit conditions for the
densification process to end. Available limit conditions are
min_coding_density (a number between 0 and 1 specifying the target
coding density), min_prop_rows (a number between 0 and 1
specifying the minimal proportion of rows that have to be retained
in the data), and min_prop_cols (a number between 0 and 1
specifying the minimal proportion of feature columns that have to
be retained in the data). More then one condition can be specified
simultaneously. densify() will stop if any condition is reached.
The default limit is min_coding_density = 1, i.e. densification
ends once the matrix is fully dense, i.e. there are no NAs left.
density_mean_weights: A named list specifying weighting factors applied
during importance score computation. Available parameters are
coding and taxonomy, applied to coding density and taxonomic
diversity values respectively and thus tweaking the relative
importance between coding density and taxonomic diversity in the
pruning process. Both values must be integers \(\in [0,1]\). Setting a
parameter to 0 (or NA, or NULL) disables the contribution of
the respective values to the computed row importance score and hence
the pruning process. The default value is
density_mean_weights = list(coding = 1, taxonomy = 1).
While it is possible to run densify() with several parameter settings
on the same input data frame, we strongly recommend motivating parameter
settings with the desired characteristics of the output data frame
(e.g., how strongly should taxonomic diversity be weighted in relation
to coding density of rows?) in mind.
cols and taxon_idUsers can specify the names of the columns which represent features that
should undergo densification using the cols parameter and the name of
the column denoting taxon names using the taxon_id parameter.
The cols argument is useful if certain columns should not be
densified, e.g. because they encode language meta-data. If no cols are
specified, densify() will assume that all but the taxon_id column
should be included.
Note that if no taxon_id is specified, densify() will try to make an
educated guess as to which column represents taxon names based on the
data.
taxonomyA taxonomy must be provided under the parameter taxonomy if taxonomic
diversity should be considered for densification or optimum sub-matrix
identification. It can be provided as a phylo object or as an
adjacency table (i.e., a data frame containing columns id and
parent_id, with each row encoding one parent-child relationship, like
glottolog_languoids).
density_mean and density_mean_weightsdensify() offers three methods of computing the mean of row-wise
absolute and weighted coding densities as well as taxonomic diversity of
the sample where applicable: Possible values are "arithmetic",
"geometric", or "log_odds". The arithmetic and geometric means
operate as conventionally understood: the arithmetic mean utilizes the
base::mean() function, while the geometric mean involves the n-th root
of the product. The “log odds mean” (henceforth: LOM) is computed by
taking the arithmetic mean of the logit transform of all inputs, and
then transforming back. In the computation of the LOM, a clamping step
prior to logit-value computation ensures that all values remain finite.
For all scoring settings, the respective means of (1) the absolute, (2)
the weighted coding densities and - if applicable - (3) the contribution
to taxonomic diversity of the sample are computed for each row.
To illustrate the different effects of the three scoring types
available, the left panel below in Figure 2.2
traces the mean of a constant 0.5 and a coding density
x between 0 and 1, with mean types "arithmetic", "geometric" and
"log_odds". Note how the log odds mean strongly skews towards 0 and 1
once x approaches the extremes: once one of the densities is
particularly low or high, this will dominate the mean. The effect of
this is that if coding weights or contribution to taxonomic diversity
are extremely high (close or equal to 1), a row is unlikely to be
removed even if the densities of the other rows are below average.
Conversely, if coding weights or contribution to taxonomic diversity are
extremely low (approaching 0), a row is very likely to be removed (even
if the others are above average). By contrast, the geometric mean shows
this skewing only when the density approaches 0, emphasizing the
negative effect of extremely low coding densities or contributions to
taxonomic diversity only. If users intend to strongly weight extremely
low or extremely high densities regarding either contribution to
taxonomic diversity or codedness, they should thus favor the LOM for
densify(). The geometric mean should be chosen if extremely high
density or contribution to taxonomic diversity should not outweigh
extremely low densities. In other words, the geometric mean is useful if
the focus lies more on removing rows with extremely low densities than
in maintaining rows with extremely high densities.
densify() also provides users with the option of modulating the
weights of the components of which the "arithmetic", "geometric", or
"log_odds" mean score is computed. This is achieved by a separate
parameter density_mean_weights, where weights can be specified for coding
and taxonomy. These are multipliers \(\in [0,1]\), defaulting to 1.
The coding weight specifies the value by which the absolute and the
weighted coding densities are multiplied prior to importance score
computation, and the taxonomy weight specifies the value by which the
taxonomic diversity score is multiplied. If set to different values,
these weight factors thus allow users to bias the weight of taxonomic
diversity relative to the weight of coding density in the computation of
the row scores and ultimately in the iterative pruning process. To show
the effect of this, the middle and right panels in Figure
2.2 illustrate how different weight factors (here:
0.5, 0.75, 0.9, 0.95, 0.99 and 0.999) modulate the behavior of the
geometric mean and the LOM, respectively, of a constant 0.5 and all
possible values of x.
By setting the taxonomy weight to 0, NA or NULL, the pruning
process ignores taxonomic structure. If a taxonomy is provided in the
data, a taxonomic diversity score (the Shannon entropy of the highest
taxonomic level in the taxonomy) will nonetheless be computed and logged
for each sub-matrix in the densify_result output produced by
densify(). This way, ignoring taxonomic diversity in the pruning
process does not preclude taxonomic diversity from being considered for
identifying the optimal number of iterations using rank_results(),
prune() or visualize() later (see below). (In principle, the
coding weight can also be disabled. However, this undermines the
purpose of densification.)
Figure 2.2: Illustrating different mean types (top), different coding weights for the geometric mean (middle), and different coding weights for the LOM (bottom).
In summary, the parameters density_mean and density_mean_weights modulate the
pruning process and can (and should) be adjusted to meet the needs of
the user. It may be necessary to experiment with different values to
find a good balance of parameters.
min_variabilityWhile it is conceivable that one might want to include features in a
dataset displaying no variability whatsoever among the coded
taxa/observations (min_variability = NA), such features are
uninteresting for most statistical purposes.
The parameter min_variability specifies how many taxa the
second-most-frequent feature state of any feature must contain for the
feature to be maintained in the matrix. The default is 1, i.e. there
must be at least one language that has a state different from all
others. This will remove features with only one state – whether they
are present in the input matrix or whether they result from
removing rows. For certain comparative analyses, it may be reasonable to
set the threshold to an integer larger \(n > 1\). This ensures that each
feature contains at least two values and that the second-most frequent
values is present in \(n\) taxa/observations.
Consider the following example:
example <- data.frame(feature.A=c("present","absent","absent","absent","absent","absent","absent"),
feature.B=c("present","present","absent","absent","absent","absent","absent"),
feature.C=c("type1","type1","type1","type1","type1","type2","type3"),
feature.D=c("type1","type1","type1","type1","type2","type2","type3"),
feature.E=c("type1","type1","type1","type1","type2","type3","type4"),
feature.F=c("type1","type1","type1","type1","type1","type1","type1"))
table(example$feature.A)
#
# absent present
# 6 1table(example$feature.B)
#
# absent present
# 5 2table(example$feature.C)
#
# type1 type2 type3
# 5 1 1table(example$feature.D)
#
# type1 type2 type3
# 4 2 1table(example$feature.E)
#
# type1 type2 type3 type4
# 4 1 1 1table(example$feature.F)
#
# type1
# 7In this data frame, features A and B each have two states, absent
and present, but feature B exhibits more variation than feature
A: The rarer state present counts two observations, whereas it
counts only one observation in feature A. Features C and D each
have three states, type1, type2 and type3, but feature D
exhibits more variation than feature C: The second-most-frequent
state type2 counts two observations instead of just one (feature
C).
Different values for min_variability on this data have the following
effect:
min_variability = NA: All features are retained, including
feature F, which displays no variability among observations.
min_variability = 1: Features A, B, C, D and E are
retained, because each of them contains at least one observation in
the second-most-frequent state. Feature F is removed, because it
only includes one state.
min_variability = 2: Features B and D are retained, because
both of them contain at least two observations in the
second-most-frequent state. Features A, C, E and F are
removed because they either have only one observation in their
second-most-frequent state or because they are invariate.
The following 16 implementations of densify() take WALS as an input
data frame with min_variability = 3 and vary the following parameters
discussed above: density_mean and density_mean_weights. Their outputs as well
as their effect on the densification process will be illustrated below.
# arithmetic mean, including and excluding taxonomic diversity
set.seed(2023)
dr_arith_taxtrue <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "arithmetic",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 1))
dr_arith_taxfalse <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "arithmetic",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 0))
# geometric mean, including and excluding taxonomic diversity
dr_geom_taxtrue <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "geometric",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 1))
dr_geom_taxfalse <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "geometric",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 0))
# log_odds mean, including and excluding taxonomic diversity
# varying weights when taxonomic diversity included
dr_lom_taxtrue_11 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 1))
dr_lom_taxtrue_099099 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.99, taxonomy = 0.99))
dr_lom_taxtrue_095095 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.95, taxonomy = 0.95))
dr_lom_taxtrue_090090 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.90, taxonomy = 0.90))
dr_lom_taxtrue_050050 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.50, taxonomy = 0.50))
dr_lom_taxtrue_1095 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 0.95))
dr_lom_taxtrue_0951 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.95, taxonomy = 1))
dr_lom_taxtrue_1090 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 0.90))
dr_lom_taxtrue_0901 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.90, taxonomy = 1))
dr_lom_taxfalse_1 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 1, taxonomy = 0))
dr_lom_taxfalse_095 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.95, taxonomy = 0))
dr_lom_taxfalse_075 <-
densify(data = WALS,
taxonomy = glottolog_languoids,
taxon_id = "Glottocode",
density_mean = "log_odds",
min_variability = 3,
density_mean_weights = list(coding = 0.75, taxonomy = 0))densify_result, the outputs of densify()Outputs of the densify() function are objects of type
densify_result, which log characteristics of all sub-matrices
generated throughout densification. The columns of the tibble log, for
each sub-matrix (row):
.step: the iteration step in the densificationn_data_points: the number of total non-missing data pointsn_rows: the number of taxan_cols: the number of columns (excluding extra columns, such as
taxon_id or meta-data)coding_density: the overall proportion of non-missing data pointsrow_coding_density_min: the coding density of the row with the
lowest coding densityrow_coding_density_median: the median row coding densityrow_coding_density_max: the coding density of the row with the
highest coding densitycol_coding_density_min: the coding density of the column with the
lowest coding densitycol_coding_density_median: the median column coding densitycol_coding_density_max: the coding density of the column with the
highest coding densitytaxonomic_index: the Shannon-Wiener diversity index for the
highest taxonomic level (i.e., family (or isolate) membership) of
languagesdata: a promise object evaluating the dimensions of the pruned
dataset. To retrieve the corresponding object, run
as.data.frame(x), e.g. as.data.frame(dr_arith_taxtrue$data[4])
for the fourth sub-matrix.changes: a tibble listing all pruned taxa and/or features in the
iteration, including the reason for pruning.# overview of densify_result object
head(dr_arith_taxtrue)
# # Densify result with 6 pruning steps
# # Highest achieved coding density is 17%
# .step n_data_points n_rows n_cols coding_density row_coding_density_min
# <int> <dbl> <int> <int> <dbl> <dbl>
# 1 1 75154 2496 192 0.157 0.00521
# 2 2 75154 2496 192 0.157 0.00521
# 3 3 75075 2494 186 0.162 0.00538
# 4 4 75037 2457 185 0.165 0.00541
# 5 5 75036 2456 185 0.165 0.00541
# 6 6 75035 2455 185 0.165 0.00541
# # ℹ 8 more variables: row_coding_density_median <dbl>,
# # row_coding_density_max <dbl>, col_coding_density_min <dbl>,
# # col_coding_density_median <dbl>, col_coding_density_max <dbl>,
# # taxonomic_index <dbl>, data <list>, changes <list>
# #
# # Use `plot()` or `visualize()` to visualize density statistics and pick the
# best result
# # Use `rank_results()` or `prune()` to pick the best densify result based on
# your criteria
# # Use `eval()` to manually retrieve any data frame from column `data`
# retrieving a specific sub-matrix
submatrix <- as.data.frame(dr_arith_taxtrue$data[4])rank_results(), prune() or visualize()The output from densify() allows for users to retrieve all
sub-matrices produced in the iterative pruning process and to compare
them based on the summary statistics available. Depending on what
criteria are relevant to a specific densification, different summary
statistics are relevant for comparison and hence optimum identification.
Naturally, the sub-matrices generated after more iterations are denser than those after only few pruning steps. However, they also contain less data. There is thus a trade-off between the proportion of coded data and the number of available data points that the users can exploit to determine the optimal number of iterations. If taxonomic diversity is relevant to users, it provides an additional criterion that can modulate which number of iterations is optimal: If the input is highly unevenly distributed taxonomically and taxonomic structure is considered for pruning, the taxonomic diversity of the taxa provided in rows will tend to increase throughout the initial pruning phase until it levels off and subsequently decreases. Other users may want to ensure their sub-matrix does not contain (any) rows or columns with very low coding densities.
The functions rank_results(), prune() and visualize() thus take
the output from densify() as an input and compare the sub-matrices
resulting from each iteration by computing a matrix quality score
via an intuitively worded formula consisting of one or more summary statistics
from the densify_result object. The formula is specified under the parameter
scoring_function (as a standard R expression) and biases the sub-matrix selection in the way the trade-offs
are meant to play out. The default formula used is
scoring_function = n_data_points * coding_density, which maximizes the product of
available non-missing data points and the overall matrix coding density.
Using the function rank_results(), users can rank all sub-matrices
with respect to the scoring function they specify (returning a vector indicating
the rank of each sub-matrix), while prune() identifies and retrieves
the sub-matrix for which the scoring function is maximized (selecting the earlier
iteration in the unlikely case of ties). The function visualize()
plots the quality score for all sub-matrices and also indicates the
optimum.
# rank results
head(rank_results(dr_arith_taxtrue, scoring_function = n_data_points * coding_density))
# [1] 2383 2383 2375 2369 2367 2366
# prune matrix
pruned <- prune(dr_arith_taxtrue, scoring_function = n_data_points * coding_density)
dim(pruned)
# [1] 1081 155
# visualize quality score
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[1407]])` to obtain the pruned data frame
(#fig:densify-result illustration)Quality score plot for scoring_function = n_data_points * coding_density on the densify result object obtained from an implementation of densify() using the values density_mean = "arithmetic", considering taxonomic diversity.
Users can run rank_results(), prune() or visualize() multiple
times on the same densification output using different scoring functions to fine-tune the
settings according to their needs. We recommend the choice of the scoring
function to be motivated by characteristics of the input data frame as
well as the desired characteristics of the output data frame in mind.
For instance, if coding sparsity of the input matrix mainly manifests
itself in rows but not columns, and an output matrix with low row coding
density is undesirable, this should be expressed by including the
summary statistic row_coding_density_min (e.g.,
scoring_function = n_data_points * coding_density * row_coding_density_min). If
this concern is strong, the importance of row_coding_density_min can
be increased by an exponent (e.g.,
scoring_function = n_data_points * coding_density * row_coding_density_min^2).
Other users may have an extremely sparse matrix and merely seek to
strongly densify it, irrespective of taxonomic structure in rows. They
would seek to include both row_coding_density_min and
col_coding_density_min, possibly even increasing coding_density
relative to n_data_points (e.g.,
scoring_function = n_data_points * coding_density^2 * row_coding_density_min * col_coding_density_min).
In another example, coding densities may be high and roughly even across
rows and columns, and the motivation for pruning resides mainly in
selecting a taxonomically diverse sub-sample for an analysis. In such a
case, the summary statistic taxonomic_index would be included in the
score, and the user may even consider to lower the parameter
n_data_points relative to coding_density (e.g.,
scoring_function = n_data_points^0.5 * coding_density * taxonomic_index).
To appreciate the effect of the scoring_function argument , compare the following
quality score plots (see Figure 2.3) using different
scoring functions on the same iteration log (obtained from an implementation
of densify() using the values
density_mean = "arithmetic", consider_taxonomic_diversity = TRUE.) The quality score per iteration and the optimum sub-matrix vary
dramatically depending on the score function. Accordingly, so do the
optima.
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[1407]])` to obtain the pruned data frame
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density
* row_coding_density_min * col_coding_density_min * taxonomic_index)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2407]])` to obtain the pruned data frame
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density
* taxonomic_index)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[1626]])` to obtain the pruned data frame
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density
* row_coding_density_min * taxonomic_index)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2388]])` to obtain the pruned data frame
Figure 2.3: Quality score plots for four different scoring functions on the iteration log obtained from an implementation of densify() using the values density_mean = "arithmetic", consider_taxonomic_diversity = TRUE. Topmost: scoring_function = n_data_points * coding_density. Second from top: scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index. Second from bottom: scoring_function = n_data_points * coding_density * taxonomic_index. Bottommost: scoring_function = n_data_points * coding_density * row_coding_density_min* taxonomic_index.
Optima also change by weighting particular concerns using exponents (see Figure 2.4):
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density^2)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2311]])` to obtain the pruned data frame
visualize(dr_arith_taxtrue, scoring_function = n_data_points * coding_density^3)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[2388]])` to obtain the pruned data frame
visualize(dr_arith_taxtrue, scoring_function = n_data_points^2 * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[976]])` to obtain the pruned data frame
visualize(dr_arith_taxtrue, scoring_function = n_data_points^3 * coding_density)
# ℹ use `tibble::as_tibble(dr_arith_taxtrue$data[[812]])` to obtain the pruned data frame
Figure 2.4: Quality score plots varying the basic trade-off of available data points and coding density in an implementation of densify() using the values density_mean = "arithmetic", considering taxonomic diversity. Topmost: scoring_function = n_data_points * coding_density^2. Second from top: scoring_function = n_data_points * coding_density^3. Second from bottom: scoring_function = n_data_points^2 * coding_density. Bottommost: scoring_function = n_data_points^3 * coding_density.
In what follows we compare some of the densified sub-matrices.
pruned_1 <- prune(dr_arith_taxtrue, scoring_function = n_data_points * coding_density
* taxonomic_index)
pruned_2 <- prune(dr_arith_taxtrue, scoring_function = n_data_points * coding_density
* row_coding_density_min * col_coding_density_min * taxonomic_index)The densified matrix resulting from defining:
scoring_function = n_data_points * coding_density * taxonomic_index encompasses
862 languages in
314
families coded for
154 features at
an overall coding density of
0.38%.
languages_1 <- pruned_1$Glottocode
data_1 <- pruned_1 %>% select(-Glottocode)
# number of languages, families and features, overall coding density for pruned_1
length(languages_1)
# [1] 862flat_taxonomy_matrix %>% filter(id %in% languages_1) %>%
select(level1) %>% unique() %>% nrow()
# [1] 314ncol(data_1)
# [1] 154sum(!is.na(data_1))/(ncol(data_1)*nrow(data_1))
# [1] 0.3815575Both language and feature coding densities have become
substantially higher (see Figure 2.5) as compared
to the input matrix (see Figure
2.1), and how the
inclusion of taxonomic diversity in both densify() and the scoring_function
argument of prune() result in the inclusion of a number of languages
that are not densely coded, but strongly contribute to taxonomic
diversity and are thus kept.
Figure 2.5: Coding density for the pruned WALS data frame, using the values density_mean = "arithmetic", considering taxonomic diversity, using scoring_function = n_data_points * coding_density * taxonomic_index (left languages, right features)
The densified sub-matrix resulting from defining:
scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index
encompasses 105 languages in
83
families coded for
128 features at
an overall coding density of
0.11%.
languages_2 <- pruned_2$Glottocode
data_2 <- pruned_2 %>% select(-Glottocode)
# number of languages, families and features, overall coding density for pruned_2
length(languages_2)
# [1] 105flat_taxonomy_matrix %>% filter(id %in% languages_2) %>%
select(level1) %>% unique() %>% nrow()
# [1] 83ncol(data_2)
# [1] 128sum(!is.na(data_2))/(ncol(data_2)*nrow(data_2))
# [1] 0.8825893Additionally including row_coding_density_min and
col_coding_density_min in the scoring_function argument thus results in a much
denser matrix with extremely well-coded rows and columns (see Figure
2.6).
Figure 2.6: Coding density for the pruned WALS data frame, using the values density_mean = "arithmetic", considering taxonomic diversity and with scoring_function = n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index (left languages, right features)
To illustrate in more detail the effects of varying parameters in
densify() and the scoring_function parameter in rank_results(), prune() and
visualize(), we set 10 different scoring_function definitions in prune() for
each of the 16 parameter settings specified above for densify() and
log for each of the resulting matrices the overall coding density of the
matrix as well as the number of languages, the number of families and
the number of features maintained. Each of these values are compared to
the values of the original data frame. The full log-file is stored as a
separate file (varying_parameters.RData) and can guide users to explore
the effects of varying parameters on this dataset.
str(logfile)
# 'data.frame': 161 obs. of 15 variables:
# $ density_mean : chr "original" "arithmetic" "arithmetic" "arithmetic" ...
# $ min_variability : int NA 3 3 3 3 3 3 3 3 3 ...
# $ seed : int NA 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
# $ taxonomy_weight : num NA 0 0 0 0 0 0 0 0 0 ...
# $ coding_weight : num NA 1 1 1 1 1 1 1 1 1 ...
# $ iteration_log_id : chr NA "dr_arith_taxfalse" "dr_arith_taxfalse" "dr_arith_taxfalse" ...
# $ scoring_function : chr NA "n_data_points * coding_density" "n_data_points * coding_density * row_coding_density_min * col_coding_density_min * taxonomic_index" "n_data_points * coding_density * taxonomic_index" ...
# $ full_coding_proprotion : num 0.157 0.411 0.937 0.425 0.887 0.887 0.819 0.836 0.836 0.425 ...
# $ number_languages : int 2496 974 159 907 174 174 208 202 202 907 ...
# $ number_families : int 318 207 93 202 98 98 111 110 110 202 ...
# $ number_features : int 192 134 86 134 107 107 119 115 115 134 ...
# $ increase_factor_coding_proportion: num 1 2.62 5.97 2.71 5.65 ...
# $ proportion_languages_kept : num 1 0.39 0.064 0.363 0.07 0.07 0.083 0.081 0.081 0.363 ...
# $ proportion_families_kept : num 1 0.651 0.292 0.635 0.308 0.308 0.349 0.346 0.346 0.635 ...
# $ proportion_features_kept : num 1 0.698 0.448 0.698 0.557 0.557 0.62 0.599 0.599 0.698 ...