Latte: Cross-framework Python Package for Evaluation of Latent-Based Generative Models

Latte (for LATent Tensor Evaluation) is a Python library for evaluation of latent-based generative models in the fields of disentanglement learning and controllable generation. Latte is compatible with both PyTorch and TensorFlow/Keras, and provides both functional and modular APIs that can be easily extended to support other deep learning frameworks. Using NumPy-based and framework-agnostic implementation, Latte ensures reproducible, consistent, and deterministic metric calculations regardless of the deep learning framework of choice.


Introduction
Disentanglement learning and controllable generation are fast-growing fields within deep learning research, powered by the advances in deep generative networks, such as variational autoencoders (VAEs) [1] and generative adversarial networks (GANs) [2].Disentanglement learning is often used with encoder-decoder architectures to produce latent representations in the form of latent vectors or tensors in the bottleneck layer, such that each latent dimension has an approximately exclusive mapping to a semantic attribute of interest.These disentangled latent representations are particularly useful in the generative models that aim to produce samples with specific and controllable semantic attributes [3,4].
With the growth of the fields comes the need for a reliable and consistent method of evaluation that allows for the comparison of different systems across a variety of metrics.Therefore, we introduce Latte1 (for LATent Tensor Evaluation), a cross-framework Python package for evaluation of latent-based generative models.Since successful latent-based controllable generation requires more than disentanglement [5], Latte also covers interpolatability metrics, in addition to disentanglement metrics.
Framework-agnostic evaluation tools are known to greatly facilitate and accelerate research development in a field.For example, in the field of audio source separation, bss eval [6] and its successor museval [7] have greatly benefited the field and provided a standard benchmarking tool.Many studies on disentanglement learning have formally or informally relied on the disentanglement lib2 library for their evaluation.However, the library has not had a new release since 2019.Moreover, disentanglement lib was mainly created as a code base for reproducing the studies [8][9][10][11][12][13], thus the metric implementations were written to fit the development codes rather than to cater to a wider range of applications, and are only available in TensorFlow.As a result, researchers working with PyTorch or other incompatible models often have to rely on their own re-implementations of the metrics -an error-prone and inefficient approach that in the best case produces additional work, and in the worst case leads to inconsistencies in evaluation metric implementations.In addition, to the best of our knowledge, no comprehensive library for the evaluation of generative interpolatability currently exists.
The introduction of Latte aims to address these shortcomings.By design, Latte performs all metric calculations with NumPy-based computation to ensure cross-framework consistency.The modular design used in Latte also ensures easy extensibility for supporting other frameworks beyond PyTorch and TensorFlow in the future.

Software Description
Latte is a cross-framework Python package for the evaluation of disentanglement and controllability in latent-based generative models.Latte supports on-the-fly metric calculation for disentanglement learning and controllable generation using both a standalone functional API, and modular APIs for the two major deep learning frameworks -TensorFlow/Keras [14] and PyTorch [15].
In order to maximize cross-framework compatibility and reproducibility, core functionalities of Latte are developed with NumPy [16] and Scikit-learn [17], without any deep learning dependencies.These NumPy-based functionalities also serve as a standalone functional API, allowing the use of Latte in post-hoc analyses without the need for specific deep learning dependencies like TensorFlow or PyTorch.
For use with deep learning frameworks, we implemented a modular API for the metrics to allow easy usage within the respective framework.For use with TensorFlow/Keras, we implemented wrappers based on the Keras Metric API to convert the core NumPy-based functionalities to TensorFlow-compatible operations.With PyTorch, we implemented similar wrappers based on the TorchMetrics API [18], which allows easy integration with both PyTorch and the popular PyTorch Lightning [19] frameworks.Latte modular metrics can be used in distributed training using the respective built-in multi-node support in Keras and TorchMetrics.An example of using Latte in modular mode with PyTorch is shown in Figure 1.In this example, the data contains three continuous semantic attributes, which are respectively regularized by the latent specified by the argument reg dim.The discrete=False option specifies that the semantic attributes are continuous-valued.

Deterministic metric calculation
A number of metrics used in disentanglement learning and controllable generation are based on randomly-initialized regressors and classifiers [20].Moreover, practical calculation of probabilistic measures via Scikit-learn, such as mutual information and entropy, also requires random number generation.As disentanglement lib does not explicitly set a seed before metric calculation, identical inputs may result in different metric values.This particular detail is not commonly known amongst end-users who may not be aware of the implementation details.In Latte, a random seed of 42 is set by default, but can be switched off by calling latte.seed(None).This allows end-users to have deterministic metric calculation by default without having to know the implementation details of Latte and its dependencies.

Metric bundles
In addition to individual metric functions and modules, Latte provides metric bundles, which are special modules containing multiple metrics commonly used together, similar to MetricCollection in TorchMetrics.All metric submodules of a metric bundle are initialized together with a common set of settings, ensuring consistency and compatibility between the metrics within a bundle.Inputs of the update calls to a metric bundle are also automatically passed to the respective submodules, reducing the amount of codes needed for metric calculations.Custom bundles can also be created via the MetricBundle class in Latte.An example of using a Latte metric bundle with TensorFlow/Keras is shown in Figure 2.

Testing and Deployment
Automated testing of Latte is performed via pytest.Continuous integration and deployment (CI/CD) is handled via CircleCI.Code coverage and code quality are respectively monitored via CodeCov and CodeFactor.Latte releases are available on the Python Package Index (PyPI) 3 and can be easily installed via pip install latte-metrics.

Disentanglement Metrics
Mutual Information Gap (MIG) evaluates the degree of disentanglement of a latent vector z ∈ R D with respect to a particular semantic attribute, a i ∈ R, by considering the gap in mutual information between the attribute and its most informative latent dimension and that between the attribute and its second-most informative latent dimension [21].Mathematically, MIG is given by where Separate Attribute Predictability (SAP) is similar in nature to MIG but, instead of mutual information, uses the coefficient of determination for continuous attributes and classification accuracy for discrete attributes to measure the extent of relationship between a latent dimension and an attribute [20].SAP is given by where j = argmax d S(a i , z d ), k = argmax d =j S(a i , z d ), and S(•, •) is either the coefficient of determination or classification accuracy.
Modularity is a latent-centric measure of disentanglement [22] based on mutual information.Modularity measures the degree in which a latent dimension contains information about only one attribute, and is given by where j = argmax i I(a i , z d ).
Dependency-aware Mutual Information Gap (DMIG) is a dependency-aware version of MIG that accounts for attribute interdependence observed in real-world data [23].Mathematically, DMIG is given by where ) is conditional entropy, and a l is the attribute regularized by z k .If z k is not regularizing any attribute, DMIG reduces to the usual MIG.DMIG compensates for the reduced maximum possible value of the numerator due to attribute interdependence.
Dependency-blind Mutual Information Gap (XMIG) is a complementary metric to MIG and DMIG that measures the gap in mutual information with the subtrahend restricted to dimensions which do not regularize any attribute [24].XMIG is given by where j = argmax d I(a i , z d ), k = argmax d / ∈D I(a i , z d ), and D is a set of latent indices which do not regularize any attribute.XMIG allows monitoring of latent disentanglement exclusively against attribute-unregularized latent dimensions.
Dependency-aware Latent Information Gap (DLIG) is a latent-centric counterpart to DMIG [24].DLIG evaluates disentanglement of a set of semantic attributes {a i } with respect to a latent dimension z d , such that where j = argmax i I(a i , z d ), k = argmax i =j I(a i , z d ).

Interpolatability Metrics
The two interpolatability metrics currently supported are based on the concept of a pseudoderivative called latent-induced attribute difference (LIAD), which is defined as where A i (•) is the measurement of attribute a i from a sample generated from its latent vector argument, d is the latent dimension regularizing a i , δ > 0 is the latent step size, and e d is the dth elementary vector [24].Second-order LIAD is similarly defined by where D (1) ≡ D.
Smoothness is a measure of how smoothly an attribute changes with respect to a change in the regularizing latent dimension [24].Smoothness of a latent vector z is based on the concept of second-order derivative, and is given by where is the contraharmonic mean of its arguments over values of k ∈ K, and R k∈K [•] is the range of its arguments over values of k ∈ K, and K is the set of interpolating points used during evaluation.
Monotonicity is a measure of how monotonic an attribute changes with respect to a change in the regularizing dimension [24].Monotonicity of a latent vector z is given by Monotonicity where is the Iverson bracket operator, and ǫ > 0 is a noise threshold for ignoring near-zero attribute changes.

Software Impact and Future Work
Latte is released under the MIT License and welcomes community contribution to the package.The authors hope that the introduction of Latte will reduce the amount of time spent on reimplementing evaluation metrics due to framework incompatibility, and provide a standardized and uniform framework for evaluation of controllable generative systems regardless of the deep learning framework of choice.

Conclusion
We introduce Latte, a cross-framework Python package for evaluation of latent-based generative models.Latte supports on-the-fly metric calculation for disentanglement learning and controllable generation using both standalone functional API based on NumPy, and modular APIs for both Ten-sorFlow/Keras and PyTorch.Latte eliminates the need for application-specific re-implementation of common metrics, allowing consistent and reproducible model evaluation regardless of the deep learning framework of choice.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figure 1 :
Figure1: Example of using Latte in modular mode with PyTorch via the TorchMetrics API.In this example, the data contains three continuous semantic attributes, which are respectively regularized by the latent specified by the argument reg dim.The discrete=False option specifies that the semantic attributes are continuous-valued.

Figure 2 :
Figure2: Example of using a Latte metric bundle in modular mode with TensorFlow via the Keras Metric API.The call signature is very similar to single-metric modules -the main difference is that a metric bundle returns a dictionary of arrays instead of a single array.DependencyAwareMutualInformationBundle contains MIG, DMIG, XMIG, and DLIG.All individual metric submodules are automatically initialized with the same reg dim and discrete options.