seqme.metrics.KLDivergence

seqme.metrics.KLDivergence#

class seqme.metrics.KLDivergence(reference, predictor, *, n_draws=10000, kde_bandwidth='silverman', seed=0, name='KL-divergence')[source]#

KL-divergence between samples and reference for a single property.

This metric measures how much the empirical distribution of a property \(f(x)\) in the generated samples deviates from the corresponding reference distribution.

The KL-divergence is defined as:

\[\mathrm{KL}\big(p_{f(\mathrm{ref})} \,\|\, p_{f(\mathrm{gen})}\big) = \int p_{f(\mathrm{ref})}(y) \log \frac{p_{f(\mathrm{ref})}(y)}{p_{f(\mathrm{gen})}(y)} \, dy,\]

where \(p_{f(\mathrm{ref})}\) denotes the reference distribution and \(p_{f(\mathrm{gen})}\) denotes the generated distribution.

The KL-divergence is approximated using Monte-Carlo sampling.

__init__(reference, predictor, *, n_draws=10000, kde_bandwidth='silverman', seed=0, name='KL-divergence')[source]#

Initialize the metric.

Parameters:
  • reference (list[str]) – Reference sequences assumed to represent the target distribution.

  • predictor (Callable[[list[str]], ndarray]) – Predictor function which returns a 1D NumPy array. One value per sequence.

  • n_draws (int) – Number of Monte Carlo samples to draw from reference distribution.

  • kde_bandwidth (Union[float, Literal['scott', 'silverman']]) – Bandwidth parameter for the Gaussian KDE.

  • seed (int) – Seed for KL-divergence Monte-Carlo sampling.

  • name (str) – Metric name.

__call__(sequences)[source]#

Compute the KL-divergence between reference and sequence predictor.

Parameters:

sequences (list[str]) – Sequences to evaluate.

Returns:

KL-divergence and standard error.

Return type:

MetricResult

Methods

__init__(reference, predictor, *[, n_draws, ...])

Initialize the metric.

__call__(sequences)

Compute the KL-divergence between reference and sequence predictor.

Attributes

name

Name of the metric.

objective

Whether lower or higher scores indicate better performance.