Metrics#
seqme provides a unified framework for evaluating sequences across three metric spaces — sequence, embedding, and property — along with a few general-purpose utilities.
Sequence-based Metrics#
Metrics that operate directly on the raw sequences.
Measures the diversity of synthetic sequences using normalized pairwise Levenshtein distance. |
|
Fraction of unique sequences within a provided list of sequences. |
|
Fraction of sequences not in the reference set. |
|
Average Jaccard similarity between each generated sequence and a reference corpus, based on n-grams of size |
Embedding-based Metrics#
Metrics that compare or assess distributions in an embedding (vector) space.
Fréchet Biological Distance (FBD) between a set of generated sequences and a reference dataset based on their embeddings. |
|
Maximum Mean Discrepancy (MMD) metric using a Gaussian kernel. |
|
Kernel Inception Distance (KID). |
|
Evaluates how realistic synthetic samples are compared to reference data. |
|
Evaluates how well the reference data is covered by the generated sequences. |
|
Evaluates how realistic synthetic samples are compared to reference data. |
|
Evaluates how well the reference data is covered by the synthetic samples. |
|
Proportion of authentic generated samples. |
|
Fourier-based Kernel Entropy Approximation (FKEA) approximates the VENDI-score and RKE-score using random Fourier features. |
Property-based Metrics#
Metrics computed on derived physicochemical or predicted properties.
Applies a user-provided predictor to a list of sequences and returns the mean and standard error of the predictors outputs. |
|
Fraction of sequences with property within [min, max] a user-defined threshold. |
|
Fraction of sequences that satisfy a user-defined condition. |
|
Computes the Hypervolume metric for multi-objective optimization. |
|
Distributional conformity score. |
|
KL-divergence between samples and reference for a single property. |
Miscellaneous#
General or utility metrics that don’t fit into the main categories.
A wrapper for any metric, which splits the sequences into non-overlapping subsets, computes the metric on each split and aggregates the results. |
|
A wrapper to approximate expensive metrics by evaluating a subset of the sequences in a group. |
|
Number of sequences. |
|
Average sequence length. |
Supported sequence types#
At-a-glance matrix of all metrics and supported sequence types.
— supported,
— not supported
Metrics |
Protein |
Peptide |
RNA |
DNA |
Small Molecule |
|---|---|---|---|---|---|