Getting started#

In this notebook, we show how to use seqme to evaluate biological sequences and add your own metric. We introduce seqme’s functionality through toy examples. Other tutorials show real-world use cases.

First the imports.

from pprint import pprint
from typing import Literal

import numpy as np

import seqme as sm

Basics#

Let’s define the sequences to use and the metrics to compute. Some sequences will be loaded from fasta files, others are hard-coded.

First let’s write a fasta file with random sequences to disk.

RANDOM_SEQUENCES_PATH = "./data/random.fasta"

random_sequences = ["KKKKK", "PLQ", "RKSPL"]

sm.to_fasta(random_sequences, RANDOM_SEQUENCES_PATH)

UniProt contains a lot of sequences. To speed-up the metric computations, we will only use a (random) subset of the UniProt sequences.

sequences = {
    "Random": sm.read_fasta(RANDOM_SEQUENCES_PATH),
    "UniProt": sm.utils.subsample(["KKWQ", "RKSPL", "RASD"], n_samples=2, seed=42),
    "HydrAMP": ["MMRK", "RKSPL", "RRLSK", "RRLSK"],
    "hyformer": ["MKQW", "RKSPL"],
}

Metrics may precompute some values upon metric object creation to speed up the overall process.

def hit_rate_condition_fn(sequences: list[str]) -> np.ndarray:
    no_lysine = ~np.array(["K" in seq for seq in sequences])
    return no_lysine


def my_embedder(sequences: list[str]) -> np.ndarray:
    lengths = [len(sequence) for sequence in sequences]
    counts = [sequence.count("K") for sequence in sequences]
    return np.array([lengths, counts], dtype=float).T


metrics = [
    sm.metrics.Uniqueness(),
    sm.metrics.Novelty(reference=sequences["UniProt"], name="Novelty (UniProt)"),
    sm.metrics.Diversity(),
    sm.metrics.FBD(reference=sequences["Random"], embedder=my_embedder),
    sm.metrics.ID(sm.models.Gravy(), name="Gravy", objective="minimize"),
    sm.metrics.HitRate(condition_fn=hit_rate_condition_fn),
]

Let’s compute the metrics.

df = sm.evaluate(sequences, metrics)

 38%|███▊      | 9/24 [00:00<00:00, 289.30it/s, data=UniProt, metric=FBD]              /Users/rasmus.larsen/work/hackathon-2025/seqme/src/seqme/metrics/fbd.py:91: LinAlgWarning: Matrix is singular. The result might be inaccurate or the array might not have a square root.
  covmean = scipy.linalg.sqrtm(sigma1.dot(sigma2))
100%|██████████| 24/24 [00:00<00:00, 614.03it/s, data=hyformer, metric=Hit-rate]         

Let’s look at the results.

df

	Uniqueness		Novelty (UniProt)		Diversity		FBD		Gravy		Hit-rate
	value	deviation	value	deviation	value	deviation	value	deviation	value	deviation	value	deviation
Random	1.00	NaN	1.0	NaN	0.866667	NaN	0.000000	NaN	-1.911111	1.032855	0.333333	0.272166
UniProt	1.00	NaN	0.0	NaN	1.000000	NaN	3.961130	NaN	-2.400000	0.650000	0.500000	0.353553
HydrAMP	0.75	NaN	1.0	NaN	0.700000	NaN	8.602244	NaN	-1.627500	0.209816	0.000000	0.000000
hyformer	1.00	NaN	1.0	NaN	0.800000	NaN	8.228118	NaN	-1.500000	0.100000	0.000000	0.000000

… let’s make it nicer to look at.

sm.show(df, n_decimals=2)

	Uniqueness↑	Novelty (UniProt)↑	Diversity↑	FBD↓	Gravy↓	Hit-rate↑
Random	1.00	1.00	0.87	0.00	-1.91±1.04	0.33±0.28
UniProt	1.00	0.00	1.00	3.96	-2.40±0.65	0.50±0.36
HydrAMP	0.75	1.00	0.70	8.60	-1.63±0.21	0.00±0.00
hyformer	1.00	1.00	0.80	8.23	-1.50±0.10	0.00±0.00

Let’s also do a parallel coordinates plot.

sm.plot_parallel(df, arrow_size=12, xticks_fontsize=10, xticks_rotation=35, figsize=(6, 2.5), show_arrow=False)

../_images/c4577881e5ab8b72942a3755448e105a6e23c8da026bfb5b069dad82d8790e78.png

We are particularly interested in the following two metrics.

sm.plot_scatter(df, metrics=["FBD", "Gravy"], show_deviation=False)

../_images/dfb6efea96d4020f3ef9bfe73c90a6407f258176150f2078115efcc3355df744.png

Let’s zoom in on one of the metrics.

sm.plot_bar(df, "Gravy")

../_images/a6492cadec36133fec9de5e1142e1414eabd8db4ee6117423e69ea1fe267596a.png

Let’s extract the two rows with lowest Gravy.

sm.show(sm.top_k(df, "Gravy", k=2))

	Uniqueness↑	Novelty (UniProt)↑	Diversity↑	FBD↓	Gravy↓	Hit-rate↑
Random	1.00	1.00	0.87	0.00	-1.91±1.04	0.33±0.28
UniProt	1.00	0.00	1.00	3.96	-2.40±0.65	0.50±0.36

Let’s perform a (non-dominated) ranking of the entries, which corresponds to computing slices of the Pareto front from the selected metrics, followed by removing the entries in the current slice. Each slice corresponds to a rank.

df = sm.rank(df, metrics=["Hit-rate", "Diversity", "Gravy"])

Let’s plot the ranking.

sm.plot_bar(df, "Rank")

../_images/46ee929bf390985467524d4e077821f00a5225294b8324329557ec55875c3f03.png

To write a metric dataframe to disk, encode the dataframe as a pickle object.

PATH = "./metrics_dataframe.pkl"

sm.to_pickle(df, PATH)
sm.show(sm.read_pickle(PATH), color_style="gradient")

	Uniqueness↑	Novelty (UniProt)↑	Diversity↑	FBD↓	Gravy↓	Hit-rate↑	Rank↓
Random	1.00	1.00	0.87	0.00	-1.91±1.04	0.33±0.28	2.00
UniProt	1.00	0.00	1.00	3.96	-2.40±0.65	0.50±0.36	1.00
HydrAMP	0.75	1.00	0.70	8.60	-1.63±0.21	0.00±0.00	3.00
hyformer	1.00	1.00	0.80	8.23	-1.50±0.10	0.00±0.00	3.00

It is also possible to export the dataframe to a LaTeX table or a PNG as shown below.

To save the styled metric dataframe to a png, we use dataframe_image.

# !pip install dataframe_image

SAVE_TABLE = False

if SAVE_TABLE:
    # save to LaTeX
    sm.to_latex(df, "table.tex", n_decimals=3, color=None, caption="My first table")

    # save to PNG
    import dataframe_image as dfi

    styler = sm.show(df, n_decimals=3, color_style="bar")
    await dfi.export_async(styler, "table.png", dpi=300)

That it all for the basic usage of seqme. Check out the documentation for the metrics and models implemented in seqme.

Let’s now look at some more advanced features!

Advanced#

In the advanced section, we show how to

Combine multiple metric dataframes into one.
Improve performance through caching.
Create your own metric.

Combining metric dataframes#

seqme’s combine can combine metric dataframes with different sequences and/or metrics. This can be useful if you want to evaluate the sequences on an additional metric or add an additional set of sequences for evaluation - without needing to recompute the results.

Let’s create three metric dataframes.

df1 = sm.evaluate(
    sequences={
        "s1": ["KKKKK", "PLUQ", "RKSPL"],
        "s2": sm.utils.subsample(["KKWQ", "RKSPL", "RASD"], n_samples=2, seed=42),
        "s3": ["MMRK", "RKSPL", "RRLSK"],
        "s4": ["MKQW", "RKSPL"],
    },
    metrics=[
        sm.metrics.Novelty(reference=["KKW", "RKSPL"]),
        sm.metrics.Novelty(reference=["RASD", "KKKQ", "LPTUY"], name="Novelty (Random)"),
    ],
    verbose=False,
)

df2 = sm.evaluate(
    sequences={"s4": ["MKQW", "RKSPL"], "s5": ["MKQW"]},
    metrics=[sm.metrics.ID(sm.models.Gravy(), name="Gravy", objective="minimize")],
    verbose=False,
)

df3 = sm.evaluate(
    sequences={"s6": ["MKQW", "RKSPL"]},
    metrics=[
        sm.metrics.ID(sm.models.Gravy(), name="Gravy", objective="minimize"),
        sm.metrics.Novelty(reference=["KKW", "RKSPL"]),
    ],
    verbose=False,
)

Let’s combine the dataframes.

df = sm.combine([df1, df2, df3])

df

	Novelty		Novelty (Random)		Gravy
	value	deviation	value	deviation	value	deviation
s1	0.666667	NaN	1.0	NaN	NaN	NaN
s2	1.000000	NaN	0.5	NaN	NaN	NaN
s3	0.666667	NaN	1.0	NaN	NaN	NaN
s4	0.500000	NaN	1.0	NaN	-1.5	0.1
s5	NaN	NaN	NaN	NaN	-1.6	NaN
s6	0.500000	NaN	NaN	NaN	-1.5	0.1

… let’s make it nicer to look at.

sm.show(df, n_decimals=[3, 2, 1], color="#d66868")

	Novelty↑	Novelty (Random)↑	Gravy↓
s1	0.667	1.00	-
s2	1.000	0.50	-
s3	0.667	1.00	-
s4	0.500	1.00	-1.5±0.1
s5	-	-	-1.6
s6	0.500	-	-1.5±0.1

sm.plot_bar(df, "Novelty", color="#d66868", ylim=(0.0, 1.1), xticks_rotation=0.0)

../_images/010877ab54e90f3aa179a1248d67c2f367cd15030a219b7925bf72e20084d3cf.png

Notice missing values (e.g., “Novelty” for “s5”) are not shown in the barplot.

Caching#

Discriminators and embedders can be computational expensive to execute. To reduce the number of function evaluations for these models, we can cache the results of discriminators and embedders.

The cache works like this: If a (sequence, embedding)-pair is already stored in the cache then retrieve the embedding. Otherwise compute the embedding, store the pair in the cache and retrieve the embedding.

Let’s create two toy models and the cache.

def my_embedder(sequences: list[str]) -> np.ndarray:
    lengths = [len(sequence) for sequence in sequences]
    counts = [sequence.count("K") for sequence in sequences]
    return np.array([lengths, counts], dtype=float).T


def my_discriminator(sequences: list[str]) -> np.ndarray:
    return np.ones(len(sequences)) * 0.85


cache = sm.Cache(models={"embedder": my_embedder, "discriminator": my_discriminator})

Optional: We can manually add model outputs to the cache. This is useful when e.g., (sequence, embedding)-pairs are already pre-computed and stored in a file.

seqs = ["MKQW", "AA", "BB"]
reps = my_embedder(seqs)

pairs = dict(zip(seqs, reps, strict=True))

cache.add(model_name="embedder", element=pairs)

Let’s compute the metrics. Since the metrics Precision, Recall, Clipped coverage, Clipped density use the same embedding model (embedder), the embedding is only computed once per sequence in sequences (instead of four times)!

Notice, we now use cache.model() instead of the embedder reference (my_embedder) directly.

df = sm.evaluate(
    sequences={
        "hyformer": ["RMKQW", "RKSPL", "RRRASD"],
        "HydrAMP": ["MKQW", "RKSPLP"],
        "Random": ["AAA", "PPPPP"],
    },
    metrics=[
        sm.metrics.Precision(
            reference=["AKWR", "AKKR", "KKKK"],
            embedder=cache.model("embedder"),
            n_neighbors=1,
            strict=False,
        ),
        sm.metrics.Recall(
            reference=["AKWR", "AKKR", "KKKK"],
            embedder=cache.model("embedder"),
            n_neighbors=1,
            strict=False,
        ),
        sm.metrics.ClippedDensity(
            reference=["AKWR", "AKKR", "KKKK"],
            embedder=cache.model("embedder"),
            n_neighbors=1,
            strict=False,
        ),
        sm.metrics.ClippedCoverage(
            reference=["AKWR", "AKKR", "KKKK"],
            embedder=cache.model("embedder"),
            n_neighbors=1,
            strict=False,
        ),
    ],
)

sm.show(df, color="#69bce8")

100%|██████████| 12/12 [00:00<00:00, 526.66it/s, data=Random, metric=Clipped coverage]  

	Precision↑	Recall↑	Clipped density↑	Clipped coverage↑
hyformer	0.67	0.00	1.00	0.67
HydrAMP	0.50	0.67	0.75	1.00
Random	0.00	0.33	0.00	0.00

Let’s plot a scatterplot of the metrics.

metrics = ["Clipped density", "Clipped coverage"]
df = sm.rank(df, metrics)

markers = np.where(df[("Rank", "value")] == 1, "x", "o")
sm.plot_scatter(df, metrics=metrics, marker=markers)

../_images/b39ffe6c1f8a30e095a8ef27c0027c9ada554a37e33a435f23cd7067af33a180.png

Let’s write the cached (sequence, embedding)-pairs for each model to a file and reinitialize the cache.

PATH = "precomputed.pkl"

sm.to_pickle(cache.get(), PATH)

precomputed = sm.read_pickle(PATH)

cache = sm.Cache(init_cache=precomputed)
pprint(cache.get())

{'discriminator': {},
 'embedder': {'AA': array([2., 0.]),
              'AAA': array([3., 0.]),
              'AKKR': array([4., 2.]),
              'AKWR': array([4., 1.]),
              'BB': array([2., 0.]),
              'KKKK': array([4., 4.]),
              'MKQW': array([4., 1.]),
              'PPPPP': array([5., 0.]),
              'RKSPL': array([5., 1.]),
              'RKSPLP': array([6., 1.]),
              'RMKQW': array([5., 1.]),
              'RRRASD': array([6., 0.])}}

Tip: The cache can become very useful if you use an external tool to predict sequence properties, e.g., a command-line tool or a tool implemented in another programming language. Now you want to use these predictions in seqme to a compute metric. To do so, 1. read the predictions stored in a file and 2. pre-init the Cache with these (sequence, predictions)-pairs.

Creating a new metric#

You can also integrate your own metrics into seqme through the Metric interface as shown below.

class MyMetric(sm.Metric):
    def __init__(self, amino_acid: str, minimize: bool = True):
        self.amino_acid = amino_acid
        self.minimize = minimize

    def __call__(self, sequences: list[str]) -> sm.MetricResult:
        aa_count = sum([seq.count(self.amino_acid) for seq in sequences])
        aa_total = sum([len(sequence) for sequence in sequences])
        return sm.MetricResult(aa_count / aa_total)

    @property
    def name(self) -> str:
        return f"{self.amino_acid}-frequency"

    @property
    def objective(self) -> Literal["minimize", "maximize"]:
        return "minimize" if self.minimize else "maximize"

Let’s use our new metric.

df = sm.evaluate(
    sequences={
        "hyformer": ["MKQW", "RKSPL", "RASD"],
        "HydrAMP": ["MKQW", "RKSPL"],
        "Random": ["AAAA", "PPPPP"],
    },
    metrics=[
        MyMetric(amino_acid="K", minimize=False),
        MyMetric(amino_acid="R", minimize=True),
    ],
)

sm.show(df, color="#e07fe1", color_style="bar", caption="Table 1. Our first metric 🎉")

100%|██████████| 6/6 [00:00<00:00, 2613.54it/s, data=Random, metric=R-frequency]  

Table 1. Our first metric 🎉
	K-frequency↑	R-frequency↓
hyformer	0.15	0.15
HydrAMP	0.22	0.11
Random	0.00	0.00

We have now covered the main functionality of seqme! 🎉

seqme also contains functionality to add third-party models, evaluate embedding models alignment with the property/properties of interest and a few extra things.

Getting started

Contents