Evaluating iterative algorithms#

Many generative methods (e.g., genetic algorithms) iteratively explore sequences to improve sequence fitness. In this notebook, we show on toy sequences how to evaluate sequences optimized across multiple rounds and visualize the results using seqme.

import seqme as sm

Single run#

seqme allows naming sequence entries as a tuple. Here we name an entry using the following format: (model name, iteration).

sequences = {
    ("model 1", 1): ["QLF", "FFQLL", "RQLL"],
    ("model 1", 2): ["RQLF", "PRFQRP", "RQLL"],
    ("model 1", 3): ["RQLRR", "RQLRRR", "RQLRRR"],
    ("model 2", 1): ["QLF", "QLF", "RQLL"],
    ("model 2", 2): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 3): ["PLFR", "RFQRP", "RQLR"],
}

Let’s define the metrics to compute.

metrics = [
    sm.metrics.ID(predictor=sm.models.Charge(), name="Charge", objective="maximize"),
    sm.metrics.ID(predictor=sm.models.Hydrophobicity(), name="Hydrophobicity", objective="maximize"),
    sm.metrics.Uniqueness(),
]

Let’s compute the metrics.

df = sm.evaluate(sequences, metrics)

100%|██████████| 18/18 [00:00<00:00, 793.34it/s, data=('model 2', 3), metric=Uniqueness]    

sm.show(df, color_style="bar", caption="Table 1.1 Iterative algorithms")

Table 1.1 Iterative algorithms
		Charge↑	Hydrophobicity↑	Uniqueness↑
model 1	1	0.33±0.34	0.32±0.32	1.00
	2	1.33±0.34	-0.43±0.16	1.00
	3	3.66±0.34	-1.57±0.06	0.67
model 2	1	0.33±0.34	0.23±0.26	0.67
	2	0.66±0.34	0.01±0.24	1.00
	3	1.66±0.34	-0.70±0.36	1.00

Let’s highlight it differently.

sm.show(df, color_style="bar", caption="Table 1.2 Iterative algorithms", level=1)

Table 1.2 Iterative algorithms
		Charge↑	Hydrophobicity↑	Uniqueness↑
model 1	1	0.33±0.34	0.32±0.32	1.00
	2	1.33±0.34	-0.43±0.16	1.00
	3	3.66±0.34	-1.57±0.06	0.67
model 2	1	0.33±0.34	0.23±0.26	0.67
	2	0.66±0.34	0.01±0.24	1.00
	3	1.66±0.34	-0.70±0.36	1.00

Notice in the above table visualization, we set level=1, this means that each sub metric dataframe (model 1 and model 2) should be colored, underlined and bolded, independently.

Let’s visualize the sequences performance at each step.

sm.plot_line(df, metric="Charge")

../_images/bca3d54ddf5a32025f2455d617a0dee462e2d1be0d65521993ea7bc4b20ed311.png

Let’s look at two metrics.

sm.plot_scatter(df, metrics=["Charge", "Hydrophobicity"])

../_images/d7b7128f216dcd85d6238d42d85f33985fc444e6b13b404579752feb7f57eb2c.png

sm.plot_scatter(df[["Charge", "Uniqueness"]])

../_images/a5d2fb9a4cadd65963aa0311d67cc6b0c26e0fc170e2d5e8cd785320c7157954.png

Let’s sort the sequences by their charge.

df2 = sm.sort(df, "Charge", level=0)
sm.show(df2, caption="Table 2.1. Iterative algorithms (sorted)", color_style="bar", hline_level=0)

Table 2.1. Iterative algorithms (sorted)
		Charge↑	Hydrophobicity↑	Uniqueness↑
model 1	3	3.66±0.34	-1.57±0.06	0.67
model 2	3	1.66±0.34	-0.70±0.36	1.00
model 1	2	1.33±0.34	-0.43±0.16	1.00
model 2	2	0.66±0.34	0.01±0.24	1.00
model 1	1	0.33±0.34	0.32±0.32	1.00
model 2	1	0.33±0.34	0.23±0.26	0.67

Let’s rearrange the entries levels and sort the sequence by uniqueness within each iteration.

df3 = df.reorder_levels([1, 0])
df3 = sm.sort(df3, "Uniqueness", level=1)

sm.show(
    df3, color="#d668c9", caption="Table 2.2. Iterative algorithms (sorted within iteration)", color_style="gradient"
)

Table 2.2. Iterative algorithms (sorted within iteration)
		Charge↑	Hydrophobicity↑	Uniqueness↑
1	model 1	0.33±0.34	0.32±0.32	1.00
1	model 2	0.33±0.34	0.23±0.26	0.67
2	model 1	1.33±0.34	-0.43±0.16	1.00
2	model 2	0.66±0.34	0.01±0.24	1.00
3	model 2	1.66±0.34	-0.70±0.36	1.00
3	model 1	3.66±0.34	-1.57±0.06	0.67

Let’s display the sequences with largest charge in each iteration.

df4 = df.reorder_levels([1, 0]).sort_index()
df4 = sm.top_k(df4, "Charge", k=1, level=1)
sm.show(df4, color_style="bar", hline_level=0, caption="Table 2.3. Best per iteration")

Table 2.3. Best per iteration
		Charge↑	Hydrophobicity↑	Uniqueness↑
1	model 1	0.33±0.34	0.32±0.32	1.00
1	model 2	0.33±0.34	0.23±0.26	0.67
2	model 1	1.33±0.34	-0.43±0.16	1.00
3	model 1	3.66±0.34	-1.57±0.06	0.67

Multiple runs#

Let’s assume we ran two generative models multiple times with a different seed each time. The sequences from the runs are shown below. And now we want to compute the deviation in performance across runs.

sequences_run1 = {
    ("model 1", 1): ["QLF", "FFQLL", "RQLL"],
    ("model 1", 2): ["RQLF", "PRFQRP", "RQLL"],
    ("model 1", 3): ["RQLRR", "RQLRRR", "RQLRRR"],
    ("model 2", 1): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 2): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 3): ["PLFR", "RFQRP", "RQLR"],
}

sequences_run2 = {
    ("model 1", 1): ["QLF", "FFQLL", "RQLL"],
    ("model 1", 2): ["RQLF", "PRFQRP", "RQLL"],
    ("model 1", 3): ["RQLRR", "RQLRRR", "RQLRRR"],
    ("model 2", 1): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 2): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 3): ["PLFR", "RFQRP", "RQLR"],
}

sequences_run3 = {
    ("model 1", 1): ["RQLF", "PRFQRP", "RQLL"],
    ("model 1", 2): ["QLF", "FFQLL", "RQLL"],
    ("model 1", 3): ["RQLRR", "RQLRRR", "RQLRRR"],
    ("model 2", 1): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 2): ["QLF", "FFQRP", "RQLL"],
    ("model 2", 3): ["PLFR", "RFQRP", "RQLR"],
}

sequences_per_run = [sequences_run1, sequences_run2, sequences_run3]

Let’s define the metrics to compute.

metrics = [
    sm.metrics.ID(predictor=sm.models.Charge(), name="Charge", objective="maximize"),
    sm.metrics.Uniqueness(),
]

df_per_run = [sm.evaluate(sequences, metrics) for sequences in sequences_per_run]

100%|██████████| 12/12 [00:00<00:00, 2110.78it/s, data=('model 2', 3), metric=Uniqueness]
100%|██████████| 12/12 [00:00<00:00, 3515.03it/s, data=('model 2', 3), metric=Uniqueness]
100%|██████████| 12/12 [00:00<00:00, 3198.50it/s, data=('model 2', 3), metric=Uniqueness]

Let’s create a metric dataframe combining the metric dataframe of each run.

df_combined = sm.combine(df_per_run, value="mean", deviation="se")

Let’s rank the models using all the metrics.

df_combined = sm.rank(df_combined, tiebreak="mean-rank")

sm.show(df_combined, color="#6892d6", color_style="bar", caption="Table 3.1. Multiple runs", n_decimals=[2, 2, 0])

Table 3.1. Multiple runs
		Charge↑	Uniqueness↑	Rank↓
model 1	1	0.66±0.34	1.00±0.00	4
	2	1.00±0.34	1.00±0.00	3
	3	3.66±0.00	0.67±0.00	2
model 2	1	0.66±0.00	1.00±0.00	4
	2	0.66±0.00	1.00±0.00	4
	3	1.66±0.00	1.00±0.00	1

Let’s extract the top two ranked entries.

sm.show(
    sm.top_k(df_combined, "Rank", 2),
    color="#6892d6",
    color_style="bar",
    caption="Table 3.2. Multiple runs - Best",
    n_decimals=[2, 3, 0],
)

Table 3.2. Multiple runs - Best
		Charge↑	Uniqueness↑	Rank↓
model 1	3	3.66±0.00	0.667±0.000	2
model 2	3	1.66±0.00	1.000±0.000	1

Let’s visualize the sequences performance at each iteration.

sm.plot_line(df_combined, "Charge", color=["#ff4949ff", "#29a1c6ff"], linestyle=["--", "-"], marker=None)

../_images/4502a97a875f85a9c2badc0d3bfac0fbcf023028e6d14d25a22430773e532ecd.png

Evaluating iterative algorithms

Contents

Evaluating iterative algorithms#

Single run#

Multiple runs#