feat: Add the ESM2 protein embedding model #600

nleroy917 · 2026-01-20T17:03:35Z

Summary

Added a new bio module that introduces embedding models for biological sequence data (Proteins, DNA, etc). The plan is to eventually get to some advanced models like Tahoe-x1, but starting simple for now.

Added

Add ProteinEmbedding class for protein sequence embeddings using ESM-2 models
Use HuggingFace tokenizers library for tokenization (consistent with other models)
Add comprehensive test suite for protein embeddings
Add protein embedding example to README

coderabbitai · 2026-01-20T17:05:33Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces a complete protein embedding functionality to the fastembed library. A new ProteinEmbedding class is added that computes embeddings for amino acid sequences using ONNX-based models. The implementation includes tokenizer loading from model files (with fallback support), ONNX model integration, mean-pooling post-processing with attention masking, batching, and lazy loading capabilities. The feature is exposed through the public API via module exports, accompanied by comprehensive test coverage and documentation examples.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nleroy917 added 2 commits January 20, 2026 10:50

add protein embedding model

32a15bb

add esm2 fully

23dc2ed

nleroy917 closed this Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add the ESM2 protein embedding model #600

feat: Add the ESM2 protein embedding model #600

Uh oh!

nleroy917 commented Jan 20, 2026

Uh oh!

coderabbitai bot commented Jan 20, 2026

Review failed

Walkthrough

Estimated code review effort

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add the ESM2 protein embedding model #600

feat: Add the ESM2 protein embedding model #600

Uh oh!

Conversation

nleroy917 commented Jan 20, 2026

Summary

Added

Uh oh!

coderabbitai bot commented Jan 20, 2026

Review failed

Walkthrough

Estimated code review effort

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant