Skip to content

Conversation

@nleroy917
Copy link
Member

Summary

Added a new bio module that introduces embedding models for biological sequence data (Proteins, DNA, etc). The plan is to eventually get to some advanced models like Tahoe-x1, but starting simple for now.

Added

  • Add ProteinEmbedding class for protein sequence embeddings using ESM-2 models
  • Use HuggingFace tokenizers library for tokenization (consistent with other models)
  • Add comprehensive test suite for protein embeddings
  • Add protein embedding example to README

@nleroy917 nleroy917 closed this Jan 20, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces a complete protein embedding functionality to the fastembed library. A new ProteinEmbedding class is added that computes embeddings for amino acid sequences using ONNX-based models. The implementation includes tokenizer loading from model files (with fallback support), ONNX model integration, mean-pooling post-processing with attention masking, batching, and lazy loading capabilities. The feature is exposed through the public API via module exports, accompanied by comprehensive test coverage and documentation examples.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant