A Python library for converting images and PDFs to Markdown or generating rich image descriptions using state-of-the-art multimodal LLMs.
- Multiple Provider Support: OpenAI, Anthropic, Google Gemini, Mistral, and OpenRouter
- Dual Mode Operation: Convert to Markdown or generate detailed descriptions
- Advanced Figure Extraction: Automatically detect, extract, and process figures from PDFs
- Robust Retry Logic: Intelligent retry with fallback models and failure feedback
- Async Support: Concurrent processing for improved performance
- Clean architecture: Type-safe, well-documented, and thoroughly tested
- Easy Integration: Simple API with comprehensive configuration options
pip install markthatgit clone https://github.com/Flopsky/markthat.git
cd markthat
pip install -e .
pre-commit installfrom markthat import MarkThat
# Initialize with your preferred model
converter = MarkThat(
model="gemini-2.0-flash-001",
provider="gemini",
api_key="YOUR_API_KEY"
)
# Convert image to markdown
result = converter.convert("path/to/image.jpg")
print(result[0])
# Generate image description
description = converter.convert(
"path/to/image.jpg",
description_mode=True
)
print(description[0])from markthat import MarkThat
from dotenv import load_dotenv
import os
import asyncio
load_dotenv()
def test_markthat_with_figure_extraction():
"""Test MarkThat with advanced figure extraction capabilities."""
try:
client = MarkThat(
provider="gemini",
model="gemini-2.0-flash-001",
api_key=os.getenv("GEMINI_API_KEY"),
api_key_figure_detector=os.getenv("GEMINI_API_KEY"),
api_key_figure_extractor=os.getenv("GEMINI_API_KEY"),
api_key_figure_parser=os.getenv("GEMINI_API_KEY"),
)
result = asyncio.run(
client.async_convert(
"path/to/document.pdf",
extract_figure=True,
coordinate_model="gemini-2.0-flash-001",
parsing_model="gemini-2.5-flash-lite",
)
)
return result
except Exception as e:
print("Figure extraction failed:", e)
return None
def test_markthat_without_figure_extraction():
"""Test standard MarkThat conversion without figure extraction."""
try:
client = MarkThat(
provider="gemini",
model="gemini-2.0-flash-001",
api_key=os.getenv("GEMINI_API_KEY"),
)
result = asyncio.run(
client.async_convert(
"path/to/document.pdf",
extract_figure=False,
)
)
return result
except Exception as e:
print("Standard conversion failed:", e)
return None
if __name__ == "__main__":
# Test both approaches
with_figures = test_markthat_with_figure_extraction()
without_figures = test_markthat_without_figure_extraction()
print("With figure extraction:", with_figures)
print("Without figure extraction:", without_figures)Quickly try MarkThat in your browser.
pip install -r requirements.txt # ensures gradio is installed
python gradio_ui.pyThen open http://localhost:7861 in your browser.
- Supports multiple providers with per-step model overrides
- Lets you pass provider-specific API keys (auto-fills from env when available)
- Exports results as Markdown or JSON with detected figure paths
from markthat import MarkThat, RetryPolicy
# Custom retry policy
retry_policy = RetryPolicy(
max_attempts=5,
timeout_seconds=30,
backoff_factor=1.5
)
# Multi-provider setup with fallbacks
converter = MarkThat(
model="gpt-4o",
provider="openai",
fallback_models=["claude-3-5-sonnet-20241022", "gemini-2.0-flash-001"],
retry_policy=retry_policy,
api_key="YOUR_OPENAI_KEY"
)# Access 300+ models through OpenRouter
converter = MarkThat(
model="anthropic/claude-3.5-sonnet",
provider="openrouter",
api_key="YOUR_OPENROUTER_KEY"
)
# Or use model path auto-detection
converter = MarkThat(
model="openai/gpt-4o", # Automatically uses OpenRouter
api_key="YOUR_OPENROUTER_KEY"
)MarkThat includes a sophisticated figure extraction system for PDFs:
converter = MarkThat(
model="gemini-2.0-flash-001",
api_key_figure_detector="DETECTOR_KEY",
api_key_figure_extractor="EXTRACTOR_KEY",
api_key_figure_parser="PARSER_KEY"
)
results = await converter.async_convert(
"research_paper.pdf",
extract_figure=True,
figure_detector_model="gemini-2.0-flash",
coordinate_model="gemini-2.0-flash-001",
parsing_model="gemini-2.5-flash-lite"
)- Detection: Analyzes document content to identify pages with figures
- Coordinate Mapping: Overlays coordinate grids and identifies figure boundaries
- Extraction: Crops figures using precise coordinate mapping
- Integration: Embeds figure paths into the final markdown output
For optimal performance with multi-page documents:
import asyncio
from markthat import MarkThat
async def process_document():
converter = MarkThat(model="gemini-2.0-flash-001")
# Process pages concurrently
results = await converter.async_convert("large_document.pdf")
for i, page_content in enumerate(results):
print(f"Page {i+1}: {len(page_content)} characters")
asyncio.run(process_document())# Primary providers (used automatically if constructor api_key is not provided)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
export GEMINI_API_KEY="your_google_key"
export MISTRAL_API_KEY="your_mistral_key"
# Unified access via OpenRouter
export OPENROUTER_API_KEY="your_openrouter_key"Note: For figure extraction you can pass separate keys via the constructor
parameters api_key_figure_detector, api_key_figure_extractor, and
api_key_figure_parser. If omitted, they default to the main api_key.
# Run the test suite
pytest
# Run with coverage
pytest --cov=markthat
# Run a specific test file
pytest tests/test_validation.pymarkthat/
βββ markthat/
β βββ __init__.py # Public API
β βββ client.py # Main MarkThat class
β βββ providers.py # LLM provider abstractions
β βββ file_processor.py # PDF/image loading
β βββ image_processing.py # Image manipulation
β βββ figure_extraction.py # Figure detection & extraction
β βββ prompts/ # Prompt templates & utilities
β βββ utils/ # Validation & helpers
β βββ exceptions.py # Custom exceptions
β βββ logging_config.py # Logging setup
βββ gradio_ui.py # Visual demo app
βββ tests/ # Test suite
βββ examples/ # Usage examples
βββ pyproject.toml # Project metadata
βββ README.md # This file
This project uses modern Python development practices:
- Type Hints: Full type annotations with mypy validation
- Code Formatting: Black for consistent code style
- Linting: Ruff for fast, comprehensive linting
- Import Sorting: isort for organized imports
- Pre-commit Hooks: Automated quality checks
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes with proper tests
- Run quality checks:
pre-commit run --all-files - Submit a pull request
# Install development dependencies
pip install -e .[dev]
# Set up pre-commit hooks
pre-commit install
# Run quality checks
black .
ruff check .
isort .
mypy markthatclass MarkThat:
def __init__(
self,
model: str,
*,
provider: Optional[str] = None,
fallback_models: Optional[Sequence[str]] = None,
retry_policy: Optional[RetryPolicy] = None,
api_key: Optional[str] = None,
api_key_figure_detector: Optional[str] = None,
api_key_figure_extractor: Optional[str] = None,
api_key_figure_parser: Optional[str] = None,
max_retry: int = 3,
) -> None: ...
def convert(
self,
file_path: str,
*,
format_options: Optional[Dict[str, Any]] = None,
additional_instructions: Optional[str] = None,
description_mode: bool = False,
extract_figure: bool = False,
figure_detector_model: str = "gemini-2.0-flash",
coordinate_model: str = "gemini-2.0-flash",
parsing_model: str = "gemini-2.5-flash-lite",
max_retry: Optional[int] = None,
clean_output: bool = True,
) -> List[str]: ...
async def async_convert(
self,
file_path: str,
*,
format_options: Optional[Dict[str, Any]] = None,
additional_instructions: Optional[str] = None,
description_mode: bool = False,
extract_figure: bool = False,
figure_detector_model: str = "gemini-2.0-flash",
coordinate_model: str = "gemini-2.0-flash",
parsing_model: str = "gemini-2.5-flash-lite",
max_retry: Optional[int] = None,
clean_output: bool = True,
) -> List[str]: ...@dataclass
class RetryPolicy:
max_attempts: int = 3
timeout_seconds: int = 30
backoff_factor: float = 1.0- OpenAI: gpt-4o, gpt-4-turbo, gpt-4o-mini
- Anthropic: claude-3-5-sonnet-20241022, claude-3-opus, claude-3-haiku
- Google: gemini-2.0-flash-001, gemini-1.5-pro, gemini-1.5-flash
- Mistral: mistral-large-latest, mistral-medium, mistral-small
- Meta: meta-llama/llama-3.2-90b-vision
- Qwen: qwen/qwen-2-vl-72b-instruct
- Many more: Access the full catalog at OpenRouter
MarkThat provides comprehensive error handling:
from markthat import MarkThat
from markthat.exceptions import ProviderInitializationError, ConversionError
try:
converter = MarkThat(model="invalid-model")
except ProviderInitializationError as e:
print(f"Provider setup failed: {e}")
try:
result = converter.convert("image.jpg")
except ConversionError as e:
print(f"Conversion failed: {e}")- Use Async for Multiple Pages:
async_convert()processes pages concurrently - Configure Appropriate Timeouts: Balance speed vs. reliability
- Choose the Right Model: Faster models for simple tasks, powerful models for complex content
- Leverage Fallbacks: Set up model hierarchies for reliability
- β Multi-provider LLM support
- β PDF processing with figure extraction
- β Async processing capabilities
- β Comprehensive retry logic
- β Type-safe, clean architecture
- π Additional file format support (TIFF, WEBP)
- π Cost tracking and optimization
- π Batch processing API
- π Custom prompt template system
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with modern Python best practices
- Leverages state-of-the-art multimodal LLMs
- Inspired by the need for robust document processing tools
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See
docs/for Sphinx sources
MarkThat - Transform visual content into structured text with the power of AI π