Skip to content

feat: Add CRW web scraping tools#7468

Open
us wants to merge 2 commits intomicrosoft:mainfrom
us:feat/add-crw-tool
Open

feat: Add CRW web scraping tools#7468
us wants to merge 2 commits intomicrosoft:mainfrom
us:feat/add-crw-tool

Conversation

@us
Copy link
Copy Markdown

@us us commented Mar 26, 2026

Summary

Adds CRW web scraping tools to autogen-ext. CRW is an open-source web scraper for AI agents — a single Rust binary with a built-in MCP server and Firecrawl-compatible REST API.

New tools:

  • CrwScrapeTool — scrape a single URL to markdown/HTML/plaintext (POST /v1/scrape)
  • CrwCrawlTool — crawl a website across multiple pages with depth/page limits (POST /v1/crawl)
  • CrwMapTool — discover all links on a site via crawling + sitemap (POST /v1/map)

All tools follow the existing BaseTool pattern, include type hints and docstrings, and are installable via pip install "autogen-ext[crw]". Includes a sample script demonstrating usage with an AssistantAgent.

Why CRW?

  • Zero-dependency single binary (Rust), easy to self-host
  • Drop-in Firecrawl-compatible API — agents that already work with Firecrawl can switch to CRW
  • Built-in chunking with BM25/cosine ranking for RAG pipelines
  • Open source (MIT)

Add CrwScrapeTool, CrwCrawlTool, and CrwMapTool as new tool extensions
for web scraping via CRW's Firecrawl-compatible REST API.
Copilot AI review requested due to automatic review settings March 26, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CRW (Firecrawl-compatible) web scraping tools to autogen-ext, plus a new sample demonstrating how to use them from autogen-agentchat agents.

Changes:

  • Introduces CrwScrapeTool, CrwCrawlTool, and CrwMapTool wrappers around CRW’s /v1/scrape, /v1/crawl, and /v1/map endpoints.
  • Adds a runnable sample (app.py) and README explaining prerequisites and usage.
  • Adds a new crw optional dependency extra to autogen-ext.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
python/samples/agentchat_crw_web_scraping/app.py New sample script showing agent + tool usage for scrape/map/crawl.
python/samples/agentchat_crw_web_scraping/README.md Sample documentation and tool-to-endpoint mapping table.
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py New CRW tool implementations and request/response models.
python/packages/autogen-ext/src/autogen_ext/tools/crw/init.py Exports CRW tool classes and result models.
python/packages/autogen-ext/pyproject.toml Adds crw extra with httpx dependency.

Comment on lines +60 to +66
class CrwScrapeTool(BaseTool[ScrapeArgs, ScrapeResult]):
"""Scrape a single URL and return its content as markdown, HTML, or plain text.

Uses the CRW web scraper's ``POST /v1/scrape`` endpoint. CRW is an open-source,
high-performance web scraper built in Rust with a Firecrawl-compatible REST API.

.. note::
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no unit tests added for the new CRW tools. autogen-ext already has a tests/tools suite (e.g., for HttpTool), so it would be good to add tests that mock CRW responses (via httpx.MockTransport/monkeypatching) to cover: success paths, success:false/missing fields, and crawl polling termination/error cases.

Copilot uses AI. Check for mistakes.

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
from autogen_agentchat.ui import Console
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console is imported but never used in this sample. With the repo’s strict pyright settings (and samples included in pyright), this is likely to be flagged as an unused import; please remove it or use it for output rendering.

Suggested change
from autogen_agentchat.ui import Console

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +13
| Tool | Function | CRW Endpoint |
|------|----------|-------------|
| `CrwScrapeTool` | Scrape a single URL | `POST /v1/scrape` |
| `CrwCrawlTool` | Crawl a website (multi-page) | `POST /v1/crawl` + `GET /v1/crawl/{id}` |
| `CrwMapTool` | Discover all links on a site | `POST /v1/map` |
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table has an extra leading pipe (|| ...) on the header and separator rows, which renders as an empty first column in most markdown parsers. Please format the table with a single leading | per row.

Copilot uses AI. Check for mistakes.
"""

import asyncio
from typing import Any, Literal, Optional
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal is imported but not used in this module. With strict pyright settings, this can surface as an unused import; consider removing it to keep the module clean.

Suggested change
from typing import Any, Literal, Optional
from typing import Any, Optional

Copilot uses AI. Check for mistakes.
ScrapeArgs,
ScrapeResult,
"scrape_url",
"Scrape a URL and return its content as markdown. Powered by CRW web scraper.",
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool description passed to BaseTool says it returns “content as markdown”, but CrwScrapeTool can also return HTML/plain text/links depending on formats. This description is used in tool schemas/prompts, so it should accurately reflect the possible outputs.

Suggested change
"Scrape a URL and return its content as markdown. Powered by CRW web scraper.",
"Scrape a URL and return its content (markdown, HTML, plain text, and links). Powered by the CRW web scraper.",

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +283

job_id = body.get("id")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CrwCrawlTool doesn’t validate the initial POST /v1/crawl response before using body.get("id"). If the server returns success: false or omits id, the tool will return success=True (when poll=False) or start polling /v1/crawl/None. Please check the success flag and ensure job_id is present (otherwise return an error/result indicating failure).

Suggested change
job_id = body.get("id")
success_flag = body.get("success", True)
job_id = body.get("id")
if not success_flag or not job_id:
return CrawlResult(
success=False,
job_id=job_id,
status=body.get("status"),
error=body.get("error") or "Failed to start crawl job.",
)

Copilot uses AI. Check for mistakes.
Comment on lines +287 to +292
# Poll until completion
while True:
await asyncio.sleep(args.poll_interval)
async with httpx.AsyncClient(timeout=self._timeout) as client:
status_resp = await client.get(
f"{self._base_url}/v1/crawl/{job_id}",
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The polling loop has no termination condition besides status in ("completed", "failed") and doesn’t check cancellation_token. If CRW returns a different/unknown status (or gets stuck), this will loop forever. Please add cancellation checks and a bounded wait (e.g., max elapsed time / max polls), and consider reusing a single httpx.AsyncClient for the whole poll loop instead of creating a new client each iteration.

Copilot uses AI. Check for mistakes.
@us
Copy link
Copy Markdown
Author

us commented Mar 26, 2026

@microsoft-github-policy-service agree

- Remove unused Console import from sample app
- Fix markdown table formatting in README
- Remove unused Literal import from _crw_tools.py
- Update CrwScrapeTool description to reflect all output formats
- Validate initial POST response in CrwCrawlTool before polling
- Add max poll limit, cancellation token check, and reuse httpx client
- Add unit tests for CrwScrapeTool, CrwCrawlTool, and CrwMapTool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants