feat: Add CRW web scraping tools by us · Pull Request #7468 · microsoft/autogen

us · 2026-03-26T14:07:20Z

Summary

Adds CRW web scraping tools to autogen-ext. CRW is an open-source web scraper for AI agents — a single Rust binary with a built-in MCP server and Firecrawl-compatible REST API.

New tools:

CrwScrapeTool — scrape a single URL to markdown/HTML/plaintext (POST /v1/scrape)
CrwCrawlTool — crawl a website across multiple pages with depth/page limits (POST /v1/crawl)
CrwMapTool — discover all links on a site via crawling + sitemap (POST /v1/map)

All tools follow the existing BaseTool pattern, include type hints and docstrings, and are installable via pip install "autogen-ext[crw]". Includes a sample script demonstrating usage with an AssistantAgent.

Why CRW?

Zero-dependency single binary (Rust), easy to self-host
Drop-in Firecrawl-compatible API — agents that already work with Firecrawl can switch to CRW
Built-in chunking with BM25/cosine ranking for RAG pipelines
Open source (MIT)

Add CrwScrapeTool, CrwCrawlTool, and CrwMapTool as new tool extensions for web scraping via CRW's Firecrawl-compatible REST API.

Copilot

Pull request overview

Adds CRW (Firecrawl-compatible) web scraping tools to autogen-ext, plus a new sample demonstrating how to use them from autogen-agentchat agents.

Changes:

Introduces CrwScrapeTool, CrwCrawlTool, and CrwMapTool wrappers around CRW’s /v1/scrape, /v1/crawl, and /v1/map endpoints.
Adds a runnable sample (app.py) and README explaining prerequisites and usage.
Adds a new crw optional dependency extra to autogen-ext.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
python/samples/agentchat_crw_web_scraping/app.py	New sample script showing agent + tool usage for scrape/map/crawl.
python/samples/agentchat_crw_web_scraping/README.md	Sample documentation and tool-to-endpoint mapping table.
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py	New CRW tool implementations and request/response models.
python/packages/autogen-ext/src/autogen_ext/tools/crw/init.py	Exports CRW tool classes and result models.
python/packages/autogen-ext/pyproject.toml	Adds `crw` extra with `httpx` dependency.

Copilot · 2026-03-26T14:13:44Z

python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py

+class CrwScrapeTool(BaseTool[ScrapeArgs, ScrapeResult]):
+    """Scrape a single URL and return its content as markdown, HTML, or plain text.
+
+    Uses the CRW web scraper's ``POST /v1/scrape`` endpoint. CRW is an open-source,
+    high-performance web scraper built in Rust with a Firecrawl-compatible REST API.
+
+    .. note::


There are no unit tests added for the new CRW tools. autogen-ext already has a tests/tools suite (e.g., for HttpTool), so it would be good to add tests that mock CRW responses (via httpx.MockTransport/monkeypatching) to cover: success paths, success:false/missing fields, and crawl polling termination/error cases.

Copilot · 2026-03-26T14:13:44Z

python/samples/agentchat_crw_web_scraping/app.py

+
+from autogen_agentchat.agents import AssistantAgent
+from autogen_agentchat.messages import TextMessage
+from autogen_agentchat.ui import Console


Console is imported but never used in this sample. With the repo’s strict pyright settings (and samples included in pyright), this is likely to be flagged as an unused import; please remove it or use it for output rendering.

Suggested change

from autogen_agentchat.ui import Console

Copilot · 2026-03-26T14:13:45Z

python/samples/agentchat_crw_web_scraping/README.md

+| Tool | Function | CRW Endpoint |
+|------|----------|-------------|
+| `CrwScrapeTool` | Scrape a single URL | `POST /v1/scrape` |
+| `CrwCrawlTool` | Crawl a website (multi-page) | `POST /v1/crawl` + `GET /v1/crawl/{id}` |
+| `CrwMapTool` | Discover all links on a site | `POST /v1/map` |


The markdown table has an extra leading pipe (|| ...) on the header and separator rows, which renders as an empty first column in most markdown parsers. Please format the table with a single leading | per row.

Copilot · 2026-03-26T14:13:45Z

python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py

+"""
+
+import asyncio
+from typing import Any, Literal, Optional


Literal is imported but not used in this module. With strict pyright settings, this can surface as an unused import; consider removing it to keep the module clean.

Suggested change

from typing import Any, Literal, Optional

from typing import Any, Optional

Copilot · 2026-03-26T14:13:46Z

python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py

+            ScrapeArgs,
+            ScrapeResult,
+            "scrape_url",
+            "Scrape a URL and return its content as markdown. Powered by CRW web scraper.",


The tool description passed to BaseTool says it returns “content as markdown”, but CrwScrapeTool can also return HTML/plain text/links depending on formats. This description is used in tool schemas/prompts, so it should accurately reflect the possible outputs.

Suggested change

"Scrape a URL and return its content as markdown. Powered by CRW web scraper.",

"Scrape a URL and return its content (markdown, HTML, plain text, and links). Powered by the CRW web scraper.",

Copilot · 2026-03-26T14:13:46Z

python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py

+
+            job_id = body.get("id")


CrwCrawlTool doesn’t validate the initial POST /v1/crawl response before using body.get("id"). If the server returns success: false or omits id, the tool will return success=True (when poll=False) or start polling /v1/crawl/None. Please check the success flag and ensure job_id is present (otherwise return an error/result indicating failure).

Suggested change

job_id = body.get("id")

success_flag = body.get("success", True)

job_id = body.get("id")

if not success_flag or not job_id:

return CrawlResult(

success=False,

job_id=job_id,

status=body.get("status"),

error=body.get("error") or "Failed to start crawl job.",

)

Copilot · 2026-03-26T14:13:46Z

python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py

+            # Poll until completion
+            while True:
+                await asyncio.sleep(args.poll_interval)
+                async with httpx.AsyncClient(timeout=self._timeout) as client:
+                    status_resp = await client.get(
+                        f"{self._base_url}/v1/crawl/{job_id}",


The polling loop has no termination condition besides status in ("completed", "failed") and doesn’t check cancellation_token. If CRW returns a different/unknown status (or gets stuck), this will loop forever. Please add cancellation checks and a bounded wait (e.g., max elapsed time / max polls), and consider reusing a single httpx.AsyncClient for the whole poll loop instead of creating a new client each iteration.

us · 2026-03-26T15:01:22Z

@microsoft-github-policy-service agree

- Remove unused Console import from sample app - Fix markdown table formatting in README - Remove unused Literal import from _crw_tools.py - Update CrwScrapeTool description to reflect all output formats - Validate initial POST response in CrwCrawlTool before polling - Add max poll limit, cancellation token check, and reuse httpx client - Add unit tests for CrwScrapeTool, CrwCrawlTool, and CrwMapTool

feat: add CRW web scraping tools

2769b83

Add CrwScrapeTool, CrwCrawlTool, and CrwMapTool as new tool extensions for web scraping via CRW's Firecrawl-compatible REST API.

Copilot AI review requested due to automatic review settings March 26, 2026 14:07

Copilot started reviewing on behalf of us March 26, 2026 14:08 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add CRW web scraping tools#7468

feat: Add CRW web scraping tools#7468
us wants to merge 2 commits intomicrosoft:mainfrom
us:feat/add-crw-tool

us commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

us commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from typing import Any, Literal, Optional
	from typing import Any, Optional

	"Scrape a URL and return its content as markdown. Powered by CRW web scraper.",
	"Scrape a URL and return its content (markdown, HTML, plain text, and links). Powered by the CRW web scraper.",

-            job_id = body.get("id")
+            success_flag = body.get("success", True)
+            job_id = body.get("id")
+            if not success_flag or not job_id:
+                return CrawlResult(
+                    success=False,
+                    job_id=job_id,
+                    status=body.get("status"),
+                    error=body.get("error") or "Failed to start crawl job.",
+                )

Conversation

us commented Mar 26, 2026

Summary

Why CRW?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

us commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants