Conversation
Add CrwScrapeTool, CrwCrawlTool, and CrwMapTool as new tool extensions for web scraping via CRW's Firecrawl-compatible REST API.
There was a problem hiding this comment.
Pull request overview
Adds CRW (Firecrawl-compatible) web scraping tools to autogen-ext, plus a new sample demonstrating how to use them from autogen-agentchat agents.
Changes:
- Introduces
CrwScrapeTool,CrwCrawlTool, andCrwMapToolwrappers around CRW’s/v1/scrape,/v1/crawl, and/v1/mapendpoints. - Adds a runnable sample (
app.py) and README explaining prerequisites and usage. - Adds a new
crwoptional dependency extra toautogen-ext.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| python/samples/agentchat_crw_web_scraping/app.py | New sample script showing agent + tool usage for scrape/map/crawl. |
| python/samples/agentchat_crw_web_scraping/README.md | Sample documentation and tool-to-endpoint mapping table. |
| python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py | New CRW tool implementations and request/response models. |
| python/packages/autogen-ext/src/autogen_ext/tools/crw/init.py | Exports CRW tool classes and result models. |
| python/packages/autogen-ext/pyproject.toml | Adds crw extra with httpx dependency. |
| class CrwScrapeTool(BaseTool[ScrapeArgs, ScrapeResult]): | ||
| """Scrape a single URL and return its content as markdown, HTML, or plain text. | ||
|
|
||
| Uses the CRW web scraper's ``POST /v1/scrape`` endpoint. CRW is an open-source, | ||
| high-performance web scraper built in Rust with a Firecrawl-compatible REST API. | ||
|
|
||
| .. note:: |
There was a problem hiding this comment.
There are no unit tests added for the new CRW tools. autogen-ext already has a tests/tools suite (e.g., for HttpTool), so it would be good to add tests that mock CRW responses (via httpx.MockTransport/monkeypatching) to cover: success paths, success:false/missing fields, and crawl polling termination/error cases.
|
|
||
| from autogen_agentchat.agents import AssistantAgent | ||
| from autogen_agentchat.messages import TextMessage | ||
| from autogen_agentchat.ui import Console |
There was a problem hiding this comment.
Console is imported but never used in this sample. With the repo’s strict pyright settings (and samples included in pyright), this is likely to be flagged as an unused import; please remove it or use it for output rendering.
| from autogen_agentchat.ui import Console |
| | Tool | Function | CRW Endpoint | | ||
| |------|----------|-------------| | ||
| | `CrwScrapeTool` | Scrape a single URL | `POST /v1/scrape` | | ||
| | `CrwCrawlTool` | Crawl a website (multi-page) | `POST /v1/crawl` + `GET /v1/crawl/{id}` | | ||
| | `CrwMapTool` | Discover all links on a site | `POST /v1/map` | |
There was a problem hiding this comment.
The markdown table has an extra leading pipe (|| ...) on the header and separator rows, which renders as an empty first column in most markdown parsers. Please format the table with a single leading | per row.
| """ | ||
|
|
||
| import asyncio | ||
| from typing import Any, Literal, Optional |
There was a problem hiding this comment.
Literal is imported but not used in this module. With strict pyright settings, this can surface as an unused import; consider removing it to keep the module clean.
| from typing import Any, Literal, Optional | |
| from typing import Any, Optional |
| ScrapeArgs, | ||
| ScrapeResult, | ||
| "scrape_url", | ||
| "Scrape a URL and return its content as markdown. Powered by CRW web scraper.", |
There was a problem hiding this comment.
The tool description passed to BaseTool says it returns “content as markdown”, but CrwScrapeTool can also return HTML/plain text/links depending on formats. This description is used in tool schemas/prompts, so it should accurately reflect the possible outputs.
| "Scrape a URL and return its content as markdown. Powered by CRW web scraper.", | |
| "Scrape a URL and return its content (markdown, HTML, plain text, and links). Powered by the CRW web scraper.", |
|
|
||
| job_id = body.get("id") |
There was a problem hiding this comment.
CrwCrawlTool doesn’t validate the initial POST /v1/crawl response before using body.get("id"). If the server returns success: false or omits id, the tool will return success=True (when poll=False) or start polling /v1/crawl/None. Please check the success flag and ensure job_id is present (otherwise return an error/result indicating failure).
| job_id = body.get("id") | |
| success_flag = body.get("success", True) | |
| job_id = body.get("id") | |
| if not success_flag or not job_id: | |
| return CrawlResult( | |
| success=False, | |
| job_id=job_id, | |
| status=body.get("status"), | |
| error=body.get("error") or "Failed to start crawl job.", | |
| ) |
| # Poll until completion | ||
| while True: | ||
| await asyncio.sleep(args.poll_interval) | ||
| async with httpx.AsyncClient(timeout=self._timeout) as client: | ||
| status_resp = await client.get( | ||
| f"{self._base_url}/v1/crawl/{job_id}", |
There was a problem hiding this comment.
The polling loop has no termination condition besides status in ("completed", "failed") and doesn’t check cancellation_token. If CRW returns a different/unknown status (or gets stuck), this will loop forever. Please add cancellation checks and a bounded wait (e.g., max elapsed time / max polls), and consider reusing a single httpx.AsyncClient for the whole poll loop instead of creating a new client each iteration.
|
@microsoft-github-policy-service agree |
- Remove unused Console import from sample app - Fix markdown table formatting in README - Remove unused Literal import from _crw_tools.py - Update CrwScrapeTool description to reflect all output formats - Validate initial POST response in CrwCrawlTool before polling - Add max poll limit, cancellation token check, and reuse httpx client - Add unit tests for CrwScrapeTool, CrwCrawlTool, and CrwMapTool
Summary
Adds CRW web scraping tools to
autogen-ext. CRW is an open-source web scraper for AI agents — a single Rust binary with a built-in MCP server and Firecrawl-compatible REST API.New tools:
CrwScrapeTool— scrape a single URL to markdown/HTML/plaintext (POST /v1/scrape)CrwCrawlTool— crawl a website across multiple pages with depth/page limits (POST /v1/crawl)CrwMapTool— discover all links on a site via crawling + sitemap (POST /v1/map)All tools follow the existing
BaseToolpattern, include type hints and docstrings, and are installable viapip install "autogen-ext[crw]". Includes a sample script demonstrating usage with an AssistantAgent.Why CRW?