feat: add timestamp for scraper content#84
Conversation
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 47 minutes and 49 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds end-to-end timestamp propagation across the FastFetchBot metadata and scraper ecosystem. It introduces a ChangesEnd-to-end timestamp propagation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The PR introduces a common timestamp field across 25+ files with platform-specific extraction logic for 15+ social/content platforms. While individual parsers follow similar patterns (validate input → normalize to Unix seconds → store), the heterogeneity of timestamp formats (ISO strings, RFC dates, milliseconds, Unix epochs) and multiple scraper implementations across independent platforms require careful review of each platform's parsing rules, edge cases (null/invalid inputs), and propagation through the metadata pipeline. The storage layer transformation (timestamp ↔ published_timestamp) adds conceptual complexity despite simple code. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/shared/fastfetchbot_shared/database/mongodb/cache.py`:
- Line 77: Replace the unsafe bypass of Pydantic validation by changing the call
to Metadata.model_construct(**document_data) to use the normal constructor
Metadata(**document_data) in the cache logic; update the surrounding code in the
function that builds persisted metadata (where doc is created) to import and
catch pydantic.ValidationError (or ValueError) so you can log/handle invalid
documents instead of allowing malformed payloads to be persisted.
In `@packages/shared/fastfetchbot_shared/models/metadata_item.py`:
- Line 122: The serializer is currently hardcoding "telegraph_url": "" which
drops stored telegraph links; update the serialization in the MetadataItem
(likely the to_dict/serialize method in metadata_item.py) to include the actual
attribute (e.g. self.telegraph_url or getattr(self, "telegraph_url", ""))
instead of the empty string so existing telegraph links are preserved in output
payloads and downstream storage/response.
In `@packages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.py`:
- Around line 197-206: The helper _parse_bluesky_created_at currently swallows
parse errors and returns None without any trace; update it to log the failure
before returning so exceptions aren't silent: catch the (TypeError, ValueError)
in _parse_bluesky_created_at and call logger.exception (or logger.debug with the
exception) including the input created_at and context (e.g., "failed to parse
created_at") before returning None, ensuring the module has an
imported/initialized logger variable referenced by name in the change.
In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py`:
- Line 93: The current line sets result["timestamp"] using
result.pop("published_timestamp", None), which overwrites legacy records'
existing "timestamp" with None; change the logic in the block that manipulates
the result dict so that if "published_timestamp" exists you move it into
"timestamp" (remove the old key) but if it does not exist you leave any existing
result["timestamp"] intact (or leave it absent) — i.e., prefer
result["published_timestamp"] when present, otherwise keep result["timestamp"];
adjust the code around the result dict (the use of
result.pop("published_timestamp", None) and assignment to result["timestamp"])
accordingly.
In `@packages/shared/fastfetchbot_shared/services/scrapers/general/zyte.py`:
- Around line 90-96: The ValueError in _parse_zyte_date_published is being
swallowed; modify the except block to log the failure before returning None—use
the module logger (e.g., logger.exception or logger.error with exception info)
and include the raw value being parsed so upstream bad timestamps are visible;
if a logger isn't present, import or create one (e.g., getLogger(__name__));
keep the current behavior of returning None after logging.
In `@packages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.py`:
- Around line 284-295: The function _parse_instagram_timestamp currently
swallows parse errors and invalid values; update it to use a module-level logger
(e.g., logger = logging.getLogger(__name__)) and log failures instead of
silently returning None: call logger.exception or logger.warning inside the
except (TypeError, ValueError) block including the raw value, and also emit a
logger.warning when timestamp <= 0 (include the parsed timestamp and original
value); keep returning None for invalid cases but ensure all failure paths are
logged with clear context identifying _parse_instagram_timestamp and the
offending value.
In `@packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.py`:
- Around line 336-345: The helper _parse_twitter_created_at silently swallows
parsing exceptions and returns None; update it to log the failure and document
the fallback: catch the (TypeError, ValueError, IndexError) from
parsedate_to_datetime, call the module logger (e.g., logger.exception or
logger.error with exc_info=True) with a clear message including the raw
created_at string, and then return None (or re-raise if you prefer explicit
failure), and add/update the _parse_twitter_created_at docstring to state that
parsing errors are logged and None is returned as the documented fallback. Use
the function name _parse_twitter_created_at and the parsedate_to_datetime call
to locate where to add logging.
In `@packages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.py`:
- Around line 534-543: The _parse_weibo_created_at helper currently swallows
parsing exceptions and returns None silently; update it to log parsing failures
and document the fallback behavior: import or use the module logger (e.g.,
logger) inside _parse_weibo_created_at, catch the (TypeError, ValueError,
IndexError) exception and call logger.exception(...) or logger.warning(...) with
the created_at value and error details, and add a short comment stating that
returning None is the documented fallback when parsing fails; ensure the
function still returns None after logging to preserve current behavior.
In `@tests/unit/scrapers/test_wechat.py`:
- Line 60: The test currently contradicts the PR intent by asserting the
injected fixture `ct` yields no timestamp; update the assertion in
tests/unit/scrapers/test_wechat.py (the failing line that reads assert
"timestamp" not in result) to instead assert that result contains "timestamp"
and that its value equals the normalized timestamp derived from the fixture `ct`
(use the same normalization routine the Wechat scraper uses — e.g., the
scraper's timestamp parsing/normalization helper such as
normalize_timestamp/parse_ct or the Wechat scraper method that produces
ISO8601/unix output) so the test verifies presence and correct normalized value
rather than absence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 715934f5-381b-4d78-aebb-13cb30345c43
📒 Files selected for processing (27)
apps/api/src/services/inoreader/__init__.pypackages/shared/fastfetchbot_shared/database/mongodb/cache.pypackages/shared/fastfetchbot_shared/database/mongodb/models/metadata.pypackages/shared/fastfetchbot_shared/models/metadata_item.pypackages/shared/fastfetchbot_shared/services/file_export/video_download.pypackages/shared/fastfetchbot_shared/services/scrapers/bluesky/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.pypackages/shared/fastfetchbot_shared/services/scrapers/common.pypackages/shared/fastfetchbot_shared/services/scrapers/douban/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/general/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/general/base.pypackages/shared/fastfetchbot_shared/services/scrapers/general/firecrawl.pypackages/shared/fastfetchbot_shared/services/scrapers/general/zyte.pypackages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/reddit/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/threads/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/wechat/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/weibo/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.pypackages/shared/fastfetchbot_shared/services/scrapers/xiaohongshu/__init__.pypackages/shared/fastfetchbot_shared/services/scrapers/zhihu/__init__.pypackages/shared/fastfetchbot_shared/utils/parse.pytests/unit/database/mongodb/test_cache.pytests/unit/scrapers/test_common_cache.pytests/unit/scrapers/test_wechat.pytests/unit/test_published_timestamp.py
| "telegraph_url": "", "content": from_str(self.content), | ||
| "text": from_str(self.text), | ||
| "url": from_str(getattr(self, "url", "")), | ||
| "telegraph_url": "", |
There was a problem hiding this comment.
Preserve telegraph_url during serialization.
Line 122 hardcodes telegraph_url to an empty string, which drops existing telegraph links and causes data loss in downstream storage/response payloads.
💡 Proposed fix
- "telegraph_url": "",
+ "telegraph_url": from_str(getattr(self, "telegraph_url", "")),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/fastfetchbot_shared/models/metadata_item.py` at line 122, The
serializer is currently hardcoding "telegraph_url": "" which drops stored
telegraph links; update the serialization in the MetadataItem (likely the
to_dict/serialize method in metadata_item.py) to include the actual attribute
(e.g. self.telegraph_url or getattr(self, "telegraph_url", "")) instead of the
empty string so existing telegraph links are preserved in output payloads and
downstream storage/response.
| def _parse_bluesky_created_at(created_at: str | None) -> int | None: | ||
| if not created_at: | ||
| return None | ||
| try: | ||
| parsed = datetime.datetime.fromisoformat(created_at.replace("Z", "+00:00")) | ||
| except (TypeError, ValueError): | ||
| return None | ||
| if parsed.tzinfo is None: | ||
| parsed = parsed.replace(tzinfo=datetime.timezone.utc) | ||
| return int(parsed.timestamp()) |
There was a problem hiding this comment.
Avoid silent parse failures in timestamp helper.
This helper swallows parse exceptions and returns None without logging or documented handling. Add at least debug/exception logging (or explicitly document this fallback behavior).
As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.py`
around lines 197 - 206, The helper _parse_bluesky_created_at currently swallows
parse errors and returns None without any trace; update it to log the failure
before returning so exceptions aren't silent: catch the (TypeError, ValueError)
in _parse_bluesky_created_at and call logger.exception (or logger.debug with the
exception) including the input created_at and context (e.g., "failed to parse
created_at") before returning None, ensuring the module has an
imported/initialized logger variable referenced by name in the change.
| if cached is not None: | ||
| logger.info("Cache hit, returning cached metadata") | ||
| result = cached.model_dump(mode="json", exclude={"id"}) | ||
| result["timestamp"] = result.pop("published_timestamp", None) |
There was a problem hiding this comment.
Keep legacy cache timestamps when published_timestamp is missing.
Line 93 sets timestamp to None for older cached records that only contain timestamp, which drops usable data.
💡 Proposed fix
- result["timestamp"] = result.pop("published_timestamp", None)
+ result["timestamp"] = result.pop(
+ "published_timestamp", result.get("timestamp")
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py` at line 93,
The current line sets result["timestamp"] using
result.pop("published_timestamp", None), which overwrites legacy records'
existing "timestamp" with None; change the logic in the block that manipulates
the result dict so that if "published_timestamp" exists you move it into
"timestamp" (remove the old key) but if it does not exist you leave any existing
result["timestamp"] intact (or leave it absent) — i.e., prefer
result["published_timestamp"] when present, otherwise keep result["timestamp"];
adjust the code around the result dict (the use of
result.pop("published_timestamp", None) and assignment to result["timestamp"])
accordingly.
| def _parse_instagram_timestamp(value: Any) -> int | None: | ||
| if value in (None, ""): | ||
| return None | ||
| try: | ||
| timestamp = int(value) | ||
| except (TypeError, ValueError): | ||
| return None | ||
| if timestamp <= 0: | ||
| return None | ||
| if timestamp > 10_000_000_000: | ||
| timestamp //= 1000 | ||
| return timestamp |
There was a problem hiding this comment.
Make invalid timestamp handling observable.
_parse_instagram_timestamp returns None on parse failures without logging or explicit documented behavior. Please add logging for failed conversions (or document silent fallback intent).
As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.py`
around lines 284 - 295, The function _parse_instagram_timestamp currently
swallows parse errors and invalid values; update it to use a module-level logger
(e.g., logger = logging.getLogger(__name__)) and log failures instead of
silently returning None: call logger.exception or logger.warning inside the
except (TypeError, ValueError) block including the raw value, and also emit a
logger.warning when timestamp <= 0 (include the parsed timestamp and original
value); keep returning None for invalid cases but ensure all failure paths are
logged with clear context identifying _parse_instagram_timestamp and the
offending value.
| def _parse_twitter_created_at(created_at: str | None) -> int | None: | ||
| if not created_at: | ||
| return None | ||
| try: | ||
| parsed = parsedate_to_datetime(created_at) | ||
| except (TypeError, ValueError, IndexError): | ||
| return None | ||
| if parsed.tzinfo is None: | ||
| parsed = parsed.replace(tzinfo=datetime.timezone.utc) | ||
| return int(parsed.timestamp()) |
There was a problem hiding this comment.
Don’t silently suppress Twitter date parse errors.
Line 341-342 returns None on parse failure without logging or explicit documented fallback behavior. Please log the failure path (or document expected silent fallback).
As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.py`
around lines 336 - 345, The helper _parse_twitter_created_at silently swallows
parsing exceptions and returns None; update it to log the failure and document
the fallback: catch the (TypeError, ValueError, IndexError) from
parsedate_to_datetime, call the module logger (e.g., logger.exception or
logger.error with exc_info=True) with a clear message including the raw
created_at string, and then return None (or re-raise if you prefer explicit
failure), and add/update the _parse_twitter_created_at docstring to state that
parsing errors are logged and None is returned as the documented fallback. Use
the function name _parse_twitter_created_at and the parsedate_to_datetime call
to locate where to add logging.
| def _parse_weibo_created_at(created_at: str | None) -> int | None: | ||
| if not created_at: | ||
| return None | ||
| try: | ||
| parsed = parsedate_to_datetime(created_at) | ||
| except (TypeError, ValueError, IndexError): | ||
| return None | ||
| if parsed.tzinfo is None: | ||
| parsed = parsed.replace(tzinfo=datetime.timezone.utc) | ||
| return int(parsed.timestamp()) |
There was a problem hiding this comment.
Add logging or documented fallback for Weibo timestamp parse failures.
The helper currently suppresses parse exceptions and returns None without logging, which makes failures opaque.
As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.py`
around lines 534 - 543, The _parse_weibo_created_at helper currently swallows
parsing exceptions and returns None silently; update it to log parsing failures
and document the fallback behavior: import or use the module logger (e.g.,
logger) inside _parse_weibo_created_at, catch the (TypeError, ValueError,
IndexError) exception and call logger.exception(...) or logger.warning(...) with
the created_at value and error details, and add a short comment stating that
returning None is the documented fallback when parsing fails; ensure the
function still returns None after logging to preserve current behavior.
| assert result["title"] == "Test Title" | ||
| assert result["author"] == "Test Author" | ||
| assert "Test content paragraph" in result["content"] | ||
| assert "timestamp" not in result |
There was a problem hiding this comment.
Assertion contradicts the timestamp propagation goal for Wechat.
This test now injects ct in the fixture but asserts timestamp must be absent, which enforces the opposite of the PR objective for Wechat timestamp support. Update the expectation to assert normalized timestamp output instead.
Suggested test fix
- assert "timestamp" not in result
+ assert result["timestamp"] == 1704067200📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert "timestamp" not in result | |
| assert result["timestamp"] == 1704067200 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/scrapers/test_wechat.py` at line 60, The test currently
contradicts the PR intent by asserting the injected fixture `ct` yields no
timestamp; update the assertion in tests/unit/scrapers/test_wechat.py (the
failing line that reads assert "timestamp" not in result) to instead assert that
result contains "timestamp" and that its value equals the normalized timestamp
derived from the fixture `ct` (use the same normalization routine the Wechat
scraper uses — e.g., the scraper's timestamp parsing/normalization helper such
as normalize_timestamp/parse_ct or the Wechat scraper method that produces
ISO8601/unix output) so the test verifies presence and correct normalized value
rather than absence.
Summary by CodeRabbit
New Features
Tests