Skip to content

feat: add timestamp for scraper content#84

Merged
aturret merged 3 commits into
mainfrom
timestamp-update
May 24, 2026
Merged

feat: add timestamp for scraper content#84
aturret merged 3 commits into
mainfrom
timestamp-update

Conversation

@aturret
Copy link
Copy Markdown
Owner

@aturret aturret commented May 24, 2026

Summary by CodeRabbit

  • New Features

    • Added publication timestamp tracking for content from multiple platforms (Instagram, Twitter, Reddit, Weibo, Bluesky, Threads, Xiaohongshu, Zhihu, and general web sources).
    • Improved date metadata consistency across all scrapers.
  • Tests

    • Added test coverage for timestamp parsing and serialization.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 24, 2026

Warning

Review limit reached

@aturret, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 47 minutes and 49 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 84d76715-6747-4e90-a0c4-b6d88132088b

📥 Commits

Reviewing files that changed from the base of the PR and between 6ea5363 and ff68610.

📒 Files selected for processing (2)
  • packages/shared/fastfetchbot_shared/database/mongodb/cache.py
  • tests/unit/database/mongodb/test_cache.py
📝 Walkthrough

Walkthrough

This PR adds end-to-end timestamp propagation across the FastFetchBot metadata and scraper ecosystem. It introduces a timestamp field to the core MetadataItem model, extends MongoDB persistence to store timestamps as published_timestamp, and updates 15+ platform scrapers (Twitter, Weibo, Instagram, Reddit, Threads, Bluesky, Zhihu, Xiaohongshu, Douban, Wechat, Inoreader, YouTube, Bilibili) to extract and normalize platform-specific timestamp formats.

Changes

End-to-end timestamp propagation

Layer / File(s) Summary
Core timestamp model and utilities
packages/shared/fastfetchbot_shared/models/metadata_item.py, packages/shared/fastfetchbot_shared/utils/parse.py
Adds MetadataItem.timestamp field and from_optional_int parser to validate positive integer timestamps; refactors unix_timestamp_to_utc to use a reusable BEIJING_TZ timezone constant for consistent timestamp formatting.
Metadata storage and cache transformation
packages/shared/fastfetchbot_shared/database/mongodb/models/metadata.py, packages/shared/fastfetchbot_shared/database/mongodb/cache.py, packages/shared/fastfetchbot_shared/services/scrapers/common.py
Adds published_timestamp field to MongoDB Metadata model; save_metadata copies and transforms input dicts to rename timestamppublished_timestamp; cache hits normalize the response by mapping published_timestamp back to timestamp.
Scraper framework integration
packages/shared/fastfetchbot_shared/services/scrapers/general/base.py, packages/shared/fastfetchbot_shared/services/scrapers/general/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/general/firecrawl.py, packages/shared/fastfetchbot_shared/services/scrapers/general/zyte.py
BaseGeneralDataProcessor._build_item_data accepts optional timestamp parameter with validation (int > 0); GeneralItem.from_dict propagates MetadataItem.timestamp; Zyte scraper extracts publication date via ISO parser helper.
Video and social media scrapers
packages/shared/fastfetchbot_shared/services/file_export/video_download.py, packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.py, packages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/reddit/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/threads/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/bluesky/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.py
Each scraper initializes self.timestamp, extracts platform timestamps (Twitter via parsedate_to_datetime; Weibo/Bluesky via ISO parser with UTC default; Reddit via created_utc Unix epoch; Instagram via taken_at with milliseconds downscaling; Threads via published_on; video scrapers via YYYYMMDD or int validation), and stores as instance attribute.
Chinese content platform scrapers
packages/shared/fastfetchbot_shared/services/scrapers/zhihu/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/xiaohongshu/__init__.py
Zhihu extracts from created field across API and JSON scraping paths; Xiaohongshu parses time/last_update_time with milliseconds-vs-seconds detection and uses the parsed values for created/updated computation via unix_timestamp_to_utc.
Miscellaneous platform scrapers
packages/shared/fastfetchbot_shared/services/scrapers/douban/__init__.py, packages/shared/fastfetchbot_shared/services/scrapers/wechat/__init__.py, apps/api/src/services/inoreader/__init__.py
Douban, Wechat, and Inoreader initialize self.timestamp = None and skip processing null metadata values (Wechat) or parse from constructor data (Inoreader).
Test coverage
tests/unit/database/mongodb/test_cache.py, tests/unit/scrapers/test_common_cache.py, tests/unit/scrapers/test_wechat.py, tests/unit/test_published_timestamp.py
Verifies save_metadata maps timestamppublished_timestamp correctly; cache hits return timestamp from stored published_timestamp; MetadataItem.to_dict() serializes numeric timestamps but rejects string datetime values; confirms timestamp absence in Wechat output when not provided.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The PR introduces a common timestamp field across 25+ files with platform-specific extraction logic for 15+ social/content platforms. While individual parsers follow similar patterns (validate input → normalize to Unix seconds → store), the heterogeneity of timestamp formats (ISO strings, RFC dates, milliseconds, Unix epochs) and multiple scraper implementations across independent platforms require careful review of each platform's parsing rules, edge cases (null/invalid inputs), and propagation through the metadata pipeline. The storage layer transformation (timestamp ↔ published_timestamp) adds conceptual complexity despite simple code.

Possibly related PRs

  • aturret/FastFetchBot#73: Extends MongoDB Metadata model and save_metadata function—this PR builds on that foundation to map timestamps during persistence.
  • aturret/FastFetchBot#49: Refactors BaseGeneralDataProcessor._build_item_data in the general scraper stack—this PR adds the timestamp parameter and conditional field validation to that signature.

Poem

🐰 Through platforms vast, from tweets to posts so bright,
We gather timestamps, normalize to light—
Zhihu, Instagram, Twitter's racing beat,
Unix seconds flowing, all aligned and neat!
From storage vaults to scrapers ever keen,
Time flows through the data—pristine and serene. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add timestamp for scraper content' accurately summarizes the main change—adding timestamp extraction and propagation across multiple scraper modules throughout the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch timestamp-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

❌ Patch coverage is 80.88235% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...hbot_shared/services/file_export/video_download.py 76.19% 5 Missing ⚠️
...tchbot_shared/services/scrapers/bluesky/scraper.py 63.63% 4 Missing ⚠️
...fetchbot_shared/services/scrapers/weibo/scraper.py 69.23% 4 Missing ⚠️

Impacted file tree graph

@@           Coverage Diff           @@
##             main      #84   +/-   ##
=======================================
  Coverage        ?   80.96%           
=======================================
  Files           ?       66           
  Lines           ?     3636           
  Branches        ?        0           
=======================================
  Hits            ?     2944           
  Misses          ?      692           
  Partials        ?        0           
Files with missing lines Coverage Δ
...ared/fastfetchbot_shared/database/mongodb/cache.py 100.00% <100.00%> (ø)
...etchbot_shared/database/mongodb/models/metadata.py 97.77% <100.00%> (ø)
...shared/fastfetchbot_shared/models/metadata_item.py 97.67% <100.00%> (ø)
...ed/fastfetchbot_shared/services/scrapers/common.py 95.31% <100.00%> (ø)
...tfetchbot_shared/services/scrapers/general/base.py 100.00% <100.00%> (ø)
...hbot_shared/services/scrapers/general/firecrawl.py 100.00% <100.00%> (ø)
packages/shared/fastfetchbot_shared/utils/parse.py 46.30% <100.00%> (ø)
...tchbot_shared/services/scrapers/bluesky/scraper.py 97.33% <63.63%> (ø)
...fetchbot_shared/services/scrapers/weibo/scraper.py 98.73% <69.23%> (ø)
...hbot_shared/services/file_export/video_download.py 83.54% <76.19%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/shared/fastfetchbot_shared/database/mongodb/cache.py`:
- Line 77: Replace the unsafe bypass of Pydantic validation by changing the call
to Metadata.model_construct(**document_data) to use the normal constructor
Metadata(**document_data) in the cache logic; update the surrounding code in the
function that builds persisted metadata (where doc is created) to import and
catch pydantic.ValidationError (or ValueError) so you can log/handle invalid
documents instead of allowing malformed payloads to be persisted.

In `@packages/shared/fastfetchbot_shared/models/metadata_item.py`:
- Line 122: The serializer is currently hardcoding "telegraph_url": "" which
drops stored telegraph links; update the serialization in the MetadataItem
(likely the to_dict/serialize method in metadata_item.py) to include the actual
attribute (e.g. self.telegraph_url or getattr(self, "telegraph_url", ""))
instead of the empty string so existing telegraph links are preserved in output
payloads and downstream storage/response.

In `@packages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.py`:
- Around line 197-206: The helper _parse_bluesky_created_at currently swallows
parse errors and returns None without any trace; update it to log the failure
before returning so exceptions aren't silent: catch the (TypeError, ValueError)
in _parse_bluesky_created_at and call logger.exception (or logger.debug with the
exception) including the input created_at and context (e.g., "failed to parse
created_at") before returning None, ensuring the module has an
imported/initialized logger variable referenced by name in the change.

In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py`:
- Line 93: The current line sets result["timestamp"] using
result.pop("published_timestamp", None), which overwrites legacy records'
existing "timestamp" with None; change the logic in the block that manipulates
the result dict so that if "published_timestamp" exists you move it into
"timestamp" (remove the old key) but if it does not exist you leave any existing
result["timestamp"] intact (or leave it absent) — i.e., prefer
result["published_timestamp"] when present, otherwise keep result["timestamp"];
adjust the code around the result dict (the use of
result.pop("published_timestamp", None) and assignment to result["timestamp"])
accordingly.

In `@packages/shared/fastfetchbot_shared/services/scrapers/general/zyte.py`:
- Around line 90-96: The ValueError in _parse_zyte_date_published is being
swallowed; modify the except block to log the failure before returning None—use
the module logger (e.g., logger.exception or logger.error with exception info)
and include the raw value being parsed so upstream bad timestamps are visible;
if a logger isn't present, import or create one (e.g., getLogger(__name__));
keep the current behavior of returning None after logging.

In `@packages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.py`:
- Around line 284-295: The function _parse_instagram_timestamp currently
swallows parse errors and invalid values; update it to use a module-level logger
(e.g., logger = logging.getLogger(__name__)) and log failures instead of
silently returning None: call logger.exception or logger.warning inside the
except (TypeError, ValueError) block including the raw value, and also emit a
logger.warning when timestamp <= 0 (include the parsed timestamp and original
value); keep returning None for invalid cases but ensure all failure paths are
logged with clear context identifying _parse_instagram_timestamp and the
offending value.

In `@packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.py`:
- Around line 336-345: The helper _parse_twitter_created_at silently swallows
parsing exceptions and returns None; update it to log the failure and document
the fallback: catch the (TypeError, ValueError, IndexError) from
parsedate_to_datetime, call the module logger (e.g., logger.exception or
logger.error with exc_info=True) with a clear message including the raw
created_at string, and then return None (or re-raise if you prefer explicit
failure), and add/update the _parse_twitter_created_at docstring to state that
parsing errors are logged and None is returned as the documented fallback. Use
the function name _parse_twitter_created_at and the parsedate_to_datetime call
to locate where to add logging.

In `@packages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.py`:
- Around line 534-543: The _parse_weibo_created_at helper currently swallows
parsing exceptions and returns None silently; update it to log parsing failures
and document the fallback behavior: import or use the module logger (e.g.,
logger) inside _parse_weibo_created_at, catch the (TypeError, ValueError,
IndexError) exception and call logger.exception(...) or logger.warning(...) with
the created_at value and error details, and add a short comment stating that
returning None is the documented fallback when parsing fails; ensure the
function still returns None after logging to preserve current behavior.

In `@tests/unit/scrapers/test_wechat.py`:
- Line 60: The test currently contradicts the PR intent by asserting the
injected fixture `ct` yields no timestamp; update the assertion in
tests/unit/scrapers/test_wechat.py (the failing line that reads assert
"timestamp" not in result) to instead assert that result contains "timestamp"
and that its value equals the normalized timestamp derived from the fixture `ct`
(use the same normalization routine the Wechat scraper uses — e.g., the
scraper's timestamp parsing/normalization helper such as
normalize_timestamp/parse_ct or the Wechat scraper method that produces
ISO8601/unix output) so the test verifies presence and correct normalized value
rather than absence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 715934f5-381b-4d78-aebb-13cb30345c43

📥 Commits

Reviewing files that changed from the base of the PR and between 381b7ff and 6ea5363.

📒 Files selected for processing (27)
  • apps/api/src/services/inoreader/__init__.py
  • packages/shared/fastfetchbot_shared/database/mongodb/cache.py
  • packages/shared/fastfetchbot_shared/database/mongodb/models/metadata.py
  • packages/shared/fastfetchbot_shared/models/metadata_item.py
  • packages/shared/fastfetchbot_shared/services/file_export/video_download.py
  • packages/shared/fastfetchbot_shared/services/scrapers/bluesky/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.py
  • packages/shared/fastfetchbot_shared/services/scrapers/common.py
  • packages/shared/fastfetchbot_shared/services/scrapers/douban/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/general/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/general/base.py
  • packages/shared/fastfetchbot_shared/services/scrapers/general/firecrawl.py
  • packages/shared/fastfetchbot_shared/services/scrapers/general/zyte.py
  • packages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/reddit/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/threads/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/wechat/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/weibo/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.py
  • packages/shared/fastfetchbot_shared/services/scrapers/xiaohongshu/__init__.py
  • packages/shared/fastfetchbot_shared/services/scrapers/zhihu/__init__.py
  • packages/shared/fastfetchbot_shared/utils/parse.py
  • tests/unit/database/mongodb/test_cache.py
  • tests/unit/scrapers/test_common_cache.py
  • tests/unit/scrapers/test_wechat.py
  • tests/unit/test_published_timestamp.py

Comment thread packages/shared/fastfetchbot_shared/database/mongodb/cache.py Outdated
"telegraph_url": "", "content": from_str(self.content),
"text": from_str(self.text),
"url": from_str(getattr(self, "url", "")),
"telegraph_url": "",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve telegraph_url during serialization.

Line 122 hardcodes telegraph_url to an empty string, which drops existing telegraph links and causes data loss in downstream storage/response payloads.

💡 Proposed fix
-            "telegraph_url": "",
+            "telegraph_url": from_str(getattr(self, "telegraph_url", "")),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/fastfetchbot_shared/models/metadata_item.py` at line 122, The
serializer is currently hardcoding "telegraph_url": "" which drops stored
telegraph links; update the serialization in the MetadataItem (likely the
to_dict/serialize method in metadata_item.py) to include the actual attribute
(e.g. self.telegraph_url or getattr(self, "telegraph_url", "")) instead of the
empty string so existing telegraph links are preserved in output payloads and
downstream storage/response.

Comment on lines +197 to +206
def _parse_bluesky_created_at(created_at: str | None) -> int | None:
if not created_at:
return None
try:
parsed = datetime.datetime.fromisoformat(created_at.replace("Z", "+00:00"))
except (TypeError, ValueError):
return None
if parsed.tzinfo is None:
parsed = parsed.replace(tzinfo=datetime.timezone.utc)
return int(parsed.timestamp())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid silent parse failures in timestamp helper.

This helper swallows parse exceptions and returns None without logging or documented handling. Add at least debug/exception logging (or explicitly document this fallback behavior).

As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/fastfetchbot_shared/services/scrapers/bluesky/scraper.py`
around lines 197 - 206, The helper _parse_bluesky_created_at currently swallows
parse errors and returns None without any trace; update it to log the failure
before returning so exceptions aren't silent: catch the (TypeError, ValueError)
in _parse_bluesky_created_at and call logger.exception (or logger.debug with the
exception) including the input created_at and context (e.g., "failed to parse
created_at") before returning None, ensuring the module has an
imported/initialized logger variable referenced by name in the change.

if cached is not None:
logger.info("Cache hit, returning cached metadata")
result = cached.model_dump(mode="json", exclude={"id"})
result["timestamp"] = result.pop("published_timestamp", None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep legacy cache timestamps when published_timestamp is missing.

Line 93 sets timestamp to None for older cached records that only contain timestamp, which drops usable data.

💡 Proposed fix
-                    result["timestamp"] = result.pop("published_timestamp", None)
+                    result["timestamp"] = result.pop(
+                        "published_timestamp", result.get("timestamp")
+                    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/fastfetchbot_shared/services/scrapers/common.py` at line 93,
The current line sets result["timestamp"] using
result.pop("published_timestamp", None), which overwrites legacy records'
existing "timestamp" with None; change the logic in the block that manipulates
the result dict so that if "published_timestamp" exists you move it into
"timestamp" (remove the old key) but if it does not exist you leave any existing
result["timestamp"] intact (or leave it absent) — i.e., prefer
result["published_timestamp"] when present, otherwise keep result["timestamp"];
adjust the code around the result dict (the use of
result.pop("published_timestamp", None) and assignment to result["timestamp"])
accordingly.

Comment thread packages/shared/fastfetchbot_shared/services/scrapers/general/zyte.py Outdated
Comment on lines +284 to +295
def _parse_instagram_timestamp(value: Any) -> int | None:
if value in (None, ""):
return None
try:
timestamp = int(value)
except (TypeError, ValueError):
return None
if timestamp <= 0:
return None
if timestamp > 10_000_000_000:
timestamp //= 1000
return timestamp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make invalid timestamp handling observable.

_parse_instagram_timestamp returns None on parse failures without logging or explicit documented behavior. Please add logging for failed conversions (or document silent fallback intent).

As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/fastfetchbot_shared/services/scrapers/instagram/__init__.py`
around lines 284 - 295, The function _parse_instagram_timestamp currently
swallows parse errors and invalid values; update it to use a module-level logger
(e.g., logger = logging.getLogger(__name__)) and log failures instead of
silently returning None: call logger.exception or logger.warning inside the
except (TypeError, ValueError) block including the raw value, and also emit a
logger.warning when timestamp <= 0 (include the parsed timestamp and original
value); keep returning None for invalid cases but ensure all failure paths are
logged with clear context identifying _parse_instagram_timestamp and the
offending value.

Comment on lines +336 to +345
def _parse_twitter_created_at(created_at: str | None) -> int | None:
if not created_at:
return None
try:
parsed = parsedate_to_datetime(created_at)
except (TypeError, ValueError, IndexError):
return None
if parsed.tzinfo is None:
parsed = parsed.replace(tzinfo=datetime.timezone.utc)
return int(parsed.timestamp())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t silently suppress Twitter date parse errors.

Line 341-342 returns None on parse failure without logging or explicit documented fallback behavior. Please log the failure path (or document expected silent fallback).

As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/fastfetchbot_shared/services/scrapers/twitter/__init__.py`
around lines 336 - 345, The helper _parse_twitter_created_at silently swallows
parsing exceptions and returns None; update it to log the failure and document
the fallback: catch the (TypeError, ValueError, IndexError) from
parsedate_to_datetime, call the module logger (e.g., logger.exception or
logger.error with exc_info=True) with a clear message including the raw
created_at string, and then return None (or re-raise if you prefer explicit
failure), and add/update the _parse_twitter_created_at docstring to state that
parsing errors are logged and None is returned as the documented fallback. Use
the function name _parse_twitter_created_at and the parsedate_to_datetime call
to locate where to add logging.

Comment on lines +534 to +543
def _parse_weibo_created_at(created_at: str | None) -> int | None:
if not created_at:
return None
try:
parsed = parsedate_to_datetime(created_at)
except (TypeError, ValueError, IndexError):
return None
if parsed.tzinfo is None:
parsed = parsed.replace(tzinfo=datetime.timezone.utc)
return int(parsed.timestamp())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging or documented fallback for Weibo timestamp parse failures.

The helper currently suppresses parse exceptions and returns None without logging, which makes failures opaque.

As per coding guidelines, "Never silently swallow exceptions—always either re-raise, log with logger.exception(), or explicitly handle and document the behavior. Do not return None or empty data on failure without logging".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/fastfetchbot_shared/services/scrapers/weibo/scraper.py`
around lines 534 - 543, The _parse_weibo_created_at helper currently swallows
parsing exceptions and returns None silently; update it to log parsing failures
and document the fallback behavior: import or use the module logger (e.g.,
logger) inside _parse_weibo_created_at, catch the (TypeError, ValueError,
IndexError) exception and call logger.exception(...) or logger.warning(...) with
the created_at value and error details, and add a short comment stating that
returning None is the documented fallback when parsing fails; ensure the
function still returns None after logging to preserve current behavior.

assert result["title"] == "Test Title"
assert result["author"] == "Test Author"
assert "Test content paragraph" in result["content"]
assert "timestamp" not in result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assertion contradicts the timestamp propagation goal for Wechat.

This test now injects ct in the fixture but asserts timestamp must be absent, which enforces the opposite of the PR objective for Wechat timestamp support. Update the expectation to assert normalized timestamp output instead.

Suggested test fix
-        assert "timestamp" not in result
+        assert result["timestamp"] == 1704067200
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert "timestamp" not in result
assert result["timestamp"] == 1704067200
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/scrapers/test_wechat.py` at line 60, The test currently
contradicts the PR intent by asserting the injected fixture `ct` yields no
timestamp; update the assertion in tests/unit/scrapers/test_wechat.py (the
failing line that reads assert "timestamp" not in result) to instead assert that
result contains "timestamp" and that its value equals the normalized timestamp
derived from the fixture `ct` (use the same normalization routine the Wechat
scraper uses — e.g., the scraper's timestamp parsing/normalization helper such
as normalize_timestamp/parse_ct or the Wechat scraper method that produces
ISO8601/unix output) so the test verifies presence and correct normalized value
rather than absence.

@aturret aturret merged commit 6a31aa2 into main May 24, 2026
5 checks passed
@aturret aturret deleted the timestamp-update branch May 24, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant