diff --git a/docs/howto/pdf_manipulation.md b/docs/howto/pdf_manipulation.md
index 6f7c2f7..ba26d45 100644
--- a/docs/howto/pdf_manipulation.md
+++ b/docs/howto/pdf_manipulation.md
@@ -102,7 +102,7 @@ parxy pdf:merge file1.pdf file2.pdf -o /output/dir/merged.pdf
## Splitting PDFs
-The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file.
+The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file. You can optionally limit which pages are extracted and combine them into a single output PDF.
### Basic Splitting
@@ -139,6 +139,51 @@ Creates files named:
- `chapter_page_2.pdf`
- etc.
+### Extracting a Page Range
+
+Use `--pages` to limit which pages are extracted (1-based indexing):
+
+**Single page:**
+```bash
+parxy pdf:split document.pdf --pages 3
+```
+
+**Page range:**
+```bash
+parxy pdf:split document.pdf --pages 2:5
+```
+
+**From start to page N:**
+```bash
+parxy pdf:split document.pdf --pages :5
+```
+
+**From page N to end:**
+```bash
+parxy pdf:split document.pdf --pages 3:
+```
+
+### Combining Pages into a Single PDF
+
+Use `--combine` to extract a page range into a single output PDF instead of one file per page:
+
+```bash
+# Extract pages 2–5 as a single PDF (auto-named)
+parxy pdf:split document.pdf --pages 2:5 --combine
+# Output: document_pages_2-5.pdf (next to the input file)
+
+# Specify a custom output path
+parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
+
+# Extract a single page as a PDF
+parxy pdf:split document.pdf --pages 3 --combine -o page3.pdf
+
+# Combine all pages (equivalent to a copy)
+parxy pdf:split document.pdf --combine -o copy.pdf
+```
+
+> **Tip:** `--combine` pairs well with `--pages` to replace the `pdf:merge file.pdf[2:5]` pattern when working with a single source file.
+
### Complete Examples
**Split with custom output directory:**
@@ -161,14 +206,25 @@ Creates:
parxy pdf:split document.pdf -o ./individual_pages -p page
```
+**Extract pages 10–20 as individual files:**
+```bash
+parxy pdf:split document.pdf --pages 10:20 -o ./extracted_pages
+```
+
## Combining Merge and Split
You can chain operations together using the CLI:
**Example: Extract specific pages and split them:**
```bash
-# First, extract pages 10-20
-parxy pdf:merge document.pdf[10:20] -o extracted.pdf
+# Extract pages 10-20 as individual files
+parxy pdf:split document.pdf --pages 10:20 -o ./individual_pages
+```
+
+**Example: Extract a range into a single PDF, then split:**
+```bash
+# First, extract pages 10-20 into one PDF
+parxy pdf:split document.pdf --pages 10:20 --combine -o extracted.pdf
# Then split into individual pages
parxy pdf:split extracted.pdf -o ./individual_pages
@@ -232,17 +288,21 @@ parxy pdf:split INPUT_FILE [OPTIONS]
```
**Arguments:**
-- `INPUT_FILE`: PDF file to split into individual pages
+- `INPUT_FILE`: PDF file to split
**Options:**
-- `--output, -o`: Output directory (default: `{filename}_split/`)
-- `--prefix, -p`: Output filename prefix (default: input filename)
+- `--output, -o`: Without `--combine`: output directory (default: `{filename}_split/`). With `--combine`: output file path (default: `{filename}_pages_{from}-{to}.pdf` next to the input).
+- `--prefix, -p`: Output filename prefix for individual split files (default: input filename)
+- `--pages`: Page range to extract, 1-based. Formats: `3` (single page), `2:5` (range), `:5` (up to page 5), `3:` (from page 3 to end)
+- `--combine`: Combine extracted pages into a single PDF instead of one file per page
**Examples:**
```bash
parxy pdf:split document.pdf
parxy pdf:split document.pdf -o ./pages
parxy pdf:split document.pdf -o ./pages -p page
+parxy pdf:split document.pdf --pages 2:5
+parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
```
## Getting Help
diff --git a/docs/tutorials/pdf_manipulation.md b/docs/tutorials/pdf_manipulation.md
index d101746..afa8604 100644
--- a/docs/tutorials/pdf_manipulation.md
+++ b/docs/tutorials/pdf_manipulation.md
@@ -81,6 +81,44 @@ for page_path in pages:
# ...
```
+You can limit splitting to a page range using 0-based `from_page` / `to_page` indices:
+
+```python
+# Split only pages 2–5 (0-based: indices 1–4)
+pages = Parxy.pdf.split(
+ input_path=Path("document.pdf"),
+ output_dir=Path("./pages"),
+ prefix="doc",
+ from_page=1,
+ to_page=4,
+)
+# Creates: doc_page_2.pdf, doc_page_3.pdf, doc_page_4.pdf, doc_page_5.pdf
+```
+
+### Extracting Pages into a Single PDF
+
+Use `extract_pages` to pull a page range from a PDF into a new single-file PDF without splitting each page individually:
+
+```python
+from pathlib import Path
+from parxy_core.services.pdf_service import PdfService
+
+# Extract pages 3–7 (0-based: indices 2–6)
+PdfService.extract_pages(
+ input_path=Path("report.pdf"),
+ output_path=Path("summary.pdf"),
+ from_page=2,
+ to_page=6,
+)
+```
+
+Omit `from_page` / `to_page` to copy all pages:
+
+```python
+# Equivalent to a copy
+PdfService.extract_pages(Path("original.pdf"), Path("copy.pdf"))
+```
+
### Optimizing PDFs
Reduce PDF file size using compression techniques:
@@ -302,6 +340,12 @@ try:
except FileNotFoundError as e:
print(f"File not found: {e}")
+# ValueError for invalid page ranges
+try:
+ Parxy.pdf.split(Path("doc.pdf"), Path("./out"), "doc", from_page=100)
+except ValueError as e:
+ print(f"Invalid page range: {e}")
+
# ValueError for invalid parameters
try:
Parxy.pdf.optimize(
@@ -332,7 +376,8 @@ except RuntimeError as e:
In this tutorial you learned:
- **`Parxy.pdf.merge()`** - Combine multiple PDFs with optional page ranges
-- **`Parxy.pdf.split()`** - Split a PDF into individual page files
+- **`Parxy.pdf.split()`** - Split a PDF into individual page files, with optional page range
+- **`PdfService.extract_pages()`** - Extract a page range into a single output PDF
- **`Parxy.pdf.optimize()`** - Reduce file size with compression options
- **`PdfService` context manager** - Work with attachments (add, list, extract, remove)
@@ -344,6 +389,7 @@ In this tutorial you learned:
| Splitting into pages | Extracting attachment content |
| Optimizing file size | Multiple operations on one file |
| One-shot operations | Need fine-grained control |
+| Splitting a page range | Extracting a page range into one PDF (`extract_pages`) |
## Next Steps
diff --git a/docs/tutorials/using_cli.md b/docs/tutorials/using_cli.md
index a350ed1..523c28e 100644
--- a/docs/tutorials/using_cli.md
+++ b/docs/tutorials/using_cli.md
@@ -14,7 +14,7 @@ The Parxy CLI lets you:
| `parxy preview` | Interactive document viewer with metadata, table of contents, and scrollable content preview |
| `parxy markdown` | Convert documents to Markdown files, with support for multiple drivers and folder processing |
| `parxy pdf:merge`| Merge multiple PDF files into one, with support for page ranges |
-| `parxy pdf:split`| Split a PDF file into individual pages |
+| `parxy pdf:split`| Split a PDF into individual pages, with optional page range and single-file extraction |
| `parxy drivers` | List available document processing drivers |
| `parxy env` | Generate a default `.env` configuration file |
| `parxy docker` | Create a Docker Compose configuration for running Parxy-related services |
@@ -218,6 +218,42 @@ parxy markdown document.pdf -d pymupdf -d llamaparse
This produces `pymupdf-document.md` and `llamaparse-document.md`.
+### Converting Pre-parsed JSON Results
+
+If you have a JSON file produced by `parxy parse -m json`, you can convert it to Markdown directly without re-parsing:
+
+```bash
+parxy markdown result.json
+```
+
+This loads the `Document` model from the JSON and converts it immediately — no driver or API call required. You can mix JSON files and PDF files in the same invocation:
+
+```bash
+parxy markdown result.json document.pdf -d pymupdf -o output/
+```
+
+### Page Separator Comments
+
+Use `--page-separators` to insert HTML comments before each page's content:
+
+```bash
+parxy markdown document.pdf --page-separators
+```
+
+Output will contain markers like:
+
+```markdown
+
+
+First page content...
+
+
+
+Second page content...
+```
+
+This is useful for post-processing scripts that need to identify page boundaries.
+
### Inline Output
Use `--inline` with a single file to print markdown directly to stdout with a YAML frontmatter header — useful for shell pipelines:
@@ -276,7 +312,7 @@ parxy pdf:merge cover.pdf /chapters doc.pdf[10:20] appendix.pdf -o book.pdf
### Splitting PDFs
-The `pdf:split` command divides a PDF file into individual pages, with each page becoming a separate PDF file.
+The `pdf:split` command divides a PDF file into individual pages, with optional page range extraction and single-file output.
**Split into individual pages:**
```bash
@@ -290,7 +326,21 @@ This creates a `document_split/` folder containing `document_page_1.pdf`, `docum
parxy pdf:split report.pdf -o ./pages -p page
```
-Creates `page_1.pdf`, `page_2.pdf`, etc. in the `./pages` directory.
+**Extract a page range as individual files:**
+```bash
+parxy pdf:split document.pdf --pages 2:5 -o ./pages
+```
+
+**Combine a page range into a single PDF:**
+```bash
+# Auto-named output next to the input file
+parxy pdf:split document.pdf --pages 2:5 --combine
+
+# Custom output path
+parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
+```
+
+Page range formats (1-based): `3` · `2:5` · `:5` · `3:`
For more detailed examples and use cases, see the [PDF Manipulation How-to Guide](../howto/pdf_manipulation.md).
@@ -358,9 +408,9 @@ With the CLI, you can use Parxy as a **standalone document parsing tool** — id
|------------------|--------------------------------------------------------------|
| `parxy parse` | Extract text from documents with multiple formats & drivers |
| `parxy preview` | Interactive document viewer with metadata and TOC |
-| `parxy markdown` | Generate Markdown files with driver prefix naming |
+| `parxy markdown` | Generate Markdown files; accepts JSON results and supports `--page-separators` |
| `parxy pdf:merge`| Merge multiple PDF files with page range support |
-| `parxy pdf:split`| Split PDF files into individual pages |
+| `parxy pdf:split`| Split PDF into individual pages; supports `--pages` and `--combine` |
| `parxy drivers` | List supported drivers |
| `parxy env` | Create default configuration file |
| `parxy docker` | Generate Docker Compose setup |
diff --git a/src/parxy_cli/commands/markdown.py b/src/parxy_cli/commands/markdown.py
index dfba9f6..2ccd009 100644
--- a/src/parxy_cli/commands/markdown.py
+++ b/src/parxy_cli/commands/markdown.py
@@ -2,11 +2,13 @@
from datetime import timedelta
from pathlib import Path
-from typing import Optional, List, Annotated
+from typing import Optional, List, Annotated, Tuple
import typer
+from pydantic import ValidationError
from parxy_core.facade import Parxy
+from parxy_core.models import Document
from parxy_cli.models import Level
from parxy_cli.console.console import Console
@@ -91,14 +93,27 @@ def markdown(
min=1,
),
] = None,
+ page_separators: Annotated[
+ bool,
+ typer.Option(
+ '--page-separators',
+ help="Insert HTML comments before each page's content.",
+ ),
+ ] = False,
):
"""Parse documents to Markdown.
+ Accepts PDF files (parsed on-the-fly) or pre-parsed JSON result files
+ (loaded directly from the Document model without re-parsing).
+
Examples:
# Parse a single file
parxy markdown document.pdf
+ # Convert a pre-parsed JSON result directly to markdown
+ parxy markdown result.json
+
# Parse with a specific driver and output to a folder
parxy markdown document.pdf -d pymupdf -o output/
@@ -110,6 +125,9 @@ def markdown(
# Output to stdout as YAML-frontmattered markdown (single file only)
parxy markdown document.pdf --inline
+
+ # Include page separator comments in the output
+ parxy markdown document.pdf --page-separators
"""
console.action('Markdown export', space_after=False)
@@ -120,85 +138,118 @@ def markdown(
console.warning('No suitable files found to process.', panel=True)
raise typer.Exit(1)
- if inline and len(files) > 1:
+ # Partition into pre-parsed JSON files and files to parse
+ json_files = [f for f in files if f.suffix.lower() == '.json']
+ parse_files = [f for f in files if f.suffix.lower() != '.json']
+
+ if inline and len(json_files) + len(parse_files) > 1:
console.error('--inline can only be used with a single file')
raise typer.Exit(1)
- # Use default driver if none specified
+ # Use default driver if none specified (only needed for parse_files)
if not drivers:
drivers = [Parxy.default_driver()]
output_path = Path(output_dir) if output_dir else None
- total_tasks = len(files) * len(drivers)
+ total_tasks = len(json_files) + len(parse_files) * len(drivers)
error_count = 0
+ elapsed_time = '0 sec'
+
+ def _write_markdown(
+ doc: Document, file_path: Path, driver_label: str | None
+ ) -> None:
+ """Write markdown content to file or stdout."""
+ content = doc.markdown(page_separators=page_separators)
+ if inline:
+ frontmatter = f'---\nfile: "{file_path}"\npages: {len(doc.pages)}\n---\n\n'
+ console.print(frontmatter + content)
+ else:
+ if output_path:
+ output_path.mkdir(parents=True, exist_ok=True)
+ save_dir = output_path
+ else:
+ save_dir = file_path.parent
+
+ base_name = file_path.stem
+ if driver_label:
+ base_name = f'{driver_label}-{base_name}'
+
+ out_file = save_dir / f'{base_name}.md'
+ out_file.write_text(content, encoding='utf-8')
+
+ via = f'via {driver_label} ' if driver_label else ''
+ console.print(
+ f'[faint]⎿ [/faint] {file_path.name} {via}to [success]{out_file}[/success] [faint]({len(doc.pages)} pages)[/faint]'
+ )
try:
with console.shimmer(
- f'Processing {len(files)} file{"s" if len(files) > 1 else ""} with {len(drivers)} driver{"s" if len(drivers) > 1 else ""}...'
+ f'Processing {len(files)} file{"s" if len(files) > 1 else ""}...'
):
with console.progress('Processing documents') as progress:
task = progress.add_task('', total=total_tasks)
- batch_tasks = [str(f) for f in files]
-
- for result in Parxy.batch_iter(
- tasks=batch_tasks,
- drivers=drivers,
- level=level.value,
- workers=workers,
- ):
- file_name = (
- Path(result.file).name
- if isinstance(result.file, str)
- else 'document'
- )
-
- if result.success:
- doc = result.document
- file_path = (
- Path(result.file)
- if isinstance(result.file, str)
- else Path('document')
+ # Process pre-parsed JSON files directly
+ for json_file in json_files:
+ try:
+ doc = Document.model_validate_json(
+ json_file.read_text(encoding='utf-8')
)
-
- content = doc.markdown()
-
- if inline:
- frontmatter = f'---\nfile: "{result.file}"\npages: {len(doc.pages)}\n---\n\n'
- console.print(frontmatter + content)
- else:
- if output_path:
- output_path.mkdir(parents=True, exist_ok=True)
- save_dir = output_path
- else:
- save_dir = file_path.parent
-
- base_name = file_path.stem
- if result.driver:
- base_name = f'{result.driver}-{base_name}'
-
- out_file = save_dir / f'{base_name}.md'
- out_file.write_text(content, encoding='utf-8')
-
- console.print(
- f'[faint]⎿ [/faint] {file_name} via {result.driver} to [success]{out_file}[/success] [faint]({len(doc.pages)} pages)[/faint]'
- )
- else:
+ _write_markdown(
+ doc, json_file.with_suffix(''), driver_label=None
+ )
+ except (ValidationError, ValueError) as e:
console.print(
- f'[faint]⎿ [/faint] {file_name} via {result.driver} error. [error]{result.error}[/error]'
+ f'[faint]⎿ [/faint] {json_file.name} error. [error]{e}[/error]'
)
error_count += 1
-
if stop_on_failure:
console.newline()
console.info(
'Stopping due to error (--stop-on-failure flag is set)'
)
raise typer.Exit(1)
-
progress.update(task, advance=1)
+ # Process files that need parsing
+ if parse_files:
+ for result in Parxy.batch_iter(
+ tasks=[str(f) for f in parse_files],
+ drivers=drivers,
+ level=level.value,
+ workers=workers,
+ ):
+ file_name = (
+ Path(result.file).name
+ if isinstance(result.file, str)
+ else 'document'
+ )
+
+ if result.success:
+ file_path = (
+ Path(result.file)
+ if isinstance(result.file, str)
+ else Path('document')
+ )
+ _write_markdown(
+ result.document, file_path, driver_label=result.driver
+ )
+ else:
+ console.print(
+ f'[faint]⎿ [/faint] {file_name} via {result.driver} error. [error]{result.error}[/error]'
+ )
+ error_count += 1
+
+ if stop_on_failure:
+ console.newline()
+ console.info(
+ 'Stopping due to error (--stop-on-failure flag is set)'
+ )
+ raise typer.Exit(1)
+
+ progress.update(task, advance=1)
+
elapsed_time = format_timedelta(
timedelta(seconds=max(0, progress.tasks[0].elapsed))
)
@@ -210,13 +261,13 @@ def markdown(
if not inline:
console.newline()
- if error_count == len(files) * len(drivers):
+ if error_count == total_tasks:
console.error('All files were not processed due to errors')
return
if error_count > 0:
console.warning(
- f'Processed {len(files)} file{"s" if len(files) > 1 else ""} with warnings using {len(drivers)} driver{"s" if len(drivers) > 1 else ""}'
+ f'Processed {len(files)} file{"s" if len(files) > 1 else ""} with warnings'
)
console.print(
f'[faint]⎿ [/faint] [highlight]{error_count} files errored[/highlight]'
@@ -225,5 +276,5 @@ def markdown(
if not inline:
console.success(
- f'Processed {len(files)} file{"s" if len(files) > 1 else ""} using {len(drivers)} driver{"s" if len(drivers) > 1 else ""} (took {elapsed_time})'
+ f'Processed {len(files)} file{"s" if len(files) > 1 else ""} (took {elapsed_time})'
)
diff --git a/src/parxy_core/models/models.py b/src/parxy_core/models/models.py
index 258b847..b965c56 100644
--- a/src/parxy_core/models/models.py
+++ b/src/parxy_core/models/models.py
@@ -155,7 +155,52 @@ def text(self, page_separator: str = '---') -> str:
return '\n'.join(texts)
- def markdown(self) -> str:
+ def contentmd(
+ self,
+ title: Optional[str] = None,
+ description: Optional[str] = None,
+ date: Optional[str] = None,
+ license: Optional[str] = None,
+ author: Optional[str] = None,
+ page_separators: bool = False,
+ ) -> str:
+ """Get the document content formatted as content-md.
+
+ Delegates to :class:`~parxy_core.services.ContentMdService`.
+
+ Parameters
+ ----------
+ title : str, optional
+ Document title. Falls back to metadata.title, a heading inferred
+ from the first page, filename, then 'Untitled'.
+ description : str, optional
+ Short summary (~200 characters). Falls back to a doc-abstract block,
+ then the longest TextBlock across the first two pages.
+ date : str, optional
+ Creation/publication date in ISO 8601. Falls back to metadata dates.
+ license : str, optional
+ License name or SPDX identifier.
+ author : str, optional
+ Author name. Falls back to metadata.author.
+
+ Returns
+ -------
+ str
+ The document content formatted as content-md.
+ """
+ from parxy_core.services.contentmd_service import ContentMdService
+
+ return ContentMdService.render(
+ self,
+ title=title,
+ description=description,
+ date=date,
+ license=license,
+ author=author,
+ page_separators=page_separators,
+ )
+
+ def markdown(self, page_separators: bool = False) -> str:
"""Get the document content formatted as Markdown.
The method attempts to preserve the document structure by:
@@ -163,6 +208,12 @@ def markdown(self) -> str:
2. Preserving line breaks where meaningful
3. Adding section headers based on block levels
+ Parameters
+ ----------
+ page_separators : bool, optional
+ When True, inserts an HTML comment ```` before
+ each page's content, by default False
+
Returns
-------
str
@@ -174,48 +225,50 @@ def markdown(self) -> str:
markdown_parts = []
for page in self.pages:
- if not page.blocks:
- if page.text.strip():
- markdown_parts.append(page.text.strip())
- continue
-
page_parts = []
- for block in page.blocks:
- if isinstance(block, TextBlock):
- # Handle different block categories
- if block.category and block.category.lower() in [
- 'heading',
- 'title',
- 'header',
- ]:
- # Determine heading level (h1-h6) based on block level or default to h2
- level = min(block.level or 2, 6)
- page_parts.append(f'{"#" * level} {block.text.strip()}')
- elif block.category and block.category.lower() == 'list':
- # Convert to bullet points
- for line in block.text.splitlines():
- if line.strip():
- page_parts.append(f'- {line.strip()}')
- else:
- # Regular paragraph
+ if page_separators:
+ page_parts.append(f'')
+
+ if not page.blocks:
+ if page.text.strip():
+ page_parts.append(page.text.strip())
+ else:
+ for block in page.blocks:
+ if isinstance(block, TextBlock):
+ # Handle different block categories
+ if block.category and block.category.lower() in [
+ 'heading',
+ 'title',
+ 'header',
+ ]:
+ # Determine heading level (h1-h6) based on block level or default to h2
+ level = min(block.level or 2, 6)
+ page_parts.append(f'{"#" * level} {block.text.strip()}')
+ elif block.category and block.category.lower() == 'list':
+ # Convert to bullet points
+ for line in block.text.splitlines():
+ if line.strip():
+ page_parts.append(f'- {line.strip()}')
+ else:
+ # Regular paragraph
+ if block.text.strip():
+ page_parts.append(block.text.strip())
+
+ elif isinstance(block, ImageBlock):
+ ext = (
+ block.name.rsplit('.', 1)[-1]
+ if block.name and '.' in block.name
+ else ''
+ )
+ lang = f'image:{ext}' if ext else 'image'
+ alt = block.alt_text or ''
+ page_parts.append(f'```{lang}\n{alt}\n```')
+
+ elif isinstance(block, TableBlock):
if block.text.strip():
page_parts.append(block.text.strip())
- elif isinstance(block, ImageBlock):
- ext = (
- block.name.rsplit('.', 1)[-1]
- if block.name and '.' in block.name
- else ''
- )
- lang = f'image:{ext}' if ext else 'image'
- alt = block.alt_text or ''
- page_parts.append(f'```{lang}\n{alt}\n```')
-
- elif isinstance(block, TableBlock):
- if block.text.strip():
- page_parts.append(block.text.strip())
-
if page_parts:
markdown_parts.append('\n\n'.join(page_parts))
diff --git a/src/parxy_core/services/__init__.py b/src/parxy_core/services/__init__.py
index 5071d08..5342a63 100644
--- a/src/parxy_core/services/__init__.py
+++ b/src/parxy_core/services/__init__.py
@@ -1,5 +1,6 @@
"""Services module for parxy_core."""
+from parxy_core.services.contentmd_service import ContentMdService
from parxy_core.services.pdf_service import PdfService
-__all__ = ['PdfService']
+__all__ = ['ContentMdService', 'PdfService']
diff --git a/src/parxy_core/services/contentmd_service.py b/src/parxy_core/services/contentmd_service.py
new file mode 100644
index 0000000..039ab38
--- /dev/null
+++ b/src/parxy_core/services/contentmd_service.py
@@ -0,0 +1,273 @@
+"""Service for rendering documents as content-md."""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Optional
+
+if TYPE_CHECKING:
+ from parxy_core.models.models import Document
+
+
+class ContentMdService:
+ """Render a :class:`Document` as a content-md string.
+
+ content-md is an open specification for optimised content exchange: a YAML
+ frontmatter section followed by CommonMark / GitHub-flavoured Markdown.
+ All methods are static; the class acts as a namespace.
+ """
+
+ # ------------------------------------------------------------------
+ # Private helpers
+ # ------------------------------------------------------------------
+
+ # Roles that provide structure or navigation rather than readable body text
+ _STRUCTURAL_ROLES: frozenset[str] = frozenset(
+ {
+ 'heading',
+ 'doc-title',
+ 'doc-subtitle',
+ 'doc-abstract',
+ 'doc-toc',
+ 'doc-pageheader',
+ 'doc-pagefooter',
+ 'caption',
+ }
+ )
+
+ @staticmethod
+ def _normalize(text: str) -> str:
+ """Collapse any run of whitespace to a single space and strip."""
+ return ' '.join(text.split())
+
+ @staticmethod
+ def _yaml_str(value: str) -> str:
+ """Wrap *value* in double quotes and escape internal quotes/backslashes."""
+ return '"' + value.replace('\\', '\\\\').replace('"', '\\"') + '"'
+
+ @staticmethod
+ def _guess_title(document: Document) -> Optional[str]:
+ """Infer a title from the first page blocks.
+
+ Prefers an explicit ``doc-title`` role; falls back to the
+ highest-ranking (lowest level number) ``heading`` block.
+ """
+ from parxy_core.models.models import TextBlock
+
+ if not document.pages:
+ return None
+ first_page = document.pages[0]
+ if not first_page.blocks:
+ return None
+
+ doc_title = next(
+ (
+ b
+ for b in first_page.blocks
+ if isinstance(b, TextBlock) and b.role == 'doc-title' and b.text.strip()
+ ),
+ None,
+ )
+ if doc_title:
+ return ContentMdService._normalize(doc_title.text)
+
+ headings = [
+ b
+ for b in first_page.blocks
+ if isinstance(b, TextBlock) and b.role == 'heading' and b.text.strip()
+ ]
+ if not headings:
+ return None
+ return ContentMdService._normalize(
+ min(headings, key=lambda b: b.level or 1).text
+ )
+
+ @staticmethod
+ def _infer_description(document: Document) -> Optional[str]:
+ """Infer a description from document content.
+
+ Uses the ``doc-abstract`` block when present. Otherwise concatenates
+ the first five body :class:`TextBlock` objects (non-structural, across
+ the first two pages), normalises whitespace, and returns at most 200
+ characters.
+ """
+ from parxy_core.models.models import TextBlock
+
+ blocks = [
+ b
+ for page in document.pages[:2]
+ if page.blocks
+ for b in page.blocks
+ if isinstance(b, TextBlock) and b.text.strip()
+ ]
+
+ abstract = next((b for b in blocks if b.role == 'doc-abstract'), None)
+ if abstract:
+ return ContentMdService._normalize(abstract.text)
+
+ body_blocks = [
+ b
+ for b in blocks
+ if (b.role or 'generic') not in ContentMdService._STRUCTURAL_ROLES
+ ]
+ if not body_blocks:
+ return None
+
+ combined = ' '.join(b.text for b in body_blocks[:5])
+ return ContentMdService._normalize(combined)[:200]
+
+ @staticmethod
+ def _build_frontmatter(
+ title: str,
+ description: Optional[str],
+ date: Optional[str],
+ license: Optional[str],
+ author: Optional[str],
+ ) -> str:
+ ys = ContentMdService._yaml_str
+ lines = ['---', f'title: {ys(title)}']
+ if description:
+ lines.append(f'description: {ys(description)}')
+ if date:
+ lines.append(f'date: {ys(date)}')
+ if license:
+ lines.append(f'license: {ys(license)}')
+ if author:
+ lines.append(f'author: {ys(author)}')
+ lines.append('---')
+ return '\n'.join(lines)
+
+ @staticmethod
+ def _build_body(
+ document: Document, title: str, page_separators: bool = False
+ ) -> str:
+ from parxy_core.models.models import ImageBlock, TableBlock, TextBlock
+
+ normalize = ContentMdService._normalize
+ parts = [f'# {title}']
+
+ for page in document.pages:
+ if page_separators:
+ parts.append(f'')
+
+ if not page.blocks:
+ if page.text.strip():
+ parts.append(normalize(page.text))
+ continue
+
+ for block in page.blocks:
+ role = (block.role or 'generic').lower()
+
+ if isinstance(block, TextBlock):
+ if role == 'doc-title':
+ # Already the top-level h1 — skip to avoid duplication
+ pass
+ elif role == 'heading':
+ # Shift levels +1: h1 content → h2, per content-md spec
+ shifted = min((block.level or 1) + 1, 6)
+ parts.append(f'{"#" * shifted} {normalize(block.text)}')
+ elif role in ('list', 'listitem'):
+ for line in block.text.splitlines():
+ if line.strip():
+ parts.append(f'- {normalize(line)}')
+ elif role == 'doc-abstract':
+ lang_attr = (
+ f' lang="{document.language}"' if document.language else ''
+ )
+ parts.append(
+ f'\n{normalize(block.text)}\n'
+ )
+ else:
+ normalized = normalize(block.text)
+ if normalized:
+ parts.append(normalized)
+
+ elif isinstance(block, ImageBlock):
+ parts.append(f'\n{block.alt_text or ""}\n')
+
+ elif isinstance(block, TableBlock):
+ # Preserve table whitespace (column alignment, padding)
+ if block.text.strip():
+ parts.append(block.text.strip())
+
+ return '\n\n'.join(parts)
+
+ # ------------------------------------------------------------------
+ # Public API
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def render(
+ document: Document,
+ title: Optional[str] = None,
+ description: Optional[str] = None,
+ date: Optional[str] = None,
+ license: Optional[str] = None,
+ author: Optional[str] = None,
+ page_separators: bool = False,
+ ) -> str:
+ """Render *document* as a content-md string.
+
+ Parameters
+ ----------
+ document:
+ The document to render.
+ title:
+ Document title. Falls back to ``metadata.title``, a heading
+ inferred from the first page, then ``filename``. Raises
+ ``ValueError`` if no title can be resolved.
+ description:
+ Short summary (~200 characters). Falls back to a ``doc-abstract``
+ block, then the first five body blocks in the first two pages.
+ date:
+ Creation/publication date in ISO 8601. Falls back to
+ ``metadata.created_at`` / ``metadata.updated_at``.
+ license:
+ License name or SPDX identifier.
+ author:
+ Author name. Falls back to ``metadata.author``.
+ page_separators:
+ When True, inserts ```` before each page's
+ content in the body.
+
+ Returns
+ -------
+ str
+ The document formatted as content-md.
+ """
+ resolved_title = (
+ title
+ or (document.metadata.title if document.metadata else None)
+ or ContentMdService._guess_title(document)
+ or document.filename
+ )
+ if not resolved_title:
+ raise ValueError(
+ 'Cannot render content-md: no title could be resolved. '
+ 'Provide a title via metadata, a doc-title/heading block, '
+ 'a filename, or pass title= explicitly.'
+ )
+ resolved_description = description or ContentMdService._infer_description(
+ document
+ )
+ resolved_date = date or (
+ (document.metadata.created_at or document.metadata.updated_at)
+ if document.metadata
+ else None
+ )
+ resolved_author = author or (
+ document.metadata.author if document.metadata else None
+ )
+
+ frontmatter = ContentMdService._build_frontmatter(
+ title=resolved_title,
+ description=resolved_description,
+ date=resolved_date,
+ license=license,
+ author=resolved_author,
+ )
+
+ if not document.pages:
+ return f'{frontmatter}\n\n# {resolved_title}\n'
+
+ body = ContentMdService._build_body(document, resolved_title, page_separators)
+ return f'{frontmatter}\n\n{body}\n'
diff --git a/tests/commands/test_markdown.py b/tests/commands/test_markdown.py
index 88b4d74..b4e772f 100644
--- a/tests/commands/test_markdown.py
+++ b/tests/commands/test_markdown.py
@@ -278,3 +278,136 @@ def test_markdown_command_no_files_found(runner, tmp_path):
result = runner.invoke(app, [str(empty_dir)])
assert result.exit_code == 1
+
+
+def test_markdown_command_json_input_converts_directly(runner, mock_document, tmp_path):
+ """Test that a valid JSON parse result is loaded directly without re-parsing."""
+
+ json_file = tmp_path / 'result.json'
+ json_file.write_text(mock_document.model_dump_json(), encoding='utf-8')
+
+ with patch('parxy_cli.commands.markdown.Parxy') as mock_parxy:
+ result = runner.invoke(app, [str(json_file)])
+
+ assert result.exit_code == 0
+ # batch_iter should NOT be called — no PDF to parse
+ mock_parxy.batch_iter.assert_not_called()
+
+ # Output file should be saved next to the JSON file, without driver prefix
+ expected_output = tmp_path / 'result.md'
+ assert expected_output.exists()
+ assert '# Test heading' in expected_output.read_text()
+
+
+def test_markdown_command_json_input_with_output_dir(runner, mock_document, tmp_path):
+ """Test that JSON input respects the --output directory."""
+
+ json_file = tmp_path / 'result.json'
+ json_file.write_text(mock_document.model_dump_json(), encoding='utf-8')
+ output_dir = tmp_path / 'out'
+
+ with patch('parxy_cli.commands.markdown.Parxy'):
+ result = runner.invoke(app, [str(json_file), '--output', str(output_dir)])
+
+ assert result.exit_code == 0
+ assert (output_dir / 'result.md').exists()
+
+
+def test_markdown_command_json_input_inline(runner, mock_document, tmp_path):
+ """Test that JSON input with --inline prints to stdout."""
+
+ json_file = tmp_path / 'result.json'
+ json_file.write_text(mock_document.model_dump_json(), encoding='utf-8')
+
+ with patch('parxy_cli.commands.markdown.Parxy'):
+ result = runner.invoke(app, [str(json_file), '--inline'])
+
+ assert result.exit_code == 0
+ cleaned = strip_ansi(result.stdout)
+ assert '---' in cleaned
+ assert 'pages:' in cleaned
+ assert '# Test heading' in cleaned
+ assert not (tmp_path / 'result.md').exists()
+
+
+def test_markdown_command_invalid_json_reports_error(runner, tmp_path):
+ """Test that a JSON file with invalid Document content reports an error."""
+
+ json_file = tmp_path / 'bad.json'
+ json_file.write_text('{"not": "a document"}', encoding='utf-8')
+
+ with patch('parxy_cli.commands.markdown.Parxy'):
+ result = runner.invoke(app, [str(json_file)])
+
+ cleaned = strip_ansi(result.stdout)
+ assert 'error' in cleaned.lower()
+
+
+def test_markdown_command_page_separators(runner, mock_document, pdf_file):
+ """Test that --page-separators injects HTML page comments into output."""
+
+ with patch('parxy_cli.commands.markdown.Parxy') as mock_parxy:
+ mock_parxy.default_driver.return_value = 'pymupdf'
+ mock_parxy.batch_iter.return_value = iter(
+ [
+ BatchResult(
+ file=str(pdf_file),
+ driver='pymupdf',
+ document=mock_document,
+ error=None,
+ )
+ ]
+ )
+
+ result = runner.invoke(app, [str(pdf_file), '--page-separators'])
+
+ assert result.exit_code == 0
+ expected_output = pdf_file.parent / 'pymupdf-test.md'
+ assert expected_output.exists()
+ assert '' in output
+
+
+def test_markdown_command_mixed_json_and_pdf(runner, mock_document, tmp_path):
+ """Test that JSON files and PDF files can be processed together."""
+
+ json_file = tmp_path / 'result.json'
+ json_file.write_text(mock_document.model_dump_json(), encoding='utf-8')
+
+ pdf_file = tmp_path / 'doc.pdf'
+ pdf_file.write_bytes(b'%PDF fake')
+
+ with patch('parxy_cli.commands.markdown.Parxy') as mock_parxy:
+ mock_parxy.default_driver.return_value = 'pymupdf'
+ mock_parxy.batch_iter.return_value = iter(
+ [
+ BatchResult(
+ file=str(pdf_file),
+ driver='pymupdf',
+ document=mock_document,
+ error=None,
+ )
+ ]
+ )
+
+ result = runner.invoke(app, [str(json_file), str(pdf_file)])
+
+ assert result.exit_code == 0
+ # JSON converted directly
+ assert (tmp_path / 'result.md').exists()
+ # PDF parsed via driver
+ assert (tmp_path / 'pymupdf-doc.md').exists()
diff --git a/tests/services/test_contentmd_service.py b/tests/services/test_contentmd_service.py
new file mode 100644
index 0000000..d0bb1a9
--- /dev/null
+++ b/tests/services/test_contentmd_service.py
@@ -0,0 +1,571 @@
+"""Test suite for ContentMdService."""
+
+import pytest
+
+from parxy_core.models.models import (
+ Document,
+ ImageBlock,
+ Metadata,
+ Page,
+ TableBlock,
+ TextBlock,
+)
+from parxy_core.services.contentmd_service import ContentMdService
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def make_page(
+ number: int = 1,
+ text: str = '',
+ blocks: list | None = None,
+) -> Page:
+ return Page(number=number, text=text, blocks=blocks)
+
+
+def make_text_block(
+ text: str,
+ role: str = 'generic',
+ level: int | None = None,
+) -> TextBlock:
+ return TextBlock(type='text', text=text, role=role, level=level)
+
+
+def make_image_block(
+ alt_text: str | None = None, name: str | None = None
+) -> ImageBlock:
+ return ImageBlock(type='image', alt_text=alt_text, name=name)
+
+
+def make_table_block(text: str) -> TableBlock:
+ return TableBlock(type='table', text=text)
+
+
+def make_doc(
+ pages: list[Page],
+ metadata: Metadata | None = None,
+ filename: str | None = None,
+ language: str | None = None,
+) -> Document:
+ return Document(
+ pages=pages,
+ metadata=metadata,
+ filename=filename,
+ language=language,
+ )
+
+
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture
+def minimal_doc():
+ """Document with a single page, no blocks, no metadata."""
+ return make_doc(pages=[make_page(text='Hello world')])
+
+
+@pytest.fixture
+def metadata_doc():
+ """Document with full metadata and one plain paragraph block."""
+ meta = Metadata(
+ title='Metadata Title',
+ author='Jane Doe',
+ created_at='2025-01-15',
+ )
+ page = make_page(
+ text='Paragraph text.',
+ blocks=[make_text_block('Paragraph text.')],
+ )
+ return make_doc(pages=[page], metadata=meta, filename='report.pdf')
+
+
+@pytest.fixture
+def all_blocks_doc():
+ """Document whose first page contains every supported block type."""
+ blocks = [
+ make_text_block('My Document', role='doc-title'),
+ make_text_block('Introduction', role='heading', level=1),
+ make_text_block('Background', role='heading', level=2),
+ make_text_block('First item\nSecond item', role='list'),
+ make_text_block('A plain paragraph.', role='paragraph'),
+ make_text_block('A brief overview.', role='doc-abstract'),
+ make_image_block(alt_text='A sunset over mountains', name='sunset.jpg'),
+ make_table_block('| Col A | Col B |\n| ----- | ----- |\n| 1 | 2 |'),
+ ]
+ page = make_page(text='My Document', blocks=blocks)
+ return make_doc(pages=[page], language='en')
+
+
+# ---------------------------------------------------------------------------
+# Frontmatter
+# ---------------------------------------------------------------------------
+
+
+class TestFrontmatter:
+ def test_frontmatter_delimiters_present(self, minimal_doc):
+ result = ContentMdService.render(minimal_doc, title='T', description='D')
+ lines = result.splitlines()
+ assert lines[0] == '---'
+ closing = lines.index('---', 1)
+ assert closing > 0
+
+ def test_explicit_title_in_frontmatter(self, minimal_doc):
+ result = ContentMdService.render(minimal_doc, title='Explicit Title')
+ assert 'title: "Explicit Title"' in result
+
+ def test_title_from_metadata(self, metadata_doc):
+ result = ContentMdService.render(metadata_doc)
+ assert 'title: "Metadata Title"' in result
+
+ def test_title_from_doc_title_role_preferred_over_heading(self):
+ blocks = [
+ make_text_block('Real Title', role='doc-title'),
+ make_text_block('Section One', role='heading', level=1),
+ ]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc)
+ assert 'title: "Real Title"' in result
+
+ def test_title_from_heading_when_no_doc_title(self):
+ blocks = [
+ make_text_block('Section One', role='heading', level=2),
+ make_text_block('Section Two', role='heading', level=1),
+ ]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc)
+ # Level 1 heading wins (lowest level = highest rank)
+ assert 'title: "Section Two"' in result
+
+ def test_title_from_filename_when_no_headings(self):
+ doc = make_doc(
+ pages=[make_page(text='body text')],
+ filename='my-report.pdf',
+ )
+ result = ContentMdService.render(doc)
+ assert 'title: "my-report.pdf"' in result
+
+ def test_title_raises_when_unresolvable(self):
+ doc = make_doc(pages=[make_page(text='body text')])
+ with pytest.raises(ValueError, match='no title could be resolved'):
+ ContentMdService.render(doc)
+
+ def test_description_from_explicit_param(self, minimal_doc):
+ result = ContentMdService.render(
+ minimal_doc, title='T', description='My summary.'
+ )
+ assert 'description: "My summary."' in result
+
+ def test_description_from_doc_abstract_block(self):
+ blocks = [
+ make_text_block('Abstract content here.', role='doc-abstract'),
+ make_text_block(
+ 'A much longer paragraph that should not be picked.', role='paragraph'
+ ),
+ ]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert 'description: "Abstract content here."' in result
+
+ def test_description_from_first_five_body_blocks(self):
+ blocks = [make_text_block(f'Sentence {i}.', role='paragraph') for i in range(7)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ # Only the first five contribute; the sixth and seventh are ignored
+ assert 'Sentence 5' not in result.split('---\n')[1].split('\n')[0]
+ assert 'Sentence 0' in result
+
+ def test_description_excludes_structural_roles(self):
+ blocks = [
+ make_text_block('Table of contents text.', role='doc-toc'),
+ make_text_block('Page header text.', role='doc-pageheader'),
+ make_text_block('A heading block.', role='heading'),
+ make_text_block('Body content.', role='paragraph'),
+ ]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc)
+ assert 'description: "Body content."' in result
+
+ def test_description_truncated_to_200_chars(self):
+ long_text = 'word ' * 60 # well over 200 chars
+ blocks = [make_text_block(long_text, role='paragraph')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ fm_end = result.index('---\n', 4)
+ frontmatter = result[:fm_end]
+ desc_line = next(
+ l for l in frontmatter.splitlines() if l.startswith('description:')
+ )
+ # Strip the YAML quoting to measure the actual value length
+ value = desc_line[len('description: "') : -1]
+ assert len(value) <= 200
+
+ def test_description_contains_no_newlines(self):
+ blocks = [
+ make_text_block('Line one.\nLine two.\nLine three.', role='paragraph')
+ ]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ fm_end = result.index('---\n', 4)
+ frontmatter = result[:fm_end]
+ desc_line = next(
+ l for l in frontmatter.splitlines() if l.startswith('description:')
+ )
+ assert '\n' not in desc_line
+
+ def test_description_searches_first_two_pages(self):
+ page1 = make_page(number=1, text='', blocks=[make_text_block('Page 1 text.')])
+ page2 = make_page(
+ number=2,
+ text='',
+ blocks=[make_text_block('Page 2 has a longer text block.')],
+ )
+ page3 = make_page(
+ number=3,
+ text='',
+ blocks=[make_text_block('Page 3 has the longest block of all by far.')],
+ )
+ doc = make_doc(pages=[page1, page2, page3])
+ result = ContentMdService.render(doc, title='T')
+ # Page 3 is out of the two-page window
+ assert 'Page 3' not in result.split('---')[1] # not in frontmatter
+
+ def test_date_from_metadata_created_at(self, metadata_doc):
+ result = ContentMdService.render(metadata_doc)
+ assert 'date: "2025-01-15"' in result
+
+ def test_date_from_metadata_updated_at_when_no_created_at(self):
+ meta = Metadata(updated_at='2025-06-01')
+ doc = make_doc(pages=[make_page(text='')], metadata=meta)
+ result = ContentMdService.render(doc, title='T')
+ assert 'date: "2025-06-01"' in result
+
+ def test_explicit_date_overrides_metadata(self, metadata_doc):
+ result = ContentMdService.render(metadata_doc, date='2026-01-01')
+ assert 'date: "2026-01-01"' in result
+ assert '2025-01-15' not in result
+
+ def test_author_from_metadata(self, metadata_doc):
+ result = ContentMdService.render(metadata_doc)
+ assert 'author: "Jane Doe"' in result
+
+ def test_optional_fields_omitted_when_absent(self, minimal_doc):
+ result = ContentMdService.render(minimal_doc, title='T')
+ assert 'description:' not in result
+ assert 'date:' not in result
+ assert 'license:' not in result
+ assert 'author:' not in result
+
+ def test_license_included_when_provided(self, minimal_doc):
+ result = ContentMdService.render(minimal_doc, title='T', license='CC-BY-4.0')
+ assert 'license: "CC-BY-4.0"' in result
+
+ def test_yaml_values_escaped(self, minimal_doc):
+ result = ContentMdService.render(
+ minimal_doc,
+ title='Title with "quotes"',
+ description='Back\\slash',
+ )
+ assert r'title: "Title with \"quotes\""' in result
+ assert r'description: "Back\\slash"' in result
+
+
+# ---------------------------------------------------------------------------
+# Body – block rendering
+# ---------------------------------------------------------------------------
+
+
+class TestBodyBlocks:
+ def test_body_starts_with_h1_title(self, metadata_doc):
+ result = ContentMdService.render(metadata_doc)
+ body = result.split('---\n', 2)[-1]
+ assert body.lstrip().startswith('# Metadata Title')
+
+ def test_doc_title_block_skipped_in_body(self, all_blocks_doc):
+ result = ContentMdService.render(all_blocks_doc)
+ body = result.split('---\n', 2)[-1]
+ # Should appear exactly once (as the h1), not twice
+ assert body.count('My Document') == 1
+
+ def test_heading_level_shifted_by_one(self):
+ blocks = [make_text_block('Section', role='heading', level=1)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '## Section' in result
+
+ def test_heading_level_2_becomes_3(self):
+ blocks = [make_text_block('Subsection', role='heading', level=2)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '### Subsection' in result
+
+ def test_heading_without_level_defaults_to_h2(self):
+ blocks = [make_text_block('Heading', role='heading')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '## Heading' in result
+
+ def test_heading_level_capped_at_6(self):
+ blocks = [make_text_block('Deep', role='heading', level=6)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '###### Deep' in result
+ assert '####### Deep' not in result
+
+ def test_list_role_rendered_as_bullets(self):
+ blocks = [make_text_block('Alpha\nBeta\nGamma', role='list')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '- Alpha' in result
+ assert '- Beta' in result
+ assert '- Gamma' in result
+
+ def test_listitem_role_rendered_as_bullet(self):
+ blocks = [make_text_block('Single item', role='listitem')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '- Single item' in result
+
+ def test_doc_abstract_rendered_as_abstract_tag(self, all_blocks_doc):
+ result = ContentMdService.render(all_blocks_doc)
+ assert '' in result
+ assert 'A brief overview.' in result
+ assert '' in result
+
+ def test_doc_abstract_without_language_omits_lang_attr(self):
+ blocks = [make_text_block('Summary.', role='doc-abstract')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '\nSummary.\n' in result
+
+ def test_generic_textblock_rendered_as_paragraph(self):
+ blocks = [make_text_block('Plain paragraph text.', role='generic')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert 'Plain paragraph text.' in result
+
+ def test_empty_textblock_not_rendered(self):
+ blocks = [make_text_block(' ', role='paragraph')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ # Body should only contain the h1 line
+ body = result.split('---\n', 2)[-1].strip()
+ assert body == '# T'
+
+ def test_image_block_rendered_as_figure(self):
+ blocks = [make_image_block(alt_text='A sunset over mountains')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '\nA sunset over mountains\n' in result
+
+ def test_image_block_without_alt_text(self):
+ blocks = [make_image_block()]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '\n\n' in result
+
+ def test_table_block_rendered_as_is(self):
+ table_text = '| Col A | Col B |\n| ----- | ----- |\n| 1 | 2 |'
+ blocks = [make_table_block(table_text)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert table_text in result
+
+ def test_page_without_blocks_uses_page_text(self):
+ page = make_page(text='Fallback page text', blocks=None)
+ doc = make_doc(pages=[page])
+ result = ContentMdService.render(doc, title='T')
+ assert 'Fallback page text' in result
+
+ def test_empty_page_text_not_rendered(self):
+ page = make_page(text=' ', blocks=None)
+ doc = make_doc(pages=[page])
+ result = ContentMdService.render(doc, title='T')
+ body = result.split('---\n', 2)[-1].strip()
+ assert body == '# T'
+
+
+# ---------------------------------------------------------------------------
+# Whitespace normalisation
+# ---------------------------------------------------------------------------
+
+
+class TestWhitespaceNormalisation:
+ def test_multiple_spaces_in_paragraph_collapsed(self):
+ blocks = [make_text_block('Word1 Word2 Word3')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert 'Word1 Word2 Word3' in result
+
+ def test_tabs_in_paragraph_collapsed(self):
+ blocks = [make_text_block('Word1\t\tWord2')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert 'Word1 Word2' in result
+
+ def test_whitespace_in_heading_collapsed(self):
+ blocks = [make_text_block('My Section', role='heading', level=1)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '## My Section' in result
+
+ def test_whitespace_in_title_collapsed(self):
+ blocks = [make_text_block(' My Title ', role='doc-title')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc)
+ assert 'title: "My Title"' in result
+
+ def test_whitespace_in_description_collapsed(self):
+ blocks = [make_text_block('Summary with gaps.', role='doc-abstract')]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert 'description: "Summary with gaps."' in result
+
+ def test_table_whitespace_preserved(self):
+ table_text = '| Col A | Col B |\n| ----- | ----- |'
+ blocks = [make_table_block(table_text)]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert '| Col A | Col B |' in result
+
+
+# ---------------------------------------------------------------------------
+# Output structure
+# ---------------------------------------------------------------------------
+
+
+class TestOutputStructure:
+ def test_result_ends_with_newline(self, minimal_doc):
+ result = ContentMdService.render(minimal_doc, title='T')
+ assert result.endswith('\n')
+
+ def test_empty_pages_list_returns_frontmatter_and_title(self):
+ doc = Document(pages=[])
+ result = ContentMdService.render(doc, title='Empty')
+ assert 'title: "Empty"' in result
+ assert '# Empty' in result
+
+ def test_blocks_separated_by_blank_line(self):
+ blocks = [
+ make_text_block('First paragraph.'),
+ make_text_block('Second paragraph.'),
+ ]
+ doc = make_doc(pages=[make_page(text='', blocks=blocks)])
+ result = ContentMdService.render(doc, title='T')
+ assert 'First paragraph.\n\nSecond paragraph.' in result
+
+ def test_multipage_document_renders_all_pages(self):
+ page1 = make_page(
+ number=1,
+ text='',
+ blocks=[make_text_block('Page one content.')],
+ )
+ page2 = make_page(
+ number=2,
+ text='',
+ blocks=[make_text_block('Page two content.')],
+ )
+ doc = make_doc(pages=[page1, page2])
+ result = ContentMdService.render(doc, title='T')
+ assert 'Page one content.' in result
+ assert 'Page two content.' in result
+
+ def test_render_delegates_from_document_method(self, metadata_doc):
+ via_service = ContentMdService.render(metadata_doc)
+ via_method = metadata_doc.contentmd()
+ assert via_service == via_method
+
+ def test_empty_document_without_args_raises(self):
+ """A document with no metadata, no blocks, no filename, and no user
+ arguments cannot satisfy the required title constraint."""
+ doc = Document(pages=[])
+ with pytest.raises(ValueError, match='no title could be resolved'):
+ ContentMdService.render(doc)
+
+ def test_empty_document_with_title_arg_returns_contentmd(self):
+ """Passing title= explicitly must succeed even when the document is
+ completely empty."""
+ doc = Document(pages=[])
+ result = ContentMdService.render(doc, title='Provided Title')
+ assert 'title: "Provided Title"' in result
+ assert '# Provided Title' in result
+
+ def test_empty_document_with_title_and_description_returns_contentmd(self):
+ """Both title= and description= passed explicitly on an empty document."""
+ doc = Document(pages=[])
+ result = ContentMdService.render(
+ doc, title='My Title', description='My description.'
+ )
+ assert 'title: "My Title"' in result
+ assert 'description: "My description."' in result
+ assert result.endswith('\n')
+
+
+class TestPageSeparators:
+ """Tests for page_separators support in ContentMdService and Document.markdown."""
+
+ def test_contentmd_page_separators_off_by_default(self):
+ page = make_page(number=1, text='', blocks=[make_text_block('Content.')])
+ doc = make_doc(pages=[page])
+ result = ContentMdService.render(doc, title='T')
+ assert '' in result
+
+ def test_contentmd_page_separators_multipage(self):
+ page1 = make_page(number=1, text='', blocks=[make_text_block('Page one.')])
+ page2 = make_page(number=2, text='', blocks=[make_text_block('Page two.')])
+ doc = make_doc(pages=[page1, page2])
+ result = ContentMdService.render(doc, title='T', page_separators=True)
+ assert '' in result
+ assert '' in result
+ # Separators appear in correct order relative to each other
+ assert result.index('') < result.index('')
+
+ def test_contentmd_page_separators_via_document_method(self):
+ page = make_page(number=3, text='', blocks=[make_text_block('Content.')])
+ doc = make_doc(pages=[page])
+ result = doc.contentmd(title='T', page_separators=True)
+ assert '' in result
+
+ def test_markdown_page_separators_off_by_default(self):
+ doc = Document(pages=[Page(number=1, text='Hello world')])
+ result = doc.markdown()
+ assert '' in result
+
+ def test_markdown_page_separators_multipage(self):
+ doc = Document(
+ pages=[
+ Page(number=1, text='First page'),
+ Page(number=2, text='Second page'),
+ ]
+ )
+ result = doc.markdown(page_separators=True)
+ assert '' in result
+ assert '' in result
+ assert result.index('') < result.index('First page')
+ assert result.index('') < result.index('Second page')
+
+ def test_markdown_page_separators_empty_page_still_emits_comment(self):
+ doc = Document(
+ pages=[
+ Page(number=1, text='Content'),
+ Page(number=2, text=''), # empty page
+ ]
+ )
+ result = doc.markdown(page_separators=True)
+ assert '' in result
+ assert '' in result