diff --git a/docs/howto/pdf_manipulation.md b/docs/howto/pdf_manipulation.md index 6f7c2f7..ba26d45 100644 --- a/docs/howto/pdf_manipulation.md +++ b/docs/howto/pdf_manipulation.md @@ -102,7 +102,7 @@ parxy pdf:merge file1.pdf file2.pdf -o /output/dir/merged.pdf ## Splitting PDFs -The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file. +The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file. You can optionally limit which pages are extracted and combine them into a single output PDF. ### Basic Splitting @@ -139,6 +139,51 @@ Creates files named: - `chapter_page_2.pdf` - etc. +### Extracting a Page Range + +Use `--pages` to limit which pages are extracted (1-based indexing): + +**Single page:** +```bash +parxy pdf:split document.pdf --pages 3 +``` + +**Page range:** +```bash +parxy pdf:split document.pdf --pages 2:5 +``` + +**From start to page N:** +```bash +parxy pdf:split document.pdf --pages :5 +``` + +**From page N to end:** +```bash +parxy pdf:split document.pdf --pages 3: +``` + +### Combining Pages into a Single PDF + +Use `--combine` to extract a page range into a single output PDF instead of one file per page: + +```bash +# Extract pages 2–5 as a single PDF (auto-named) +parxy pdf:split document.pdf --pages 2:5 --combine +# Output: document_pages_2-5.pdf (next to the input file) + +# Specify a custom output path +parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf + +# Extract a single page as a PDF +parxy pdf:split document.pdf --pages 3 --combine -o page3.pdf + +# Combine all pages (equivalent to a copy) +parxy pdf:split document.pdf --combine -o copy.pdf +``` + +> **Tip:** `--combine` pairs well with `--pages` to replace the `pdf:merge file.pdf[2:5]` pattern when working with a single source file. + ### Complete Examples **Split with custom output directory:** @@ -161,14 +206,25 @@ Creates: parxy pdf:split document.pdf -o ./individual_pages -p page ``` +**Extract pages 10–20 as individual files:** +```bash +parxy pdf:split document.pdf --pages 10:20 -o ./extracted_pages +``` + ## Combining Merge and Split You can chain operations together using the CLI: **Example: Extract specific pages and split them:** ```bash -# First, extract pages 10-20 -parxy pdf:merge document.pdf[10:20] -o extracted.pdf +# Extract pages 10-20 as individual files +parxy pdf:split document.pdf --pages 10:20 -o ./individual_pages +``` + +**Example: Extract a range into a single PDF, then split:** +```bash +# First, extract pages 10-20 into one PDF +parxy pdf:split document.pdf --pages 10:20 --combine -o extracted.pdf # Then split into individual pages parxy pdf:split extracted.pdf -o ./individual_pages @@ -232,17 +288,21 @@ parxy pdf:split INPUT_FILE [OPTIONS] ``` **Arguments:** -- `INPUT_FILE`: PDF file to split into individual pages +- `INPUT_FILE`: PDF file to split **Options:** -- `--output, -o`: Output directory (default: `{filename}_split/`) -- `--prefix, -p`: Output filename prefix (default: input filename) +- `--output, -o`: Without `--combine`: output directory (default: `{filename}_split/`). With `--combine`: output file path (default: `{filename}_pages_{from}-{to}.pdf` next to the input). +- `--prefix, -p`: Output filename prefix for individual split files (default: input filename) +- `--pages`: Page range to extract, 1-based. Formats: `3` (single page), `2:5` (range), `:5` (up to page 5), `3:` (from page 3 to end) +- `--combine`: Combine extracted pages into a single PDF instead of one file per page **Examples:** ```bash parxy pdf:split document.pdf parxy pdf:split document.pdf -o ./pages parxy pdf:split document.pdf -o ./pages -p page +parxy pdf:split document.pdf --pages 2:5 +parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf ``` ## Getting Help diff --git a/docs/tutorials/pdf_manipulation.md b/docs/tutorials/pdf_manipulation.md index d101746..afa8604 100644 --- a/docs/tutorials/pdf_manipulation.md +++ b/docs/tutorials/pdf_manipulation.md @@ -81,6 +81,44 @@ for page_path in pages: # ... ``` +You can limit splitting to a page range using 0-based `from_page` / `to_page` indices: + +```python +# Split only pages 2–5 (0-based: indices 1–4) +pages = Parxy.pdf.split( + input_path=Path("document.pdf"), + output_dir=Path("./pages"), + prefix="doc", + from_page=1, + to_page=4, +) +# Creates: doc_page_2.pdf, doc_page_3.pdf, doc_page_4.pdf, doc_page_5.pdf +``` + +### Extracting Pages into a Single PDF + +Use `extract_pages` to pull a page range from a PDF into a new single-file PDF without splitting each page individually: + +```python +from pathlib import Path +from parxy_core.services.pdf_service import PdfService + +# Extract pages 3–7 (0-based: indices 2–6) +PdfService.extract_pages( + input_path=Path("report.pdf"), + output_path=Path("summary.pdf"), + from_page=2, + to_page=6, +) +``` + +Omit `from_page` / `to_page` to copy all pages: + +```python +# Equivalent to a copy +PdfService.extract_pages(Path("original.pdf"), Path("copy.pdf")) +``` + ### Optimizing PDFs Reduce PDF file size using compression techniques: @@ -302,6 +340,12 @@ try: except FileNotFoundError as e: print(f"File not found: {e}") +# ValueError for invalid page ranges +try: + Parxy.pdf.split(Path("doc.pdf"), Path("./out"), "doc", from_page=100) +except ValueError as e: + print(f"Invalid page range: {e}") + # ValueError for invalid parameters try: Parxy.pdf.optimize( @@ -332,7 +376,8 @@ except RuntimeError as e: In this tutorial you learned: - **`Parxy.pdf.merge()`** - Combine multiple PDFs with optional page ranges -- **`Parxy.pdf.split()`** - Split a PDF into individual page files +- **`Parxy.pdf.split()`** - Split a PDF into individual page files, with optional page range +- **`PdfService.extract_pages()`** - Extract a page range into a single output PDF - **`Parxy.pdf.optimize()`** - Reduce file size with compression options - **`PdfService` context manager** - Work with attachments (add, list, extract, remove) @@ -344,6 +389,7 @@ In this tutorial you learned: | Splitting into pages | Extracting attachment content | | Optimizing file size | Multiple operations on one file | | One-shot operations | Need fine-grained control | +| Splitting a page range | Extracting a page range into one PDF (`extract_pages`) | ## Next Steps diff --git a/docs/tutorials/using_cli.md b/docs/tutorials/using_cli.md index a350ed1..523c28e 100644 --- a/docs/tutorials/using_cli.md +++ b/docs/tutorials/using_cli.md @@ -14,7 +14,7 @@ The Parxy CLI lets you: | `parxy preview` | Interactive document viewer with metadata, table of contents, and scrollable content preview | | `parxy markdown` | Convert documents to Markdown files, with support for multiple drivers and folder processing | | `parxy pdf:merge`| Merge multiple PDF files into one, with support for page ranges | -| `parxy pdf:split`| Split a PDF file into individual pages | +| `parxy pdf:split`| Split a PDF into individual pages, with optional page range and single-file extraction | | `parxy drivers` | List available document processing drivers | | `parxy env` | Generate a default `.env` configuration file | | `parxy docker` | Create a Docker Compose configuration for running Parxy-related services | @@ -218,6 +218,42 @@ parxy markdown document.pdf -d pymupdf -d llamaparse This produces `pymupdf-document.md` and `llamaparse-document.md`. +### Converting Pre-parsed JSON Results + +If you have a JSON file produced by `parxy parse -m json`, you can convert it to Markdown directly without re-parsing: + +```bash +parxy markdown result.json +``` + +This loads the `Document` model from the JSON and converts it immediately — no driver or API call required. You can mix JSON files and PDF files in the same invocation: + +```bash +parxy markdown result.json document.pdf -d pymupdf -o output/ +``` + +### Page Separator Comments + +Use `--page-separators` to insert HTML comments before each page's content: + +```bash +parxy markdown document.pdf --page-separators +``` + +Output will contain markers like: + +```markdown + + +First page content... + + + +Second page content... +``` + +This is useful for post-processing scripts that need to identify page boundaries. + ### Inline Output Use `--inline` with a single file to print markdown directly to stdout with a YAML frontmatter header — useful for shell pipelines: @@ -276,7 +312,7 @@ parxy pdf:merge cover.pdf /chapters doc.pdf[10:20] appendix.pdf -o book.pdf ### Splitting PDFs -The `pdf:split` command divides a PDF file into individual pages, with each page becoming a separate PDF file. +The `pdf:split` command divides a PDF file into individual pages, with optional page range extraction and single-file output. **Split into individual pages:** ```bash @@ -290,7 +326,21 @@ This creates a `document_split/` folder containing `document_page_1.pdf`, `docum parxy pdf:split report.pdf -o ./pages -p page ``` -Creates `page_1.pdf`, `page_2.pdf`, etc. in the `./pages` directory. +**Extract a page range as individual files:** +```bash +parxy pdf:split document.pdf --pages 2:5 -o ./pages +``` + +**Combine a page range into a single PDF:** +```bash +# Auto-named output next to the input file +parxy pdf:split document.pdf --pages 2:5 --combine + +# Custom output path +parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf +``` + +Page range formats (1-based): `3` · `2:5` · `:5` · `3:` For more detailed examples and use cases, see the [PDF Manipulation How-to Guide](../howto/pdf_manipulation.md). @@ -358,9 +408,9 @@ With the CLI, you can use Parxy as a **standalone document parsing tool** — id |------------------|--------------------------------------------------------------| | `parxy parse` | Extract text from documents with multiple formats & drivers | | `parxy preview` | Interactive document viewer with metadata and TOC | -| `parxy markdown` | Generate Markdown files with driver prefix naming | +| `parxy markdown` | Generate Markdown files; accepts JSON results and supports `--page-separators` | | `parxy pdf:merge`| Merge multiple PDF files with page range support | -| `parxy pdf:split`| Split PDF files into individual pages | +| `parxy pdf:split`| Split PDF into individual pages; supports `--pages` and `--combine` | | `parxy drivers` | List supported drivers | | `parxy env` | Create default configuration file | | `parxy docker` | Generate Docker Compose setup | diff --git a/src/parxy_cli/commands/markdown.py b/src/parxy_cli/commands/markdown.py index dfba9f6..2ccd009 100644 --- a/src/parxy_cli/commands/markdown.py +++ b/src/parxy_cli/commands/markdown.py @@ -2,11 +2,13 @@ from datetime import timedelta from pathlib import Path -from typing import Optional, List, Annotated +from typing import Optional, List, Annotated, Tuple import typer +from pydantic import ValidationError from parxy_core.facade import Parxy +from parxy_core.models import Document from parxy_cli.models import Level from parxy_cli.console.console import Console @@ -91,14 +93,27 @@ def markdown( min=1, ), ] = None, + page_separators: Annotated[ + bool, + typer.Option( + '--page-separators', + help="Insert HTML comments before each page's content.", + ), + ] = False, ): """Parse documents to Markdown. + Accepts PDF files (parsed on-the-fly) or pre-parsed JSON result files + (loaded directly from the Document model without re-parsing). + Examples: # Parse a single file parxy markdown document.pdf + # Convert a pre-parsed JSON result directly to markdown + parxy markdown result.json + # Parse with a specific driver and output to a folder parxy markdown document.pdf -d pymupdf -o output/ @@ -110,6 +125,9 @@ def markdown( # Output to stdout as YAML-frontmattered markdown (single file only) parxy markdown document.pdf --inline + + # Include page separator comments in the output + parxy markdown document.pdf --page-separators """ console.action('Markdown export', space_after=False) @@ -120,85 +138,118 @@ def markdown( console.warning('No suitable files found to process.', panel=True) raise typer.Exit(1) - if inline and len(files) > 1: + # Partition into pre-parsed JSON files and files to parse + json_files = [f for f in files if f.suffix.lower() == '.json'] + parse_files = [f for f in files if f.suffix.lower() != '.json'] + + if inline and len(json_files) + len(parse_files) > 1: console.error('--inline can only be used with a single file') raise typer.Exit(1) - # Use default driver if none specified + # Use default driver if none specified (only needed for parse_files) if not drivers: drivers = [Parxy.default_driver()] output_path = Path(output_dir) if output_dir else None - total_tasks = len(files) * len(drivers) + total_tasks = len(json_files) + len(parse_files) * len(drivers) error_count = 0 + elapsed_time = '0 sec' + + def _write_markdown( + doc: Document, file_path: Path, driver_label: str | None + ) -> None: + """Write markdown content to file or stdout.""" + content = doc.markdown(page_separators=page_separators) + if inline: + frontmatter = f'---\nfile: "{file_path}"\npages: {len(doc.pages)}\n---\n\n' + console.print(frontmatter + content) + else: + if output_path: + output_path.mkdir(parents=True, exist_ok=True) + save_dir = output_path + else: + save_dir = file_path.parent + + base_name = file_path.stem + if driver_label: + base_name = f'{driver_label}-{base_name}' + + out_file = save_dir / f'{base_name}.md' + out_file.write_text(content, encoding='utf-8') + + via = f'via {driver_label} ' if driver_label else '' + console.print( + f'[faint]⎿ [/faint] {file_path.name} {via}to [success]{out_file}[/success] [faint]({len(doc.pages)} pages)[/faint]' + ) try: with console.shimmer( - f'Processing {len(files)} file{"s" if len(files) > 1 else ""} with {len(drivers)} driver{"s" if len(drivers) > 1 else ""}...' + f'Processing {len(files)} file{"s" if len(files) > 1 else ""}...' ): with console.progress('Processing documents') as progress: task = progress.add_task('', total=total_tasks) - batch_tasks = [str(f) for f in files] - - for result in Parxy.batch_iter( - tasks=batch_tasks, - drivers=drivers, - level=level.value, - workers=workers, - ): - file_name = ( - Path(result.file).name - if isinstance(result.file, str) - else 'document' - ) - - if result.success: - doc = result.document - file_path = ( - Path(result.file) - if isinstance(result.file, str) - else Path('document') + # Process pre-parsed JSON files directly + for json_file in json_files: + try: + doc = Document.model_validate_json( + json_file.read_text(encoding='utf-8') ) - - content = doc.markdown() - - if inline: - frontmatter = f'---\nfile: "{result.file}"\npages: {len(doc.pages)}\n---\n\n' - console.print(frontmatter + content) - else: - if output_path: - output_path.mkdir(parents=True, exist_ok=True) - save_dir = output_path - else: - save_dir = file_path.parent - - base_name = file_path.stem - if result.driver: - base_name = f'{result.driver}-{base_name}' - - out_file = save_dir / f'{base_name}.md' - out_file.write_text(content, encoding='utf-8') - - console.print( - f'[faint]⎿ [/faint] {file_name} via {result.driver} to [success]{out_file}[/success] [faint]({len(doc.pages)} pages)[/faint]' - ) - else: + _write_markdown( + doc, json_file.with_suffix(''), driver_label=None + ) + except (ValidationError, ValueError) as e: console.print( - f'[faint]⎿ [/faint] {file_name} via {result.driver} error. [error]{result.error}[/error]' + f'[faint]⎿ [/faint] {json_file.name} error. [error]{e}[/error]' ) error_count += 1 - if stop_on_failure: console.newline() console.info( 'Stopping due to error (--stop-on-failure flag is set)' ) raise typer.Exit(1) - progress.update(task, advance=1) + # Process files that need parsing + if parse_files: + for result in Parxy.batch_iter( + tasks=[str(f) for f in parse_files], + drivers=drivers, + level=level.value, + workers=workers, + ): + file_name = ( + Path(result.file).name + if isinstance(result.file, str) + else 'document' + ) + + if result.success: + file_path = ( + Path(result.file) + if isinstance(result.file, str) + else Path('document') + ) + _write_markdown( + result.document, file_path, driver_label=result.driver + ) + else: + console.print( + f'[faint]⎿ [/faint] {file_name} via {result.driver} error. [error]{result.error}[/error]' + ) + error_count += 1 + + if stop_on_failure: + console.newline() + console.info( + 'Stopping due to error (--stop-on-failure flag is set)' + ) + raise typer.Exit(1) + + progress.update(task, advance=1) + elapsed_time = format_timedelta( timedelta(seconds=max(0, progress.tasks[0].elapsed)) ) @@ -210,13 +261,13 @@ def markdown( if not inline: console.newline() - if error_count == len(files) * len(drivers): + if error_count == total_tasks: console.error('All files were not processed due to errors') return if error_count > 0: console.warning( - f'Processed {len(files)} file{"s" if len(files) > 1 else ""} with warnings using {len(drivers)} driver{"s" if len(drivers) > 1 else ""}' + f'Processed {len(files)} file{"s" if len(files) > 1 else ""} with warnings' ) console.print( f'[faint]⎿ [/faint] [highlight]{error_count} files errored[/highlight]' @@ -225,5 +276,5 @@ def markdown( if not inline: console.success( - f'Processed {len(files)} file{"s" if len(files) > 1 else ""} using {len(drivers)} driver{"s" if len(drivers) > 1 else ""} (took {elapsed_time})' + f'Processed {len(files)} file{"s" if len(files) > 1 else ""} (took {elapsed_time})' ) diff --git a/src/parxy_core/models/models.py b/src/parxy_core/models/models.py index 258b847..b965c56 100644 --- a/src/parxy_core/models/models.py +++ b/src/parxy_core/models/models.py @@ -155,7 +155,52 @@ def text(self, page_separator: str = '---') -> str: return '\n'.join(texts) - def markdown(self) -> str: + def contentmd( + self, + title: Optional[str] = None, + description: Optional[str] = None, + date: Optional[str] = None, + license: Optional[str] = None, + author: Optional[str] = None, + page_separators: bool = False, + ) -> str: + """Get the document content formatted as content-md. + + Delegates to :class:`~parxy_core.services.ContentMdService`. + + Parameters + ---------- + title : str, optional + Document title. Falls back to metadata.title, a heading inferred + from the first page, filename, then 'Untitled'. + description : str, optional + Short summary (~200 characters). Falls back to a doc-abstract block, + then the longest TextBlock across the first two pages. + date : str, optional + Creation/publication date in ISO 8601. Falls back to metadata dates. + license : str, optional + License name or SPDX identifier. + author : str, optional + Author name. Falls back to metadata.author. + + Returns + ------- + str + The document content formatted as content-md. + """ + from parxy_core.services.contentmd_service import ContentMdService + + return ContentMdService.render( + self, + title=title, + description=description, + date=date, + license=license, + author=author, + page_separators=page_separators, + ) + + def markdown(self, page_separators: bool = False) -> str: """Get the document content formatted as Markdown. The method attempts to preserve the document structure by: @@ -163,6 +208,12 @@ def markdown(self) -> str: 2. Preserving line breaks where meaningful 3. Adding section headers based on block levels + Parameters + ---------- + page_separators : bool, optional + When True, inserts an HTML comment ```` before + each page's content, by default False + Returns ------- str @@ -174,48 +225,50 @@ def markdown(self) -> str: markdown_parts = [] for page in self.pages: - if not page.blocks: - if page.text.strip(): - markdown_parts.append(page.text.strip()) - continue - page_parts = [] - for block in page.blocks: - if isinstance(block, TextBlock): - # Handle different block categories - if block.category and block.category.lower() in [ - 'heading', - 'title', - 'header', - ]: - # Determine heading level (h1-h6) based on block level or default to h2 - level = min(block.level or 2, 6) - page_parts.append(f'{"#" * level} {block.text.strip()}') - elif block.category and block.category.lower() == 'list': - # Convert to bullet points - for line in block.text.splitlines(): - if line.strip(): - page_parts.append(f'- {line.strip()}') - else: - # Regular paragraph + if page_separators: + page_parts.append(f'') + + if not page.blocks: + if page.text.strip(): + page_parts.append(page.text.strip()) + else: + for block in page.blocks: + if isinstance(block, TextBlock): + # Handle different block categories + if block.category and block.category.lower() in [ + 'heading', + 'title', + 'header', + ]: + # Determine heading level (h1-h6) based on block level or default to h2 + level = min(block.level or 2, 6) + page_parts.append(f'{"#" * level} {block.text.strip()}') + elif block.category and block.category.lower() == 'list': + # Convert to bullet points + for line in block.text.splitlines(): + if line.strip(): + page_parts.append(f'- {line.strip()}') + else: + # Regular paragraph + if block.text.strip(): + page_parts.append(block.text.strip()) + + elif isinstance(block, ImageBlock): + ext = ( + block.name.rsplit('.', 1)[-1] + if block.name and '.' in block.name + else '' + ) + lang = f'image:{ext}' if ext else 'image' + alt = block.alt_text or '' + page_parts.append(f'```{lang}\n{alt}\n```') + + elif isinstance(block, TableBlock): if block.text.strip(): page_parts.append(block.text.strip()) - elif isinstance(block, ImageBlock): - ext = ( - block.name.rsplit('.', 1)[-1] - if block.name and '.' in block.name - else '' - ) - lang = f'image:{ext}' if ext else 'image' - alt = block.alt_text or '' - page_parts.append(f'```{lang}\n{alt}\n```') - - elif isinstance(block, TableBlock): - if block.text.strip(): - page_parts.append(block.text.strip()) - if page_parts: markdown_parts.append('\n\n'.join(page_parts)) diff --git a/src/parxy_core/services/__init__.py b/src/parxy_core/services/__init__.py index 5071d08..5342a63 100644 --- a/src/parxy_core/services/__init__.py +++ b/src/parxy_core/services/__init__.py @@ -1,5 +1,6 @@ """Services module for parxy_core.""" +from parxy_core.services.contentmd_service import ContentMdService from parxy_core.services.pdf_service import PdfService -__all__ = ['PdfService'] +__all__ = ['ContentMdService', 'PdfService'] diff --git a/src/parxy_core/services/contentmd_service.py b/src/parxy_core/services/contentmd_service.py new file mode 100644 index 0000000..039ab38 --- /dev/null +++ b/src/parxy_core/services/contentmd_service.py @@ -0,0 +1,273 @@ +"""Service for rendering documents as content-md.""" + +from __future__ import annotations + +from typing import TYPE_CHECKING, Optional + +if TYPE_CHECKING: + from parxy_core.models.models import Document + + +class ContentMdService: + """Render a :class:`Document` as a content-md string. + + content-md is an open specification for optimised content exchange: a YAML + frontmatter section followed by CommonMark / GitHub-flavoured Markdown. + All methods are static; the class acts as a namespace. + """ + + # ------------------------------------------------------------------ + # Private helpers + # ------------------------------------------------------------------ + + # Roles that provide structure or navigation rather than readable body text + _STRUCTURAL_ROLES: frozenset[str] = frozenset( + { + 'heading', + 'doc-title', + 'doc-subtitle', + 'doc-abstract', + 'doc-toc', + 'doc-pageheader', + 'doc-pagefooter', + 'caption', + } + ) + + @staticmethod + def _normalize(text: str) -> str: + """Collapse any run of whitespace to a single space and strip.""" + return ' '.join(text.split()) + + @staticmethod + def _yaml_str(value: str) -> str: + """Wrap *value* in double quotes and escape internal quotes/backslashes.""" + return '"' + value.replace('\\', '\\\\').replace('"', '\\"') + '"' + + @staticmethod + def _guess_title(document: Document) -> Optional[str]: + """Infer a title from the first page blocks. + + Prefers an explicit ``doc-title`` role; falls back to the + highest-ranking (lowest level number) ``heading`` block. + """ + from parxy_core.models.models import TextBlock + + if not document.pages: + return None + first_page = document.pages[0] + if not first_page.blocks: + return None + + doc_title = next( + ( + b + for b in first_page.blocks + if isinstance(b, TextBlock) and b.role == 'doc-title' and b.text.strip() + ), + None, + ) + if doc_title: + return ContentMdService._normalize(doc_title.text) + + headings = [ + b + for b in first_page.blocks + if isinstance(b, TextBlock) and b.role == 'heading' and b.text.strip() + ] + if not headings: + return None + return ContentMdService._normalize( + min(headings, key=lambda b: b.level or 1).text + ) + + @staticmethod + def _infer_description(document: Document) -> Optional[str]: + """Infer a description from document content. + + Uses the ``doc-abstract`` block when present. Otherwise concatenates + the first five body :class:`TextBlock` objects (non-structural, across + the first two pages), normalises whitespace, and returns at most 200 + characters. + """ + from parxy_core.models.models import TextBlock + + blocks = [ + b + for page in document.pages[:2] + if page.blocks + for b in page.blocks + if isinstance(b, TextBlock) and b.text.strip() + ] + + abstract = next((b for b in blocks if b.role == 'doc-abstract'), None) + if abstract: + return ContentMdService._normalize(abstract.text) + + body_blocks = [ + b + for b in blocks + if (b.role or 'generic') not in ContentMdService._STRUCTURAL_ROLES + ] + if not body_blocks: + return None + + combined = ' '.join(b.text for b in body_blocks[:5]) + return ContentMdService._normalize(combined)[:200] + + @staticmethod + def _build_frontmatter( + title: str, + description: Optional[str], + date: Optional[str], + license: Optional[str], + author: Optional[str], + ) -> str: + ys = ContentMdService._yaml_str + lines = ['---', f'title: {ys(title)}'] + if description: + lines.append(f'description: {ys(description)}') + if date: + lines.append(f'date: {ys(date)}') + if license: + lines.append(f'license: {ys(license)}') + if author: + lines.append(f'author: {ys(author)}') + lines.append('---') + return '\n'.join(lines) + + @staticmethod + def _build_body( + document: Document, title: str, page_separators: bool = False + ) -> str: + from parxy_core.models.models import ImageBlock, TableBlock, TextBlock + + normalize = ContentMdService._normalize + parts = [f'# {title}'] + + for page in document.pages: + if page_separators: + parts.append(f'') + + if not page.blocks: + if page.text.strip(): + parts.append(normalize(page.text)) + continue + + for block in page.blocks: + role = (block.role or 'generic').lower() + + if isinstance(block, TextBlock): + if role == 'doc-title': + # Already the top-level h1 — skip to avoid duplication + pass + elif role == 'heading': + # Shift levels +1: h1 content → h2, per content-md spec + shifted = min((block.level or 1) + 1, 6) + parts.append(f'{"#" * shifted} {normalize(block.text)}') + elif role in ('list', 'listitem'): + for line in block.text.splitlines(): + if line.strip(): + parts.append(f'- {normalize(line)}') + elif role == 'doc-abstract': + lang_attr = ( + f' lang="{document.language}"' if document.language else '' + ) + parts.append( + f'\n{normalize(block.text)}\n' + ) + else: + normalized = normalize(block.text) + if normalized: + parts.append(normalized) + + elif isinstance(block, ImageBlock): + parts.append(f'
\n{block.alt_text or ""}\n
') + + elif isinstance(block, TableBlock): + # Preserve table whitespace (column alignment, padding) + if block.text.strip(): + parts.append(block.text.strip()) + + return '\n\n'.join(parts) + + # ------------------------------------------------------------------ + # Public API + # ------------------------------------------------------------------ + + @staticmethod + def render( + document: Document, + title: Optional[str] = None, + description: Optional[str] = None, + date: Optional[str] = None, + license: Optional[str] = None, + author: Optional[str] = None, + page_separators: bool = False, + ) -> str: + """Render *document* as a content-md string. + + Parameters + ---------- + document: + The document to render. + title: + Document title. Falls back to ``metadata.title``, a heading + inferred from the first page, then ``filename``. Raises + ``ValueError`` if no title can be resolved. + description: + Short summary (~200 characters). Falls back to a ``doc-abstract`` + block, then the first five body blocks in the first two pages. + date: + Creation/publication date in ISO 8601. Falls back to + ``metadata.created_at`` / ``metadata.updated_at``. + license: + License name or SPDX identifier. + author: + Author name. Falls back to ``metadata.author``. + page_separators: + When True, inserts ```` before each page's + content in the body. + + Returns + ------- + str + The document formatted as content-md. + """ + resolved_title = ( + title + or (document.metadata.title if document.metadata else None) + or ContentMdService._guess_title(document) + or document.filename + ) + if not resolved_title: + raise ValueError( + 'Cannot render content-md: no title could be resolved. ' + 'Provide a title via metadata, a doc-title/heading block, ' + 'a filename, or pass title= explicitly.' + ) + resolved_description = description or ContentMdService._infer_description( + document + ) + resolved_date = date or ( + (document.metadata.created_at or document.metadata.updated_at) + if document.metadata + else None + ) + resolved_author = author or ( + document.metadata.author if document.metadata else None + ) + + frontmatter = ContentMdService._build_frontmatter( + title=resolved_title, + description=resolved_description, + date=resolved_date, + license=license, + author=resolved_author, + ) + + if not document.pages: + return f'{frontmatter}\n\n# {resolved_title}\n' + + body = ContentMdService._build_body(document, resolved_title, page_separators) + return f'{frontmatter}\n\n{body}\n' diff --git a/tests/commands/test_markdown.py b/tests/commands/test_markdown.py index 88b4d74..b4e772f 100644 --- a/tests/commands/test_markdown.py +++ b/tests/commands/test_markdown.py @@ -278,3 +278,136 @@ def test_markdown_command_no_files_found(runner, tmp_path): result = runner.invoke(app, [str(empty_dir)]) assert result.exit_code == 1 + + +def test_markdown_command_json_input_converts_directly(runner, mock_document, tmp_path): + """Test that a valid JSON parse result is loaded directly without re-parsing.""" + + json_file = tmp_path / 'result.json' + json_file.write_text(mock_document.model_dump_json(), encoding='utf-8') + + with patch('parxy_cli.commands.markdown.Parxy') as mock_parxy: + result = runner.invoke(app, [str(json_file)]) + + assert result.exit_code == 0 + # batch_iter should NOT be called — no PDF to parse + mock_parxy.batch_iter.assert_not_called() + + # Output file should be saved next to the JSON file, without driver prefix + expected_output = tmp_path / 'result.md' + assert expected_output.exists() + assert '# Test heading' in expected_output.read_text() + + +def test_markdown_command_json_input_with_output_dir(runner, mock_document, tmp_path): + """Test that JSON input respects the --output directory.""" + + json_file = tmp_path / 'result.json' + json_file.write_text(mock_document.model_dump_json(), encoding='utf-8') + output_dir = tmp_path / 'out' + + with patch('parxy_cli.commands.markdown.Parxy'): + result = runner.invoke(app, [str(json_file), '--output', str(output_dir)]) + + assert result.exit_code == 0 + assert (output_dir / 'result.md').exists() + + +def test_markdown_command_json_input_inline(runner, mock_document, tmp_path): + """Test that JSON input with --inline prints to stdout.""" + + json_file = tmp_path / 'result.json' + json_file.write_text(mock_document.model_dump_json(), encoding='utf-8') + + with patch('parxy_cli.commands.markdown.Parxy'): + result = runner.invoke(app, [str(json_file), '--inline']) + + assert result.exit_code == 0 + cleaned = strip_ansi(result.stdout) + assert '---' in cleaned + assert 'pages:' in cleaned + assert '# Test heading' in cleaned + assert not (tmp_path / 'result.md').exists() + + +def test_markdown_command_invalid_json_reports_error(runner, tmp_path): + """Test that a JSON file with invalid Document content reports an error.""" + + json_file = tmp_path / 'bad.json' + json_file.write_text('{"not": "a document"}', encoding='utf-8') + + with patch('parxy_cli.commands.markdown.Parxy'): + result = runner.invoke(app, [str(json_file)]) + + cleaned = strip_ansi(result.stdout) + assert 'error' in cleaned.lower() + + +def test_markdown_command_page_separators(runner, mock_document, pdf_file): + """Test that --page-separators injects HTML page comments into output.""" + + with patch('parxy_cli.commands.markdown.Parxy') as mock_parxy: + mock_parxy.default_driver.return_value = 'pymupdf' + mock_parxy.batch_iter.return_value = iter( + [ + BatchResult( + file=str(pdf_file), + driver='pymupdf', + document=mock_document, + error=None, + ) + ] + ) + + result = runner.invoke(app, [str(pdf_file), '--page-separators']) + + assert result.exit_code == 0 + expected_output = pdf_file.parent / 'pymupdf-test.md' + assert expected_output.exists() + assert '' in output + + +def test_markdown_command_mixed_json_and_pdf(runner, mock_document, tmp_path): + """Test that JSON files and PDF files can be processed together.""" + + json_file = tmp_path / 'result.json' + json_file.write_text(mock_document.model_dump_json(), encoding='utf-8') + + pdf_file = tmp_path / 'doc.pdf' + pdf_file.write_bytes(b'%PDF fake') + + with patch('parxy_cli.commands.markdown.Parxy') as mock_parxy: + mock_parxy.default_driver.return_value = 'pymupdf' + mock_parxy.batch_iter.return_value = iter( + [ + BatchResult( + file=str(pdf_file), + driver='pymupdf', + document=mock_document, + error=None, + ) + ] + ) + + result = runner.invoke(app, [str(json_file), str(pdf_file)]) + + assert result.exit_code == 0 + # JSON converted directly + assert (tmp_path / 'result.md').exists() + # PDF parsed via driver + assert (tmp_path / 'pymupdf-doc.md').exists() diff --git a/tests/services/test_contentmd_service.py b/tests/services/test_contentmd_service.py new file mode 100644 index 0000000..d0bb1a9 --- /dev/null +++ b/tests/services/test_contentmd_service.py @@ -0,0 +1,571 @@ +"""Test suite for ContentMdService.""" + +import pytest + +from parxy_core.models.models import ( + Document, + ImageBlock, + Metadata, + Page, + TableBlock, + TextBlock, +) +from parxy_core.services.contentmd_service import ContentMdService + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def make_page( + number: int = 1, + text: str = '', + blocks: list | None = None, +) -> Page: + return Page(number=number, text=text, blocks=blocks) + + +def make_text_block( + text: str, + role: str = 'generic', + level: int | None = None, +) -> TextBlock: + return TextBlock(type='text', text=text, role=role, level=level) + + +def make_image_block( + alt_text: str | None = None, name: str | None = None +) -> ImageBlock: + return ImageBlock(type='image', alt_text=alt_text, name=name) + + +def make_table_block(text: str) -> TableBlock: + return TableBlock(type='table', text=text) + + +def make_doc( + pages: list[Page], + metadata: Metadata | None = None, + filename: str | None = None, + language: str | None = None, +) -> Document: + return Document( + pages=pages, + metadata=metadata, + filename=filename, + language=language, + ) + + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +@pytest.fixture +def minimal_doc(): + """Document with a single page, no blocks, no metadata.""" + return make_doc(pages=[make_page(text='Hello world')]) + + +@pytest.fixture +def metadata_doc(): + """Document with full metadata and one plain paragraph block.""" + meta = Metadata( + title='Metadata Title', + author='Jane Doe', + created_at='2025-01-15', + ) + page = make_page( + text='Paragraph text.', + blocks=[make_text_block('Paragraph text.')], + ) + return make_doc(pages=[page], metadata=meta, filename='report.pdf') + + +@pytest.fixture +def all_blocks_doc(): + """Document whose first page contains every supported block type.""" + blocks = [ + make_text_block('My Document', role='doc-title'), + make_text_block('Introduction', role='heading', level=1), + make_text_block('Background', role='heading', level=2), + make_text_block('First item\nSecond item', role='list'), + make_text_block('A plain paragraph.', role='paragraph'), + make_text_block('A brief overview.', role='doc-abstract'), + make_image_block(alt_text='A sunset over mountains', name='sunset.jpg'), + make_table_block('| Col A | Col B |\n| ----- | ----- |\n| 1 | 2 |'), + ] + page = make_page(text='My Document', blocks=blocks) + return make_doc(pages=[page], language='en') + + +# --------------------------------------------------------------------------- +# Frontmatter +# --------------------------------------------------------------------------- + + +class TestFrontmatter: + def test_frontmatter_delimiters_present(self, minimal_doc): + result = ContentMdService.render(minimal_doc, title='T', description='D') + lines = result.splitlines() + assert lines[0] == '---' + closing = lines.index('---', 1) + assert closing > 0 + + def test_explicit_title_in_frontmatter(self, minimal_doc): + result = ContentMdService.render(minimal_doc, title='Explicit Title') + assert 'title: "Explicit Title"' in result + + def test_title_from_metadata(self, metadata_doc): + result = ContentMdService.render(metadata_doc) + assert 'title: "Metadata Title"' in result + + def test_title_from_doc_title_role_preferred_over_heading(self): + blocks = [ + make_text_block('Real Title', role='doc-title'), + make_text_block('Section One', role='heading', level=1), + ] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc) + assert 'title: "Real Title"' in result + + def test_title_from_heading_when_no_doc_title(self): + blocks = [ + make_text_block('Section One', role='heading', level=2), + make_text_block('Section Two', role='heading', level=1), + ] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc) + # Level 1 heading wins (lowest level = highest rank) + assert 'title: "Section Two"' in result + + def test_title_from_filename_when_no_headings(self): + doc = make_doc( + pages=[make_page(text='body text')], + filename='my-report.pdf', + ) + result = ContentMdService.render(doc) + assert 'title: "my-report.pdf"' in result + + def test_title_raises_when_unresolvable(self): + doc = make_doc(pages=[make_page(text='body text')]) + with pytest.raises(ValueError, match='no title could be resolved'): + ContentMdService.render(doc) + + def test_description_from_explicit_param(self, minimal_doc): + result = ContentMdService.render( + minimal_doc, title='T', description='My summary.' + ) + assert 'description: "My summary."' in result + + def test_description_from_doc_abstract_block(self): + blocks = [ + make_text_block('Abstract content here.', role='doc-abstract'), + make_text_block( + 'A much longer paragraph that should not be picked.', role='paragraph' + ), + ] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert 'description: "Abstract content here."' in result + + def test_description_from_first_five_body_blocks(self): + blocks = [make_text_block(f'Sentence {i}.', role='paragraph') for i in range(7)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + # Only the first five contribute; the sixth and seventh are ignored + assert 'Sentence 5' not in result.split('---\n')[1].split('\n')[0] + assert 'Sentence 0' in result + + def test_description_excludes_structural_roles(self): + blocks = [ + make_text_block('Table of contents text.', role='doc-toc'), + make_text_block('Page header text.', role='doc-pageheader'), + make_text_block('A heading block.', role='heading'), + make_text_block('Body content.', role='paragraph'), + ] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc) + assert 'description: "Body content."' in result + + def test_description_truncated_to_200_chars(self): + long_text = 'word ' * 60 # well over 200 chars + blocks = [make_text_block(long_text, role='paragraph')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + fm_end = result.index('---\n', 4) + frontmatter = result[:fm_end] + desc_line = next( + l for l in frontmatter.splitlines() if l.startswith('description:') + ) + # Strip the YAML quoting to measure the actual value length + value = desc_line[len('description: "') : -1] + assert len(value) <= 200 + + def test_description_contains_no_newlines(self): + blocks = [ + make_text_block('Line one.\nLine two.\nLine three.', role='paragraph') + ] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + fm_end = result.index('---\n', 4) + frontmatter = result[:fm_end] + desc_line = next( + l for l in frontmatter.splitlines() if l.startswith('description:') + ) + assert '\n' not in desc_line + + def test_description_searches_first_two_pages(self): + page1 = make_page(number=1, text='', blocks=[make_text_block('Page 1 text.')]) + page2 = make_page( + number=2, + text='', + blocks=[make_text_block('Page 2 has a longer text block.')], + ) + page3 = make_page( + number=3, + text='', + blocks=[make_text_block('Page 3 has the longest block of all by far.')], + ) + doc = make_doc(pages=[page1, page2, page3]) + result = ContentMdService.render(doc, title='T') + # Page 3 is out of the two-page window + assert 'Page 3' not in result.split('---')[1] # not in frontmatter + + def test_date_from_metadata_created_at(self, metadata_doc): + result = ContentMdService.render(metadata_doc) + assert 'date: "2025-01-15"' in result + + def test_date_from_metadata_updated_at_when_no_created_at(self): + meta = Metadata(updated_at='2025-06-01') + doc = make_doc(pages=[make_page(text='')], metadata=meta) + result = ContentMdService.render(doc, title='T') + assert 'date: "2025-06-01"' in result + + def test_explicit_date_overrides_metadata(self, metadata_doc): + result = ContentMdService.render(metadata_doc, date='2026-01-01') + assert 'date: "2026-01-01"' in result + assert '2025-01-15' not in result + + def test_author_from_metadata(self, metadata_doc): + result = ContentMdService.render(metadata_doc) + assert 'author: "Jane Doe"' in result + + def test_optional_fields_omitted_when_absent(self, minimal_doc): + result = ContentMdService.render(minimal_doc, title='T') + assert 'description:' not in result + assert 'date:' not in result + assert 'license:' not in result + assert 'author:' not in result + + def test_license_included_when_provided(self, minimal_doc): + result = ContentMdService.render(minimal_doc, title='T', license='CC-BY-4.0') + assert 'license: "CC-BY-4.0"' in result + + def test_yaml_values_escaped(self, minimal_doc): + result = ContentMdService.render( + minimal_doc, + title='Title with "quotes"', + description='Back\\slash', + ) + assert r'title: "Title with \"quotes\""' in result + assert r'description: "Back\\slash"' in result + + +# --------------------------------------------------------------------------- +# Body – block rendering +# --------------------------------------------------------------------------- + + +class TestBodyBlocks: + def test_body_starts_with_h1_title(self, metadata_doc): + result = ContentMdService.render(metadata_doc) + body = result.split('---\n', 2)[-1] + assert body.lstrip().startswith('# Metadata Title') + + def test_doc_title_block_skipped_in_body(self, all_blocks_doc): + result = ContentMdService.render(all_blocks_doc) + body = result.split('---\n', 2)[-1] + # Should appear exactly once (as the h1), not twice + assert body.count('My Document') == 1 + + def test_heading_level_shifted_by_one(self): + blocks = [make_text_block('Section', role='heading', level=1)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '## Section' in result + + def test_heading_level_2_becomes_3(self): + blocks = [make_text_block('Subsection', role='heading', level=2)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '### Subsection' in result + + def test_heading_without_level_defaults_to_h2(self): + blocks = [make_text_block('Heading', role='heading')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '## Heading' in result + + def test_heading_level_capped_at_6(self): + blocks = [make_text_block('Deep', role='heading', level=6)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '###### Deep' in result + assert '####### Deep' not in result + + def test_list_role_rendered_as_bullets(self): + blocks = [make_text_block('Alpha\nBeta\nGamma', role='list')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '- Alpha' in result + assert '- Beta' in result + assert '- Gamma' in result + + def test_listitem_role_rendered_as_bullet(self): + blocks = [make_text_block('Single item', role='listitem')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '- Single item' in result + + def test_doc_abstract_rendered_as_abstract_tag(self, all_blocks_doc): + result = ContentMdService.render(all_blocks_doc) + assert '' in result + assert 'A brief overview.' in result + assert '' in result + + def test_doc_abstract_without_language_omits_lang_attr(self): + blocks = [make_text_block('Summary.', role='doc-abstract')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '\nSummary.\n' in result + + def test_generic_textblock_rendered_as_paragraph(self): + blocks = [make_text_block('Plain paragraph text.', role='generic')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert 'Plain paragraph text.' in result + + def test_empty_textblock_not_rendered(self): + blocks = [make_text_block(' ', role='paragraph')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + # Body should only contain the h1 line + body = result.split('---\n', 2)[-1].strip() + assert body == '# T' + + def test_image_block_rendered_as_figure(self): + blocks = [make_image_block(alt_text='A sunset over mountains')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '
\nA sunset over mountains\n
' in result + + def test_image_block_without_alt_text(self): + blocks = [make_image_block()] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '
\n\n
' in result + + def test_table_block_rendered_as_is(self): + table_text = '| Col A | Col B |\n| ----- | ----- |\n| 1 | 2 |' + blocks = [make_table_block(table_text)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert table_text in result + + def test_page_without_blocks_uses_page_text(self): + page = make_page(text='Fallback page text', blocks=None) + doc = make_doc(pages=[page]) + result = ContentMdService.render(doc, title='T') + assert 'Fallback page text' in result + + def test_empty_page_text_not_rendered(self): + page = make_page(text=' ', blocks=None) + doc = make_doc(pages=[page]) + result = ContentMdService.render(doc, title='T') + body = result.split('---\n', 2)[-1].strip() + assert body == '# T' + + +# --------------------------------------------------------------------------- +# Whitespace normalisation +# --------------------------------------------------------------------------- + + +class TestWhitespaceNormalisation: + def test_multiple_spaces_in_paragraph_collapsed(self): + blocks = [make_text_block('Word1 Word2 Word3')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert 'Word1 Word2 Word3' in result + + def test_tabs_in_paragraph_collapsed(self): + blocks = [make_text_block('Word1\t\tWord2')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert 'Word1 Word2' in result + + def test_whitespace_in_heading_collapsed(self): + blocks = [make_text_block('My Section', role='heading', level=1)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '## My Section' in result + + def test_whitespace_in_title_collapsed(self): + blocks = [make_text_block(' My Title ', role='doc-title')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc) + assert 'title: "My Title"' in result + + def test_whitespace_in_description_collapsed(self): + blocks = [make_text_block('Summary with gaps.', role='doc-abstract')] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert 'description: "Summary with gaps."' in result + + def test_table_whitespace_preserved(self): + table_text = '| Col A | Col B |\n| ----- | ----- |' + blocks = [make_table_block(table_text)] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert '| Col A | Col B |' in result + + +# --------------------------------------------------------------------------- +# Output structure +# --------------------------------------------------------------------------- + + +class TestOutputStructure: + def test_result_ends_with_newline(self, minimal_doc): + result = ContentMdService.render(minimal_doc, title='T') + assert result.endswith('\n') + + def test_empty_pages_list_returns_frontmatter_and_title(self): + doc = Document(pages=[]) + result = ContentMdService.render(doc, title='Empty') + assert 'title: "Empty"' in result + assert '# Empty' in result + + def test_blocks_separated_by_blank_line(self): + blocks = [ + make_text_block('First paragraph.'), + make_text_block('Second paragraph.'), + ] + doc = make_doc(pages=[make_page(text='', blocks=blocks)]) + result = ContentMdService.render(doc, title='T') + assert 'First paragraph.\n\nSecond paragraph.' in result + + def test_multipage_document_renders_all_pages(self): + page1 = make_page( + number=1, + text='', + blocks=[make_text_block('Page one content.')], + ) + page2 = make_page( + number=2, + text='', + blocks=[make_text_block('Page two content.')], + ) + doc = make_doc(pages=[page1, page2]) + result = ContentMdService.render(doc, title='T') + assert 'Page one content.' in result + assert 'Page two content.' in result + + def test_render_delegates_from_document_method(self, metadata_doc): + via_service = ContentMdService.render(metadata_doc) + via_method = metadata_doc.contentmd() + assert via_service == via_method + + def test_empty_document_without_args_raises(self): + """A document with no metadata, no blocks, no filename, and no user + arguments cannot satisfy the required title constraint.""" + doc = Document(pages=[]) + with pytest.raises(ValueError, match='no title could be resolved'): + ContentMdService.render(doc) + + def test_empty_document_with_title_arg_returns_contentmd(self): + """Passing title= explicitly must succeed even when the document is + completely empty.""" + doc = Document(pages=[]) + result = ContentMdService.render(doc, title='Provided Title') + assert 'title: "Provided Title"' in result + assert '# Provided Title' in result + + def test_empty_document_with_title_and_description_returns_contentmd(self): + """Both title= and description= passed explicitly on an empty document.""" + doc = Document(pages=[]) + result = ContentMdService.render( + doc, title='My Title', description='My description.' + ) + assert 'title: "My Title"' in result + assert 'description: "My description."' in result + assert result.endswith('\n') + + +class TestPageSeparators: + """Tests for page_separators support in ContentMdService and Document.markdown.""" + + def test_contentmd_page_separators_off_by_default(self): + page = make_page(number=1, text='', blocks=[make_text_block('Content.')]) + doc = make_doc(pages=[page]) + result = ContentMdService.render(doc, title='T') + assert '' in result + + def test_contentmd_page_separators_multipage(self): + page1 = make_page(number=1, text='', blocks=[make_text_block('Page one.')]) + page2 = make_page(number=2, text='', blocks=[make_text_block('Page two.')]) + doc = make_doc(pages=[page1, page2]) + result = ContentMdService.render(doc, title='T', page_separators=True) + assert '' in result + assert '' in result + # Separators appear in correct order relative to each other + assert result.index('') < result.index('') + + def test_contentmd_page_separators_via_document_method(self): + page = make_page(number=3, text='', blocks=[make_text_block('Content.')]) + doc = make_doc(pages=[page]) + result = doc.contentmd(title='T', page_separators=True) + assert '' in result + + def test_markdown_page_separators_off_by_default(self): + doc = Document(pages=[Page(number=1, text='Hello world')]) + result = doc.markdown() + assert '' in result + + def test_markdown_page_separators_multipage(self): + doc = Document( + pages=[ + Page(number=1, text='First page'), + Page(number=2, text='Second page'), + ] + ) + result = doc.markdown(page_separators=True) + assert '' in result + assert '' in result + assert result.index('') < result.index('First page') + assert result.index('') < result.index('Second page') + + def test_markdown_page_separators_empty_page_still_emits_comment(self): + doc = Document( + pages=[ + Page(number=1, text='Content'), + Page(number=2, text=''), # empty page + ] + ) + result = doc.markdown(page_separators=True) + assert '' in result + assert '' in result