Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 66 additions & 6 deletions docs/howto/pdf_manipulation.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ parxy pdf:merge file1.pdf file2.pdf -o /output/dir/merged.pdf

## Splitting PDFs

The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file.
The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file. You can optionally limit which pages are extracted and combine them into a single output PDF.

### Basic Splitting

Expand Down Expand Up @@ -139,6 +139,51 @@ Creates files named:
- `chapter_page_2.pdf`
- etc.

### Extracting a Page Range

Use `--pages` to limit which pages are extracted (1-based indexing):

**Single page:**
```bash
parxy pdf:split document.pdf --pages 3
```

**Page range:**
```bash
parxy pdf:split document.pdf --pages 2:5
```

**From start to page N:**
```bash
parxy pdf:split document.pdf --pages :5
```

**From page N to end:**
```bash
parxy pdf:split document.pdf --pages 3:
```

### Combining Pages into a Single PDF

Use `--combine` to extract a page range into a single output PDF instead of one file per page:

```bash
# Extract pages 2–5 as a single PDF (auto-named)
parxy pdf:split document.pdf --pages 2:5 --combine
# Output: document_pages_2-5.pdf (next to the input file)

# Specify a custom output path
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf

# Extract a single page as a PDF
parxy pdf:split document.pdf --pages 3 --combine -o page3.pdf

# Combine all pages (equivalent to a copy)
parxy pdf:split document.pdf --combine -o copy.pdf
```

> **Tip:** `--combine` pairs well with `--pages` to replace the `pdf:merge file.pdf[2:5]` pattern when working with a single source file.

### Complete Examples

**Split with custom output directory:**
Expand All @@ -161,14 +206,25 @@ Creates:
parxy pdf:split document.pdf -o ./individual_pages -p page
```

**Extract pages 10–20 as individual files:**
```bash
parxy pdf:split document.pdf --pages 10:20 -o ./extracted_pages
```

## Combining Merge and Split

You can chain operations together using the CLI:

**Example: Extract specific pages and split them:**
```bash
# First, extract pages 10-20
parxy pdf:merge document.pdf[10:20] -o extracted.pdf
# Extract pages 10-20 as individual files
parxy pdf:split document.pdf --pages 10:20 -o ./individual_pages
```

**Example: Extract a range into a single PDF, then split:**
```bash
# First, extract pages 10-20 into one PDF
parxy pdf:split document.pdf --pages 10:20 --combine -o extracted.pdf

# Then split into individual pages
parxy pdf:split extracted.pdf -o ./individual_pages
Expand Down Expand Up @@ -232,17 +288,21 @@ parxy pdf:split INPUT_FILE [OPTIONS]
```

**Arguments:**
- `INPUT_FILE`: PDF file to split into individual pages
- `INPUT_FILE`: PDF file to split

**Options:**
- `--output, -o`: Output directory (default: `{filename}_split/`)
- `--prefix, -p`: Output filename prefix (default: input filename)
- `--output, -o`: Without `--combine`: output directory (default: `{filename}_split/`). With `--combine`: output file path (default: `{filename}_pages_{from}-{to}.pdf` next to the input).
- `--prefix, -p`: Output filename prefix for individual split files (default: input filename)
- `--pages`: Page range to extract, 1-based. Formats: `3` (single page), `2:5` (range), `:5` (up to page 5), `3:` (from page 3 to end)
- `--combine`: Combine extracted pages into a single PDF instead of one file per page

**Examples:**
```bash
parxy pdf:split document.pdf
parxy pdf:split document.pdf -o ./pages
parxy pdf:split document.pdf -o ./pages -p page
parxy pdf:split document.pdf --pages 2:5
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
```

## Getting Help
Expand Down
48 changes: 47 additions & 1 deletion docs/tutorials/pdf_manipulation.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,44 @@ for page_path in pages:
# ...
```

You can limit splitting to a page range using 0-based `from_page` / `to_page` indices:

```python
# Split only pages 2–5 (0-based: indices 1–4)
pages = Parxy.pdf.split(
input_path=Path("document.pdf"),
output_dir=Path("./pages"),
prefix="doc",
from_page=1,
to_page=4,
)
# Creates: doc_page_2.pdf, doc_page_3.pdf, doc_page_4.pdf, doc_page_5.pdf
```

### Extracting Pages into a Single PDF

Use `extract_pages` to pull a page range from a PDF into a new single-file PDF without splitting each page individually:

```python
from pathlib import Path
from parxy_core.services.pdf_service import PdfService

# Extract pages 3–7 (0-based: indices 2–6)
PdfService.extract_pages(
input_path=Path("report.pdf"),
output_path=Path("summary.pdf"),
from_page=2,
to_page=6,
)
```

Omit `from_page` / `to_page` to copy all pages:

```python
# Equivalent to a copy
PdfService.extract_pages(Path("original.pdf"), Path("copy.pdf"))
```

### Optimizing PDFs

Reduce PDF file size using compression techniques:
Expand Down Expand Up @@ -302,6 +340,12 @@ try:
except FileNotFoundError as e:
print(f"File not found: {e}")

# ValueError for invalid page ranges
try:
Parxy.pdf.split(Path("doc.pdf"), Path("./out"), "doc", from_page=100)
except ValueError as e:
print(f"Invalid page range: {e}")

# ValueError for invalid parameters
try:
Parxy.pdf.optimize(
Expand Down Expand Up @@ -332,7 +376,8 @@ except RuntimeError as e:
In this tutorial you learned:

- **`Parxy.pdf.merge()`** - Combine multiple PDFs with optional page ranges
- **`Parxy.pdf.split()`** - Split a PDF into individual page files
- **`Parxy.pdf.split()`** - Split a PDF into individual page files, with optional page range
- **`PdfService.extract_pages()`** - Extract a page range into a single output PDF
- **`Parxy.pdf.optimize()`** - Reduce file size with compression options
- **`PdfService` context manager** - Work with attachments (add, list, extract, remove)

Expand All @@ -344,6 +389,7 @@ In this tutorial you learned:
| Splitting into pages | Extracting attachment content |
| Optimizing file size | Multiple operations on one file |
| One-shot operations | Need fine-grained control |
| Splitting a page range | Extracting a page range into one PDF (`extract_pages`) |

## Next Steps

Expand Down
60 changes: 55 additions & 5 deletions docs/tutorials/using_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The Parxy CLI lets you:
| `parxy preview` | Interactive document viewer with metadata, table of contents, and scrollable content preview |
| `parxy markdown` | Convert documents to Markdown files, with support for multiple drivers and folder processing |
| `parxy pdf:merge`| Merge multiple PDF files into one, with support for page ranges |
| `parxy pdf:split`| Split a PDF file into individual pages |
| `parxy pdf:split`| Split a PDF into individual pages, with optional page range and single-file extraction |
| `parxy drivers` | List available document processing drivers |
| `parxy env` | Generate a default `.env` configuration file |
| `parxy docker` | Create a Docker Compose configuration for running Parxy-related services |
Expand Down Expand Up @@ -218,6 +218,42 @@ parxy markdown document.pdf -d pymupdf -d llamaparse

This produces `pymupdf-document.md` and `llamaparse-document.md`.

### Converting Pre-parsed JSON Results

If you have a JSON file produced by `parxy parse -m json`, you can convert it to Markdown directly without re-parsing:

```bash
parxy markdown result.json
```

This loads the `Document` model from the JSON and converts it immediately — no driver or API call required. You can mix JSON files and PDF files in the same invocation:

```bash
parxy markdown result.json document.pdf -d pymupdf -o output/
```

### Page Separator Comments

Use `--page-separators` to insert HTML comments before each page's content:

```bash
parxy markdown document.pdf --page-separators
```

Output will contain markers like:

```markdown
<!-- page: 1 -->

First page content...

<!-- page: 2 -->

Second page content...
```

This is useful for post-processing scripts that need to identify page boundaries.

### Inline Output

Use `--inline` with a single file to print markdown directly to stdout with a YAML frontmatter header — useful for shell pipelines:
Expand Down Expand Up @@ -276,7 +312,7 @@ parxy pdf:merge cover.pdf /chapters doc.pdf[10:20] appendix.pdf -o book.pdf

### Splitting PDFs

The `pdf:split` command divides a PDF file into individual pages, with each page becoming a separate PDF file.
The `pdf:split` command divides a PDF file into individual pages, with optional page range extraction and single-file output.

**Split into individual pages:**
```bash
Expand All @@ -290,7 +326,21 @@ This creates a `document_split/` folder containing `document_page_1.pdf`, `docum
parxy pdf:split report.pdf -o ./pages -p page
```

Creates `page_1.pdf`, `page_2.pdf`, etc. in the `./pages` directory.
**Extract a page range as individual files:**
```bash
parxy pdf:split document.pdf --pages 2:5 -o ./pages
```

**Combine a page range into a single PDF:**
```bash
# Auto-named output next to the input file
parxy pdf:split document.pdf --pages 2:5 --combine

# Custom output path
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
```

Page range formats (1-based): `3` · `2:5` · `:5` · `3:`

For more detailed examples and use cases, see the [PDF Manipulation How-to Guide](../howto/pdf_manipulation.md).

Expand Down Expand Up @@ -358,9 +408,9 @@ With the CLI, you can use Parxy as a **standalone document parsing tool** — id
|------------------|--------------------------------------------------------------|
| `parxy parse` | Extract text from documents with multiple formats & drivers |
| `parxy preview` | Interactive document viewer with metadata and TOC |
| `parxy markdown` | Generate Markdown files with driver prefix naming |
| `parxy markdown` | Generate Markdown files; accepts JSON results and supports `--page-separators` |
| `parxy pdf:merge`| Merge multiple PDF files with page range support |
| `parxy pdf:split`| Split PDF files into individual pages |
| `parxy pdf:split`| Split PDF into individual pages; supports `--pages` and `--combine` |
| `parxy drivers` | List supported drivers |
| `parxy env` | Create default configuration file |
| `parxy docker` | Generate Docker Compose setup |
Loading