Skip to content

Self-hosted XLSX upload indexes ZIP/OpenXML bytes instead of spreadsheet cell text #1152

Description

@Konan69

Summary

Self-hosted supermemory-server v0.0.3 accepts an .xlsx upload and eventually marks the document done, but the stored chunks are raw ZIP/OpenXML bytes (PK\u0003\u0004...xl/worksheets/sheet1.xml...) rather than extracted spreadsheet cell text. Search then returns binary/OpenXML fragments instead of useful spreadsheet content.

Environment

  • Release: server-v0.0.3
  • Binary: supermemory-server-linux-x64
  • OS: Linux/WSL
  • Upload endpoint: POST /v3/documents/file
  • File MIME: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Reproduction

Upload an Excel workbook:

curl -sS -X POST http://127.0.0.1:6767/v3/documents/file \
  -H 'Authorization: Bearer <api-key>' \
  -F 'file=@MAG Agent Data.xlsx;type=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' \
  -F 'containerTag=debug:xlsx' \
  -F 'customId=debug-xlsx-1'

The document is accepted and later reports done:

{
  "id": "E63cFP2EuiVgMefznAxADD",
  "status": "done",
  "title": "MAG Agent Data.xlsx",
  "type": "text",
  "metadata": {
    "mimeType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  }
}

But GET /v3/documents/{id}/chunks returns chunks like this:

{
  "position": 0,
  "content": "PK\u0003\u0004...xl/worksheets/sheet1.xml..."
}

and later chunks include more OpenXML/ZIP internals instead of cell values:

xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
[Content_Types].xml

Search also returns those binary/OpenXML chunks instead of spreadsheet rows/cells.

Suspected cause

The self-hosted content detection appears to classify Office MIME types as generic text:

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet -> text

Then the text extractor decodes the raw .xlsx ZIP bytes as text and embeds that, rather than using an XLSX/OpenXML extractor to unpack worksheets/shared strings and serialize cells.

Docs ambiguity

The docs are inconsistent:

  • Supported Content Types says Microsoft Office Excel .xlsx is supported with content type xlsx.
  • Upload Files / file upload docs list spreadsheets as CSV / Google Sheets, but not XLSX.

Either way, marking an uploaded .xlsx as done while indexing ZIP bytes is misleading. If XLSX is unsupported for file upload/self-hosted, the document should fail with a clear unsupported-content error instead of indexing binary content.

Expected behavior

One of:

  1. .xlsx uploads extract worksheet cell text into searchable chunks, preserving useful sheet/row context; or
  2. .xlsx uploads are rejected/marked failed with an explicit unsupported file type error.

Suggested fix

Add an XLSX/OpenXML extractor for uploaded files, or treat Excel MIME/extension as unsupported rather than text. A regression test could upload a minimal workbook with a unique cell value and assert that GET /v3/documents/{id}/chunks or document search contains that cell value, not PK ZIP bytes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions