Self-hosted XLSX upload indexes ZIP/OpenXML bytes instead of spreadsheet cell text

## Summary

Self-hosted `supermemory-server` v0.0.3 accepts an `.xlsx` upload and eventually marks the document `done`, but the stored chunks are raw ZIP/OpenXML bytes (`PK\u0003\u0004...xl/worksheets/sheet1.xml...`) rather than extracted spreadsheet cell text. Search then returns binary/OpenXML fragments instead of useful spreadsheet content.

## Environment

- Release: `server-v0.0.3`
- Binary: `supermemory-server-linux-x64`
- OS: Linux/WSL
- Upload endpoint: `POST /v3/documents/file`
- File MIME: `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`

## Reproduction

Upload an Excel workbook:

```bash
curl -sS -X POST http://127.0.0.1:6767/v3/documents/file \
  -H 'Authorization: Bearer <api-key>' \
  -F 'file=@MAG Agent Data.xlsx;type=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' \
  -F 'containerTag=debug:xlsx' \
  -F 'customId=debug-xlsx-1'
```

The document is accepted and later reports `done`:

```json
{
  "id": "E63cFP2EuiVgMefznAxADD",
  "status": "done",
  "title": "MAG Agent Data.xlsx",
  "type": "text",
  "metadata": {
    "mimeType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  }
}
```

But `GET /v3/documents/{id}/chunks` returns chunks like this:

```json
{
  "position": 0,
  "content": "PK\u0003\u0004...xl/worksheets/sheet1.xml..."
}
```

and later chunks include more OpenXML/ZIP internals instead of cell values:

```txt
xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
[Content_Types].xml
```

Search also returns those binary/OpenXML chunks instead of spreadsheet rows/cells.

## Suspected cause

The self-hosted content detection appears to classify Office MIME types as generic `text`:

```txt
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet -> text
```

Then the text extractor decodes the raw `.xlsx` ZIP bytes as text and embeds that, rather than using an XLSX/OpenXML extractor to unpack worksheets/shared strings and serialize cells.

## Docs ambiguity

The docs are inconsistent:

- `Supported Content Types` says Microsoft Office Excel `.xlsx` is supported with content type `xlsx`.
- `Upload Files` / file upload docs list spreadsheets as CSV / Google Sheets, but not XLSX.

Either way, marking an uploaded `.xlsx` as `done` while indexing ZIP bytes is misleading. If XLSX is unsupported for file upload/self-hosted, the document should fail with a clear unsupported-content error instead of indexing binary content.

## Expected behavior

One of:

1. `.xlsx` uploads extract worksheet cell text into searchable chunks, preserving useful sheet/row context; or
2. `.xlsx` uploads are rejected/marked failed with an explicit unsupported file type error.

## Suggested fix

Add an XLSX/OpenXML extractor for uploaded files, or treat Excel MIME/extension as unsupported rather than text. A regression test could upload a minimal workbook with a unique cell value and assert that `GET /v3/documents/{id}/chunks` or document search contains that cell value, not `PK` ZIP bytes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Self-hosted XLSX upload indexes ZIP/OpenXML bytes instead of spreadsheet cell text #1152

Summary

Environment

Reproduction

Suspected cause

Docs ambiguity

Expected behavior

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Self-hosted XLSX upload indexes ZIP/OpenXML bytes instead of spreadsheet cell text #1152

Description

Summary

Environment

Reproduction

Suspected cause

Docs ambiguity

Expected behavior

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions