Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/weaviate/manage-collections/inverted-index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -201,15 +201,17 @@ Tokenization determines how text content is broken down into individual terms th

**`word`** - The default tokenization that splits text on whitespace and punctuation, converting to lowercase. Best for general text search where you want to match individual words.

**`lowercase`** - Converts the entire property value to lowercase but treats it as a single token. Useful for exact matching of short strings like categories or tags while being case-insensitive.
**`lowercase`** - Splits text on whitespace only, then lowercases each token. Preserves symbols (like `&`, `@`, `_`) that `word` tokenization would strip. Good for case-insensitive matching where punctuation is meaningful — e.g. code snippets or email addresses.

**`whitespace`** - Splits text only on whitespace characters, preserving punctuation and case. Good when punctuation is meaningful for search.

**`field`** - Treats the entire property value as a single token without any processing. Use for exact matching of complete field values like IDs, email addresses, or URLs.

**`trigram`** - Breaks text into overlapping 3-character sequences. Enables fuzzy matching and is useful for handling typos or partial matches.

**`gse`** - Google Search Engine tokenization, optimized for Chinese, Japanese, and Korean text. Provides language-aware tokenization for CJK languages.
**`gse`** - Language-aware tokenization for Chinese and Japanese text. Disabled by default. Enable with the `ENABLE_TOKENIZER_GSE` environment variable. For Korean text, see the `kagome_kr` option.

For the full list of supported tokenizers — including `kagome_ja`, `kagome_kr`, and the per-property text-analyzer options — see the [tokenization reference](../config-refs/collections.mdx#tokenization).

</details>

Expand Down
Loading