diff --git a/docs/weaviate/manage-collections/inverted-index.mdx b/docs/weaviate/manage-collections/inverted-index.mdx index b230d7b8..665c695c 100644 --- a/docs/weaviate/manage-collections/inverted-index.mdx +++ b/docs/weaviate/manage-collections/inverted-index.mdx @@ -201,7 +201,7 @@ Tokenization determines how text content is broken down into individual terms th **`word`** - The default tokenization that splits text on whitespace and punctuation, converting to lowercase. Best for general text search where you want to match individual words. -**`lowercase`** - Converts the entire property value to lowercase but treats it as a single token. Useful for exact matching of short strings like categories or tags while being case-insensitive. +**`lowercase`** - Splits text on whitespace only, then lowercases each token. Preserves symbols (like `&`, `@`, `_`) that `word` tokenization would strip. Good for case-insensitive matching where punctuation is meaningful — e.g. code snippets or email addresses. **`whitespace`** - Splits text only on whitespace characters, preserving punctuation and case. Good when punctuation is meaningful for search. @@ -209,7 +209,9 @@ Tokenization determines how text content is broken down into individual terms th **`trigram`** - Breaks text into overlapping 3-character sequences. Enables fuzzy matching and is useful for handling typos or partial matches. -**`gse`** - Google Search Engine tokenization, optimized for Chinese, Japanese, and Korean text. Provides language-aware tokenization for CJK languages. +**`gse`** - Language-aware tokenization for Chinese and Japanese text. Disabled by default. Enable with the `ENABLE_TOKENIZER_GSE` environment variable. For Korean text, see the `kagome_kr` option. + +For the full list of supported tokenizers — including `kagome_ja`, `kagome_kr`, and the per-property text-analyzer options — see the [tokenization reference](../config-refs/collections.mdx#tokenization).