Fix byte/character offset confusion in formatter for multi-byte UTF-8 #105

Copilot · 2026-02-02T15:28:58Z

Formatting proto files with multi-byte UTF-8 characters (Cyrillic, etc.) was non-idempotent, adding empty // comment lines on each format operation.

Root Cause

offset_to_position in src/formatter/clang.rs converted clang-format's byte offsets to LSP positions using byte arithmetic:

let character = offset - last_newline;  // treats byte offset as character offset

This fails for multi-byte UTF-8. Example: byte offset 134 in Cyrillic text → calculated position 119 → should be 77 UTF-16 code units.

Changes

Fixed offset calculation: Count UTF-16 code units from last newline instead of byte arithmetic

let text_after_newline = &up_to_offset[last_newline..];
let character = text_after_newline.encode_utf16().count();

Added tests: test_offset_to_position_cyrillic (unit) and test_textedit_from_clang_output_cyrillic (integration) with multi-byte UTF-8 input

Original prompt

This section details on the original issue you should resolve

<issue_title>Formatting inserts a new empty // comment line every time (non-idempotent formatting)</issue_title>
<issue_description>Hi! I just so happen to stumble upon this tricky bug in formatting.

What happened

I have this example:
message Test {
  // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.
  int32 x = 1;
}
When applying a formatting it seem to try to split the comment in two and spread among two lines. But in the end it just adds a new line with empty comment:
message Test {
  // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.
  // 
  int32 x = 1;
}
Further formatting just add empty line comments.
message Test {
  // Обратная совместимость: если true, применяет фильтры enabled_not_false и removed_not_true.
  // 
  // 
  // 
  int32 x = 1;
}
Environment

OS: Linux fedora 42

Neovim: NVIM v0.11.1

protols: 0.13.2

clang-format: clang-format version 20.1.8 (Fedora 20.1.8-4.fc42)

Formatting trigger: Neovim LSP (vim.lsp.buf.format()), also happens on :w (format-on-save enabled)

Video Example

https://github.com/user-attachments/assets/f33be03e-78e6-45db-8c83-89ae15a31d0b
</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Formatting inserts a new empty // comment line every time (non-idempotent formatting) #104

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: asharkhan3101 <[email protected]>

Initial plan

d5fe0bf

Copilot AI assigned Copilot and asharkhan3101 Feb 2, 2026

Copilot started work on behalf of asharkhan3101 February 2, 2026 15:29 View session

Copilot AI and others added 3 commits February 2, 2026 15:34

Fix non-idempotent formatting with multi-byte UTF-8 characters

f30b8b3

Co-authored-by: asharkhan3101 <[email protected]>

Add integration test for Cyrillic character formatting

1e1662e

Co-authored-by: asharkhan3101 <[email protected]>

Clarify test comment about byte vs UTF-16 positions

d85f377

Co-authored-by: asharkhan3101 <[email protected]>

Copilot AI changed the title ~~[WIP] Fix formatting bug creating empty comment lines~~ Fix byte/character offset confusion in formatter for multi-byte UTF-8 Feb 2, 2026

Copilot AI requested a review from asharkhan3101 February 2, 2026 15:41

Copilot finished work on behalf of asharkhan3101 February 2, 2026 15:41

asharkhan3101 marked this pull request as ready for review February 2, 2026 15:49

asharkhan3101 approved these changes Feb 2, 2026

View reviewed changes

coder3101 merged commit 8bc1877 into main Feb 2, 2026
1 check passed

coder3101 deleted the copilot/fix-empty-comment-line-issue branch February 2, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix byte/character offset confusion in formatter for multi-byte UTF-8 #105

Fix byte/character offset confusion in formatter for multi-byte UTF-8 #105

Uh oh!

Copilot AI commented Feb 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix byte/character offset confusion in formatter for multi-byte UTF-8 #105

Fix byte/character offset confusion in formatter for multi-byte UTF-8 #105

Uh oh!

Conversation

Copilot AI commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Changes

What happened

Environment

Video Example

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 2, 2026 •

edited

Loading