Skip to content

Remove key atomization in SchemaUtils.to_typemap/2#3534

Draft
ruslandoga wants to merge 4 commits into
Logflare:mainfrom
ruslandoga:conductor/schemautils-preserve-metadata-string-keys
Draft

Remove key atomization in SchemaUtils.to_typemap/2#3534
ruslandoga wants to merge 4 commits into
Logflare:mainfrom
ruslandoga:conductor/schemautils-preserve-metadata-string-keys

Conversation

@ruslandoga
Copy link
Copy Markdown
Contributor

@ruslandoga ruslandoga commented May 28, 2026

Sort of related: https://erlef.org/blog/eef/atom-exhaustion -- but I'll also check if it helps perf.

@ruslandoga ruslandoga changed the title remove key atomization in schemautils to_timemap remove key atomization in SchemaUtils.to_timemap/1 May 28, 2026
[normalized_key] = Map.keys(typemap)
assert is_atom(normalized_key)
assert String.valid?(Atom.to_string(normalized_key))
assert SchemaUtils.to_typemap(%{raw => "x"}) == %{"yþy" => %{t: :string}}
Copy link
Copy Markdown
Contributor Author

@ruslandoga ruslandoga May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-topic, but is this actually the desired behaviour? The produced string is very different.

iex(1)> "yþy" <> <<0>>
#==> <<121, 195, 190, 121, 0>>
iex(2)> <<0xFF, 0xFE, 0xFD>>
#==> <<255, 254, 253>>

@ruslandoga ruslandoga changed the title remove key atomization in SchemaUtils.to_timemap/1 Remove key atomization in SchemaUtils.to_timemap/1 May 28, 2026
@djwhitt
Copy link
Copy Markdown
Contributor

djwhitt commented May 28, 2026

I think it's worth cleaning up the dead to_typemap/1 metadata path as part of this PR.

  • The metadata variant (lines 75-118) and decode_until_valid!/2 are unreachable from production code. The only caller that fed user-supplied keys into to_typemap/1 was BigQuerySchemaChange.validate, and perf: rewrite BigQuerySchemaChange.validate as single-pass walk #3489 rewrote that to walk the body directly with to_string/1 — no intermediate typemap, no atomization, no canonical form needed for keys. Invalid-UTF-8 keys in log payloads now just miss the schema_flat_map lookup and pass through as unknown fields, which is the same effective outcome the old code produced.
  • A git grep to_typemap across lib/ and test/ confirms zero non-test callers of the metadata variant; the only consumer is schema_utils_test.exs itself.
  • The new "yþy" assertion in this PR is documenting behavior of dead code.

Brief history for context on why the path is there in the first place:

  • 85dfb6972 (Jan 2020) introduced to_typemap/1 with String.to_existing_atom, assuming field-name atoms were already minted earlier in the pipeline.
  • 9d7470435 (Sep 2020, "fix the schema utils to not existing atom bug") swapped that for String.to_atom after the assumption broke on novel payload fields — trading the crash for unbounded atom growth.
  • e08caec35 (Dec 2020, "fix the utf8 encoding bug") added decode_until_valid!/2 after customers sent Latin1-encoded field names (the fixture in that commit is the Portuguese "Código de Rastreio" mojibake). String.to_atom requires valid UTF-8, so the decoder fallback + Unicode.unaccent existed solely to produce a string the atomizer would accept.
  • 62aaa6ef0 (perf: rewrite BigQuerySchemaChange.validate as single-pass walk #3489, May 2026) rewrote the validator to skip the intermediate typemap entirely. That severed the only production call site that fed user-supplied keys into to_typemap/1, which means the whole "make an atom out of arbitrary bytes" chain no longer has a job to do.

Suggested steps to do this:

  • Delete the to_typemap/1 metadata clauses in schema_utils.ex (lines 75-118 on this branch).
  • Delete decode_until_valid!/2 (lines 120-136).
  • Delete not_empty_container?/1 (lines 138-148) if there are no other callers — looks like there aren't.
  • In schema_utils_test.exs, drop the describe "to_typemap/1" block's metadata-input cases. Keep the BigQuery-schema test ("BigQuery schema field names are preserved as strings") since it covers the live variant.

@ruslandoga ruslandoga changed the title Remove key atomization in SchemaUtils.to_timemap/1 Remove key atomization in SchemaUtils.to_timemap/1,2 May 28, 2026
@ruslandoga ruslandoga changed the title Remove key atomization in SchemaUtils.to_timemap/1,2 Remove key atomization in SchemaUtils.to_timemap/2 May 28, 2026
@ruslandoga
Copy link
Copy Markdown
Contributor Author

ruslandoga commented May 28, 2026

I've opened a separate PR to remove to_typemap/1: #3539

This PR will focus on removing atomization from to_typemap/2.

@ruslandoga ruslandoga changed the title Remove key atomization in SchemaUtils.to_timemap/2 Remove key atomization in SchemaUtils.to_typemap/2 May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants