Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Dec 5, 2025

Some of the types listed in the Overture maps foundation Parquet files are deeply nested in various ways we hadn't encountered yet when trying to convert to R.

The motivating reprex here was:

library(sedonadb)

sd_read_parquet("/Volumes/data/overture/data/theme=divisions/type=division_area/") |>
  head(100000) |> 
  sd_collect() |> 
  tibble::as_tibble()
#> Can't convert `common` <map<key_value: struct<key: string, value: string>>> to R vector of type vctrs_list_of
#> or
#> Something about expected length mixing up rows and columns of a data frame

After this PR:

library(sedonadb)
#> Warning: package 'sedonadb' was built under R version 4.5.2

sd_read_parquet("/Volumes/data/overture/data/theme=divisions/type=division_area/") |>
  head(100000) |> 
  sd_collect() |> 
  tibble::as_tibble()
#> # A tibble: 100,000 × 13
#>    id     geometry bbox$xmin country version sources subtype class names$primary
#>    <chr>  <grrw_v>     <dbl> <chr>     <int> <list<> <chr>   <chr> <chr>        
#>  1 194bd… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plaza Paco …
#>  2 f7334… <MULTIP…     -16.6 ES            3 [1 × 7] locali… land  "Los Realejo…
#>  3 a273d… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plaza de la…
#>  4 46d32… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plaza de la…
#>  5 897c5… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plaza Domín…
#>  6 92c7c… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plazoleta M…
#>  7 afdbf… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plaza Poeta…
#>  8 8ed3a… <POLYGO…     -16.6 ES            1 [1 × 7] microh… land  "Plaza de Pe…
#>  9 319f4… <POLYGO…     -16.7 ES            2 [1 × 7] locali… land  "La Guancha" 
#> 10 36457… <MULTIP…     -16.7 ES            3 [1 × 7] locali… land  "San Juan de…
#> # ℹ 99,990 more rows
#> # ℹ 9 more variables: bbox$xmax <dbl>, $ymin <dbl>, $ymax <dbl>,
#> #   names$common <list<df[,2]>>, $rules <list<df[,6]>>, is_land <lgl>,
#> #   is_territorial <lgl>, region <chr>, division_id <chr>

Interestingly, nanoarrow seems to be much faster than arrow at converting this deeply nested structure. I have no idea why.

complex_map_file <- system.file("test-data/complex-map.arrows", package = "nanoarrow")

bench::mark(
  arrow = as.data.frame(arrow::read_ipc_stream(complex_map_file)),
  nanoarrow = as.data.frame(nanoarrow::read_nanoarrow(complex_map_file)),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 arrow       44.95ms  44.95ms      22.2   21.04MB    178. 
#> 2 nanoarrow    3.97ms   5.81ms     178.     2.06MB     23.8

Created on 2025-12-05 with reprex v2.1.1

@paleolimbot paleolimbot marked this pull request as ready for review December 6, 2025 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant