feat(python/sedonadb): add DataFrame __getitem__#846
Conversation
Pandas-style bracket indexing dispatches to the existing select / filter / col paths shipped in earlier PRs: - `df["x"]` returns an `Expr` referencing column `x` (a Series wrapper will land in a later phase). - `df[["x", "y"]]` returns a new `DataFrame` projecting the listed columns. - `df[bool_expr]` returns a new `DataFrame` filtered by the boolean expression. Row-position indexing (ints, slices, `.loc`, `.iloc`) is intentionally not supported and raises `TypeError` with a message pointing at the three supported forms. Pure Python sugar — no new Rust code. Composes cleanly with the operator overloads from apache#823, so the natural pandas idiom `df[(df.x + df.y) > 30]` works as-is.
There was a problem hiding this comment.
Pull request overview
Adds pandas-style bracket indexing (DataFrame.__getitem__) to SedonaDB’s Python DataFrame, providing ergonomic sugar that routes to the already-landed expression (Expr), projection (select), and filtering (filter) APIs.
Changes:
- Implement
DataFrame.__getitem__supportingdf["col"] -> Expr,df[["c1","c2"]] -> select, anddf[expr] -> filter, with non-supported row-based indexing raisingTypeError. - Add a focused test module covering supported forms, composition/chaining, and key error cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| python/sedonadb/python/sedonadb/dataframe.py | Adds DataFrame.__getitem__ dispatching to col(), select(), and filter() with clear TypeErrors for unsupported indexers. |
| python/sedonadb/tests/expr/test_dataframe_getitem.py | Adds unit tests validating the new bracket-indexing behavior and error paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
paleolimbot
left a comment
There was a problem hiding this comment.
@zhangfengcdt Please let the original committer do the merging (if a PR is not from a committer, give them a reasonable window to update if they want to do another round of self-review).
@jiayuasu Can we turn off the round robin reviews? We have a huge variety in our code base and should split up review based on familiarity, not randomly.
| from sedonadb.expr import Expr | ||
| from sedonadb.expr import col as _col |
There was a problem hiding this comment.
These should be module-level imports (and they should be combined). Lazy imports in this module are only for optional dependencies (pyarrow is the usual culprit)
| if isinstance(key, str): | ||
| return _col(key) | ||
| if isinstance(key, Expr): | ||
| return self.filter(key) | ||
| if isinstance(key, list): | ||
| for k in key: | ||
| if not isinstance(k, str): | ||
| raise TypeError( | ||
| f"DataFrame[list] expects a list of column names, " | ||
| f"got {type(k).__name__}" | ||
| ) | ||
| return self.select(*key) | ||
| raise TypeError( | ||
| f"DataFrame indexing is not supported for {type(key).__name__}. " | ||
| f"Use df['x'] for a column expression, df[['x', 'y']] to project " | ||
| f"columns, or df[bool_expr] to filter rows." | ||
| ) |
There was a problem hiding this comment.
I think we should restrict this to only the first case. The ability to put a filter or multi-column selection here is pandas-y but prevents type hinting from working well on the output (i.e., IDEs and LLMs that use the language server or can use type hints can't auto complete tab["col"].<tab>).
This should also support integer indexing (e.g., tab[0] gives you the first column).
FWIW Ibis used to accept the second two cases but deprecated them (FutureWarning: Selecting/filtering arbitrary expressions in Table.getitemis deprecated and will be removed in version 10.0. Please useTable.selectorTable.filter instead.).
| """ | ||
| return self.limit(n) | ||
|
|
||
| def __getitem__(self, key): |
There was a problem hiding this comment.
| def __getitem__(self, key): | |
| def __getitem__(self, key: Union[str, int]) -> Expr: |
Eventually we can give Expr a subclass for columns to improve the type hinting
| if isinstance(key, str): | ||
| return _col(key) |
There was a problem hiding this comment.
This should reach into the underlying DFSchema and pull out the actual qualified column expression (displaying a nice error for columns that don't exist). You'll need this for join expressions.
Pandas-style bracket indexing on
DataFrame, dispatching to the existingselect/filter/colpaths landed in earlier PRs (#807, #823, #832, #835).This is a small follow-on PR continuing Phase P1/P2 of #791. Pure Python sugar — no new Rust code.
What's new
Row-position indexing (ints, slices,
.loc,.iloc) is intentionally not supported and raisesTypeErrorwith a message pointing at the three supported forms. SedonaDB has no row-ordering or index concept in this scope; the non-goal is documented in #791.Note on
df["x"]return typeFor now,
df["x"]returns anExpr(a column reference). ASerieswrapper is planned for the next phase — when that lands,df["x"]will start returningSeriesinstead andExprwill fall back to an internal escape hatch. The compose-with-operators idiom continues to work either way becauseSerieswill share the same operator surface asExpr.Test plan
10 tests in
tests/expr/test_dataframe_getitem.py:repr()); list projection (two-col, single-col, reorder); bool-Expr filter; arithmetic compose; filter-then-project chain.TypeError; int →TypeError; slice →TypeError.All 10 pass locally. No regressions in existing tests; doctests +
ruff format+ruff checkall clean.