Skip to content

feat(python/sedonadb): add DataFrame __getitem__#846

Merged
zhangfengcdt merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-getitem
May 15, 2026
Merged

feat(python/sedonadb): add DataFrame __getitem__#846
zhangfengcdt merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-getitem

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Pandas-style bracket indexing on DataFrame, dispatching to the existing select / filter / col paths landed in earlier PRs (#807, #823, #832, #835).

This is a small follow-on PR continuing Phase P1/P2 of #791. Pure Python sugar — no new Rust code.

What's new

df["x"]                       # → Expr(col("x"))
df[["x", "y"]]                # → DataFrame.select("x", "y")
df[df["x"] > 0]               # → DataFrame.filter(col("x") > 0)
df[(df["x"] + df["y"]) > 30]  # composes cleanly with operator overloads
df[df["x"] > 1][["y"]]        # chain: filter → project

Row-position indexing (ints, slices, .loc, .iloc) is intentionally not supported and raises TypeError with a message pointing at the three supported forms. SedonaDB has no row-ordering or index concept in this scope; the non-goal is documented in #791.

Note on df["x"] return type

For now, df["x"] returns an Expr (a column reference). A Series wrapper is planned for the next phase — when that lands, df["x"] will start returning Series instead and Expr will fall back to an internal escape hatch. The compose-with-operators idiom continues to work either way because Series will share the same operator surface as Expr.

Test plan

10 tests in tests/expr/test_dataframe_getitem.py:

  • Positive: string lookup → Expr (exact repr()); list projection (two-col, single-col, reorder); bool-Expr filter; arithmetic compose; filter-then-project chain.
  • Errors: list with non-string element → TypeError; int → TypeError; slice → TypeError.

All 10 pass locally. No regressions in existing tests; doctests + ruff format + ruff check all clean.

Pandas-style bracket indexing dispatches to the existing select /
filter / col paths shipped in earlier PRs:

- `df["x"]` returns an `Expr` referencing column `x` (a Series
  wrapper will land in a later phase).
- `df[["x", "y"]]` returns a new `DataFrame` projecting the listed
  columns.
- `df[bool_expr]` returns a new `DataFrame` filtered by the boolean
  expression.

Row-position indexing (ints, slices, `.loc`, `.iloc`) is intentionally
not supported and raises `TypeError` with a message pointing at the
three supported forms.

Pure Python sugar — no new Rust code. Composes cleanly with the
operator overloads from apache#823, so the natural pandas idiom
`df[(df.x + df.y) > 30]` works as-is.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds pandas-style bracket indexing (DataFrame.__getitem__) to SedonaDB’s Python DataFrame, providing ergonomic sugar that routes to the already-landed expression (Expr), projection (select), and filtering (filter) APIs.

Changes:

  • Implement DataFrame.__getitem__ supporting df["col"] -> Expr, df[["c1","c2"]] -> select, and df[expr] -> filter, with non-supported row-based indexing raising TypeError.
  • Add a focused test module covering supported forms, composition/chaining, and key error cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/sedonadb/python/sedonadb/dataframe.py Adds DataFrame.__getitem__ dispatching to col(), select(), and filter() with clear TypeErrors for unsupported indexers.
python/sedonadb/tests/expr/test_dataframe_getitem.py Adds unit tests validating the new bracket-indexing behavior and error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as ready for review May 15, 2026 07:45
Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangfengcdt zhangfengcdt merged commit 5650acb into apache:main May 15, 2026
6 checks passed
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangfengcdt Please let the original committer do the merging (if a PR is not from a committer, give them a reasonable window to update if they want to do another round of self-review).

@jiayuasu Can we turn off the round robin reviews? We have a huge variety in our code base and should split up review based on familiarity, not randomly.

Comment on lines +121 to +122
from sedonadb.expr import Expr
from sedonadb.expr import col as _col
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be module-level imports (and they should be combined). Lazy imports in this module are only for optional dependencies (pyarrow is the usual culprit)

Comment on lines +124 to +140
if isinstance(key, str):
return _col(key)
if isinstance(key, Expr):
return self.filter(key)
if isinstance(key, list):
for k in key:
if not isinstance(k, str):
raise TypeError(
f"DataFrame[list] expects a list of column names, "
f"got {type(k).__name__}"
)
return self.select(*key)
raise TypeError(
f"DataFrame indexing is not supported for {type(key).__name__}. "
f"Use df['x'] for a column expression, df[['x', 'y']] to project "
f"columns, or df[bool_expr] to filter rows."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should restrict this to only the first case. The ability to put a filter or multi-column selection here is pandas-y but prevents type hinting from working well on the output (i.e., IDEs and LLMs that use the language server or can use type hints can't auto complete tab["col"].<tab>).

This should also support integer indexing (e.g., tab[0] gives you the first column).

FWIW Ibis used to accept the second two cases but deprecated them (FutureWarning: Selecting/filtering arbitrary expressions in Table.getitemis deprecated and will be removed in version 10.0. Please useTable.selectorTable.filter instead.).

"""
return self.limit(n)

def __getitem__(self, key):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __getitem__(self, key):
def __getitem__(self, key: Union[str, int]) -> Expr:

Eventually we can give Expr a subclass for columns to improve the type hinting

Comment on lines +124 to +125
if isinstance(key, str):
return _col(key)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should reach into the underlying DFSchema and pull out the actual qualified column expression (displaying a nice error for columns that don't exist). You'll need this for join expressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants