Skip to content

feat: add data_purposes to Dataset, DatasetCollection, DatasetField#39

Merged
galvana merged 4 commits intomainfrom
feat/add-data-purposes-to-dataset-models
Mar 25, 2026
Merged

feat: add data_purposes to Dataset, DatasetCollection, DatasetField#39
galvana merged 4 commits intomainfrom
feat/add-data-purposes-to-dataset-models

Conversation

@galvana
Copy link

@galvana galvana commented Mar 16, 2026

Description Of Changes

Add data_purposes: Optional[List[FidesKey]] to three classes in the dataset model hierarchy, mirroring the existing data_categories pattern:

  • DatasetFieldBase — applies to individual fields (and sub-fields via DatasetField inheritance)
  • DatasetCollection — applies to all fields in the collection
  • Dataset — applies to all collections in the dataset

This enables purpose-based access control (PBAC): each level of the dataset hierarchy can declare which data purposes are allowed, and the PBAC engine evaluates consumer access against these purpose restrictions.

The field is Optional with a None default, so this is fully backward compatible — existing datasets without data_purposes are unaffected.

Code Changes

  • src/fideslang/models.py - Add data_purposes field to DatasetFieldBase, DatasetCollection, and Dataset

Steps to Confirm

from fideslang.models import Dataset, DatasetCollection, DatasetField

# All levels including recursive sub-fields
d = Dataset(
    fides_key="test",
    data_purposes=["marketing"],
    collections=[
        DatasetCollection(
            name="c1",
            data_purposes=["analytics"],
            fields=[
                DatasetField(name="f1", data_purposes=["fraud"]),
                DatasetField(
                    name="nested",
                    fides_meta={"data_type": "object"},
                    fields=[DatasetField(name="sub1", data_purposes=["compliance"])],
                ),
            ],
        )
    ],
)

j = d.model_dump(mode="json")
assert j["data_purposes"] == ["marketing"]
assert j["collections"][0]["data_purposes"] == ["analytics"]
assert j["collections"][0]["fields"][0]["data_purposes"] == ["fraud"]
assert j["collections"][0]["fields"][1]["fields"][0]["data_purposes"] == ["compliance"]

# Backward compat: None when omitted
d2 = Dataset(fides_key="old", collections=[DatasetCollection(name="c", fields=[DatasetField(name="f")])])
assert d2.data_purposes is None

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation Updated
  • Issue Requirements are Met
  • Relevant Follow-Up Issues Created
  • Update CHANGELOG.md

🤖 Generated with Claude Code

…setField

Add data_purposes as an optional field at all levels of the dataset
hierarchy, mirroring the existing data_categories pattern. This enables
purpose-based access control (PBAC) by declaring which data purposes
are allowed for each dataset, collection, field, and sub-field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
galvana pushed a commit to ethyca/fides that referenced this pull request Mar 16, 2026
Update fideslang dependency to use the feat/add-data-purposes-to-dataset-models
branch which adds data_purposes at dataset, collection, field, and sub-field
levels.

Dependency: ethyca/fideslang#39

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adrian Galvan and others added 3 commits March 16, 2026 16:50
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Suppress mypy errors caused by newer pydantic versions resolving in CI:
- Add type: ignore[misc] for ValidationInfo explicit Any warnings
- Remove stale type: ignore[assignment] comments no longer needed
- Add type: ignore[arg-type] for Optional list default_factory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused `_: ValidationInfo` param from `validate_object_fields`
  (model_validator mode="after" doesn't need it)
- Remove `Optional` from Taxonomy list fields that default to `[]`
  (type now matches the actual default value)
- Remove `disallow_any_explicit` mypy setting that conflicts with
  pydantic's own types (ValidationInfo, Dict[str, Any])
- Clean up all `# type: ignore` comments that are no longer needed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@galvana galvana marked this pull request as ready for review March 25, 2026 17:07
@galvana galvana requested a review from adamsachs March 25, 2026 17:07
Copy link

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this all looks fine, just left an inline note about None vs [] but i can see this was a deliberate choice for backward compatibility reasons which makes sense.

it does strike me as a bit odd that we have this data_purposes pointer without actually defining a DataPurpose model here in fideslang (unless i've missed it...?) do we have other examples of doing that? it just seems like it leaves our model here a bit incomplete and not self-contained, and something doesn't feel 'right' about that.

i won't consider that a blocker, but i do want to hear your thoughts on that.

Comment on lines +368 to +371
data_purposes: Optional[List[FidesKey]] = Field(
default=None,
description="Array of Data Purpose resources, identified by `fides_key`, that apply to this field.",
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this deliberately an Optional[List] that defaults to None rather than a List[] that defaults to []?

i realize this follows the data_categories convention above, but i just want to make sure we're not perpetuating a bad design choice. do we imagine that None will signify something different than an empty list ([]) in this case? perhaps the flexibility is good for a model as foundational as this one, because there may be many different applications, and it can be hard to anticipate whether some application may want to distinguish None from [] in the future, even if we don't have that use case now.

i can conceive of different meanings for None vs [] (e.g. None = 'hasn't yet been reviewed/annotated for data_purposes'; [] = 'reviewed and determined no data_purposes apply') so i think this choice is justifiable. but just wanted to raise this for thought/discussion/confirmation!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK and now i see the PR note about backward compatibility. that seems like enough justification 👍

@galvana galvana merged commit c5b7861 into main Mar 25, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants