Skip to content

Conversation

@aireenmei
Copy link
Collaborator

@aireenmei aireenmei commented Jan 30, 2026

Description

Restructure the input pipeline folder as follows:
under src/maxtext:
input_pipeline

  • packing
    • prefill_packing.py
    • sequence_packing.py
  • tokenizer.py
  • multihost_dataloading.py
  • distillation_data_processing.py (prev _distillation_data_processing.py)
  • grain_data_processing.py (_grain_data_processing.py)
  • grain_tokenizer.py (_grain_tokenizer.py)
  • hf_data_processing.py (_hf_data_processing.py)
  • input_pipeline_utils.py (_input_pipeline_utils.py)
  • tfds_data_processing.py (_tfds_data_processing.py)
  • tfds_data_processing_c4_mlperf.py (_tfds_data_processing_c4_mlperf.py)
  • input_pipeline_interface.py
  • synthetic_data_processing.py

Makes corresponding changes in imports

Tests

CI test

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@aireenmei aireenmei force-pushed the aireen/input_restructure branch from 8905df5 to 99f69ca Compare January 30, 2026 00:46
@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rc/maxtext/input_pipeline/grain_data_processing.py 62.50% 6 Missing ⚠️
src/maxtext/input_pipeline/hf_data_processing.py 61.53% 5 Missing ⚠️
src/MaxText/rl/train_rl.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@aireenmei aireenmei force-pushed the aireen/input_restructure branch 2 times, most recently from 8cf1adb to 0faa654 Compare January 30, 2026 01:06
@aireenmei aireenmei marked this pull request as ready for review January 30, 2026 01:06
@aireenmei aireenmei force-pushed the aireen/input_restructure branch from 0faa654 to 9b4fdb2 Compare January 30, 2026 01:20
@aireenmei aireenmei changed the title restructure the input pipeline folder Restructure the input pipeline folder and migrate to src/maxtext Jan 30, 2026
@aireenmei aireenmei changed the title Restructure the input pipeline folder and migrate to src/maxtext Restructure input_pipeline and migrate to src/maxtext Jan 30, 2026
@aireenmei aireenmei force-pushed the aireen/input_restructure branch 4 times, most recently from 4b3ea13 to 68e1ad6 Compare January 30, 2026 17:32
@github-actions
Copy link

🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Collaborator

@hengtaoguo hengtaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browsed through all the files and looks like a solid restructure and relocation. Are there any edge cases which could not be covered by the CI? Thanks!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This pull request effectively restructures the input pipeline by migrating it to src/maxtext and organizing it into submodules. The changes are consistent and well-executed across the entire codebase, including examples, tests, and tools.

🔍 General Feedback

  • The refactoring significantly improves the project structure and modularity of the input pipeline.
  • All import paths have been updated correctly to reflect the new structure.
  • Renaming files by removing the leading underscore is a good cleanup practice that enhances clarity.

@aireenmei
Copy link
Collaborator Author

Browsed through all the files and looks like a solid restructure and relocation. Are there any edge cases which could not be covered by the CI? Thanks!

As this is mainly restructure and migration, nothing new is added, the existing CI test should be sufficient to make sure no import errors.

@aireenmei aireenmei force-pushed the aireen/input_restructure branch from 68e1ad6 to 9ae7e45 Compare January 31, 2026 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants