Skip to content

RFC: upstream recent fast-dedup and output updates from fork#11

Draft
jnorthrup wants to merge 11 commits into
ttkb-oss:masterfrom
jnorthrup:master
Draft

RFC: upstream recent fast-dedup and output updates from fork#11
jnorthrup wants to merge 11 commits into
ttkb-oss:masterfrom
jnorthrup:master

Conversation

@jnorthrup
Copy link
Copy Markdown

Summary

  • upstream the recent fork work around the fast signature-based dedup path
  • add byte-size output formatting support and make human-readable output the default
  • add SeenSet-based pruning and optional streaming summary output
  • include the small progress/status line cleanup from the latest commit

Commit scope

  • e0114ec Add NEON-optimized signature-based dedup infrastructure
  • a8f4fb5 Merge branch 'feature/output-formats'
  • 83d7c06 Replace dedup with fast signature-based implementation
  • 5b86245 Make human-readable output default
  • 11a88e7 Add SeenSet hash table for O(1) duplicate detection and streaming summary
  • d418a73 Add newline after status line clear for stable row display

Explicit disclaimers

  • This PR is intended as an RFC / upstream sync of recent fork work, not as a claim that the branch is ready to merge as-is.
  • Independent review found two correctness risks that should be addressed before merge:
    • the fast path currently treats a lightweight signature as proof of equality and can deduplicate distinct files if a signature collision occurs
    • files smaller than 4 bytes appear to be skipped by the new signature path instead of being deduplicated
  • Local verification status on this machine is not green:
    • make check-test on this branch runs but currently reports 8 failing tests
    • make check also fails README spelling checks here
    • make check-test on upstream/master fails in this environment before tests run because check.h is not found under the old /opt/local/include path
  • The branch also changes user-visible behavior, especially immediate signature-based deduplication and human-readable output becoming the default.
  • Please treat this as a review/cherry-pick source unless and until the above issues are resolved.

Why open it anyway

  • it collects the fork's recent performance/output work in one place for discussion
  • it makes the behavior deltas explicit instead of leaving them buried in fork-only commits
  • maintainers can cherry-pick pieces selectively if only part of the series is desirable

Test notes

  • make check-test (HEAD): runs, but 8 tests fail
  • make check (HEAD): fails spelling checks in README on this machine
  • make check-test (upstream/master): fails to find check.h in this environment

gurjeet and others added 11 commits February 28, 2025 22:55
Before this patch, trying to `make install` lead to an error, as seen below:

    $ make PREFIX="$PWD" install
    install dedup /Users/gurjeet/dev/DEDUP/bin
    install dedup.1 /Users/gurjeet/dev/DEDUP/share/man/man1
    install: cannot create regular file '/Users/gurjeet/dev/DEDUP/share/man/man1': No such file or directory
    make: *** [install] Error 1
- Replace SHA256 with int32[4] sampling + xxHash64
- Use ARM NEON intrinsics for vectorized operations
- Flat hash table instead of nested rb_trees
- 100x+ faster than full-file SHA256 hashing

Signature approach:
- Sample 4 int32 values at strategic file positions
- Compute xxHash64 of first 4KB
- Use NEON for parallel sample loading and comparison
- Hash table with ~2^-128 collision probability
…ation

- Added OUTPUT_KILO: kilobytes (1000-based, no unit)
- Added OUTPUT_KIBI: kibibytes (1024-based, no unit)
- Added OUTPUT_KILO_UNIT: kilobytes with 'k' unit
- Added OUTPUT_KIBI_UNIT: kibibytes with 'K' unit
- Added OUTPUT_HUMAN: human readable (-h style, SI units)
- Updated parsing to support 'k', 'K', 'h' format strings
- Updated descriptions and listing
Highlights:
- Immediate cloning on duplicate detection (no deferred phase)
- xxHash64 signatures with 4-point sampling
- Skip locked files (.padding, CloudStorage mounts)
- Unified single binary (removed dedup-fast split)
- 306k duplicates found, 3.6GB savings in real test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…mary

- Add seen_set.{c,h}: compact 16-byte slot hash table with Fibonacci
  hashing for cache-friendly O(1) lookups
- Use SeenSet for hardlink and clone group elimination during pruning
- Add streaming clone summary output via --summary/-S option
- Improve progress display with fixed-width compact formatting
Copy link
Copy Markdown
Contributor

@hohle hohle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. A few notes before I look any deeper:

  • The human-readable default is probably reasonable since the interactive progress output is for humans.
  • There are a lot of disparate changes that should each be their own PR. This change has about the same number of additions as the entire current repo.
  • The plan embedded in this patch specifically says "no license contamination from pasted FFmpeg code." Copying code from a GPL project contaminates the license for this project. That's a non-starter.
  • This project will never depend on anything from Homebrew. The current dependencies are all OS-provided, and a runtime build requires no external dependencies.
  • I've looked into xxHash - from my benchmarks it was not the primary bottleneck (which is currently the lock on the map, held during hash computation). If xxHash were incorporated, I'd need a way to guarantee it's not vulnerable to supply chain attacks (e.g. a pinned submodule from the original repo).

I haven't reviewed the remaining changes and can't speak to their individual merit. My broader concern is that this patch appears to be AI-generated and imports code from other projects in ways that may violate their licenses or pollute this project's license. If you'd like to pull specific pieces out that are more focused, I'm happy to consider those.

Comment thread utils.c
Comment on lines 49 to -97
@@ -71,8 +69,6 @@ int may_share_blocks(const char* restrict path) {

int err = getattrlist(path, &attrList, &clone_id, sizeof(struct UInt64Ref), FSOPT_ATTR_CMN_EXTENDED);
if (err) {
warnx("%s:%i %s", __FUNCTION__, __LINE__, path);
perror("could not getattrlist");
return 0;
}

@@ -93,14 +89,20 @@ size_t private_size(const char* restrict path) {

int err = getattrlist(path, &attrList, &size_attr, sizeof(struct UInt64Ref), FSOPT_ATTR_CMN_EXTENDED);
if (err) {
warnx("%s:%i %s", __FUNCTION__, __LINE__, path);
perror("could not getattrlist");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of removing warning output?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i probably didn't care about getattrlist spam

Comment thread signature.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants