RFC: upstream recent fast-dedup and output updates from fork#11
Draft
jnorthrup wants to merge 11 commits into
Draft
RFC: upstream recent fast-dedup and output updates from fork#11jnorthrup wants to merge 11 commits into
jnorthrup wants to merge 11 commits into
Conversation
Before this patch, trying to `make install` lead to an error, as seen below:
$ make PREFIX="$PWD" install
install dedup /Users/gurjeet/dev/DEDUP/bin
install dedup.1 /Users/gurjeet/dev/DEDUP/share/man/man1
install: cannot create regular file '/Users/gurjeet/dev/DEDUP/share/man/man1': No such file or directory
make: *** [install] Error 1
- Replace SHA256 with int32[4] sampling + xxHash64 - Use ARM NEON intrinsics for vectorized operations - Flat hash table instead of nested rb_trees - 100x+ faster than full-file SHA256 hashing Signature approach: - Sample 4 int32 values at strategic file positions - Compute xxHash64 of first 4KB - Use NEON for parallel sample loading and comparison - Hash table with ~2^-128 collision probability
…ation - Added OUTPUT_KILO: kilobytes (1000-based, no unit) - Added OUTPUT_KIBI: kibibytes (1024-based, no unit) - Added OUTPUT_KILO_UNIT: kilobytes with 'k' unit - Added OUTPUT_KIBI_UNIT: kibibytes with 'K' unit - Added OUTPUT_HUMAN: human readable (-h style, SI units) - Updated parsing to support 'k', 'K', 'h' format strings - Updated descriptions and listing
Highlights: - Immediate cloning on duplicate detection (no deferred phase) - xxHash64 signatures with 4-point sampling - Skip locked files (.padding, CloudStorage mounts) - Unified single binary (removed dedup-fast split) - 306k duplicates found, 3.6GB savings in real test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…mary
- Add seen_set.{c,h}: compact 16-byte slot hash table with Fibonacci
hashing for cache-friendly O(1) lookups
- Use SeenSet for hardlink and clone group elimination during pruning
- Add streaming clone summary output via --summary/-S option
- Improve progress display with fixed-width compact formatting
hohle
reviewed
Apr 23, 2026
Contributor
hohle
left a comment
There was a problem hiding this comment.
Thanks for the contribution. A few notes before I look any deeper:
- The human-readable default is probably reasonable since the interactive progress output is for humans.
- There are a lot of disparate changes that should each be their own PR. This change has about the same number of additions as the entire current repo.
- The plan embedded in this patch specifically says "no license contamination from pasted FFmpeg code." Copying code from a GPL project contaminates the license for this project. That's a non-starter.
- This project will never depend on anything from Homebrew. The current dependencies are all OS-provided, and a runtime build requires no external dependencies.
- I've looked into xxHash - from my benchmarks it was not the primary bottleneck (which is currently the lock on the map, held during hash computation). If xxHash were incorporated, I'd need a way to guarantee it's not vulnerable to supply chain attacks (e.g. a pinned submodule from the original repo).
I haven't reviewed the remaining changes and can't speak to their individual merit. My broader concern is that this patch appears to be AI-generated and imports code from other projects in ways that may violate their licenses or pollute this project's license. If you'd like to pull specific pieces out that are more focused, I'm happy to consider those.
Comment on lines
49
to
-97
| @@ -71,8 +69,6 @@ int may_share_blocks(const char* restrict path) { | |||
|
|
|||
| int err = getattrlist(path, &attrList, &clone_id, sizeof(struct UInt64Ref), FSOPT_ATTR_CMN_EXTENDED); | |||
| if (err) { | |||
| warnx("%s:%i %s", __FUNCTION__, __LINE__, path); | |||
| perror("could not getattrlist"); | |||
| return 0; | |||
| } | |||
|
|
|||
| @@ -93,14 +89,20 @@ size_t private_size(const char* restrict path) { | |||
|
|
|||
| int err = getattrlist(path, &attrList, &size_attr, sizeof(struct UInt64Ref), FSOPT_ATTR_CMN_EXTENDED); | |||
| if (err) { | |||
| warnx("%s:%i %s", __FUNCTION__, __LINE__, path); | |||
| perror("could not getattrlist"); | |||
Contributor
There was a problem hiding this comment.
What is the purpose of removing warning output?
Author
There was a problem hiding this comment.
i probably didn't care about getattrlist spam
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Commit scope
Explicit disclaimers
make check-teston this branch runs but currently reports 8 failing testsmake checkalso fails README spelling checks heremake check-testonupstream/masterfails in this environment before tests run becausecheck.his not found under the old/opt/local/includepathWhy open it anyway
Test notes
make check-test(HEAD): runs, but 8 tests failmake check(HEAD): fails spelling checks in README on this machinemake check-test(upstream/master): fails to findcheck.hin this environment