Replicheck is a Python tool designed to detect technical dept within a specified directory. Its primary goal is to help developers maintain code quality and reduce redundancy by technical dept across a codebase.
- Duplicate Detection: Finds duplicate code blocks within user-specified directories.
- Configurable Thresholds: Allows users to set the minimum similarity threshold and minimum code block size for comparison.
- Language Support: Supports Python, JavaScript/JSX, Typescript/TSX, and C# code natively.
- Directory and File Filtering: Ignores virtual environments and user-specified directories. Users can also specify which file extensions to analyze.
- Reporting: Generates detailed reports in either text or JSON format, with the option to output to a file or the console.
- Robust Testing: The project boasts 90%+ test coverage and robust error handling.
- Developer Experience: Includes pre-commit hooks, code formatting and testing (Black, pytest), and a clear development workflow.
- CLI Entrypoint:
main.pyprovides a command-line interface for users to run Replicheck with various options. - Core Modules:
replicheck/config.py: Handles configuration and validation.replicheck/parser.py: Parses code files and extracts code blocks (Python, JavaScript/JSX, and C# fully supported).replicheck/detector.py: Contains the logic for detecting duplicate code blocks using a Jaccard similarity metric.replicheck/reporter.py: Handles report generation in text and JSON formats.replicheck/utils.py: Provides utility functions for file discovery and similarity calculation.replicheck/tree_sitter_loader.py: Loads and initializes Tree-sitter grammars (JavaScript, JSX, C#).
- Testing: Comprehensive tests for all modules are located in the
tests/directory.
Basic usage:
python main.py --path /path/to/codeKey options:
--path: Directory to analyze (required)--min-sim: Minimum similarity threshold (default: 0.8)--min-size: Minimum code block size in tokens (default: 50)--extensions: File extensions to analyze (default: .py,.js,.ts,.jsx,.tsx)--output-format: Output format (text or json, default: text)--output-file: Path to save the report (optional)--ignore-dirs: Directories to ignore (default: .git .venv venv env ENV build dist node_modules)
- Code Parsing: Uses the AST module for Python and Tree-sitter for JavaScript/JSX, Typescript/TSX and C# to extract functions and classes for comparison and analysis.
- Duplicate Detection: Compares code blocks using Jaccard similarity, with configurable thresholds.
- Reporting: Outputs results in both human-readable and machine-readable formats.
- Comprehensive Testing: All core modules have dedicated tests, including edge cases and error handling.
- Pre-commit and Formatting: Supports Black and pytest for code style and testing, and pre-commit hooks for consistency.
- Robust CLI: User-friendly command-line interface with helpful error messages and flexible options.
- Current State: Python, JavaScript/JSX, Typescript/TSX, and C# are fully supported.
- Improvement: Add support for other languages (e.g., Java, C++, Go) by implementing or integrating language-specific parsers.
- Current State: Uses Jaccard similarity on token sets.
- Improvement: Explore more advanced similarity metrics (e.g., token sequence alignment, AST-based comparison, or machine learning approaches) for more accurate detection.
- Current State: Compares all code blocks pairwise, which may be slow for large codebases.
- Improvement: Implement indexing, hashing, or clustering to reduce the number of comparisons.
- Current State: Manual execution.
- Improvement: Provide GitHub Actions or other CI/CD integration examples to automate duplicate detection in pull requests.
- Current State: Text, Markdown, and JSON reports.
- Improvement: Add HTML reports with links and code highlighting, or integrate with code review tools.
- Current State: All options are CLI arguments.
- Improvement: Allow configuration via a YAML or TOML file for easier reuse and sharing of settings.
- Current State: Only detects duplicates, high cyclomatic complexity functions, large files and large classes by token count.
- Improvement: Suggest possible refactorings or code extraction opportunities based on reports.
- Similarity metric is basic and may not catch all forms of duplication.
- Performance may degrade on very large codebases due to pairwise comparison.
- Tests cover configuration, parsing, detection, reporting, and utility functions.
- Edge cases (empty files, syntax errors, unsupported extensions) are tested.
- Error handling in report generation and configuration is tested.
- Full support for C# and TS/TSX code metrics:
- Cyclomatic complexity (via Tree-sitter)
- Large file/class detection
- Duplicate code blocks
- C# and TS/TSX added to core parsing, reporting, and CLI workflows
MIT License
This document provides a high-level summary and actionable roadmap for Replicheck. For more details, see the README.md and the codebase itself.