Official implementation of the IJSEKE publication: Martinez-Gil, J. (2025). Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks. International Journal of Software Engineering and Knowledge Engineering, 35(05), 657–678.
Code similarity models are increasingly used in software engineering workflows, yet their decisions are often difficult to interpret. This repository accompanies our IJSEKE paper and presents a practical interpretability framework around GraphCodeBERT for code similarity tasks. We combine complementary lenses—token-level cosine similarity, PCA, t-SNE, UMAP, and gradient-based saliency maps—to expose how semantic and structural relationships emerge across classical sorting algorithms. Results consistently show that algorithms with closer computational behavior (e.g., Bubble Sort and Insertion Sort) occupy nearby embedding regions and exhibit aligned token saliency patterns, improving trust and explainability in model outputs.
- Key Contributions
- Methodology Overview
- Results
- Repository Structure
- Quick Start
- Usage Examples
- Visualizations Gallery
- How to Cite
- Related Work / See Also
- Reproducibility
- Contributing
- Acknowledgements
- License
- Introduces a multi-lens interpretability workflow tailored to GraphCodeBERT code-similarity analysis.
- Provides token-level and projection-based visual diagnostics for pairwise algorithm comparisons.
- Bridges semantic, structural, and textual baselines through AST and TF-IDF companion analyses.
- Releases reproducible scripts, figures, and a Colab-ready notebook to accelerate follow-up research.
This repository operationalizes five complementary interpretability lenses:
- Token-level cosine similarity (
comparison.py): computes fine-grained pairwise token alignment and generates HTML highlights. - PCA (
pca.py): projects token embeddings into 2D spaces that preserve dominant variance directions. - t-SNE (
tsne.py): captures local neighborhood structure in non-linear manifolds of token embeddings. - UMAP (
pumap.py): provides topology-aware low-dimensional projections with strong cluster separation. - Saliency Maps (
saliency_maps.py): uses input-gradient magnitudes to estimate token importance for model activations.
Across projection methods and saliency analysis, structurally related algorithms tend to group together. In particular, Bubble Sort and Insertion Sort embeddings cluster closely under all projections, reflecting their procedural similarity. The heatmap output further supports this trend with strong pairwise similarity among quadratic-time in-place sorting strategies.
Representative outputs are available in:
Bubble_Sort_vs_Insertion_Sort_tokens_2d_pca.pngsorting_algorithms_similarity.png
| File | Description |
|---|---|
comparison.py |
Token-level cosine similarity and HTML highlight generation |
heatmap.py |
Similarity heatmap across sorting algorithm pairs |
pca.py |
PCA 2D/3D projection of token embeddings |
tsne.py |
t-SNE projection of token embeddings |
pumap.py |
UMAP projection of token embeddings |
saliency_maps.py |
Gradient-based saliency maps per token |
ablation.py |
Ablation study scripts |
ast-s.py |
AST-based structural similarity |
tf-s.py |
TF-IDF textual similarity baseline |
git clone https://github.com/jorge-martinez-gil/graphcodebert-interpretability.git
cd graphcodebert-interpretability
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txtRun each script from the repository root:
python comparison.py # Produces code_similarity.html with token-level highlights and a final similarity score
python heatmap.py # Produces sorting_algorithms_similarity.png (global pairwise similarity heatmap)
python pca.py # Produces PNG files in pca_pairwise_comparisons/
python tsne.py # Produces PNG files in tsne_pairwise_comparisons/
python pumap.py # Produces PNG files in umap_pairwise_comparisons/
python saliency_maps.py # Produces PNG files in saliency_maps/
python ablation.py # Produces PNG files in ablation_study_results/
python ast-s.py # Produces PNG files in ast_tree_kernel_visualizations/
python tf-s.py # Produces PNG files in cosine_similarity_visualizations/
Figure 1. PCA token-level projection showing close embedding neighborhoods for Bubble Sort and Insertion Sort.
Figure 2. Pairwise GraphCodeBERT similarity heatmap across classical sorting implementations.
If you use this work, please cite:
@article{martinezgil2025augmenting,
author = {Martinez-Gil, Jorge},
title = {Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks},
journal = {International Journal of Software Engineering and Knowledge Engineering},
volume = {35},
number = {05},
pages = {657-678},
year = {2025},
doi = {10.1142/S0218194025500160},
URL = {https://doi.org/10.1142/S0218194025500160}
}A machine-readable citation is also provided in CITATION.cff.
- GraphCodeBERT: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model
- Jain, S., & Wallace, B. C. (2019). Attention is not Explanation
See the complete guide in docs/REPRODUCIBILITY.md, including runtime estimates and deterministic configuration notes.
Contributions are welcome. Please read CONTRIBUTING.md before opening an issue or submitting improvements.
This repository accompanies the IJSEKE publication and was prepared to support transparent and reproducible code-similarity interpretability research.
This project is licensed under the MIT License. See LICENSE.