Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks

Official implementation of the IJSEKE publication: Martinez-Gil, J. (2025). Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks. International Journal of Software Engineering and Knowledge Engineering, 35(05), 657–678.

Abstract

Code similarity models are increasingly used in software engineering workflows, yet their decisions are often difficult to interpret. This repository accompanies our IJSEKE paper and presents a practical interpretability framework around GraphCodeBERT for code similarity tasks. We combine complementary lenses—token-level cosine similarity, PCA, t-SNE, UMAP, and gradient-based saliency maps—to expose how semantic and structural relationships emerge across classical sorting algorithms. Results consistently show that algorithms with closer computational behavior (e.g., Bubble Sort and Insertion Sort) occupy nearby embedding regions and exhibit aligned token saliency patterns, improving trust and explainability in model outputs.

Key Contributions

Introduces a multi-lens interpretability workflow tailored to GraphCodeBERT code-similarity analysis.
Provides token-level and projection-based visual diagnostics for pairwise algorithm comparisons.
Bridges semantic, structural, and textual baselines through AST and TF-IDF companion analyses.
Releases reproducible scripts, figures, and a Colab-ready notebook to accelerate follow-up research.

Methodology Overview

This repository operationalizes five complementary interpretability lenses:

Token-level cosine similarity (comparison.py): computes fine-grained pairwise token alignment and generates HTML highlights.
PCA (pca.py): projects token embeddings into 2D spaces that preserve dominant variance directions.
t-SNE (tsne.py): captures local neighborhood structure in non-linear manifolds of token embeddings.
UMAP (pumap.py): provides topology-aware low-dimensional projections with strong cluster separation.
Saliency Maps (saliency_maps.py): uses input-gradient magnitudes to estimate token importance for model activations.

Results

Across projection methods and saliency analysis, structurally related algorithms tend to group together. In particular, Bubble Sort and Insertion Sort embeddings cluster closely under all projections, reflecting their procedural similarity. The heatmap output further supports this trend with strong pairwise similarity among quadratic-time in-place sorting strategies.

Representative outputs are available in:

Bubble_Sort_vs_Insertion_Sort_tokens_2d_pca.png
sorting_algorithms_similarity.png

Repository Structure

File	Description
`comparison.py`	Token-level cosine similarity and HTML highlight generation
`heatmap.py`	Similarity heatmap across sorting algorithm pairs
`pca.py`	PCA 2D/3D projection of token embeddings
`tsne.py`	t-SNE projection of token embeddings
`pumap.py`	UMAP projection of token embeddings
`saliency_maps.py`	Gradient-based saliency maps per token
`ablation.py`	Ablation study scripts
`ast-s.py`	AST-based structural similarity
`tf-s.py`	TF-IDF textual similarity baseline

Quick Start

git clone https://github.com/jorge-martinez-gil/graphcodebert-interpretability.git
cd graphcodebert-interpretability
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt

Usage Examples

Run each script from the repository root:

python comparison.py      # Produces code_similarity.html with token-level highlights and a final similarity score
python heatmap.py         # Produces sorting_algorithms_similarity.png (global pairwise similarity heatmap)
python pca.py             # Produces PNG files in pca_pairwise_comparisons/
python tsne.py            # Produces PNG files in tsne_pairwise_comparisons/
python pumap.py           # Produces PNG files in umap_pairwise_comparisons/
python saliency_maps.py   # Produces PNG files in saliency_maps/
python ablation.py        # Produces PNG files in ablation_study_results/
python ast-s.py           # Produces PNG files in ast_tree_kernel_visualizations/
python tf-s.py            # Produces PNG files in cosine_similarity_visualizations/

Visualizations Gallery

Figure 1. PCA token-level projection showing close embedding neighborhoods for Bubble Sort and Insertion Sort.

Figure 2. Pairwise GraphCodeBERT similarity heatmap across classical sorting implementations.

How to Cite

If you use this work, please cite:

@article{martinezgil2025augmenting,
      author = {Martinez-Gil, Jorge},
      title = {Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks},
      journal = {International Journal of Software Engineering and Knowledge Engineering},
      volume = {35},
      number = {05},
      pages = {657-678},
      year = {2025},
      doi = {10.1142/S0218194025500160},
      URL = {https://doi.org/10.1142/S0218194025500160}
}

A machine-readable citation is also provided in CITATION.cff.

Related Work / See Also

GraphCodeBERT: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model
Jain, S., & Wallace, B. C. (2019). Attention is not Explanation

Reproducibility

See the complete guide in docs/REPRODUCIBILITY.md, including runtime estimates and deterministic configuration notes.

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening an issue or submitting improvements.

Acknowledgements

This repository accompanies the IJSEKE publication and was prepared to support transparent and reproducible code-similarity interpretability research.

License

This project is licensed under the MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks

Abstract

Table of Contents

Key Contributions

Methodology Overview

Results

Repository Structure

Quick Start

Usage Examples

Visualizations Gallery

How to Cite

Related Work / See Also

Reproducibility

Contributing

Acknowledgements

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
ast_tree_kernel_visualizations		ast_tree_kernel_visualizations
cosine_similarity_visualizations		cosine_similarity_visualizations
docs		docs
examples		examples
pca_pairwise_comparisons		pca_pairwise_comparisons
saliency_maps		saliency_maps
tsne_pairwise_comparisons		tsne_pairwise_comparisons
umap_pairwise_comparisons		umap_pairwise_comparisons
Bubble_Sort_vs_Insertion_Sort_tokens_2d_pca.png		Bubble_Sort_vs_Insertion_Sort_tokens_2d_pca.png
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ablation.py		ablation.py
ast-s.py		ast-s.py
code_similarity.html		code_similarity.html
comparison.py		comparison.py
heatmap.py		heatmap.py
pca.py		pca.py
pumap.py		pumap.py
requirements.txt		requirements.txt
saliency_maps.py		saliency_maps.py
sorting_algorithms_similarity.png		sorting_algorithms_similarity.png
tf-s.py		tf-s.py
tsne.py		tsne.py

Folders and files

Latest commit

History

Repository files navigation

Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks

Abstract

Table of Contents

Key Contributions

Methodology Overview

Results

Repository Structure

Quick Start

Usage Examples

Visualizations Gallery

How to Cite

Related Work / See Also

Reproducibility

Contributing

Acknowledgements

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages