A two-step CLI tool that converts a Google Scholar profile into static HTML for embedding in a personal website.
- Fetch: scrape or load a Google Scholar profile page and extract structured publication data into JSON
- Fetch PDFs (optional): enrich the JSON with PDF/paper links by visiting each publication's citation page
- Render: turn that JSON into HTML using a Jinja2 template
The JSON intermediate format decouples fetching from rendering. You can re-render with different templates without re-fetching, hand-edit the data to fix author names or add missing venues, or check the JSON into version control to track how your citation counts change over time.
I had Claude Code build this as an easy way to sync my Google Scholar profile's publications to my website from time to time. I use the leifme.html.j2 template and then generate the HTML as a fragment to paste into the Ghost editor as HTML.
Requires Python 3.10+.
pip install -r requirements.txtFor development (running tests):
pip install -r requirements-dev.txtDependencies: requests, beautifulsoup4, lxml, jinja2.
# Parse a saved Scholar HTML file into JSON
python -m scholar_html fetch --html saved_page.html -o data.json
# (Optional) Enrich with PDF links from citation pages
python -m scholar_html fetch-pdfs data.json
# Render the JSON to a full HTML document
python -m scholar_html render data.json -o publications.htmlThe fetch command reads a Google Scholar profile and outputs structured JSON. There are three mutually exclusive ways to specify the input:
# From a saved HTML file (recommended — avoids rate-limiting)
python -m scholar_html fetch --html saved_page.html -o data.json
# From a Scholar user ID (fetches over the network)
python -m scholar_html fetch --user 22Scgp0AAAAJ -o data.json
# From a full URL (fetches over the network)
python -m scholar_html fetch --url "https://scholar.google.com/citations?user=22Scgp0AAAAJ" -o data.jsonOmit -o to write to stdout instead of a file.
Recommended workflow: Google Scholar aggressively blocks automated requests. The most reliable approach is to open your Scholar profile in a browser, save the page as HTML ("Save As" → "Webpage, Complete" or "Webpage, HTML Only"), and then use --html to parse the saved file.
The fetch-pdfs command enriches an existing JSON file by visiting each publication's Google Scholar citation detail page to find PDF/paper links. It's designed to be interruptible and resumable — Google Scholar aggressively rate-limits, so you may need to run it several times.
# Enrich data.json in place (default 5s delay between requests)
python -m scholar_html fetch-pdfs data.json
# Custom delay and separate output file
python -m scholar_html fetch-pdfs data.json -o enriched.json --delay 10Each publication's pdf field tracks its state:
""— not yet attempted"UNAVAILABLE"— attempted, but no PDF link was found on the citation page- A URL — the paper/PDF link
Progress is saved after each publication, so if the process is interrupted (rate-limited, Ctrl-C, etc.), re-running picks up where it left off. On rate-limiting, the command prints a message to stderr and exits with code 1.
The render command reads JSON (produced by fetch) and generates HTML:
# Full HTML document with the default template
python -m scholar_html render data.json -o publications.html
# HTML fragment (just the <section>, no <!DOCTYPE>/html/body wrapper)
python -m scholar_html render data.json --fragment -o snippet.html
# Limit to the first 10 publications
python -m scholar_html render data.json --limit 10 -o publications.html
# Use a custom Jinja2 template
python -m scholar_html render data.json --template my_template.html.j2 -o publications.htmlOmit -o to write to stdout.
--fragment outputs only the <section class="scholar-profile">...</section> block, without the surrounding <!DOCTYPE html>, <html>, <head>, or <body> tags. This is useful when you want to paste or include the output into an existing page.
The intermediate JSON looks like this:
{
"meta": {
"scholar_id": "22Scgp0AAAAJ",
"fetched_at": "2025-02-01T21:33:00+00:00",
"source": "file:saved_page.html"
},
"profile": {
"name": "Leif Singer",
"affiliation": "University of Victoria",
"interests": ["Software Engineering", "Developer Tools"],
"stats": {
"citations_all": 1234,
"citations_recent": 456,
"h_index_all": 12,
"h_index_recent": 8,
"i10_index_all": 15,
"i10_index_recent": 10
}
},
"publications": [
{
"title": "How software developers use GitHub",
"authors": "L Singer, F Figueira Filho, N Bettenburg, M Storey",
"venue": "IEEE Software 31 (2), 58-65",
"year": "2014",
"citations": 312,
"url": "/citations?view_op=view_citation&citation_for_view=...",
"pdf": "https://ieeexplore.ieee.org/abstract/document/6773718"
}
]
}You can edit this file freely — fix author names, remove publications, reorder entries — and then re-render.
Three templates are included in the templates/ directory:
Semantic HTML5 output with:
- All CSS classes prefixed with
scholar-(e.g.scholar-profile,scholar-pub,scholar-pub-title) so they won't collide with your site's styles data-yearanddata-citationsattributes on each<li>for optional client-side filtering or sorting with JavaScript<ol reversed>for the publication list- Citation stats displayed as a
<dl> - No CSS framework dependency — bring your own styles
A plain <ul> with one <li> per publication in "Authors. Title. Venue, Year (N citations)." format. No classes, no data attributes.
Styled template designed to match leif.me. Includes self-contained CSS within a <style> block so it works as a --fragment without external stylesheets. Uses the Inter font stack, the site's #fcb615 accent color for PDF links, and a clean publication layout with bold titles, grey authors, italic venues, and a light year/citation meta line.
python -m scholar_html render data.json --template templates/leifme.html.j2 --fragment -o publications.htmlPass any Jinja2 template with --template. The template receives these variables:
| Variable | Type | Description |
|---|---|---|
profile |
object | .name, .affiliation, .interests (list of strings), .stats |
profile.stats |
object | .citations_all, .citations_recent, .h_index_all, .h_index_recent, .i10_index_all, .i10_index_recent |
publications |
list | Each has .title, .authors, .venue, .year, .citations (int), .url, .pdf |
meta |
object | .scholar_id, .fetched_at, .source |
fragment |
bool | Whether --fragment was passed |
Your template should check {% if not fragment %} to conditionally wrap output in a full HTML document.
| Class | Element | Content |
|---|---|---|
scholar-profile |
<section> |
Wrapper for everything |
scholar-name |
<h2> |
Author name |
scholar-affiliation |
<p> |
Affiliation |
scholar-interests |
<ul> |
Research interest tags |
scholar-interest |
<li> |
Individual interest |
scholar-stats |
<dl> |
Citation statistics |
scholar-stat |
<div> |
Individual stat (wraps <dt> + <dd>) |
scholar-publications |
<ol> |
Publication list |
scholar-pub |
<li> |
Single publication |
scholar-pub-title |
<span> |
Title (contains <a> if URL present) |
scholar-pub-authors |
<span> |
Author list |
scholar-pub-venue |
<span> |
Journal/conference name |
scholar-pub-pdf |
<a> |
[PDF] link (only rendered when a URL is available) |
scholar-pub-meta |
<span> |
Year and citation count |
scholar-pub-year |
<span> |
Publication year |
scholar-pub-citations |
<span> |
Citation count |
scholar_html/
__init__.py
__main__.py # python -m scholar_html entry point
cli.py # argparse CLI with fetch/render/fetch-pdfs subcommands
fetch.py # HTML parsing and network fetching
fetch_pdfs.py # PDF link discovery from citation pages
render.py # Jinja2 template rendering
schema.py # Dataclasses and JSON serialization
selectors.py # CSS selectors for Scholar's DOM (isolated for maintainability)
templates/
default.html.j2 # Semantic HTML5 with scholar- prefixed classes
minimal.html.j2 # Bare <ul> list
leifme.html.j2 # Styled for leif.me, self-contained CSS
tests/
conftest.py # Shared fixtures
fixtures/
sample_profile.html
sample_citation.html
sample_citation_no_link.html
test_cli.py # End-to-end CLI tests
test_fetch.py # Parser tests against saved HTML
test_fetch_pdfs.py # PDF discovery orchestration tests
test_render.py # Template rendering tests
test_schema.py # JSON round-trip tests
test_selectors.py # Validates selectors find elements in fixture
pytest tests/ -vAll tests run against a saved HTML fixture (tests/fixtures/sample_profile.html). No tests hit the network.
The CSS selectors used to parse Scholar profiles are isolated in scholar_html/selectors.py. If Google changes their page structure, the selector tests (test_selectors.py) will fail first, pointing you to exactly what broke. Update the selectors and fixture, and the rest of the code stays the same.