Skip to content

lsinger/scholar-html

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scholar-html

A two-step CLI tool that converts a Google Scholar profile into static HTML for embedding in a personal website.

  1. Fetch: scrape or load a Google Scholar profile page and extract structured publication data into JSON
  2. Fetch PDFs (optional): enrich the JSON with PDF/paper links by visiting each publication's citation page
  3. Render: turn that JSON into HTML using a Jinja2 template

The JSON intermediate format decouples fetching from rendering. You can re-render with different templates without re-fetching, hand-edit the data to fix author names or add missing venues, or check the JSON into version control to track how your citation counts change over time.

I had Claude Code build this as an easy way to sync my Google Scholar profile's publications to my website from time to time. I use the leifme.html.j2 template and then generate the HTML as a fragment to paste into the Ghost editor as HTML.

Installation

Requires Python 3.10+.

pip install -r requirements.txt

For development (running tests):

pip install -r requirements-dev.txt

Dependencies: requests, beautifulsoup4, lxml, jinja2.

Quick Start

# Parse a saved Scholar HTML file into JSON
python -m scholar_html fetch --html saved_page.html -o data.json

# (Optional) Enrich with PDF links from citation pages
python -m scholar_html fetch-pdfs data.json

# Render the JSON to a full HTML document
python -m scholar_html render data.json -o publications.html

Usage

Step 1: fetch — Extract publication data

The fetch command reads a Google Scholar profile and outputs structured JSON. There are three mutually exclusive ways to specify the input:

# From a saved HTML file (recommended — avoids rate-limiting)
python -m scholar_html fetch --html saved_page.html -o data.json

# From a Scholar user ID (fetches over the network)
python -m scholar_html fetch --user 22Scgp0AAAAJ -o data.json

# From a full URL (fetches over the network)
python -m scholar_html fetch --url "https://scholar.google.com/citations?user=22Scgp0AAAAJ" -o data.json

Omit -o to write to stdout instead of a file.

Recommended workflow: Google Scholar aggressively blocks automated requests. The most reliable approach is to open your Scholar profile in a browser, save the page as HTML ("Save As" → "Webpage, Complete" or "Webpage, HTML Only"), and then use --html to parse the saved file.

Step 1.5 (optional): fetch-pdfs — Discover PDF links

The fetch-pdfs command enriches an existing JSON file by visiting each publication's Google Scholar citation detail page to find PDF/paper links. It's designed to be interruptible and resumable — Google Scholar aggressively rate-limits, so you may need to run it several times.

# Enrich data.json in place (default 5s delay between requests)
python -m scholar_html fetch-pdfs data.json

# Custom delay and separate output file
python -m scholar_html fetch-pdfs data.json -o enriched.json --delay 10

Each publication's pdf field tracks its state:

  • "" — not yet attempted
  • "UNAVAILABLE" — attempted, but no PDF link was found on the citation page
  • A URL — the paper/PDF link

Progress is saved after each publication, so if the process is interrupted (rate-limited, Ctrl-C, etc.), re-running picks up where it left off. On rate-limiting, the command prints a message to stderr and exits with code 1.

Step 2: render — Generate HTML

The render command reads JSON (produced by fetch) and generates HTML:

# Full HTML document with the default template
python -m scholar_html render data.json -o publications.html

# HTML fragment (just the <section>, no <!DOCTYPE>/html/body wrapper)
python -m scholar_html render data.json --fragment -o snippet.html

# Limit to the first 10 publications
python -m scholar_html render data.json --limit 10 -o publications.html

# Use a custom Jinja2 template
python -m scholar_html render data.json --template my_template.html.j2 -o publications.html

Omit -o to write to stdout.

Fragment mode

--fragment outputs only the <section class="scholar-profile">...</section> block, without the surrounding <!DOCTYPE html>, <html>, <head>, or <body> tags. This is useful when you want to paste or include the output into an existing page.

JSON Format

The intermediate JSON looks like this:

{
  "meta": {
    "scholar_id": "22Scgp0AAAAJ",
    "fetched_at": "2025-02-01T21:33:00+00:00",
    "source": "file:saved_page.html"
  },
  "profile": {
    "name": "Leif Singer",
    "affiliation": "University of Victoria",
    "interests": ["Software Engineering", "Developer Tools"],
    "stats": {
      "citations_all": 1234,
      "citations_recent": 456,
      "h_index_all": 12,
      "h_index_recent": 8,
      "i10_index_all": 15,
      "i10_index_recent": 10
    }
  },
  "publications": [
    {
      "title": "How software developers use GitHub",
      "authors": "L Singer, F Figueira Filho, N Bettenburg, M Storey",
      "venue": "IEEE Software 31 (2), 58-65",
      "year": "2014",
      "citations": 312,
      "url": "/citations?view_op=view_citation&citation_for_view=...",
      "pdf": "https://ieeexplore.ieee.org/abstract/document/6773718"
    }
  ]
}

You can edit this file freely — fix author names, remove publications, reorder entries — and then re-render.

Templates

Three templates are included in the templates/ directory:

default.html.j2

Semantic HTML5 output with:

  • All CSS classes prefixed with scholar- (e.g. scholar-profile, scholar-pub, scholar-pub-title) so they won't collide with your site's styles
  • data-year and data-citations attributes on each <li> for optional client-side filtering or sorting with JavaScript
  • <ol reversed> for the publication list
  • Citation stats displayed as a <dl>
  • No CSS framework dependency — bring your own styles

minimal.html.j2

A plain <ul> with one <li> per publication in "Authors. Title. Venue, Year (N citations)." format. No classes, no data attributes.

leifme.html.j2

Styled template designed to match leif.me. Includes self-contained CSS within a <style> block so it works as a --fragment without external stylesheets. Uses the Inter font stack, the site's #fcb615 accent color for PDF links, and a clean publication layout with bold titles, grey authors, italic venues, and a light year/citation meta line.

python -m scholar_html render data.json --template templates/leifme.html.j2 --fragment -o publications.html

Custom templates

Pass any Jinja2 template with --template. The template receives these variables:

Variable Type Description
profile object .name, .affiliation, .interests (list of strings), .stats
profile.stats object .citations_all, .citations_recent, .h_index_all, .h_index_recent, .i10_index_all, .i10_index_recent
publications list Each has .title, .authors, .venue, .year, .citations (int), .url, .pdf
meta object .scholar_id, .fetched_at, .source
fragment bool Whether --fragment was passed

Your template should check {% if not fragment %} to conditionally wrap output in a full HTML document.

CSS Classes (default template)

Class Element Content
scholar-profile <section> Wrapper for everything
scholar-name <h2> Author name
scholar-affiliation <p> Affiliation
scholar-interests <ul> Research interest tags
scholar-interest <li> Individual interest
scholar-stats <dl> Citation statistics
scholar-stat <div> Individual stat (wraps <dt> + <dd>)
scholar-publications <ol> Publication list
scholar-pub <li> Single publication
scholar-pub-title <span> Title (contains <a> if URL present)
scholar-pub-authors <span> Author list
scholar-pub-venue <span> Journal/conference name
scholar-pub-pdf <a> [PDF] link (only rendered when a URL is available)
scholar-pub-meta <span> Year and citation count
scholar-pub-year <span> Publication year
scholar-pub-citations <span> Citation count

Project Structure

scholar_html/
  __init__.py
  __main__.py          # python -m scholar_html entry point
  cli.py               # argparse CLI with fetch/render/fetch-pdfs subcommands
  fetch.py             # HTML parsing and network fetching
  fetch_pdfs.py        # PDF link discovery from citation pages
  render.py            # Jinja2 template rendering
  schema.py            # Dataclasses and JSON serialization
  selectors.py         # CSS selectors for Scholar's DOM (isolated for maintainability)

templates/
  default.html.j2      # Semantic HTML5 with scholar- prefixed classes
  minimal.html.j2      # Bare <ul> list
  leifme.html.j2       # Styled for leif.me, self-contained CSS

tests/
  conftest.py          # Shared fixtures
  fixtures/
    sample_profile.html
    sample_citation.html
    sample_citation_no_link.html
  test_cli.py          # End-to-end CLI tests
  test_fetch.py        # Parser tests against saved HTML
  test_fetch_pdfs.py   # PDF discovery orchestration tests
  test_render.py       # Template rendering tests
  test_schema.py       # JSON round-trip tests
  test_selectors.py    # Validates selectors find elements in fixture

Testing

pytest tests/ -v

All tests run against a saved HTML fixture (tests/fixtures/sample_profile.html). No tests hit the network.

When Google Changes Their HTML

The CSS selectors used to parse Scholar profiles are isolated in scholar_html/selectors.py. If Google changes their page structure, the selector tests (test_selectors.py) will fail first, pointing you to exactly what broke. Update the selectors and fixture, and the rest of the code stays the same.

About

Google Scholar profile to HTML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published