Skip to content

CharlesPikachu/paperdl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📚 Documents: paperdl.readthedocs.io

🎉 What's New

  • 2026-05-31: Paperdl has received a major upgrade: all code has been rewritten to be asynchronous, with support for paper search and download across seven major platforms, including arXiv, OpenReview, ACL Anthology, bioRxiv, medRxiv, PMLR, and PMC OA. The documentation has also been comprehensively optimized.

🧠 Introduction

A simple and extensible toolkit for searching, organizing, and downloading academic papers from specific websites.

If this project helps your research workflow, please consider giving it a star ⭐. Your support helps more people discover the project and motivates future improvements.

🛡️ Project Disclaimer

This repository is intended for lawful, educational, academic, and research-related purposes only, such as learning Python, exploring academic paper search workflows, and assisting non-profit research or study.

Users are solely responsible for ensuring that their use of this project complies with applicable laws, website terms of service, copyright rules, publisher policies, institutional requirements, and third-party rights. This project must not be used for illegal purposes, copyright infringement, unauthorized access, abusive downloading, or any activity that may harm authors, publishers, platforms, or institutions.

This project is released under the Apache License 2.0. The authors and contributors provide no warranty, commercial authorization, indemnity, or liability commitment beyond the license terms, and are not responsible for any misuse or consequences arising from the use, modification, redistribution, or commercial application of this project.

📚 Supported Paper Clients

Client Description 🔎 Search ⬇️ Download Code Snippet
ArxivPaperClient arXiv preprint search and PDF download.
arXiv 预印本论文搜索与 PDF 下载。
arxiv_paper_client.py
OpenReviewPaperClient OpenReview paper search and PDF download, especially for conference submissions and reviews.
OpenReview 论文搜索与 PDF 下载,适合会议投稿与评审数据。
openreview_paper_client.py
ACLAnthologyPaperClient ACL Anthology paper search and PDF download for NLP and computational linguistics papers.
ACL Anthology 论文搜索与 PDF 下载,主要面向 NLP 和计算语言学论文。
acl_anthology_paper_client.py
BioRxivPaperClient bioRxiv preprint search and PDF download for biology-related papers.
bioRxiv 生物学预印本论文搜索与 PDF 下载。
biorxiv_paper_client.py
MedRxivPaperClient medRxiv preprint search and PDF download for medical and health science papers.
medRxiv 医学与健康科学预印本论文搜索与 PDF 下载。
biorxiv_paper_client.py
PMLRPaperClient PMLR paper search and PDF download for machine learning proceedings.
PMLR 机器学习会议论文集搜索与 PDF 下载。
pmlr_paper_client.py
PMCOAPaperClient PubMed Central Open Access paper search and PDF download.
PubMed Central 开放获取论文搜索与 PDF 下载。
pmc_oa_paper_client.py

⚙️ Installation

Paperdl requires Python 3.10+. Using a virtual environment is recommended to avoid dependency conflicts.

Install from PyPI:

python -m pip install -U paperdl

Or install the latest version from GitHub:

python -m pip install -U git+https://github.com/CharlesPikachu/paperdl.git@main

For local development:

git clone https://github.com/CharlesPikachu/paperdl.git
cd paperdl
python -m pip install -e .

Most paper clients work without browser dependencies. However, some bioRxiv / medRxiv PDF downloads may require the optional Playwright-based browser fallback.

Install with browser support:

python -m pip install -U "paperdl[browser]"
python -m playwright install chromium

For local development with browser support:

python -m pip install -e ".[browser]"
python -m playwright install chromium

On some Linux servers, Playwright may also require system dependencies:

python -m playwright install-deps chromium

🚀 Quick Start

Paperdl is a unified asynchronous toolkit for scholarly paper search and PDF download. It can be used in two main ways:

  • Command line: powered by PaperClientCMD, suitable for quick searches, saved results, and batch downloads.
  • Python package: powered by PaperClient, suitable for scripts, scheduled jobs, and research workflows.

Built-in client names: arxiv, openreview, acl_anthology, biorxiv, medrxiv, pmlr, and pmc_oa. The default client is arxiv.

Command Line Usage

The examples below use the paperdl command. If your development environment has not registered the console script, replace paperdl with:

python -m paperdl.paperdl

(1) List Available Clients

paperdl clients

(2) Search Papers

Search the default arXiv source:

paperdl search "diffusion model" -n 10

Search multiple sources:

paperdl search "large language model" -c arxiv,pmlr,acl_anthology -n 5

Search all registered sources:

paperdl search "retrieval augmented generation" -c all -n 3 \
  --client-search-param openreview.venue_id=ICLR.cc/2024/Conference \
  --client-search-param biorxiv.max_scan_results=500 \
  --client-search-param medrxiv.max_scan_results=500 \
  --client-search-param pmlr.max_volumes=120

When using -c all, client-specific search parameters may be required. For example, OpenReview needs a search scope such as venue_id, while clients such as bioRxiv, medRxiv, and PMLR may need scan limits to keep the search fast.

Print JSON or JSONL:

paperdl search "transformer" -c arxiv -n 5 --format json
paperdl search "transformer" -c arxiv -n 5 --format jsonl

Save search results for later download:

paperdl search "graph neural network" -c arxiv,pmlr -n 10 --output-json outputs/search_results.json

Show only the first few rows in the terminal while saving all results:

paperdl search "multimodal large language model" -c arxiv -n 50 --limit 10 --output-json outputs/mllm.json

Pass common search parameters to every selected client:

paperdl search "large language model" -c arxiv -n 20 --search-param sort_by=relevance --search-param page_size=20

Pass per-client search parameters. For macOS/Linux/Git Bash:

paperdl search "diffusion" -c arxiv,pmlr -n 3 \
  --client-search-param 'arxiv.categories=["cs.CV","cs.LG"]' \
  --client-search-param pmlr.max_volumes=30

For Windows cmd:

paperdl search "diffusion" -c arxiv,pmlr -n 3 ^
  --client-search-param "arxiv.categories=[\"cs.CV\",\"cs.LG\"]" ^
  --client-search-param pmlr.max_volumes=30

You can also pass per-client parameters as JSON. For macOS/Linux/Git Bash:

paperdl search "diffusion" -c arxiv,pmlr -n 3 \
  --client-search-kwargs '{"arxiv":{"categories":["cs.CV","cs.LG"]},"pmlr":{"max_volumes":30}}'

For Windows cmd:

paperdl search "diffusion" -c arxiv,pmlr -n 3 ^
  --client-search-kwargs "{\"arxiv\":{\"categories\":[\"cs.CV\",\"cs.LG\"]},\"pmlr\":{\"max_volumes\":30}}"

On Windows cmd, do not use single quotes around JSON-like values. Use double quotes around the whole argument and escape inner double quotes with \".

(3) Download Papers

Search and download all returned papers:

paperdl download "diffusion model" -c arxiv -n 5 -o papers

Download only the first three results:

paperdl download "diffusion model" -c arxiv -n 20 --select top3 -o papers

Download selected result indices shown in the preview table:

paperdl download "diffusion model" -c arxiv -n 20 --select 1,3-5 -o papers

Download from a saved search result file:

paperdl download --input-json outputs/search_results.json --select top10 -o papers

Overwrite existing PDF files:

paperdl download "attention is all you need" -c arxiv -n 1 -o papers --overwrite

Run in quiet mode:

paperdl download "diffusion" -c arxiv,pmlr -n 5 --quiet -o papers

Stop immediately when any selected client fails:

paperdl download "diffusion" -c arxiv,pmlr -n 5 --raise-on-error

(4) Common CLI Options

Option Purpose
-c, --clients Comma-separated client names, or all. Default: arxiv.
-n, --total-results Default number of results per client.
--output-json Save search results to a JSON file.
--input-json Load paper records from JSON when running download.
--format Output format: table, json, or jsonl.
--select Download selection, such as all, top10, or 1,3-5.
-o, --output-dir Output directory for PDFs.
--overwrite Overwrite existing PDF files.
--no-dedupe Disable cross-client deduplication.
--quiet Disable verbose logs and progress output where possible.
--search-concurrency Number of clients searched concurrently.
--init-param / --search-param Constructor or search parameter applied to all clients.
--client-init-param / --client-search-param Constructor or search parameter applied to one client.
--init-kwargs / --search-kwargs JSON object applied to all clients.
--client-init-kwargs / --client-search-kwargs JSON object keyed by client name.

Python Package Usage

(1) Minimal Search Example

import asyncio
from paperdl import PaperClient

async def main():
    async with PaperClient(["arxiv"], default_init_kwargs={"verbose": False}) as client:
        papers = await client.search("diffusion model", total_results=5)
        for paper in papers:
            print(paper.title, paper.article_url, paper.download_url)

asyncio.run(main())

client.search(...) returns a list of PaperInfo objects. Common fields include title, abstract, authors, article_url, download_url, doi, arxiv_id, venue, published_at, and source.

(2) Search Multiple Sources

import asyncio
from paperdl import PaperClient

async def main():
    async with PaperClient(["arxiv", "pmlr", "acl_anthology"]) as client:
        papers = await client.search("large language model", total_results=5)
        print(f"found {len(papers)} papers")

asyncio.run(main())

Return results grouped by client:

results = await client.search(
    "large language model",
    total_results=5,
    return_by_client=True,
)
print(results["arxiv"])
print(results["pmlr"])

(3) Search and Download

import asyncio
from paperdl import PaperClient

async def main():
    async with PaperClient(["arxiv"]) as client:
        papers = await client.search("attention is all you need", total_results=1)
        paths = await client.download(papers, output_dir="papers")
        print(paths)

asyncio.run(main())

Run search and download in one call:

papers, paths = await client.searchanddownload(
    "diffusion model",
    clients=["arxiv"],
    total_results=5,
    output_dir="papers",
)

(4) Save and Load Search Results

from paperdl import PaperClient

PaperClient.saveresults(papers, "outputs/search_results.json")
loaded_papers = PaperClient.loadresults("outputs/search_results.json")

Loaded results can be downloaded later:

async with PaperClient(["arxiv", "pmlr", "acl_anthology"]) as client:
    loaded_papers = PaperClient.loadresults("outputs/search_results.json")
    paths = await client.download(loaded_papers, output_dir="papers")

(5) Configure Different Clients Differently

import asyncio
from paperdl import PaperClient

async def main():
    async with PaperClient(
        ["arxiv", "pmlr"],
        default_init_kwargs={"verbose": False, "show_progress": False},
        client_search_kwargs={
            "arxiv": {"categories": ["cs.CL", "cs.AI"], "sort_by": "submittedDate"},
            "pmlr": {"max_volumes": 30, "enrich_abstracts": True},
        },
        search_concurrency=2,
    ) as client:
        papers = await client.search("large language model", total_results=10)
        await client.download(papers[:5], output_dir="papers")

asyncio.run(main())

Override search parameters for a single call:

papers = await client.search(
    "diffusion",
    total_results=10,
    client_search_kwargs={
        "arxiv": {"categories": ["cs.CV"]},
        "pmlr": {"max_volumes": 20},
    },
)

(6) Error Handling

By default, a failed client does not stop other clients. Search errors are stored in client.last_errors.

papers = await client.search("diffusion", total_results=5)
if client.last_errors:
    for name, err in client.last_errors.items():
        print(name, err)

Raise immediately on failure:

papers = await client.search("diffusion", total_results=5, raise_on_error=True)
paths = await client.download(papers, output_dir="papers", raise_on_error=True)

Keep download exceptions in the returned list:

results = await client.download(
    papers,
    output_dir="papers",
    return_exceptions=True,
)

(7) Use PaperClientCMD from Python

PaperClientCMD is the Python wrapper behind the command line interface. It is useful when you want to reuse CLI behavior inside another script:

from paperdl import PaperClientCMD

PaperClientCMD(["clients"]).run()
PaperClientCMD(["search", "diffusion model", "-c", "arxiv", "-n", "5"]).run()
PaperClientCMD(["download", "diffusion model", "-c", "arxiv", "-n", "3", "-o", "papers"]).run()

Next Steps

  • For quick usage, start with the CLI and PaperClient examples in this file.
  • For source-specific search options, see Clients.md.
  • To add a new paper source, subclass BasePaperClient, implement search and downloaditem, and register it in the client registry.

⭐ Recommended Projects

Project ⭐ Stars 📦 Version ⏱ Last Update 🛠 Repository
🎵 Musicdl
轻量级无损音乐下载器
Stars Version Last Commit 🛠 Repository
🎬 Videodl
轻量级高清无水印视频下载器
Stars Version Last Commit 🛠 Repository
🖼️ Imagedl
轻量级海量图片搜索下载器
Stars Version Last Commit 🛠 Repository
🖼️ Paperdl
轻量级学术论文搜索下载器
Stars Version Last Commit 🛠 Repository
🌐 FreeProxy
全球海量高质量免费代理采集器
Stars Version Last Commit 🛠 Repository
🌐 MusicSquare
简易音乐搜索下载和播放网页
Stars Version Last Commit 🛠 Repository
🌐 FreeGPTHub
真正免费的GPT统一接口
Stars Version Last Commit 🛠 Repository

📚 Citation

If you use this project in your research, please cite the repository.

@misc{musicdl2020,
    author = {Zhenchao Jin},
    title = {Paperdl: A Unified Asynchronous Framework for Scholarly Paper Search and Download},
    year = {2022},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/CharlesPikachu/paperdl}},
}

🌟 Star History

Star History Chart

☕ Appreciation (赞赏 / 打赏)

WeChat Appreciation QR Code (微信赞赏码) Alipay Appreciation QR Code (支付宝赞赏码)

📢 WeChat Official Account (微信公众号):

Charles的皮卡丘 (Charles_pikachu)
img

About

Paperdl: A Unified Asynchronous Framework for Scholarly Paper Search and Download. (轻量级论文下载器:支持Arxiv,Scihub,OpenReview,ACL Anthology,bioRxiv,medRxiv,PMLR,PMC等平台)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

 
 
 

Contributors