A comprehensive resume parsing system that extracts structured data from PDF resumes using LangChain and OpenAI, and stores the data in a SQLite database categorized by user.
- π PDF Text Extraction: Extract text from PDF resume files using PyPDF2
- π€ AI-Powered Parsing: Use LangChain with OpenAI to intelligently parse resume content
- ποΈ Structured Data Storage: Store parsed data in SQLite database with user categorization
- π€ User Management: Automatically categorize resumes by user name (inferred or specified)
- π Comprehensive Data Extraction: Extract education, experience, skills, projects, achievements, and more
- Clone or set up the project
- Install required dependencies:
pip install -r requirements.txt- Set up your OpenAI API credentials in a
.envfile:
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
YOUR_SITE_URL=your_site_url
YOUR_SITE_NAME=your_site_namePlace your PDF resume files in the resume/ folder:
resume/
βββ arpit_solanki_resume.pdf
βββ john_doe_cv.pdf
βββ jane_smith_resume.pdfpython main.pyThis will process all PDF files in the resume/ folder and automatically infer user names from the file content or filename.
python main.py --user "Arpit Solanki"python main.py --file "resume.pdf" --user "John Doe"python main.py --list--user, -u: Specify user name to associate with resume(s)--file, -f: Process a specific PDF file--list, -l: List all users and their resumes in the database
langchain/
βββ resume/ # Folder for PDF resume files
βββ main.py # Main script to run the resume parser
βββ pdf_parser.py # PDF text extraction module
βββ resume_parser.py # LangChain-based resume parsing module
βββ database.py # Database models and management
βββ model.py # Original LangChain model configuration
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (API keys)
βββ README.md # This file
id: Primary keyname: User namecreated_at: Timestamp
id: Primary keyuser_id: Foreign key to users tablefile_name: Original PDF filenamefull_name: Extracted full nameemail: Email addressphone: Phone numberlocation: Current locationlinkedin: LinkedIn profile URLgithub: GitHub profile URLsummary: Professional summaryeducation: JSON string of education detailsexperience: JSON string of work experiencetechnical_skills: JSON string of technical skillssoft_skills: JSON string of soft skillsprojects: JSON string of projectsachievements: JSON string of achievementscertifications: JSON string of certificationsraw_text: Original extracted textcreated_at: Timestampupdated_at: Last update timestamp
The system extracts the following structured data from resumes:
- Full name
- Email address
- Phone number
- Current location
- LinkedIn profile
- GitHub profile
- Brief professional summary or objective
- Institution name
- Degree/qualification
- Field of study
- Graduation year
- GPA (if mentioned)
- Location
- Company name
- Position/job title
- Duration of employment
- Location
- Key responsibilities and achievements
- Technical skills (programming languages, frameworks, tools)
- Soft skills
- Project name
- Description
- Technologies used
- Duration
- Project links
- Achievement title
- Description
- Year achieved
- Awarding organization
- Certification name
- Issuing organization
- Issue date
- Expiry date
- Credential ID
from pdf_parser import PDFParser
from resume_parser import ResumeParser
from database import DatabaseManager
# Initialize components
pdf_parser = PDFParser()
resume_parser = ResumeParser()
db_manager = DatabaseManager()
# Extract text from a PDF
text = pdf_parser.extract_text_from_pdf("resume/arpit_resume.pdf")
# Parse the resume
parsed_data = resume_parser.parse_resume(text)
# Save to database
resume_record = db_manager.save_resume_data("Arpit Solanki", "arpit_resume.pdf", parsed_data)The system includes comprehensive error handling:
- Failed PDF text extraction
- AI parsing failures (falls back to regex-based extraction)
- Database connection issues
- File not found errors
π Starting Resume Processing Pipeline
=====================================
π§ Initializing components...
π Found 1 PDF file(s): ['arpit_resume.pdf']
============================================================
Processing: arpit_resume.pdf
============================================================
β
Successfully extracted text (2847 characters)
π Parsing resume with AI...
π€ User identified as: Arpit Solanki
πΎ Saving to database...
β
Successfully saved resume data for Arpit Solanki
π Resume ID: 1
π Extracted Data Summary:
β’ Full Name: Arpit Solanki
β’ Email: arpitsolanki6825@gmail.com
β’ Phone: +91-8279824227
β’ Location: N/A
β’ Education entries: 1
β’ Work experience entries: 1
β’ Projects: 2
β’ Technical skills: 8
π― Processing Complete!
========================================
β
Successfully processed: 1
β Failed to process: 0
π Total files: 1
πΎ Data saved to: resume_database.db
π You can query the database to retrieve user resume data
Feel free to contribute by:
- Adding support for more file formats (DOC, DOCX)
- Improving the AI parsing prompts
- Adding more structured data fields
- Enhancing error handling
- Adding a web interface
This project is open source. Feel free to use and modify as needed.