ScrapeBackend API is a Node.js-based API service designed to scrape and store data from websites. It provides an easy-to-use interface to interact with scraped data and enables both real-time scraping and data retrieval from a MongoDB database.
- Web Scraping: Leverage Puppeteer to scrape data from websites.
- Real-Time Data Fetching: Scrape and serve data directly from websites.
- Database Integration: Store scraped data in a MongoDB database for persistent access.
- Flexible API: Expose endpoints for retrieving and managing scraped data.
- Modular Design: Easily extend and integrate with other systems.
Follow the steps below to get the ScrapeBackend API up and running:
Make sure you have the following installed:
- Node.js (v14+)
- MongoDB (local or cloud, such as MongoDB Atlas)
git clone https://github.com/YuvrajKarna/ScrapeBackend-API.git
cd ScrapeBackend-APInpm installCreate a .env file in the root directory and set the following environment variables:
MONGO_URI=your_mongodb_connection_string
PORT=5000Replace your_mongodb_connection_string with your MongoDB connection URI.
npm run devStartThis will start the server on the specified port (5000 by default) with hot reloading enabled.
Once the server is running, you can interact with the API via HTTP requests. The API supports both real-time scraping and retrieving stored data.
To get a list of all scraped books from the database:
GET http://localhost:5000/api/scrape/books- Description: Fetch all scraped book details from the database.
- Response: Returns an array of book objects.
[
{
"title": "A Light in the Attic",
"price": "£51.77",
"stock": "In Stock",
"rating": "Three",
"link": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"stockInfo": "In stock (22 available)",
"imageLink": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
},
...
]- Description: Scrapes data from a provided URL and returns the result.
- Request Body:
{ "url": "https://example.com" } - Response: Returns the scraped data from the provided URL.
We welcome contributions to ScrapeBackend API! If you'd like to help improve this project, follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature). - Make your changes and commit them (
git commit -am 'Add new feature'). - Push to your branch (
git push origin feature/your-feature). - Create a pull request.
Please ensure your code is well-tested and follows the project's coding conventions.
This project is licensed under the MIT License - see the LICENSE file for details.
- Scalability: The project is designed to scale easily. You can extend it to support more complex scraping logic, additional APIs, or more databases.
- Error Handling: Ensure you add proper error handling in your own code when interacting with the API.
- Puppeteer: Scraping performance can depend on the complexity of the target website. Puppeteer is used here for headless browser-based scraping, but you may need to tweak it for certain sites.