GitHub - US-Artificial-Intelligence/ScrapeServ: A self-hosted API that takes a URL and returns a file with browser screenshots.
Extracto
A self-hosted API that takes a URL and returns a file with browser screenshots. - US-Artificial-Intelligence/ScrapeServ
Contenido
ScrapeServ: Simple URL to screenshots server
You run the API on your machine, you send it a URL, and you get back the website data as a file plus screenshots of the site. Simple as.
This project was made to support Abbey, an AI platform. Its author is Gordon Kamer.
Some highlights:
- Scrolls through the page and takes screenshots of different sections
- Runs in a docker container
- Browser-based (will run websites' Javascript)
- Gives you the HTTP status code and headers from the first request
- Automatically handles 302 redirects
- Handles download links properly
- Tasks are processed in a queue with configurable memory allocation
- Blocking API
- Zero state or other complexity
This web scraper is resource intensive but higher quality than many alternatives. Websites are scraped using Playwright, which launches a Firefox browser context for each job.
Setup
You should have Docker and docker compose installed.
- Clone this repo
- Run
docker compose up(adocker-compose.ymlfile is provided for your use)
...and the service will be available at http://localhost:5006. See the Usage section below for details on how to interact with it.
API Keys
You may set an API key using a .env file inside the /scraper folder (same level as app.py).
You can set as many API keys as you'd like; allowed API keys are those that start with SCRAPER_API_KEY. For example, here is a .env file that has three available keys:
SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too
API keys are sent to the service using the Authorization Bearer scheme.
Usage
Look in client for a full reference implementation in Python
The root path / returns status 200 if online, plus some Gilbert and Sullivan lyrics (you can go there in your browser to see if it's working).
The only other path is /scrape, to which you send a JSON formatted POST request and (if all things go well) receive a multipart/mixed type response. You could provide the desired output image format as an Accept header MIME type. If no Accept header is provided (or if the Accept header is */* or image/*), the screenshots are saved by default in JPEG format. The following values are supported:
- image/webp
- image/png
- image/jpeg
Every response from the API will be either:
- Status 200:
multipart/mixedresponse where: the first part is of typeapplication/jsonwith information about the request (includesstatus,headers, andmetadata); the second part is the website data (usuallytext/html); and the remaining parts are up to 5 screenshots. Each part contains aContent-Typeheader with its MIME type. - Not status 200:
application/jsonresponse with an error message under the "error" key
Here's a sample cURL request, which will return some long response if everything's working properly:
curl -X POST "http://localhost:5006/scrape"
-H "Content-Type: application/json"
-d '{"url": "https://us.ai"}'
Refer to the client for a full reference implementation, which shows you how to call the API and save the files it sends back.
Security Considerations
Navigating to untrusted websites is a serious security issue. Risks are somewhat mitigated in the following ways:
- Runs as isolated container (container isolation)
- Each website is scraped in a new browser context (process isolation)
- Strict memory limits and timeouts for each task
- Checks the URL to make sure that it's not too weird (loopback, local, non http, etc.)
You may take additional precautions depending on your needs, like:
- Only giving the API trusted URLs (or otherwise screening URLs)
- Running this API on isolated VMs (hardware isolation)
- Using one API instance per user
- Not making any secret files or keys available inside the container (besides the API key for the scraper itself)
If you'd like to make sure that this API is up to your security standards, please examine the code and open issues! It's not a big repo.
Other Configuration
You can control memory limits and other variables at the top of scraper/worker.py. Here are the defaults:
MEM_LIMIT_MB = 4_000 # 4 GB memory threshold for child scraping process
MAX_SCREENSHOTS = 5
SCREENSHOT_QUALITY = 85
BROWSER_HEIGHT = 2000
BROWSER_WIDTH = 1280
USER_AGENT = "Mozilla/5.0 (compatible; Abbey/1.0; +https://github.com/US-Artificial-Intelligence/scraper)"
Fuente: GitHub
