← all projects

SmartScrape

Self-hosted, AI-driven web scraping with structured extraction, change detection, and notifications.

2026livetypescriptreactpostgresplaywrightai agents

Problem

Most "AI scraping" tools either ship as opaque SaaS with your data on someone else's servers, or require you to hand-author selectors for every site you want to track. Neither fits a freelance workflow where the same client might ask to monitor a competitor's price list one week and an in-stock RSS-less inventory page the next.

The brief was a tool that takes a URL and a plain-English description, produces a working scrape configuration, runs it on a schedule, detects what actually changed between runs, and notifies — all self-hosted, with the operator's own API keys.

Approach

A single Node/TypeScript service with a React SPA, deployed via Docker Compose alongside Postgres. The core loop is deliberately small:

  • Configure. A wizard sends the target page and the user's brief to an LLM (OpenRouter / OpenAI / Anthropic — the user chooses) and proposes a JSON extraction schema, a comparison key, and notification rules. The user accepts, edits, or starts over manually.
  • Fetch. Cheerio for static HTML where it works; Playwright when it doesn't. The choice is auto-detected per job.
  • Diff. SHA-256 per extracted row, matched across runs by the user-chosen comparison key. The output is an added / removed / changed delta, not a full document diff.
  • Notify. Rule types include any-change, new-items, removed-items, field threshold (price < 500), and field-value transitions. Delivery to email or Telegram.

Security work that mattered more than the AI surface: API keys and OAuth tokens encrypted with AES-256-GCM at rest, bcrypt-hashed passwords, JWT with rotating refresh tokens, an SSRF guard on every user-supplied URL, prompt-injection hardening on every extraction call, CORS locked to the configured frontend, Helmet headers, and zero outbound analytics.

Outcome

A working, deployable tool with mobile-responsive UI, OAuth-backed Google Sheets export, and a CI pipeline that gates merges. Useful to me as a freelance tool; useful as a portfolio piece because the operationally interesting decisions (comparison keys, SSRF, key encryption, scheduling) are where most demos hand-wave.