Skip to content

πŸ•·οΈ DiscovAI Crawl API(🚧 Work in Progress 🚧): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.

License

Notifications You must be signed in to change notification settings

DiscovAI/DiscovAI-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DiscovAI Crawl API πŸ•·οΈπŸ”

One API to scrape everything you need from URLs for your AI tool and vector database.

🚧 Work in Progress 🚧

🌟 Features

Our API provides a comprehensive suite of data extraction and processing capabilities:

  • 🧼 Clean HTML (JavaScript and CSS removed)
  • πŸ“ LLM-friendly Markdown conversion
  • 🚫 Ad-free, cookie banner-free, and dialog-free content
  • πŸ“Έ Website screenshots (auto-saved to AWS S3 or Cloudflare R2)
  • πŸ€– LLM-generated SEO-friendly content
  • πŸ”‘ LLM-extracted key information (summary, features, FAQs, etc.)
  • 🧠 Ready-to-use embeddings for vector database integration (auto-saved to db)

πŸ”§ Installation

pnpm i
cd apps/api && pnpm exec playwright install

πŸš€ Usage

pnpm dev
open http://localhost:3000

πŸ“¦ API Response Structure

{
  "clean_html": "...",
  "LLM_friendly_markdown": "...",
  "clean_text": "...",
  "screenshot_url": "...",
  "llm_extracts_key_info": {
    "what": "...",
    "summary": "...",
    "features": ["...", "..."],
    "faqs": [{"q": "...", "a": "..."}]
  },
  "llm_summarized_detail": "...",
  "embeddings": [...]
}

πŸ“š Documentation

TODO

🀝 Contributing

TODO

About

πŸ•·οΈ DiscovAI Crawl API(🚧 Work in Progress 🚧): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published