Ingestion - Ragrails

Ingestion is stage one: it pulls raw content and normalizes it into clean documents the rest of the pipeline can use. Three sources:

Method	Source	Needs
`scrape()`	Websites (one URL or a full crawl)	`ragrails[url]` + `setup_url()`
`parse()`	Files: local paths, file URLs, or raw bytes (PDF, DOCX, XLSX, HTML, MD, CSV…)	none
`fetch()`	REST API responses	none

All three return document dicts with id, text, source, and metadata.

parse() takes more than file paths. Pass a local path, a file URL (downloaded automatically), or raw bytes, which is ideal for web uploads and object storage where the file never touches disk. See input forms.

SDK
CLI
REST API

from ragrails import RagRails
rag = RagRails()

docs = rag.parse(folder="files/docs/")
api  = rag.fetch(url="https://api.example.com/products", title="Products")

rag.setup_url()  # one-time browser install
site = rag.scrape("https://example.com", mode="full", max_pages=50)

ragrails parse --folder files/docs/ --output-dir files/output/docs/
ragrails fetch https://api.example.com/products --title Products --output-dir files/output/api/

ragrails setup-url
ragrails scrape https://example.com --mode full --max-pages 50 --output-dir files/output/web/

curl -X POST http://127.0.0.1:8000/v1/ingest/docs -H "Content-Type: application/json" \
  -d '{"folder": "files/docs/"}'

Crawling websites

mode="each" - scrape only the exact URLs you pass.
mode="full" - crawl the whole site from a starting URL.

Always cap max_pages on a full crawl. Sites can be huge, and every page costs a fetch.

Dead-letter queue: in a large crawl, some pages fail (timeouts, rate limits). Pass dlq=DLQ("file.json") to record failures, then retry only those with scrape(dlq="file.json"). See DLQ.

Reference

Full parameters: SDK ingestion · CLI · REST.

​Crawling websites

​Reference

Crawling websites

Reference