Skip to main content
Ingestion is stage one: it pulls raw content and normalizes it into clean documents the rest of the pipeline can use. Three sources:
MethodSourceNeeds
scrape()Websites (one URL or a full crawl)ragrails[url] + setup_url()
parse()Files: local paths, file URLs, or raw bytes (PDF, DOCX, XLSX, HTML, MD, CSV…)none
fetch()REST API responsesnone
All three return document dicts with id, text, source, and metadata.
parse() takes more than file paths. Pass a local path, a file URL (downloaded automatically), or raw bytes, which is ideal for web uploads and object storage where the file never touches disk. See input forms.
from ragrails import RagRails
rag = RagRails()

docs = rag.parse(folder="files/docs/")
api  = rag.fetch(url="https://api.example.com/products", title="Products")

rag.setup_url()  # one-time browser install
site = rag.scrape("https://example.com", mode="full", max_pages=50)

Crawling websites

  • mode="each" - scrape only the exact URLs you pass.
  • mode="full" - crawl the whole site from a starting URL.
Always cap max_pages on a full crawl. Sites can be huge, and every page costs a fetch.
Dead-letter queue: in a large crawl, some pages fail (timeouts, rate limits). Pass dlq=DLQ("file.json") to record failures, then retry only those with scrape(dlq="file.json"). See DLQ.

Reference

Full parameters: SDK ingestion · CLI · REST.