SDK Ingestion - Ragrails

All ingestors return normalized document dicts with id, text, source, and metadata fields.

URL ingestion: `scrape()`

pip install "ragrails[url]"

from ragrails import RagRails

rag = RagRails()
rag.setup_url()  # run once per environment

# Single URL
result = rag.scrape("https://example.com/docs")

# Full site crawl
result = rag.scrape("https://example.com", mode="full", max_depth=2, max_pages=50)

# Multiple URLs with per-URL config
result = rag.scrape([
    "https://example.com/docs",
    {"url": "https://example.com/blog", "mode": "full", "max_depth": 1},
])

result.pages    # pages scraped
result.outputs  # list of document dicts
result.errors   # list of error dicts

Dead-letter queue (DLQ)

Capture failed pages and retry them later:

from ragrails import DLQ

# Collect failures in memory
result = rag.scrape("https://example.com", mode="full", dlq=DLQ())
result.dlq.items  # list of retry input dicts

# Collect and save to file
result = rag.scrape("https://example.com", mode="full", dlq=DLQ("files/dlq/web.json"))

# Retry from a previous result
result = rag.scrape(dlq=result.dlq)

# Retry from a saved file
result = rag.scrape(dlq="files/dlq/web.json")

# Filter before retrying
result.dlq.items = [i for i in result.dlq.items if "docs" in i["url"]]
result = rag.scrape(dlq=result.dlq)

Saving to disk

result = rag.scrape(
    "https://example.com/docs",
    output_format="json",
    output_dest="file",
    output_dir="files/output/web/",
)
result.outputs[0]["output_path"]  # path of saved file

Parameters

Parameter	Default	Description
`url`	required	URL, list of URLs, or list of dicts with `url` key
`mode`	`"each"`	`"each"` scrapes exact URLs; `"full"` crawls the entire site
`max_depth`	`3`	Max crawl depth (used with `mode="full"`)
`max_pages`	`200`	Max pages per URL
`verbose`	`False`	Enable crawler logging
`frontmatter`	`False`	Prepend YAML frontmatter to markdown output
`output_format`	`"markdown"`	`"markdown"` or `"json"`
`output_dest`	`"response"`	`"response"` (in-memory) or `"file"` (save to disk)
`output_dir`	`None`	Required when `output_dest="file"`
`dlq`	`None`	`DLQ()`, `DLQ("path")`, `result.dlq`, or a file path string

Document ingestion: `parse()`

Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, TXT, CSV, and more.

Input forms

files accepts four forms. Mix them freely in one list:

Form	Example	Use when
Local path	`"docs/report.pdf"`	The file is on disk
File URL	`"https://example.com/report.pdf"`	The file is remote; Ragrails downloads it
Path dict	`{"path": "report.pdf", "title": "Report"}`	You want to set `title`/`description`
Bytes dict	`{"content": b"...", "filename": "report.pdf"}`	You already hold the file in memory (uploads, S3)

# Local paths
result = rag.parse(files=["files/guide.pdf", "files/pricing.csv"])

# A folder (all supported files inside)
result = rag.parse(folder="files/docs/")

# A direct file URL, downloaded automatically
result = rag.parse(files="https://example.com/whitepaper.pdf")

# Raw bytes, no disk write needed
with open("files/guide.pdf", "rb") as f:
    data = f.read()
result = rag.parse(files=[{"content": data, "filename": "guide.pdf", "title": "Guide"}])

# Mixed in one call
result = rag.parse(files=[
    "files/local.pdf",
    "https://example.com/remote.pdf",
    {"content": data, "filename": "upload.pdf"},
])

result.documents  # documents parsed
result.outputs    # list of document dicts
result.errors     # list of error dicts

Bytes ingestion is the path for web uploads or object storage. Pass the file content directly instead of writing a temp file first. filename is required so Ragrails can detect the file type.

File URLs must end in a supported extension (e.g. .pdf). Ragrails reads the type from the URL path before downloading.

Parameters

Parameter	Default	Description
`files`	`None`	A path, file URL, path dict, or bytes dict, or a list of them. Mutually exclusive with `folder`
`folder`	`None`	Directory to parse. Mutually exclusive with `files`
`frontmatter`	`False`	Prepend YAML frontmatter
`output_format`	`"markdown"`	`"markdown"` or `"json"`
`output_dest`	`"response"`	`"response"` or `"file"`
`output_dir`	`None`	Required when `output_dest="file"`

Dict fields

Field	For	Description
`path`	Path dict	Local path or file URL
`content`	Bytes dict	Raw file bytes
`filename`	Bytes dict	File name with extension (required for type detection)
`title`	Both	Optional document title
`description`	Both	Optional description metadata

API ingestion: `fetch()`

# Single endpoint
result = rag.fetch(
    url="https://api.example.com/posts",
    title="Blog posts",
    headers={"Authorization": "Bearer token"},
    pagination={"type": "page", "param": "page", "size_param": "per_page", "size": 100},
    max_pages=10,
)

# Multiple endpoints
result = rag.fetch(apis=[
    {"url": "https://api.example.com/posts", "title": "Posts"},
    {"url": "https://api.example.com/comments", "title": "Comments"},
])

result.documents  # documents fetched
result.outputs    # list of document dicts

Pagination

fetch() walks paginated APIs automatically. Omit pagination for a single request. Otherwise set type to one of three strategies:

Strategy	When to use	Example
`page`	API uses page numbers (`?page=2`)	`{"type": "page", "param": "page", "size_param": "per_page", "size": 100}`
`offset`	API uses row offsets (`?offset=100`)	`{"type": "offset", "param": "offset", "size_param": "limit", "size": 100}`
`cursor`	API returns a next-page token	`{"type": "cursor", "param": "cursor", "cursor_path": "meta.next_cursor"}`

Key	Description
`type`	`"page"`, `"offset"`, or `"cursor"`. Omit for a single fetch
`param`	Query-param name carrying the page number, offset, or cursor
`size_param`	Query-param name for page size (optional)
`size`	Page size value (optional)
`cursor_path`	Dot-path into the response JSON to the next cursor (cursor type only)

Always set max_pages as a safety cap. It stops pagination even if the API keeps returning a next page.

Parameters

Parameter	Default	Description
`url`	`None`	Single endpoint. Mutually exclusive with `apis`
`title`	`None`	Document title
`description`	`""`	Description added to the document
`method`	`"GET"`	HTTP method
`headers`	`None`	Request headers dict
`params`	`None`	Query parameters dict
`body`	`None`	Request body dict
`pagination`	`None`	`{"type": "page", "param": "page", "size_param": "per_page", "size": 100}`
`max_pages`	`100`	Max paginated requests
`timeout`	`None`	Request timeout in seconds
`apis`	`None`	List of endpoint configs. Mutually exclusive with `url`
`output_format`	`"markdown"`	`"markdown"` or `"json"`
`output_dest`	`"response"`	`"response"` or `"file"`
`output_dir`	`None`	Required when `output_dest="file"`

​URL ingestion: scrape()

​Dead-letter queue (DLQ)

​Saving to disk

​Parameters

​Document ingestion: parse()

​Input forms

​Parameters

​Dict fields

​API ingestion: fetch()

​Pagination

​Parameters

URL ingestion: `scrape()`

Dead-letter queue (DLQ)

Saving to disk

Parameters

Document ingestion: `parse()`

Input forms

Parameters

Dict fields

API ingestion: `fetch()`

Pagination

Parameters