Skip to main content
All ingestors return normalized document dicts with id, text, source, and metadata fields.

URL ingestion: scrape()

pip install "ragrails[url]"
from ragrails import RagRails

rag = RagRails()
rag.setup_url()  # run once per environment

# Single URL
result = rag.scrape("https://example.com/docs")

# Full site crawl
result = rag.scrape("https://example.com", mode="full", max_depth=2, max_pages=50)

# Multiple URLs with per-URL config
result = rag.scrape([
    "https://example.com/docs",
    {"url": "https://example.com/blog", "mode": "full", "max_depth": 1},
])

result.pages    # pages scraped
result.outputs  # list of document dicts
result.errors   # list of error dicts

Dead-letter queue (DLQ)

Capture failed pages and retry them later:
from ragrails import DLQ

# Collect failures in memory
result = rag.scrape("https://example.com", mode="full", dlq=DLQ())
result.dlq.items  # list of retry input dicts

# Collect and save to file
result = rag.scrape("https://example.com", mode="full", dlq=DLQ("files/dlq/web.json"))

# Retry from a previous result
result = rag.scrape(dlq=result.dlq)

# Retry from a saved file
result = rag.scrape(dlq="files/dlq/web.json")

# Filter before retrying
result.dlq.items = [i for i in result.dlq.items if "docs" in i["url"]]
result = rag.scrape(dlq=result.dlq)

Saving to disk

result = rag.scrape(
    "https://example.com/docs",
    output_format="json",
    output_dest="file",
    output_dir="files/output/web/",
)
result.outputs[0]["output_path"]  # path of saved file

Parameters

ParameterDefaultDescription
urlrequiredURL, list of URLs, or list of dicts with url key
mode"each""each" scrapes exact URLs; "full" crawls the entire site
max_depth3Max crawl depth (used with mode="full")
max_pages200Max pages per URL
verboseFalseEnable crawler logging
frontmatterFalsePrepend YAML frontmatter to markdown output
output_format"markdown""markdown" or "json"
output_dest"response""response" (in-memory) or "file" (save to disk)
output_dirNoneRequired when output_dest="file"
dlqNoneDLQ(), DLQ("path"), result.dlq, or a file path string

Document ingestion: parse()

Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, TXT, CSV, and more.

Input forms

files accepts four forms. Mix them freely in one list:
FormExampleUse when
Local path"docs/report.pdf"The file is on disk
File URL"https://example.com/report.pdf"The file is remote; Ragrails downloads it
Path dict{"path": "report.pdf", "title": "Report"}You want to set title/description
Bytes dict{"content": b"...", "filename": "report.pdf"}You already hold the file in memory (uploads, S3)
# Local paths
result = rag.parse(files=["files/guide.pdf", "files/pricing.csv"])

# A folder (all supported files inside)
result = rag.parse(folder="files/docs/")

# A direct file URL, downloaded automatically
result = rag.parse(files="https://example.com/whitepaper.pdf")

# Raw bytes, no disk write needed
with open("files/guide.pdf", "rb") as f:
    data = f.read()
result = rag.parse(files=[{"content": data, "filename": "guide.pdf", "title": "Guide"}])

# Mixed in one call
result = rag.parse(files=[
    "files/local.pdf",
    "https://example.com/remote.pdf",
    {"content": data, "filename": "upload.pdf"},
])

result.documents  # documents parsed
result.outputs    # list of document dicts
result.errors     # list of error dicts
Bytes ingestion is the path for web uploads or object storage. Pass the file content directly instead of writing a temp file first. filename is required so Ragrails can detect the file type.
File URLs must end in a supported extension (e.g. .pdf). Ragrails reads the type from the URL path before downloading.

Parameters

ParameterDefaultDescription
filesNoneA path, file URL, path dict, or bytes dict, or a list of them. Mutually exclusive with folder
folderNoneDirectory to parse. Mutually exclusive with files
frontmatterFalsePrepend YAML frontmatter
output_format"markdown""markdown" or "json"
output_dest"response""response" or "file"
output_dirNoneRequired when output_dest="file"

Dict fields

FieldForDescription
pathPath dictLocal path or file URL
contentBytes dictRaw file bytes
filenameBytes dictFile name with extension (required for type detection)
titleBothOptional document title
descriptionBothOptional description metadata

API ingestion: fetch()

# Single endpoint
result = rag.fetch(
    url="https://api.example.com/posts",
    title="Blog posts",
    headers={"Authorization": "Bearer token"},
    pagination={"type": "page", "param": "page", "size_param": "per_page", "size": 100},
    max_pages=10,
)

# Multiple endpoints
result = rag.fetch(apis=[
    {"url": "https://api.example.com/posts", "title": "Posts"},
    {"url": "https://api.example.com/comments", "title": "Comments"},
])

result.documents  # documents fetched
result.outputs    # list of document dicts

Pagination

fetch() walks paginated APIs automatically. Omit pagination for a single request. Otherwise set type to one of three strategies:
StrategyWhen to useExample
pageAPI uses page numbers (?page=2){"type": "page", "param": "page", "size_param": "per_page", "size": 100}
offsetAPI uses row offsets (?offset=100){"type": "offset", "param": "offset", "size_param": "limit", "size": 100}
cursorAPI returns a next-page token{"type": "cursor", "param": "cursor", "cursor_path": "meta.next_cursor"}
KeyDescription
type"page", "offset", or "cursor". Omit for a single fetch
paramQuery-param name carrying the page number, offset, or cursor
size_paramQuery-param name for page size (optional)
sizePage size value (optional)
cursor_pathDot-path into the response JSON to the next cursor (cursor type only)
Always set max_pages as a safety cap. It stops pagination even if the API keeps returning a next page.

Parameters

ParameterDefaultDescription
urlNoneSingle endpoint. Mutually exclusive with apis
titleNoneDocument title
description""Description added to the document
method"GET"HTTP method
headersNoneRequest headers dict
paramsNoneQuery parameters dict
bodyNoneRequest body dict
paginationNone{"type": "page", "param": "page", "size_param": "per_page", "size": 100}
max_pages100Max paginated requests
timeoutNoneRequest timeout in seconds
apisNoneList of endpoint configs. Mutually exclusive with url
output_format"markdown""markdown" or "json"
output_dest"response""response" or "file"
output_dirNoneRequired when output_dest="file"