Resilient Ingestion

Large ingestion jobs don’t fully succeed on the first try. These features make them recoverable.

Retries with a dead-letter queue

Crawl hundreds of pages and a handful will fail from timeouts or rate limits. A dead-letter queue (DLQ) records just the failures so you retry those, not the whole crawl.

from ragrails import RagRails, DLQ

rag = RagRails()

# Collect failures to a file
result = rag.scrape("https://example.com", mode="full", dlq=DLQ("files/dlq/web.json"))

# Retry just the failures
result = rag.scrape(dlq="files/dlq/web.json")

# Or filter before retrying
result.dlq.items = [i for i in result.dlq.items if "docs" in i["url"]]
result = rag.scrape(dlq=result.dlq)

`dlq` value	Behaviour
`DLQ()`	Collect retryable failures in memory
`DLQ("path.json")`	Collect and save to a file
`result.dlq`	Retry from a previous result
`"path.json"`	Retry from a saved file

Only retryable failures (timeouts, transient errors) go to the DLQ. Permanent errors like 404s don’t.

API pagination

fetch() walks paginated APIs automatically. Pick the strategy your API uses:

Strategy	When	Example
`page`	Page numbers (`?page=2`)	`{"type": "page", "param": "page", "size_param": "per_page", "size": 100}`
`offset`	Row offsets (`?offset=100`)	`{"type": "offset", "param": "offset", "size_param": "limit", "size": 100}`
`cursor`	Next-page token	`{"type": "cursor", "param": "cursor", "cursor_path": "meta.next_cursor"}`

result = rag.fetch(
    url="https://api.example.com/products",
    pagination={"type": "page", "param": "page", "size_param": "per_page", "size": 100},
    max_pages=20,   # always cap as a safety stop
)

Reference: SDK ingestion.

​Retries with a dead-letter queue

​API pagination

Retries with a dead-letter queue

API pagination