Large ingestion jobs don’t fully succeed on the first try. These features make them recoverable.
Retries with a dead-letter queue
Crawl hundreds of pages and a handful will fail from timeouts or rate limits. A dead-letter queue (DLQ) records just the failures so you retry those, not the whole crawl.
from ragrails import RagRails, DLQ
rag = RagRails()
# Collect failures to a file
result = rag.scrape("https://example.com", mode="full", dlq=DLQ("files/dlq/web.json"))
# Retry just the failures
result = rag.scrape(dlq="files/dlq/web.json")
# Or filter before retrying
result.dlq.items = [i for i in result.dlq.items if "docs" in i["url"]]
result = rag.scrape(dlq=result.dlq)
dlq value | Behaviour |
|---|
DLQ() | Collect retryable failures in memory |
DLQ("path.json") | Collect and save to a file |
result.dlq | Retry from a previous result |
"path.json" | Retry from a saved file |
Only retryable failures (timeouts, transient errors) go to the DLQ. Permanent errors like 404s don’t.
fetch() walks paginated APIs automatically. Pick the strategy your API uses:
| Strategy | When | Example |
|---|
page | Page numbers (?page=2) | {"type": "page", "param": "page", "size_param": "per_page", "size": 100} |
offset | Row offsets (?offset=100) | {"type": "offset", "param": "offset", "size_param": "limit", "size": 100} |
cursor | Next-page token | {"type": "cursor", "param": "cursor", "cursor_path": "meta.next_cursor"} |
result = rag.fetch(
url="https://api.example.com/products",
pagination={"type": "page", "param": "page", "size_param": "per_page", "size": 100},
max_pages=20, # always cap as a safety stop
)
Reference: SDK ingestion.