SEO Crawl
Install
See the Install guide for the full setup, including Windows PowerShell.
curl -fsSL https://install.skippr.io/install.sh | shClick to copyInstalling Skippr means accepting the Skippr EULA.
On-site SEO/AEO observability via bounded BFS crawl: robots/sitemap discovery, static HTML parsing, content-block extraction, optional OpenAI analysis, and link-graph edges. No Playwright — use Site Quality for JS rendering.
How it works
- Normalizes the configured site origin and fetches
robots.txt. - Discovers sitemaps (robots directives + common paths) and seeds a BFS queue.
- Crawls up to
max_urlspages; emits seven bronze namespaces partitioned oncrawl_date. - Skips OpenAI when
content_hashmatches the per-URL checkpoint (skip_unchanged_content). - Discover mode: ~10 pages, no checkpoints, no OpenAI.
Configuration
yaml
data_sources:
example_site:
SeoCrawl:
site: "https://example.com"
max_urls: 500
max_depth: 8
openai_enabled: true
openai_model: "gpt-4.1-mini"
skip_unchanged_content: true
pipelines:
seo_daily:
data_source: data_sources.example_site
data_sink: data_sinks.athena
transform:
batch_time_fields: [crawl_date]Public skippr.yaml:
yaml
source:
kind: seo_crawl
site: "https://example.com"
max_urls: 500
openai_enabled: true| Field | Default | Description |
|---|---|---|
site | (required) | Site origin / TLD |
max_urls | 5000 | Max pages per run |
max_depth | 8 | BFS depth cap |
respect_robots | true | Honor robots.txt |
openai_enabled | true | Call OpenAI when OPENAI_API_KEY set |
openai_analyze_blocks | true | Per-block AEO scoring |
skip_unchanged_content | true | Skip OpenAI when page hash unchanged |
Namespaces (seo_crawl.*)
site_run_daily, page_daily, link_edge, robots_txt, sitemap_url, issue, content_block — all replace_partition on crawl_date.
CLI
bash
skippr connect source seo-crawl \
--site "https://example.com" \
--max-urls 500 \
--openai-enabled trueAuthentication
OPENAI_API_KEY— optional; required for live block analysisOPENAI_API_KEY— optional; required for live block analysis when enabledSKIPPR_SEO_CRAWL_FIXTURE_DIR— offline HTML/robots/sitemap fixtures
Recommended destination
Athena with transform.batch_time_fields: [crawl_date].
