Skip to content

SEO Crawl

Install

See the Install guide for the full setup, including Windows PowerShell.

curl -fsSL https://install.skippr.io/install.sh | shClick to copy

Installing Skippr means accepting the Skippr EULA.

On-site SEO/AEO observability via bounded BFS crawl: robots/sitemap discovery, static HTML parsing, content-block extraction, optional OpenAI analysis, and link-graph edges. No Playwright — use Site Quality for JS rendering.

How it works

  1. Normalizes the configured site origin and fetches robots.txt.
  2. Discovers sitemaps (robots directives + common paths) and seeds a BFS queue.
  3. Crawls up to max_urls pages; emits seven bronze namespaces partitioned on crawl_date.
  4. Skips OpenAI when content_hash matches the per-URL checkpoint (skip_unchanged_content).
  5. Discover mode: ~10 pages, no checkpoints, no OpenAI.

Configuration

yaml
data_sources:
  example_site:
    SeoCrawl:
      site: "https://example.com"
      max_urls: 500
      max_depth: 8
      openai_enabled: true
      openai_model: "gpt-4.1-mini"
      skip_unchanged_content: true

pipelines:
  seo_daily:
    data_source: data_sources.example_site
    data_sink: data_sinks.athena
    transform:
      batch_time_fields: [crawl_date]

Public skippr.yaml:

yaml
source:
  kind: seo_crawl
  site: "https://example.com"
  max_urls: 500
  openai_enabled: true
FieldDefaultDescription
site(required)Site origin / TLD
max_urls5000Max pages per run
max_depth8BFS depth cap
respect_robotstrueHonor robots.txt
openai_enabledtrueCall OpenAI when OPENAI_API_KEY set
openai_analyze_blockstruePer-block AEO scoring
skip_unchanged_contenttrueSkip OpenAI when page hash unchanged

Namespaces (seo_crawl.*)

site_run_daily, page_daily, link_edge, robots_txt, sitemap_url, issue, content_block — all replace_partition on crawl_date.

CLI

bash
skippr connect source seo-crawl \
  --site "https://example.com" \
  --max-urls 500 \
  --openai-enabled true

Authentication

  • OPENAI_API_KEY — optional; required for live block analysis
  • OPENAI_API_KEY — optional; required for live block analysis when enabled
  • SKIPPR_SEO_CRAWL_FIXTURE_DIR — offline HTML/robots/sitemap fixtures

See Environment variables.

Athena with transform.batch_time_fields: [crawl_date].