Loading tutorials…
Loading tutorials…
If your CMS-generated sitemap is bloated, missing pages, or includes noindex URLs, Screaming Frog can produce a clean replacement in 20 minutes. Better than any sitemap plugin for sites that need precision.
Who this is forSEOs and developers maintaining mid-to-large sites where the CMS sitemap is unreliable (incomplete, includes parameters, doesn't update). If your sitemap and GSC's indexed URL count diverge significantly, this is the fix.
What you'll need
Step 1
Run the standard crawl. Make sure indexability extraction is on so SF knows which URLs to include in the sitemap.
Sitemap generation requires a completed crawl. Use the first-crawl setup guide if you haven't run one yet.
Confirm Configuration → Spider → Crawl includes 'Crawl Linked XML Sitemaps' (catches URLs from existing sitemap) and 'Crawl Canonical' (catches canonicalized URLs).
Confirm Configuration → Spider → Extraction includes 'Indexability.' This is what tells SF whether each URL is meant to be in the sitemap.
Don't generate the sitemap from a partial crawl. Wait for completion and validate URL count against expected before proceeding.
Step 2
Top menu → Sitemaps → XML Sitemap. This opens the sitemap configuration dialog.
Top menu → Sitemaps → XML Sitemap. The dialog has multiple tabs: Pages, Last Modified, Priority, Change Frequency, Images, Hreflang.
By default, Screaming Frog only includes Indexable URLs returning 200 status. This is what you want — never include noindex, 4xx, 5xx, or canonicalized URLs in a sitemap.
Each tab is a filter or value-assignment step. The Pages tab is the most important — it controls which URLs are in the sitemap at all.
Step 3
Check "Include Indexable Pages." Uncheck "Include Non-Indexable Pages," "Paginated URLs," and "PDF" unless you specifically need them.
Pages tab → check 'Include Indexable Pages.' This pulls in URLs that returned 200 and are not noindexed or canonicalized.
Uncheck 'Include Non-Indexable Pages.' Including noindex URLs in a sitemap actively confuses Google ('you told me to index this in the sitemap but said noindex in the head') and can suppress indexation.
Uncheck 'Paginated URLs' unless your pagination is canonical-to-self (most isn't — most pagination canonicalizes to page 1).
Uncheck 'PDF' unless your PDFs are SEO-critical (legal docs, whitepapers you actively want indexed). Most sites should not have PDFs in the sitemap.
Result: a clean URL list of indexable HTML pages. Cross-check the count against your expected page count — should be within 10%.
Step 4
Last Modified: use crawl data. Priority + Change Frequency: leave at default OR set per-template values that match reality.
Last Modified tab: choose 'Use Server Response.' SF will pull the Last-Modified header from each URL's response, giving accurate timestamps. Falls back to crawl date if the header isn't present.
Priority tab: Google has publicly stated they largely ignore the priority attribute. Default of 0.5 for all URLs is fine. Don't waste cycles tuning this.
Change Frequency tab: also largely ignored by Google, but worth setting honestly. 'Daily' for blog index pages, 'Weekly' for product pages, 'Yearly' for legal/about pages. Don't lie ('hourly' on a page that updates monthly is noise).
Images tab: 'Include Images in Sitemap' if you want image sitemap data merged. Useful for ecom sites with strong image SEO. Skip for blogs that aren't image-heavy.
Hreflang tab: enable if your site is multi-locale. SF will include hreflang entries per URL in the sitemap, matching what's in the page <head>.
Step 5
Hit Generate → choose output path. SF creates sitemap.xml (or sitemap_index.xml + multiple files for sites over 50K URLs).
Click 'Next' through the tabs, then 'Generate' on the final screen. Choose where to save the file locally.
Sites under 50K URLs: SF generates a single sitemap.xml.
Sites over 50K URLs: SF auto-splits into multiple sitemaps (sitemap1.xml, sitemap2.xml, etc.) and generates a sitemap_index.xml that references all of them. This is the standard for large sites.
Open the file in a text editor and spot-check 5 URLs to confirm they look right (proper https://, correct domain, no trailing query strings or fragments).
Step 6
FTP/SFTP/CMS upload to /sitemap.xml. Add the URL to robots.txt. Submit in GSC → Sitemaps → Add a new sitemap.
Upload sitemap.xml (and any sub-sitemaps) to your site root: https://example.com/sitemap.xml.
If you have a sitemap_index.xml, that's the file you submit to GSC — Google will follow it to discover all sub-sitemaps.
Add `Sitemap: https://example.com/sitemap.xml` to your robots.txt. This is a discoverability signal — Google checks robots.txt for sitemap references.
Open GSC → Sitemaps → 'Add a new sitemap.' Enter the sitemap URL and submit. GSC will fetch and process it within 24-48 hours.
Monitor GSC → Sitemaps for the status: 'Success' with submitted vs. indexed URL counts. The indexed count should approach the submitted count within 4-8 weeks for healthy sites.
Step 7
Configure Screaming Frog CLI + cron / Task Scheduler to regenerate the sitemap weekly. Or use a CMS plugin that mirrors SF's filtering.
Manual sitemap regeneration is fine for one-off audits but breaks within weeks as URLs change. Automate it.
Screaming Frog CLI (paid tier): write a shell script that runs `ScreamingFrogSEOSpiderCli.exe --crawl https://example.com --output-folder /sitemaps --create-sitemap --headless`. Schedule via cron (Mac/Linux) or Task Scheduler (Windows) to run nightly or weekly.
CMS plugin alternative: WordPress Yoast or RankMath generate sitemaps automatically. Configure them to match SF's filtering (no noindex, no parameters, no PDFs). The CMS plugin handles updates automatically.
For dynamic sites with frequent URL changes (ecom, news), framework-level sitemap generation is best — Next.js, Nuxt, and others have built-in sitemap support that updates on every deploy.
Whichever path: monitor the sitemap weekly. A broken automation that silently stops producing sitemaps will tank indexation within 30 days.
Common mistakes
Including noindex URLs in the sitemap
What goes wrong: Google sees 'index this' (sitemap) and 'don't index this' (meta tag) on the same URL. It downgrades trust in the sitemap and may deprioritize indexation of your new pages. Lost indexation velocity can cost $5K-30K/month for content-driven businesses.
How to avoid: Always uncheck 'Include Non-Indexable Pages' in the Pages tab. The sitemap should only contain URLs you actively want indexed.
Submitting a sitemap from a crawl that included parameter URLs
What goes wrong: Your sitemap balloons from 4K to 40K URLs because faceted nav got crawled. Google sees a low signal-to-noise ratio and slows crawling of new content. Index coverage actually drops over 60-90 days.
How to avoid: Before generating, add URL exclusion rules in Configuration → URL Rewriting to strip parameters. Re-crawl with the cleaner scope, then generate the sitemap.
Setting Priority and Change Frequency dishonestly
What goes wrong: You set every page to Priority 1.0 and Change Frequency 'hourly' thinking it'll boost rankings. Google ignores both fields (publicly stated) but treats the dishonesty as a low-quality signal. Net effect: nothing positive, slight reputational risk.
How to avoid: Default Priority 0.5, honest Change Frequency per template. Don't try to game these fields — they're advisory at best, ignored at worst.
Not submitting the sitemap to GSC
What goes wrong: You generate and upload the sitemap but never tell Google. Discovery relies on the robots.txt reference, which is slower. New URLs take 2-4 weeks longer to index than they should.
How to avoid: GSC → Sitemaps → Add a new sitemap. One-time action that compounds across every new URL forever.
Manual sitemap generation with no automation
What goes wrong: Your sitemap is current the day you generate it and stale within a week. Three months later, GSC says '850 submitted, 612 indexed' — the missing 238 are pages you've added since the last regen. Lost indexation: dozens of pages per month.
How to avoid: Automate via SF CLI + cron OR via CMS plugin OR via framework-level sitemap generation. Whichever path, monitor weekly.
Including 50K+ URLs in one sitemap file
What goes wrong: Google's sitemap spec caps at 50K URLs / 50 MB per file. Beyond that, Google will partially read and ignore the rest. Half your URLs silently miss Google's discovery.
How to avoid: Screaming Frog auto-splits at the 50K boundary and generates a sitemap_index.xml. Submit the index file, not individual sitemap files.
Recap
Done — what's next
How to set up Screaming Frog and run your first crawl
Read the next tutorial
Hand it off
Sitemap maintenance is a recurring task that compounds value (better indexation velocity) or compounds debt (silent indexation suppression). A vetted technical SEO specialist on EverestX will set up the automation, monitor weekly, and own the GSC reconciliation — typically $300-500/mo at $14-16/hr.
See specialist rates
CMS plugins win on automation; Screaming Frog wins on precision. For most WordPress sites, Yoast or RankMath is enough. For complex setups (ecom with parameter issues, headless CMS, custom URL structures), Screaming Frog generates cleaner output. Use SF for the initial audit and migration; use the CMS plugin for ongoing maintenance once configured correctly.
Spec maximum: 50,000 URLs OR 50 MB per file, whichever you hit first. Beyond that, split into multiple sitemaps referenced by a sitemap_index.xml. Screaming Frog handles the split automatically.
Image sitemap: yes if you have meaningful image SEO (ecom, photography, design). Otherwise no. Video sitemap: yes if video is core to your content. Generally these are added as extensions to the main sitemap rather than separate files.
Normal lag is 4-8 weeks for new sites; 1-2 weeks for established sites. Large gaps usually mean: (1) some sitemap URLs are low quality and Google declined to index them; (2) duplicate content suppression; (3) crawl budget exhaustion on parameter URLs you should exclude. Use GSC's Pages report to diagnose.
Yes, especially on large sites — one sitemap for /blog, one for /products, one for /docs. Easier to monitor section-by-section indexation in GSC. Each section gets its own submission. Reference all of them in a sitemap_index.xml.
Screaming Frog SEO Spider
Screaming Frog only earns its keep when the crawl matches how Googlebot actually sees your site. This walks through the install, license activation, memory tuning, and configuration choices that 90% of first-time users get wrong.
Screaming Frog SEO Spider
Custom Extraction is the most under-used Screaming Frog feature and the closest thing to a superpower it offers. Pull prices, schema fields, review counts, hreflang variants — anything on the page, at crawl scale.
Screaming Frog SEO Spider
You've crawled the site. You have 6,000 issues. You're not sure which 30 actually matter. This is the honest decision framework for when self-managed technical SEO becomes false economy.