‹ All Screaming Frog SEO Spider tutorials

How to find duplicate content with Screaming Frog

Duplicate content cannibalization is the silent killer of mid-sized SEO programs. Screaming Frog's Content analysis tab finds exact and near-duplicates in one pass — but the configuration matters more than the report.

~2 hrIntermediateUpdated May 26, 2026

Who this is forSEOs running a site with templates, programmatic pages, or category-heavy navigation who suspect content cannibalization. If multiple pages target the same keyword and none rank well, duplicate-content detection is the diagnostic.

What you'll need

Screaming Frog SEO Spider 21+ (Content analysis tab requires the licence)
A completed crawl with HTML extraction enabled
About 2 hours for the analysis + initial decisions
Familiarity with canonical tags vs. 301 redirects vs. content consolidation as fix paths

Step 1

Enable content analysis before the crawl

Configuration → Content → Duplicates. Set Near Duplicate Threshold to 90% (default). Check "Enable Near Duplicates" before crawling.

Open Configuration → Content → Duplicates. By default, Screaming Frog flags exact-duplicate pages (identical HTML body) but not near-duplicates.

Check 'Enable Near Duplicates.' This activates the minhash similarity algorithm that catches pages with 90%+ identical content even when titles, meta, or sidebars differ.

Set Near Duplicate Threshold to 90% for a tight first pass. Drop to 80% if your initial run returns too few clusters. Anything below 70% catches too many false positives (template chrome being similar).

Check the 'Only check indexable pages' option — duplicates on noindexed pages don't matter for SEO. This narrows the report to actionable findings.

Step 2

Configure content area extraction

Configuration → Content → Area. Tell Screaming Frog which HTML element is your "main content" — usually <main>, <article>, or a specific class.

Configuration → Content → Area. Without this set correctly, SF compares the full HTML including header/footer/sidebar — making near-duplicates impossible to detect because the chrome dominates the comparison.

Use the 'Include' field with a CSS selector that wraps your main content. Common values: 'main,' 'article,' '.entry-content,' '.post-content,' '#main-content.'

If you're unsure, view-source on a sample page and find the wrapper element around the body copy. That's your include selector.

Use the 'Exclude' field to strip elements that shouldn't count: '.sidebar,' '.comments,' '.related-posts.' These are similar across pages by design and shouldn't trigger duplicate flags.

Re-crawl after changing these settings. The Content tab will only populate correctly for the new crawl.

Step 3

Read the Content tab — Exact Duplicates first

Content tab → Filter: "Exact Duplicates." These are pages with identical body content. Usually unintended.

Open the Content tab. Filter to 'Exact Duplicates.' These are pages where the body content (within your include selector) is identical character-for-character.

Common causes: parameter URLs serving the same page (/product?utm=email vs /product), printer-friendly versions (/page/print), AMP duplicates (/page/amp), session-ID URLs.

Click any URL to see its duplicate cluster in the bottom Duplicates panel. Note the full set — if /product appears 5 times with different params, you have a parameter-canonical problem.

Export → Content → Exact Duplicates. This CSV is your action list. Most exact duplicates are fixed via canonical tags or URL parameter handling.

Step 4

Triage Near Duplicates — separate cannibalization from intent

Content tab → Filter: "Near Duplicates." Click each cluster, read the pages, decide: merge, canonical, or leave alone.

Filter to 'Near Duplicates.' These are pages with 90%+ content similarity but not byte-identical. The Match column shows the similarity %.

Click each cluster and ask three questions: (1) Do these pages target the same keyword/intent? (2) Were they written separately for different audiences? (3) Are they ranking against each other in Google?

If yes to all three: this is cannibalization. Consolidate. Pick the strongest URL (most backlinks, best position), update it with the best content from the duplicates, then 301 the duplicates to it.

If they're intentionally similar (e.g., 'Buy a Laptop in NYC' / 'Buy a Laptop in Chicago' — programmatic location pages): leave them alone, but verify each has unique value (city-specific reviews, local store info, geo schema).

If two pages serve different intents but happen to share boilerplate (product pages with the same shipping/returns blocks): tighten your content extraction selector to exclude the shared block, then re-crawl.

Step 5

Check duplicate titles and meta descriptions

Page Titles tab → Filter: "Duplicate." Meta Description tab → Filter: "Duplicate." Both should be near-zero on a healthy site.

Open the Page Titles tab. Filter to 'Duplicate.' These are pages with identical <title> tags.

Duplicate titles are often a more reliable cannibalization signal than duplicate body content — if two pages have the same title, Google likely sees them as serving the same query.

Same workflow in the Meta Description tab. Duplicate descriptions are less harmful but still a wasted opportunity for unique SERP snippets.

Export both as CSVs. Cross-reference against your Near Duplicates list — pages with duplicate titles AND duplicate near-content are top-priority consolidations.

Step 6

Decide between canonical, 301, and content rewrite

Three fixes: canonical (when both URLs must remain accessible), 301 (when one URL is the winner), rewrite (when both should remain but differentiate).

Canonical tag: use when both URLs need to remain accessible (e.g., /product and /product?utm=campaign). Add `<link rel="canonical" href="/product">` to both. Google consolidates signals to the canonical.

301 redirect: use when one URL is the clear winner and the duplicate has no reason to exist (old URLs, deprecated category pages, accidental duplicate creation). Permanent and final.

Content rewrite: use when both pages should remain but be differentiated (programmatic location pages that need unique value). Add unique geo content, local schema, region-specific testimonials.

Avoid noindex as a duplicate fix — it removes the page from the index but doesn't consolidate equity. Use canonical or 301 instead.

Common mistakes

What goes wrong (and how to avoid it)

Running content analysis without setting the content area selector
What goes wrong: Screaming Frog compares full HTML. Every page on your site has the same header/footer/sidebar, so it flags 8,000 false-positive near-duplicates. You spend a week investigating duplicates that don't exist. Lost productivity: 20-30 hours.
How to avoid: Configuration → Content → Area. Set Include to your main content wrapper and Exclude common chrome before crawling.
Consolidating the wrong winner
What goes wrong: You merge a duplicate cluster into the page with prettier URLs but no rankings. The 301-merged page had 50 keywords ranking — they all drop in 4-8 weeks. Estimated traffic loss: 10,000-50,000 monthly sessions. Lost revenue depends on conversion economics, but $5K-50K is the typical range for ecom.
How to avoid: Before merging, cross-reference each duplicate URL against GSC Performance data. The winner is the URL with the most ranking keywords AND the strongest backlink profile, not the one with the cleanest URL.
Using noindex instead of canonical to handle duplicates
What goes wrong: Noindex removes the page from the index but doesn't pass equity to a canonical. You lose the backlinks pointing at the noindexed URL forever. On a high-traffic duplicate, this can cost $10K-30K in lost equity.
How to avoid: Use rel=canonical (or 301 redirect) for duplicates that have backlinks or rankings. Noindex only for accidental pages that have no equity to preserve.
Ignoring duplicate titles as 'just a tag issue'
What goes wrong: Two pages with the same title compete for the same SERP. Google picks one to rank and ignores the other. The ignored page's content effort is wasted. Cumulative ignored-page count compounds over years.
How to avoid: Make every page title unique. Even on programmatic templates, append city/category/year to differentiate.
Setting the Near Duplicate threshold to 70% or lower
What goes wrong: Returns 30,000 'near duplicate' pairs that are mostly false positives (shared chrome, similar template). You drown in noise and miss the real cannibalization cases.
How to avoid: Start at 90% threshold. Drop to 85% only if you suspect deeper similarity issues. Below 80% is rarely useful.
Not re-crawling after consolidation
What goes wrong: You ship 50 consolidations. Without a re-crawl, you don't know if redirects work, canonicals are honored, or new duplicates were created in the process. Issues compound for months.
How to avoid: After every batch of duplicate fixes, re-crawl. Verify the cluster sizes drop and no new clusters appeared.

Recap

What to take away

Enable Near Duplicates and set the content area BEFORE crawling — settings only apply to new crawls.
Exact Duplicates first (usually parameter or AMP issues), Near Duplicates second (real cannibalization).
Cross-reference duplicate clusters against GSC Performance data before deciding the winner.
Three fix paths: canonical (both URLs needed), 301 (one URL wins), rewrite (both URLs needed and differentiated).
Avoid noindex as a duplicate fix — it loses equity instead of consolidating it.

Done — what's next

How to set up Screaming Frog and run your first crawl

Read the next tutorial

Hand it off

Duplicate content audits sound simple — find duplicates, fix them. In practice, picking the right winner and right fix path takes pattern recognition that builds over hundreds of consolidations. A vetted technical SEO specialist on EverestX will own the audit, the decisions, and the ship sequence — typically $500-1,000/mo at $14-16/hr.

See specialist rates

Frequently Asked Questions

How is "near duplicate" different from "exact duplicate"?

Exact = byte-identical body content. Near = 90%+ similar based on minhash. Exact duplicates are usually parameter or technical issues; near duplicates are usually editorial/content overlap that causes ranking cannibalization.

Should I always consolidate near duplicates?

No. Programmatic pages targeting different cities or product variants are intentionally similar and shouldn't be consolidated — they should be differentiated with unique value (local content, unique reviews, geo schema). Consolidate only when both pages target the same intent.

What threshold should I use for near duplicates?

Start at 90% (the default). Drop to 85% if your first pass returns too few clusters. Below 80% almost always returns false positives from shared template chrome.

Can I detect duplicate content across domains (cross-site)?

Screaming Frog crawls one site at a time, so cross-domain duplicate detection requires two crawls + manual comparison. For cross-domain content theft detection, Copyscape or Siteliner are purpose-built tools.

How long until consolidation moves rankings?

Canonical tag changes: 2-6 weeks to be honored, 4-8 weeks for ranking effect. 301 redirects: 4-12 weeks for full equity transfer. Plan a 90-day window before evaluating the result of any consolidation effort.

How to find duplicate content with Screaming Frog

Enable content analysis before the crawl

Configure content area extraction

Read the Content tab — Exact Duplicates first

Triage Near Duplicates — separate cannibalization from intent

Check duplicate titles and meta descriptions

Decide between canonical, 301, and content rewrite

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up Screaming Frog and run your first crawl

How to use Custom Extraction in Screaming Frog

How to run an Ahrefs content gap analysis (and prioritize what to write)

When to hire a technical SEO specialist — the honest checklist

How to find duplicate content with Screaming Frog

Enable content analysis before the crawl

Configure content area extraction

Read the Content tab — Exact Duplicates first

Triage Near Duplicates — separate cannibalization from intent

Check duplicate titles and meta descriptions

Decide between canonical, 301, and content rewrite

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up Screaming Frog and run your first crawl

How to use Custom Extraction in Screaming Frog

How to run an Ahrefs content gap analysis (and prioritize what to write)

When to hire a technical SEO specialist — the honest checklist