Loading tutorials…
Loading tutorials…
Duplicate content cannibalization is the silent killer of mid-sized SEO programs. Screaming Frog's Content analysis tab finds exact and near-duplicates in one pass — but the configuration matters more than the report.
Who this is forSEOs running a site with templates, programmatic pages, or category-heavy navigation who suspect content cannibalization. If multiple pages target the same keyword and none rank well, duplicate-content detection is the diagnostic.
What you'll need
Step 1
Configuration → Content → Duplicates. Set Near Duplicate Threshold to 90% (default). Check "Enable Near Duplicates" before crawling.
Open Configuration → Content → Duplicates. By default, Screaming Frog flags exact-duplicate pages (identical HTML body) but not near-duplicates.
Check 'Enable Near Duplicates.' This activates the minhash similarity algorithm that catches pages with 90%+ identical content even when titles, meta, or sidebars differ.
Set Near Duplicate Threshold to 90% for a tight first pass. Drop to 80% if your initial run returns too few clusters. Anything below 70% catches too many false positives (template chrome being similar).
Check the 'Only check indexable pages' option — duplicates on noindexed pages don't matter for SEO. This narrows the report to actionable findings.
Step 2
Configuration → Content → Area. Tell Screaming Frog which HTML element is your "main content" — usually <main>, <article>, or a specific class.
Configuration → Content → Area. Without this set correctly, SF compares the full HTML including header/footer/sidebar — making near-duplicates impossible to detect because the chrome dominates the comparison.
Use the 'Include' field with a CSS selector that wraps your main content. Common values: 'main,' 'article,' '.entry-content,' '.post-content,' '#main-content.'
If you're unsure, view-source on a sample page and find the wrapper element around the body copy. That's your include selector.
Use the 'Exclude' field to strip elements that shouldn't count: '.sidebar,' '.comments,' '.related-posts.' These are similar across pages by design and shouldn't trigger duplicate flags.
Re-crawl after changing these settings. The Content tab will only populate correctly for the new crawl.
Step 3
Content tab → Filter: "Exact Duplicates." These are pages with identical body content. Usually unintended.
Open the Content tab. Filter to 'Exact Duplicates.' These are pages where the body content (within your include selector) is identical character-for-character.
Common causes: parameter URLs serving the same page (/product?utm=email vs /product), printer-friendly versions (/page/print), AMP duplicates (/page/amp), session-ID URLs.
Click any URL to see its duplicate cluster in the bottom Duplicates panel. Note the full set — if /product appears 5 times with different params, you have a parameter-canonical problem.
Export → Content → Exact Duplicates. This CSV is your action list. Most exact duplicates are fixed via canonical tags or URL parameter handling.
Step 4
Content tab → Filter: "Near Duplicates." Click each cluster, read the pages, decide: merge, canonical, or leave alone.
Filter to 'Near Duplicates.' These are pages with 90%+ content similarity but not byte-identical. The Match column shows the similarity %.
Click each cluster and ask three questions: (1) Do these pages target the same keyword/intent? (2) Were they written separately for different audiences? (3) Are they ranking against each other in Google?
If yes to all three: this is cannibalization. Consolidate. Pick the strongest URL (most backlinks, best position), update it with the best content from the duplicates, then 301 the duplicates to it.
If they're intentionally similar (e.g., 'Buy a Laptop in NYC' / 'Buy a Laptop in Chicago' — programmatic location pages): leave them alone, but verify each has unique value (city-specific reviews, local store info, geo schema).
If two pages serve different intents but happen to share boilerplate (product pages with the same shipping/returns blocks): tighten your content extraction selector to exclude the shared block, then re-crawl.
Step 5
Page Titles tab → Filter: "Duplicate." Meta Description tab → Filter: "Duplicate." Both should be near-zero on a healthy site.
Open the Page Titles tab. Filter to 'Duplicate.' These are pages with identical <title> tags.
Duplicate titles are often a more reliable cannibalization signal than duplicate body content — if two pages have the same title, Google likely sees them as serving the same query.
Same workflow in the Meta Description tab. Duplicate descriptions are less harmful but still a wasted opportunity for unique SERP snippets.
Export both as CSVs. Cross-reference against your Near Duplicates list — pages with duplicate titles AND duplicate near-content are top-priority consolidations.
Step 6
Three fixes: canonical (when both URLs must remain accessible), 301 (when one URL is the winner), rewrite (when both should remain but differentiate).
Canonical tag: use when both URLs need to remain accessible (e.g., /product and /product?utm=campaign). Add `<link rel="canonical" href="/product">` to both. Google consolidates signals to the canonical.
301 redirect: use when one URL is the clear winner and the duplicate has no reason to exist (old URLs, deprecated category pages, accidental duplicate creation). Permanent and final.
Content rewrite: use when both pages should remain but be differentiated (programmatic location pages that need unique value). Add unique geo content, local schema, region-specific testimonials.
Avoid noindex as a duplicate fix — it removes the page from the index but doesn't consolidate equity. Use canonical or 301 instead.
Common mistakes
Running content analysis without setting the content area selector
What goes wrong: Screaming Frog compares full HTML. Every page on your site has the same header/footer/sidebar, so it flags 8,000 false-positive near-duplicates. You spend a week investigating duplicates that don't exist. Lost productivity: 20-30 hours.
How to avoid: Configuration → Content → Area. Set Include to your main content wrapper and Exclude common chrome before crawling.
Consolidating the wrong winner
What goes wrong: You merge a duplicate cluster into the page with prettier URLs but no rankings. The 301-merged page had 50 keywords ranking — they all drop in 4-8 weeks. Estimated traffic loss: 10,000-50,000 monthly sessions. Lost revenue depends on conversion economics, but $5K-50K is the typical range for ecom.
How to avoid: Before merging, cross-reference each duplicate URL against GSC Performance data. The winner is the URL with the most ranking keywords AND the strongest backlink profile, not the one with the cleanest URL.
Using noindex instead of canonical to handle duplicates
What goes wrong: Noindex removes the page from the index but doesn't pass equity to a canonical. You lose the backlinks pointing at the noindexed URL forever. On a high-traffic duplicate, this can cost $10K-30K in lost equity.
How to avoid: Use rel=canonical (or 301 redirect) for duplicates that have backlinks or rankings. Noindex only for accidental pages that have no equity to preserve.
Ignoring duplicate titles as 'just a tag issue'
What goes wrong: Two pages with the same title compete for the same SERP. Google picks one to rank and ignores the other. The ignored page's content effort is wasted. Cumulative ignored-page count compounds over years.
How to avoid: Make every page title unique. Even on programmatic templates, append city/category/year to differentiate.
Setting the Near Duplicate threshold to 70% or lower
What goes wrong: Returns 30,000 'near duplicate' pairs that are mostly false positives (shared chrome, similar template). You drown in noise and miss the real cannibalization cases.
How to avoid: Start at 90% threshold. Drop to 85% only if you suspect deeper similarity issues. Below 80% is rarely useful.
Not re-crawling after consolidation
What goes wrong: You ship 50 consolidations. Without a re-crawl, you don't know if redirects work, canonicals are honored, or new duplicates were created in the process. Issues compound for months.
How to avoid: After every batch of duplicate fixes, re-crawl. Verify the cluster sizes drop and no new clusters appeared.
Recap
Done — what's next
How to set up Screaming Frog and run your first crawl
Read the next tutorial
Hand it off
Duplicate content audits sound simple — find duplicates, fix them. In practice, picking the right winner and right fix path takes pattern recognition that builds over hundreds of consolidations. A vetted technical SEO specialist on EverestX will own the audit, the decisions, and the ship sequence — typically $500-1,000/mo at $14-16/hr.
See specialist rates
Exact = byte-identical body content. Near = 90%+ similar based on minhash. Exact duplicates are usually parameter or technical issues; near duplicates are usually editorial/content overlap that causes ranking cannibalization.
No. Programmatic pages targeting different cities or product variants are intentionally similar and shouldn't be consolidated — they should be differentiated with unique value (local content, unique reviews, geo schema). Consolidate only when both pages target the same intent.
Start at 90% (the default). Drop to 85% if your first pass returns too few clusters. Below 80% almost always returns false positives from shared template chrome.
Screaming Frog crawls one site at a time, so cross-domain duplicate detection requires two crawls + manual comparison. For cross-domain content theft detection, Copyscape or Siteliner are purpose-built tools.
Canonical tag changes: 2-6 weeks to be honored, 4-8 weeks for ranking effect. 301 redirects: 4-12 weeks for full equity transfer. Plan a 90-day window before evaluating the result of any consolidation effort.
Screaming Frog SEO Spider
Screaming Frog only earns its keep when the crawl matches how Googlebot actually sees your site. This walks through the install, license activation, memory tuning, and configuration choices that 90% of first-time users get wrong.
Screaming Frog SEO Spider
Custom Extraction is the most under-used Screaming Frog feature and the closest thing to a superpower it offers. Pull prices, schema fields, review counts, hreflang variants — anything on the page, at crawl scale.
Ahrefs
Content Gap shows you keywords your competitors rank for but you don't. The trap is treating it as a write-list. This walks through the qualification + prioritization that separates good briefs from bloated content roadmaps.
Screaming Frog SEO Spider
You've crawled the site. You have 6,000 issues. You're not sure which 30 actually matter. This is the honest decision framework for when self-managed technical SEO becomes false economy.