What Percentage is Considered Duplicate Content to Google?
In response to whether Google uses a percentage threshold for identifying and removing duplicate content, John Mueller of Google recently answered.
- Duane Forrester: (@DuaneForrester) asked on Facebook if any search engine had published data showing the level of duplicate content at which it was considered redundant. Bill Hartzer (@bhartzer) posted on Twitter to ask John Mueller and received an almost immediate reply.
- Bill said: “Hey, @johnmu, is there a percentage that symbolizes duplicate content?
For example, should we attempt to ensure that pages are at least 72.6 percent unique to other pages on our site?
Does Google even measure it?”
- John answered: “There is no number (also, how do you calculate it anyway?).”
How does Google determine if the content is duplicated?
Google has remained remarkably consistent in finding duplicate content for many years.
Matt Cutts, a Google software engineer at the time, published an official Google video in 2013 describing how Google detects duplicate content. He began the video by acknowledging that much Internet content is redundant and expected it to happen.
“It’s necessary to admit that if you examine content on the web, something like 25% or 30% of all the web’s content is duplicate content.
People will cite a paragraph of a blog and then link to the blog, that sort of thing.”
Google won’t punish duplicate content because so much of it is harmless and not spammy, Senior said. He warned that penalizing web pages for having any duplicate content would negatively affect search results quality.
When Google encounters duplicate content, it does:
“try to group it all and treat it as just one portion of content.”
- Matt continued:
“It’s just treated as something that we must cluster suitably. And we need to ensure that it ranks flawlessly.”
He said that to improve the user experience, Google chooses which page to show in the search results. He explained that Google then filters out duplicate pages.
The way Google deals with duplicate content—the 2020 version—is explained below.
In 2020, Google published a Search Off the Record podcast episode in which the subject was addressed in a remarkably similar fashion. Here is what happened at 06:44 minutes into the episode:
- Gary Illyes: “And now we finished the next step, canonicalization, and dupe detection.
- Martin Splitt: Isn’t that the identical dupe detection and canonicalization?
- Gary Illyes: [00:06:56] Well, it’s not, right? Because first, you have to catch the dupes, essentially cluster them together, declaring that all of these pages are dupes of each other,
and then you have to find a leader page for all of them.
…And that is canonicalization.
So, you have the duplication, which is the full term, but within that, you have cluster building, like dupe cluster building and canonicalization. “
In Layman's Terms
In layman’s terms, Gary explains how Google does this in technical language. In actuality, Google compares checksums rather than percentages. Checksums can be considered content converted into a series of numbers or letters. If the content is duplicated, the checksum number sequence will be similar.
- Gary described it like this:
“So, for dupe detection, what we accomplish is, well, we try to catch dupes.
And how we do that is conceivably how most people at other search engines do it, lowering the content into a hash or checksum and then comparing the checksums.”
It’s easier and more accurate, Gary said, for Google to do it that way.
Google uses checksums to detect duplicate content.
There is no specific percentage at which content is considered duplicated; instead, content is identified using a checksum, and then those checksums have resembled.
All the content being duplicated appears to be distinct from when just part of the content is duplicated.
For SEO audits, click here to book your appointment.
What’s Your SEO Score?
Enter the URL of any landing page or blog article and see how optimized it is for one keyword or phrase.