Blog content

Google on the percentage that represents duplicate content

Google’s John Mueller recently answered the question of whether there is a Duplicate Content Percentage Threshold that Google uses to identify and filter out duplicate content.

What percentage equals duplicate content?

The conversation actually started on Facebook when Duane Forrester (@DuaneForrester) asked if anyone knows if a search engine has published a content overlap percentage at which content is considered duplicate.

Bill Hartzer (bhartzer) took to Twitter to ask John Mueller and received an almost immediate response.

Bill tweeted:

“Hey @johnmu, is there a percentage that represents duplicate content?

For example, should we try to ensure that pages are at least 72.6% unique from other pages on our site?

Does Google even measure it?”

Google’s John Mueller replied:

How does Google detect duplicate content?

Google’s methodology for detecting duplicate content has remained remarkably similar for many years.

In 2013, Matt Cutts (@mattcutts), then a software engineer at Google published an official Google video describing how Google detects duplicate content.

He started the video by stating that a lot of internet content is duplicate and that’s a normal thing.

“It’s important not to realize that if you watch content on the web, something like 25% or 30% of all content on the web is duplicate content.

… People quote a paragraph from a blog and then a link to the blog, that sort of thing.

He went on to say that because much duplicate content is innocent and without spammy intent, Google will not penalize this content.

According to him, penalizing web pages for having duplicate content would have a negative effect on the quality of search results.

What Google does when it finds duplicate content:

“…try to bundle everything together and treat it as one piece of content.”

Matt continued:

“It’s just treated as something that we have to bundle appropriately. And we have to make sure it ranks correctly.

He explained that Google then chooses which page to display in search results and filters out duplicate pages to improve user experience.

How Google Handles Duplicate Content – ​​2020 Version

Fast forward to 2020 and Google released a Search Off the Record podcast episode where the same topic is described in remarkably similar language.

here is relevant section of this podcast from minutes 06:44 into the episode:

“Gary Illyes: And now we’ve ended up with the next step, which is actually canonization and dupe detection.

Martin Splitt: Isn’t it the same thing, detection of dupes and canonization, in a way?

Gary Illyes: [00:06:56] Well, it’s not, is it? Because you have to detect the dupes first, group them together roughly, saying that all these pages are dupes of each other,
and then you basically have to find a page leader for each of them.

…And that’s canonization.

So you have duplication, which is the whole term, but inside of that you have cluster building, like dupe cluster building, and canonicalization. “

Gary then explains in technical terms exactly how they do it. Basically, Google doesn’t look at percentages exactly, but rather compares checksums.

A checksum can be thought of as a representation of content as a series of numbers or letters. So if the content is duplicate, the sequence of checksum numbers will be similar.

Here’s how Gary explained it:

“So for dupe detection, what we do, well, we try to detect dupes.

And the way we do it is maybe the way most users of other search engines do it, which is basically reduce the content to a hash or checksum and then Compare checksums.

Gary said Google does it this way because it’s easier (and obviously accurate).

Google detects duplicate content with checksums

So when talking about duplicate content, it’s probably not a percentage threshold issue, where there’s a number at which the content is said to be duplicated.

Rather, duplicate content is detected with a representation of the content in the form of a checksum, and then those checksums are compared.

Another point to remember is that there seems to be a distinction between when some content is duplicate and all content is duplicate.

Featured Image by Shutterstock/Ezume Images