02/10/2022
๐๐จ๐ฐ ๐๐จ๐จ๐ ๐ฅ๐ ๐๐๐ง๐๐ฅ๐๐ฌ ๐๐ฎ๐ฉ๐ฅ๐ข๐๐๐ญ๐ ๐๐จ๐ง๐ญ๐๐ง๐ญ โ ๐๐๐๐ ๐๐๐ซ๐ฌ๐ข๐จ๐ง?
Fast forward to 2020 and Google published a Search Off the Record podcast episode where the same topic is described in remarkably similar language.
Here is the relevant section of that podcast from the 06:44 minutes into the episode:
โGary Illyes: And now we ended up with the next step, which is actually canonicalization and dupe detection.
Martin Splitt: Isnโt that the same, dupe detection and canonicalization, kind of?
Gary Illyes: [00:06:56] Well, itโs not, right? Because first you have to detect the dupes, basically cluster them together, saying that all of these pages are dupes of each other,
and then you have to basically find a leader page for all of them.
โฆAnd that is canonicalization.
So, you have the duplication, which is the whole term, but within that you have cluster building, like dupe cluster building, and canonicalization. โ
This is how Gary explained it:
โSo, for dupe detection what we do is, well, we try to detect dupes.
And how we do that is perhaps how most people at other search engines do it, which is, basically, reducing the content into a hash or checksum and then comparing the checksums.โ