If you want to audit a website and look for the best SEO tool for plagiarism, you’re in the right place. Before sharing my favorite anti-plagiarism tools, let’s find out what duplicate content actually is.
Duplicate content is a rather controversial topic in the SEO world; While Google officials say we shouldn’t worry about this as the search engine manages to rank web pages on its own, most SEO experts panic when they find pages full of duplicate content in the audit process.
So let’s find out what duplicate content is, how it affects Google rankings, its causes, how it can be removed, and what preventative measures you can take.
What is duplicate content?
When the same content, whether we are talking about an article or a product, can be accessed at different URLs, we deal with duplicate content. For example, if the content on Page 1 is also on Page 2, we have two pages with the same content: duplicate content.
We also have duplicate content when someone copies content from your site to another site.
How does duplicate content affect the ranking in Google and other search engines?
Duplicate content is not a negligible aspect because it makes the indexing process of a site very difficult. When we have multiple URLs for the same content, Google gets confused and doesn’t know which pages to index, at which point the following may happen:
- Google does not index any of the pages: When Google bots crawl a site, certain resources are allocated for this purpose, limited resources (crawling budget). Most importantly, Google does not explore a site indefinitely, especially if it detects many pages with irrelevant content, such as duplicate pages, limiting the number of pages that will be indexed.
- Google indexes all pages: For small sites (up to 10k pages), Google may index all the pages it encounters, regardless of whether they are duplicates. Thus, several pages within a single site will compete for the same keywords, a process known as cannibalization.
- Google only indexes the wrong pages: This is the most frustrating option. Basically, there is the possibility that the main page, to which we build backlinks and with which we intend to grow in Google, will not be indexed, the search engine choosing to index the other duplicate pages.
- Google only indexes the right page: Google will figure out which page needs to be indexed and ignore the rest for itself in the happiest case. Unfortunately, such cases are quite rare, being almost utopian.
What causes duplicate content?
We’ve made it clear that it’s not good to have duplicate content, and we’ve learned how Google handles this issue, now it’s time to find out what causes duplicate content and find out how and why it appears.
www vs non-www
A site can operate both in the “www” form (www.example.com) and in the “non-www” form (example.com). In order for Google not to index both options, it is necessary to set a preferred version of the site. For example, we can set www.example.com as the preferred option, and example.com will be redirected to it.
https vs http
If the website uses an SSL certificate (as it should), its URLs will no longer be in the form HTTP but will also contain the letter “s” (secured). The URL will be in the form of HTTPS (https://example.com).
For this reason, when installing the SSL certificate, it is necessary to perform several procedures such as redirecting HTTP URLs, updating sitemaps, setting preferred URLs in Google Analytics and Google Search Console accounts to ensure a migration without ranking losses.
A few years ago, tag pages were extremely often used, especially in blogs; an article could even have a few dozen tags. Excessively used tag pages can do a lot of damage to the site, as for each tag, a page is created that will contain the articles associated with that tag.
Thus, if we add tags identical to the site categories, we will have the same content for both the tag page and the category page.
Some sites with a dedicated version of mobile devices have the letter “m” in the URL structure (m.example.com or example.com/m). Thus, we have two different URLs for each page (mobile and desktop).
To resolve this issue, mobile URLs must contain a canonical tag to the desktop URL. The URL of the desktop version will also contain the tag rel=alternate regarding the mobile version’s corresponding page.
Products that belong to several categories
A common problem with online stores is the duplication of product pages. This happens especially when the product pages contain the category/subcategory to which they belong (example.com/category/subcategory/product), and the same product is added in several categories/subcategories.
Thus, if product X is added in category A and category B, we will have two URLs for the same product (example.com/category-a/product-x/ and example.com/category-b/product-x/).
Duplicate category pages by indexing sort filters
In the case of e-commerce sites, when the user sorts the products according to price, sales, reviews, etc., at the end of the URL, parameter specific strings are automatically generated. This way, the main category page is duplicated.
When the same content is found on two or more sites, we talk about external duplicate content. Many online stores run into this issue for product pages because they often prefer to import the manufacturer’s description rather than create original content for each product.
At the same time, unfortunately, plagiarism is widespread in the online environment. Sometimes it happens that those who plagiarise rank over the sites where they copied the content.
How to identify duplicate content?
Finding the best SEO tool for plagiarism
Identifying duplicate content is not as easy as it sounds at first glance, and it often takes hours of research to identify different patterns of duplicate URLs. Fortunately, there are tools we can use to automate our work. Each website is unique, and so are the tools offered. There is no best SEO tool for plagiarism, but you can find the right one for you.
You can also use Google as a SEO tool for plagiarism.
Not all URLs identified by the above tools are indexed in Google. To find duplicate indexed URLs, you can use the commands “site: example.com inurl: keyword” or “site: example.com intitle:keyword” when searching on Google.
For example, if after the analysis with one of the above tools, we notice that for the category pages, the parameter “sort =?” is generated through the filtering system, we can then check if and how many of these URLs have been indexed using the “site: example.com inurl: sort =?“.
How to fix duplicate content?
After identifying the duplicate content, we need to remove it, and for this, we have the following options:
Canonization of pages
The most convenient way to get rid of duplicate content is to use the canonical attribute. This attribute is intended to inform search engines which page to index.
For example, if instead of “example.com/product-sort=?” we want to index the “example.com/product” page. We need to define the following canonical tag in the section of the HTML code:
<link rel = ”canonical” href = ”example.com/product” />
If duplicate pages are of no interest to users, these can be redirected to the main pages, and this method is also the fastest. But pay attention to the number of redirects because they can increase the loading time of the site.
Block content via the robots file (robots.txt)
Robots.txt is a file that informs search engines about pages that should not be indexed. This tool is handy when you want to block the indexing of URLs that contain certain parameters.
For example, to block from indexing all URLs that contain the “sort =?” parameter. we need to add the following line of code to the robots.txt file:
Noindex meta tag
Another beneficial meta tag in the fight against duplicate content is “meta robots“. We inform the search engines not to index the respective page by defining the content as noindex in the respective page’s head section.
<meta name=”robots” content=”noindex,follow”>
How to prevent duplicate content?
The best way to solve duplicate content is to prevent it from occurring by taking precautions.
By canonical self, I mean that a page has implemented the canonical attribute to itself. This method will prevent the indexing of duplicate pages such as those generated by the filtering system, as these dynamically generated pages will be automatically canonical to the main page.
Periodic website checks
Duplicate content must be identified in advance so as not to undermine and cannibalize the main pages. For this reason, it is recommended that you perform regular checks through the “site:” search command or through the Search Console.
If you want to avoid duplicate content, the best way is to prevent it from ever happening. It is a lot easier and faster to prevent duplicate content than to fix it.
You can use my list to find the best SEO tool for plagiarism (for your budget), periodically check your content, and prevent things like this from ever happening to your blog or website.
If you want to find out more, check out my other articles about web development.