Have you ever imagined how a search engine robot can analyze the data of a website for indexing?
Do you own a WordPress website? Sometimes you want Googlebot to quickly index your website or not to index a specific page. So what to do now?
I can immediately answer you: Create a robots.txt file for WordPress right away! To understand this file and how to create it, I will bring you the following useful article.
This article will guide you:
- Understand the concept of what is a robots.txt file.
- The basic structure of it
- What are the notes when creating them
- Why do you need it for your website
- How to create a complete file for your website
Let’s find out!
What is a robots.txt file?
The robots.txt file is a simple .txt text file. This file is part of the Robots Exclusion Protocol (REP) that contains a group of Web standards that regulate how Web Robots (or Search Engine Robots) crawl the web, access, index content, and serve that content to the user.
REP also includes commands like Meta Robots, Page-Subdirectory, and Site-Wide Instructions. It instructs Google’s tools to process links. (eg Follow or Nofollow link).
In fact, creating robots.txt WordPress helps webmasters be more flexible and proactive in allowing or not allowing Google Index engine bots in certain parts of their site.
Syntax of robots.txt file
The syntax is considered the own language of the robots.txt files. There are 5 common terms that you will come across in creating them. These include:
- User-agent: This section is the name of the web crawlers and accessors. (e.g. Googlebot, Bingbot, …)
- Disallow: Used to notify User-agents not to collect any specific URL data. Only 1 Disallow line can be used per URL.
- Allow (Googlebot search engines only): The command tells Googlebot that it will visit a page or subdirectory. Although pages or its subfolders may not be allowed.
- Crawl-delay: Tells the Web crawler how many seconds it must wait before loading and crawling the page’s content. However, note that the Googlebot search engine does not recognize this command. You set the crawl rate in Google Search Console.
- Sitemap: Used to provide the locations of any XML Sitemaps associated with this URL. Note this command is only supported by Google, Ask, Bing and Yahoo engines.
Pattern – Matching
In reality, they make it difficult to restrict or enable bot access due to the flexibility provided by the Pattern-Matching function, which may be applied to a broad variety of URL schemes.
To specify which files or folders should be ignored, SEOs can use any of the two regexes supported by Google and Bing tools. The asterisk (*) and the dollar sign ($) are these two symbols.
- * is a wildcard for any string of characters – This means it is applicable to all Bots of the Google tools.
- $ is the character that matches the end of the URL.
The basic format of robots.txt. file
This file has the following basic format:
However, you can still omit the Crawl-delay and Sitemap sections. This is the basic format of the complete WordPress robots.txt. However, in reality, this file contains more User-agent lines and more user directives.
For example, the command lines: Disallow, Allow, Crawl-delay, … In the robots.txt file, you specify many different bots. Each command is usually written separately, separated by a line.
In a WordPress robots.txt file, you can specify multiple commands for the bots by writing them consecutively with no lines. However, in case this file has many commands for a type of bot, by default the bot will follow the command written clearly and completely.
Standard robots.txt file
To block all Web crawlers from collecting any data on the website including the home page. Let’s use the following syntax:
To allow all crawlers to access all content on the website including the homepage. Let’s use the following syntax:
To block crawlers, Google crawler (User-agent: Googlebot) does not crawl any pages that contain the URL string www.example.com/example-subfolder/. Let’s use the following syntax:
To block Bing’s crawler (User-agent: Bing) from crawling the specific page at www.example.com/example-subfolder/blocked-page. Let’s use the following syntax:
Example for standard robots.txt file
Here is an example of this file that works for the site www.example.com:
In your opinion, what does this file’s structure mean? Let me explain. This proves that you allow all Google tools to follow the link www.example.com/sitemap_index.xml to find this file and analyze it. The same index of all the data in the pages of your website except www.example.com/wp-admin/.
Why do you need to create robots.txt file?
Creating it for your website helps you control bots’ access to certain areas of your website. And this can be extremely dangerous if you accidentally make a few mistakes that make Googlebot unable to index your website. However, creating this file is still really useful for many reasons:
- Prevent Duplicate Content from appearing on your website (note that Meta Robots are usually a better choice for this)
- Keep some parts of the page private
- Keep internal search results pages from showing up on the SERP
- Specify the location of the Sitemap
- Prevents Google Index tools from certain files on your site (images, PDFs, …)
- Use the Crawl-delay command to set the time. This will prevent your server from being overloaded when crawlers load a lot of content at once.
If you don’t want to prevent Web crawlers from crawling your website, you don’t need to create robots.txt at all.
Limitations of the robots.txt. file
1. Some search browsers do not support commands in robots.txt
Not all search engines will support the commands in the robots.txt file, so to keep your data secure, your best bet is to password-protect private files on the server.
2. Each dataset has its own data parsing syntax
Usually, reputable data engines will follow the standard of the commands in the robots.txt file. But each search engine will have a different way of interpreting the data, some will not be able to understand the statement set in this file. Therefore, web developers must understand the syntax of each website crawling tool.
3. Blocked by robots.txt file but Google can still index
Even if you previously blocked a URL on your website but that URL still appears, now Google can still crawl and index your URL.
You should delete that URL on your website if the content inside is not too important for the highest security. Because the content in this URL can still appear when someone searches for them on Google.
Some notes when using robots.txt file
- It is not necessary to specify commands for each User-agent because most User-agents are from a search engine and follow a general rule.
- Absolutely do not use this file to block private data such as user information because Googlebot will ignore the commands in the robots.txt file, so the security is not high.
- To secure the data for the website, the best way is to use a separate password for the files or URLs you do not want to access on the website. However, you should not abuse these commands because sometimes the efficiency will not be as high as expected.
How does the robots.txt file work?
Search engines have 2 main tasks:
- Crawl (scratch/analyze) data on web pages to discover content
- Index that content in response to user searches
In order to crawl the website, the engines will follow the links from one page to another. Ultimately, it crawls through billions of different web pages. This crawling process is also known as “Spidering”.
After arriving at a website, before spidering, the Google engine bots will look for the WordPress robots.txt files. If it finds one, it will read that file first before proceeding to the next steps.
The robots.txt file will contain information about how Google’s engines should crawl your website. Here these bots will be guided with more specific information for this process.
If the robots.txt file does not contain any directives for User-agents or if you do not create this file for the website, the bots will proceed to crawl other information on the web.
Where is the robots.txt file located on a website?
When you create a WordPress website, it automatically creates a robots.txt file located just below the server root directory.
For example, if your site is located in the root directory of the address sharetool.net, you will be able to access the robots.txt file at the path sharetool.net/robots.txt, the initial output will look like this:
As I said above, the part after User-agent: * means that the rule is applied to all types of bots everywhere on the website. In this case, this file will tell bots that they are not allowed in the wp-admin and wp-includes directory files. Very reasonable, isn’t it, because these 2 folders contain a lot of sensitive information files.
Remember this is a virtual file, which WordPress defaults to on installation and cannot be edited (although it should still work). Usually, the standard WordPress robots.txt file location is located in the root directory, commonly called public_html and www (or website name). And to create this file, you need to create a new file to replace the old file placed in that original directory.
In the section below, I will show you many ways to create it for WordPress very easily. But first, do your research on the rules you should use in this file.
How to check if the website has a robots.txt file?
If you are wondering if your website has a robots.txt file. Enter your Root Domain, then add /robots.txt to the end of the URL. If you don’t have a .txt page appearing, then your website is probably not creating it for WordPress. Very simple! Similarly, you can check if my website sharetool.net generates that file by doing the following:
Type Root Domain > insert /robots.txt at the end > Press Enter. And wait for the results to know right away!
What rules should be added to the WordPress robots.txt file?
To date, each person has dealt with a single regulation. But what if you need to treat individual robots differently?
For each robot, simply provide its ruleset in the declaration for its User-agent.
Here’s an illustration of how to make a rule that applies to all bots and another that applies specifically to Bingbot:
Here, all bots will be blocked from accessing /wp-admin/ but Bingbot will be blocked from accessing your entire site.
3 simple methods to create a robots.txt WordPress file
If after checking, you find that your website does not have a robots.txt file or you simply want to change thisfile. Please refer to 3 ways to create robots.txt for WordPress below:
1. Use Yoast SEO
You can edit or create a robots.txt file for WordPress on the WordPress Dashboard itself with a few simple steps. Log in to your website, when you log in you will see the interface of the Dashboard page.
On the left side of the screen, click SEO > Tools > File editor.
The file editor feature will not appear if your WordPress does not have a file editing manager enabled. So enable them via FTP (File Transfer Protocol).
You will now see the robots.txt and .htaccess file sections – this is where you can create them.
2. Through the All in One SEO Plugin set
The All in One SEO Pack plugin is another option for easily making a robots.txt file in WordPress. Another useful plugin for WordPress, this one is basic and straightforward.
The All in One SEO Pack plugin’s primary interface is where you’ll go to make a robots.txt file for WordPress. Simply navigate to All-in-One SEO > Features Manager > Choose Enabled in robots.txt
Now, the UI with all its cool options will appear:
And then, the robots.txt section will appear as a new tab in the large All in One SEO folder. You can create and modify the robots.txt WordPress file here.
However, this set of plugins is a bit different from the Yoast SEO I just mentioned above.
All in One SEO blurs out the information of the robots.txt file instead of you being able to edit the file like the Yoast SEO tool. This can make you a bit passive when editing this file. However, positively speaking, this factor will help you limit the damage to your website. In particular, some Malware bots will harm your website without knowing it.
3. Create and upload robots.txt file via FTP
If you don’t want to use a plugin to create your WordPress robots.txt file, then I have a way for you – Create your own file manually for your WordPress.
It only takes you a few minutes to create this WordPress robots.txt file manually. Use Notepad or Textedit to create a WordPress robots.txt file template according to the Rule I introduced at the beginning of the article. Then upload this file via FTP without using a plugin, this process is very simple and does not take you too much time.
Some rules when creating robots.txt file
- To be found by bots, the WordPress robots.txt files must be placed in the top-level directories of the site.
- Case sensitivity is a feature of txt. A robots.txt extension is required for the file. Not Robots.txt (or robots.TXT)
- Do not put /wp-content/themes/ or /wp-content/plugins/ in the Disallow section. That will prevent the tools from seeing exactly how your blog or website looks.
- Some User-agents choose to ignore your standard robots.txt files. This is quite common with nefarious User-agents like:
Robotic malware (bots of malicious code)
– Email Address Scraping Processes
- These files are generally available and publicly available on the web. Seeing the site’s instructions is as simple as appending /robots.txt to the end of any Root Domain. This implies that everyone can see your list of crawlable and non-crawlable pages. You must not use these documents to conceal the user’s identity.
- Each Subdomain on a Root Domain will use separate robots.txt files. This means that both blog.example.com and example.com should have separate robots.txt files (blog.example.com/robots.txt and example.com/robots.txt). In a nutshell, this is the recommended method for pointing search engines to the domain’s sitemaps at the conclusion of the robots.txt file.
Some notes when using robots.txt file
Make sure you’re not blocking any content or parts of your site that you want Google to index.
Links on pages blocked by robots.txt will not be tracked by bots. Unless these links have links to other pages (pages not blocked by robots.txt, Meta Robots, etc.). Otherwise, the linked resources may not be crawled and indexed.
Link juice will not be passed from blocked pages to landing pages. So if you want the power of Link juice to pass through these pages, then you should use another method instead of creating WordPress robots.txt.
Sensitive material (such as private user information) should not be hidden from search engine results using the robots.txt file. Due to the fact that numerous other websites may be connected to this one, exposing users’ personal information to potential risks. As a result, this file on your Root Domain or homepage will be disregarded by the bots, allowing your site to be crawled and indexed.
If you want to block this site from search results, use another method instead of creating this file for WordPress such as password protection or Noindex Meta Directive. Some search engines have a lot of User-agent. For example, Google uses Googlebot for free searches and Googlebot-Image for image searches.
Most User-agents from the same engine follow the same rule. Therefore you do not need to specify commands for each User-agent. However, doing this can still help you adjust the way the website content is indexed.
Search engines will cache the content of the WordPress robots.txt file. However it still usually updates the content in the cache at least once a day. If you change files and want to update your files faster then immediately use the Submit function of the robots.txt File Inspector.
Frequently asked questions about robots.txt
Here are some frequently asked questions, which may be your questions about robots.txt now:
What is the maximum size of robots.txt file?
500 kilobytes (approx.).
Where is the WordPress robots.txt file located on the website?
At the location: domain.com/robots.txt.
How to edit robots.txt WordPress?
You can do it manually or use one of the many WordPress SEO plugins like Yoast which allows you to edit this file from the WordPress backend.
What if Disallow on Noindex content in robots.txt?
Google will never see the Noindex directive because it cannot crawl the page data.
I use the same robots.txt file for multiple sites. Can I use a full URL instead of a relative path?
No, the commands in it(except the Sitemap: code) only apply to relative paths.
How can I suspend all of my site’s crawling?
You can suspend all crawling by returning an HTTP 503 result code for every URL, including the robots.txt file. You should not change this file to block crawling.
Now it’s your turn! Do you know what robots.txt file is? Checked if my website has this file or not. Create and edit your own robots.txt WordPress file to help search engine bots crawl and index your site quickly.