2019年5月5日 wordpress

wordpress通过robots_txt钩子生成robots.txt规则的方法(需支持重写功能)

作者森林

robots.txt 文件对于一个网站是很重要的，特别是优化网站的时候。wordpress安装的时候有个选择，是否禁止蜘蛛爬行。当然wordpress安装成功以后，后台也有个选项，可以重新选择是否让搜索引擎的蜘蛛爬行。

robots.txt 简单介绍一下

robots.txt 文件并不是wordpress网站特有的，所有的网站都可以有。robots.txt文件告诉爬虫(通常是搜索引擎蜘蛛)，那个文件可以爬取，那个文件不要爬取。是个君子协定，意思是robots.txt告诉蜘蛛能爬取什么，不能爬取什么。至于蜘蛛爬虫是否遵守，全看爬虫自己。

wordpress网站根目录没有 robots.txt 文件，为什么可以访问这个文件

好多人wordpress博主，可能并没有注意到。在网站根目录下并没有文件 robots.txt, 但是依然可以访问这个文件。比如

访问： https://www.liuhaolin.com/robots.txt 这样的地址，返回的结果可能是

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

如果你也发现的这些，说明了两点
1. 你的网站支持伪静态，就是重写规则
2. 你网站没有禁止蜘蛛的爬行。

wordpress禁止蜘蛛爬行看到的结果是什么

User-agent: *
Disallow: /

为什么可以看到 robots.txt 文件

首先要明白这个和重写规则有关系，就看下重新规则。

add_action('init', function () {
    global $wp_rewrite; //global重写类
    var_dump($wp_rewrite->rewrite_rules());
});

// 输出的结果
/*
["robots\.txt$"]=>
string(18) "index.php?robots=1" // 1 表示没有禁用蜘蛛， 0 表示禁用蜘蛛
*/

后一个问题，通过robots_txt钩子生成robots.txt规则


add_filter('robots_txt', 'my_robots', 10,  2);
function my_robots($rules, $type) {
    $new_rules = <<<EOT
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /*/comment-page-*
Disallow: /*?replytocom=*
Disallow: /category/*/page/
Disallow: /tag/*/page/
Disallow: /*/trackback
Disallow: /feed
Disallow: /*/feed
Disallow: /comments/feed
Disallow: /?s=*
Disallow: /*/?s=*\
Disallow: /attachment/
Disallow: /authror/
EOT;
    return $new_rules;
}

回复取消回复