[DEV] Search Engine Bot Validation for PHPBB

Need some custom code changes to the phpBB core simple enough that you feel doesn't require an extension? Then post your request here so that community members can provide some assistance.

NOTE: NO OFFICIAL SUPPORT IS PROVIDED IN THIS SUB-FORUM
Forum rules
READ: phpBB.com Board-Wide Rules and Regulations

NOTE: NO OFFICIAL SUPPORT IS PROVIDED IN THIS SUB-FORUM
User avatar
JLA
Registered User
Posts: 621
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS

[DEV] Search Engine Bot Validation for PHPBB

Post by JLA »

So as a way of giving back to the phpBB community for all that we are thankful for over the past 20 years - we from time to time try to share some helpful bits of code we've developed that maybe other users will find useful.

If you end up using any of this code or sharing it, please give us credit in the code comments

We use this on our PHPBB2 board but it I think it can be used the same on a phpbb3 board as well.

Since our board is pretty complicated, some things are stripped out but this should give those who are interested a good starting point. Of course, you will want to add any (what you think is necessary) DB insert validations that you think you need for your setup.

Another note - even though search engines in many cases can be trusted, keep an eye on what they are doing on your site from the IP blocks that they are supplying to you. In the past there have been incidents of less than honest activity from many of the major players so having consistent oversight into the activity on your board from all users is very important. If you do not have good oversight over all of your board's visitors and activity you would be absolutely shocked by what you would find.

This mod maintains a list in your database of the valid IP blocks from Google, Apple, and Bing that you can allow access to your board through another mod such as functions_ip_track by aUsTiN-Inc

You'll need to run the script (at an interval of your choosing) as a cron or Windows job using at least php 5.3. It might work on earlier versions - we do not have any records of testing it on anything earlier than 5.3. We've stripped it back to be compatible with that version. Check your php.ini that the needed extensions are enabled as well

1st step, you need to create the table to hold the IP information in your phpbb database

Code: Select all


--
-- Table structure for table `sebot_ip_ranges`
--

CREATE TABLE IF NOT EXISTS `sebot_ip_ranges` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ip_start` varchar(15) NOT NULL,
  `ip_end` varchar(15) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unique_ip_range` (`ip_start`,`ip_end`)
)  ;



Then here is the script you'll need to modify to fit (see comments inside) and run to update your table.

Code: Select all

<?php
/***************************************************************************
 *                             googlebotipaddresses.php
 *                            -------------------
 *   Author  		: 	JLA FORUMS - [email protected] 
 *   Created 		: 	XXXX
 *   Last Updated	:	Saturday, Mar 08, 2025
 *
 *	 Version		: 	1.0.0 for approved shared release - JLA
 *
 ***************************************************************************/

// This is setup for a phpbb2 board but you can make the necessary changes for it to work on a phpbb3 board as well easily enough.

define('IN_PHPBB', true);
$phpbb_root_path = //Change this to fit your board
include($phpbb_root_path . 'extension.inc');
include($phpbb_root_path . 'common.'.$phpEx);


global $db;

// URLs to fetch Googlebot, Special Crawlers, Bingbot, and Applebot IP ranges - these should be updated if any of the services change them - which can happen).  These URLs are good as of March 2025
$urls = [
    'googlebot' => 'https://developers.google.com/static/search/apis/ipranges/googlebot.json',  //These are for Googlebot
    'special_crawlers' => 'https://developers.google.com/static/search/apis/ipranges/special-crawlers.json', // For some special google crawlers you might want to allow
    'bingbot' => 'https://www.bing.com/toolbox/bingbot.json', // For Bing
    'applebot' => 'https://search.developer.apple.com/applebot.json'//For applebot
];

// Important to check for each service what to add to robots.txt to disallow anything to do with AI training or AI access to your site.  For example Apple has something like Applebot Extended.  They have it documented what to add to Robots.txt to disallow.

// Array to store all IP ranges
$ip_ranges = array();
$url_errors = array();

// Fetch and process data from each URL
foreach ($urls as $name => $url) {
    // Initialize cURL
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $response = curl_exec($ch);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    // Check if the request was successful
    if ($http_code != 200 || empty($response)) {
        $url_errors[] = "Failed to fetch data from $name URL: HTTP code $http_code";
        continue; // Skip this URL and proceed to the next one
    }

    // Decode the JSON response
    $data = json_decode($response, true);
    if (json_last_error() !== JSON_ERROR_NONE || !isset($data['prefixes'])) {
        $url_errors[] = "Invalid JSON data from $name URL";
        continue; // Skip this URL and proceed to the next one
    }

    // Extract IPv4 prefixes and convert them to start-end ranges
    foreach ($data['prefixes'] as $prefix) {
        if (isset($prefix['ipv4Prefix'])) {
            list($network, $mask) = explode('/', $prefix['ipv4Prefix']);
            $ip_start = long2ip(ip2long($network) & (-1 << (32 - $mask)));
            $ip_end = long2ip(ip2long($network) + pow(2, (32 - $mask)) - 1);
            $ip_ranges[] = array(
                'ip_start' => $ip_start,
                'ip_end' => $ip_end,
            );

            // Echo the processed IP range
            echo "Processed IP range: $ip_start - $ip_end\n";
        }
    }
}

// If any URLs failed, echo the errors and stop further processing
if (!empty($url_errors)) {
    foreach ($url_errors as $error) {
        echo "$error\n";
    }
    echo "No changes were made to the database due to URL errors.\n";
    echo "-->COMPLETED WITH ERRORS!\n\n";
    exit;
}

// Fetch existing IP ranges from the database
$existing_ranges = array();
$sql = 'SELECT ip_start, ip_end FROM sebot_ip_ranges';
$result = $db->sql_query($sql);
while ($row = $db->sql_fetchrow($result)) {
    $existing_ranges[] = $row;
}
$db->sql_freeresult($result);

// Compare fetched ranges with existing ranges
$ranges_to_insert = array();
$ranges_to_keep = array();

foreach ($ip_ranges as $range) {
    $range_key = $range['ip_start'] . '-' . $range['ip_end'];
    if (!in_array($range, $existing_ranges)) {
        $ranges_to_insert[] = $range; // Range is new or changed
    } else {
        $ranges_to_keep[] = $range; // Range is unchanged
    }
}

// If there are new or changed ranges, update the database
if (!empty($ranges_to_insert)) {
    // Clear only the old ranges that are not in the new data
    foreach ($existing_ranges as $existing_range) {
        if (!in_array($existing_range, $ip_ranges)) {
            $sql = "DELETE FROM sebot_ip_ranges 
                    WHERE ip_start = '" . $existing_range['ip_start'] . "' 
                    AND ip_end = '" . $existing_range['ip_end'] . "'";
            $db->sql_query($sql);
        }
    }

    // Insert new or changed ranges
    foreach ($ranges_to_insert as $range) {
        $sql = "INSERT INTO sebot_ip_ranges (ip_start, ip_end) 
                VALUES ('" . $range['ip_start'] . "', '" . $range['ip_end'] . "')";
        $db->sql_query($sql);

        // Echo the IP range being written to the database
        echo "Writing to database: " . $range['ip_start'] . " - " . $range['ip_end'] . "\n";
    }
    echo "IP ranges updated successfully.\n";
} else {
    echo "No changes in IP ranges.\n";
}

echo "-->COMPLETED!\n\n";
sleep(3500); // this is good to keep for example if you run the script on an hourly basis to see if you had any problems and you do not have any sort of error notification script or oversight.  You can comment it out if you do not need it.

?>

So now you have this information in your database you'll need to use it on your board. There are mods or other phpbb2 files where you could include a variation of the example code below to check each visitor's ip $ip to allow access.
We also put an example of using WINCACHE that will catch the result in memory so you only make a single DB query each hour for this info. Using Wincache or similar to reduce DB queries results in HUGE performance improvements for PHPBB.

Code: Select all

// EXAMPLE CODE for checking IPs for valid SE BOTS to allow access to your phpBB

//add Googlebot and others to wincache and if not there pull ranges from DB -- JLA

	// WinCache key for storing/retrieving IP ranges
	$gbwincache_key = 'SEBOT_IP_RANGES';

	// Check if the IP ranges are cached in WinCache
	$ip_ranges_cached = false;
	$ip_ranges = wincache_ucache_get($gbwincache_key, $ip_ranges_cached);

	if (!$ip_ranges_cached) 
	{
		// If not cached, fetch IP ranges from the database
		$sql = 'SELECT ip_start, ip_end FROM googlebot_ip_ranges';
		$result = $db->sql_query($sql);

		$ip_ranges = array();
		while ($row = $db->sql_fetchrow($result)) {
			$ip_ranges[] = array(
				'ip_start' => $row['ip_start'],
				'ip_end' => $row['ip_end'],
			);
		}
		$db->sql_freeresult($result);

		// Store the IP ranges in WinCache for future use (cache for 1 hour)
		wincache_ucache_set($gbwincache_key, $ip_ranges, 3600);
	}

	//Check SEbot IPs
	// Check if the IP is within any of the allowed ranges
	$ip_allowed = false;
	
	foreach ($ip_ranges as $range) 
	{
		$ip_start = $range['ip_start'];
		$ip_end = $range['ip_end'];

		// Check if the IP is within the range
		if (sprintf('%u', ip2long($ip)) >= sprintf('%u', ip2long($ip_start)) && sprintf('%u', ip2long($ip)) <= sprintf('%u', ip2long($ip_end))) 
		{
			$ip_allowed = true;
			break;
		}
	}
	
	if (!$ip_allowed)		
	{
            // Here you can put code to deny the visitor and send to an error page or something else.  Any IP that is not in the range list of valid Search Engine bots will execute the code you put here
         }


Let us know if you have any questions.
Last edited by thecoalman on Wed Mar 12, 2025 8:24 pm, edited 1 time in total.
Reason: Moved to Custom Coding
User avatar
thecoalman
Community Team Member
Community Team Member
Posts: 6692
Joined: Wed Dec 22, 2004 3:52 am
Location: Pennsylvania, U.S.A.

Re: [DEV] Search Engine Bot Validation for PHPBB

Post by thecoalman »

phpBB3 already has entry for IP's for the bots but I don't know think they are blocked if there is mismatch between the IP and user agent. Soeone would have to test but I think they just get guest permissions.

It's not enabled here on phpbb.com but Cloudflare will block spoofed user agents with invalid IP's.

I'm not so sure someone spoofing a bots user agent is a big concern. The six gazillion random IP's from bot nets with browser user agents are the concern today.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”

Attributed - Thomas Edison
User avatar
JLA
Registered User
Posts: 621
Joined: Tue Nov 16, 2004 5:23 pm
Location: USA
Name: JLA FORUMS

Re: [DEV] Search Engine Bot Validation for PHPBB

Post by JLA »

thecoalman wrote: Wed Mar 12, 2025 9:15 pm phpBB3 already has entry for IP's for the bots but I don't know think they are blocked if there is mismatch between the IP and user agent. Soeone would have to test but I think they just get guest permissions.

It's not enabled here on phpbb.com but Cloudflare will block spoofed user agents with invalid IP's.

I'm not so sure someone spoofing a bots user agent is a big concern. The six gazillion random IP's from bot nets with browser user agents are the concern today.
We do a considerable amount of user validation on our phpbb board. It has accomplished a great deal to combat the problem you mentioned. Since it is important that valid SE bots from these three do not have to go through any of that, this tool is helpful at the front end of our validation routines/code. All three change their IP blocks from time to time, so the tool is helpful for keeping track of those ranges automatically - allowing uninterrupted access.

The concepts in this script can also be a useful starting point for those who wish to take a more vigilant approach to user validation.

We do not use CF as we prefer to minimize third party dependencies.
Last edited by thecoalman on Fri Mar 14, 2025 2:14 am, edited 1 time in total.
Reason: Merged double post with quote where user intended to edit.

Return to “phpBB Custom Coding”