Page 1 of 3

[BETA] BartVB SiteMap Generator

Posted: Mon Apr 28, 2008 11:24 pm
by BartVB
I needed a sitemap generator for my site (http://www.bokt.nl) and used this mod as a starting point. I ended up rewriting a large part of it though :)

Things I've changed:

- Permissions are taken into account to decide what forums/topics to put in the sitemap
- Sitemap generation code has been simplified
- It's able to handle large amounts of topics (approx 1M on my board in under a minute)
- Use proper <lastmod> in sitemap index
- Use proper <lastmod> in sitemaps
- Support for a 'sitemap' directory to avoid annoying permission problems
- Config variables all start with 'sitemap_' to avoid clashes with other mods/phpBB updates

These config values are used:

sitemap_seo_mod - (boolean) Not using this seo mod myself, no idea if I broke something.
sitemap_gzip - (boolean)
sitemap_ping - (boolean) ping Google?
sitemap_show_stats - (boolean) probably broken in my version :) Not using the stats
sitemap_bot_user - user_id of a bot user. I use this to calculate what forums are accessible
sitemap_directory - Directory (with trailing /) to use for the sitemaps

TODO (and I'm probably not going to do it :P)
- Fix the admin screen to account for the new config variables (most notably bot_user and directory)
- Take approval of topics into account
- Fix (or remove) the stats, counting the number of <loc> tokens in the files is not the way to go IMO :)
- Automate generating the sitemaps
- Create the config variables during installation time

This is an example of a generated index sitemap:

http://www.bokt.nl/sitemap_forum.xml.gz

Google was able to parse all that without a hitch.

My version can be found at:

http://www.typo.nl/m isc/SiteMap_Generator_BartVB.tgz [edit: Replaced by new version below]

do with it what you want, I'm not going to give support to endusers or add features I don't need/want :) Everything works on my server with my settings, haven't tested beyond that.

Again; I have no intention of maintaining or supporting this mod! I have modified it for my own needs and I am contributing my modifications back to the community. I'm hoping that someone (Joshua2100 perhaps?) will integrate some or all of my changes in his mod.

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 6:53 am
by Dogs and things
Great stuff Bart,

If you ever get to do your TODO please let us know the results.

Thanks so far. :P

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 7:24 am
by UncleVIBES
I see only the xml file but no other roots files. But it's on the seo conversion I need changes, you didn't touch this part of the mod.

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 8:16 am
by BartVB
Dogs and things wrote: If you ever get to do your TODO please let us know the results.
As I said; it's not a TODO list I plan to finish. It's not even my todo list, I don't need the items on that list. I don't use the admin page, don't use the stats, already have automated generation of the sitemaps. So if you want to see the TODO list finished you either have to do this yourself or wait for someone else to fix that for you :)

UncleVIBES: Here's a .zip version:
http://www.typo.nl/misc/ SiteMap_Generator_BartVB.zip [edit: Replaced by new version below]

The only file I changed are /includes/sitemap_functions.php and /includes/acp/acp_sitemap_generator.php (all based on B4 by Joshua2100).

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 10:31 am
by Aexo
I have error 500 "Internal Server Error" when I click to "run now" and I have "SEO mod" activate. I have SEO mod installed.
A few days agos I could run this mod and I have a xml with links in .html format and I don't want to loose it. I wouldn't like hava a xml without seo mod. any idea?

BartVB, I don't work your mod.

Code: Select all

Error General
SQL ERROR [ mysql4 ]

You have an error in your SQL syntax. Check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 3 [1064]

SQL

SELECT forum_id, auth_option_id, auth_role_id, auth_setting FROM bbb_acl_users WHERE user_id = 

BACKTRACE

FILE: includes/db/mysql.php
LINE: 158
CALL: dbal_mysql->sql_error()

FILE: includes/auth.php
LINE: 773
CALL: dbal_mysql->sql_query()

FILE: includes/auth.php
LINE: 345
CALL: auth->acl_raw_data_single_user()

FILE: includes/auth.php
LINE: 71
CALL: auth->acl_cache()

FILE: includes/sitemap_functions.php
LINE: 40
CALL: auth->acl()

FILE: includes/acp/acp_sitemap_generator.php
LINE: 47
CALL: generate_sitemap()

FILE: includes/functions_module.php
LINE: 471
CALL: acp_sitemap_generator->main()

FILE: adm/index.php
LINE: 75
CALL: p_master->load_active()
Thanks
http://www.clubifone.con

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 10:51 am
by Dogs and things
I get the exact same error, I don't have SEO MOD activated.

Btw. This Error is produced using Bart's version, with the original version I didn't get this error.

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 11:50 am
by BartVB
My version is not really ready for casual use, it needs some manual nurturing :) I guess 'improved error handling' needs to be on the TODO list too :D

Most important is that you create a 'sitemap_bot_user' value in the config table that contains the ID of the bot user that you want to use the permissions of. So you need to do something like:

insert into phpbb_config (config_name, config_value) values ('sitemap_bot_user', '28508');
insert into phpbb_config (config_name, config_value) values ('sitemap_directory', 'sitemaps/');

where 28508 is the user_id of my 'GoogleBot' user. If you use a sitemaps directory you need to create that directory under your $phpbb_root_path and allow the webserver to write in that directory.

BTW I noticed two small (but crucial :D) bugs in my code that result in incorrect URLs in the sitemaps. New version is at:

http://www.typo.nl/misc/SiteMap_Generator_BartVB2.zip

only includes/sitemap_functions.php has changed.

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 12:24 pm
by Dogs and things
I´m sorry Bart,

I created the bot-user in phpbb_config running your query as you posted and changed the includes/sitemap_functions.php for the new one but I get the exact same error.

Should I do anything more to make it work?

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 12:38 pm
by BartVB
Clear your cache :) phpBB caches most config variables. If you use the default file cache there should be a file like 'global_vars.php' or something like that in /forums/cache/ delete that file to get rid of the cached config values.

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 1:04 pm
by Dogs and things
Makes no difference :(
SQL ERROR [ mysqli ]

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 3 [1064]

SQL

SELECT forum_id, auth_option_id, auth_role_id, auth_setting FROM phpbb_acl_users WHERE user_id =

BACKTRACE

FILE: includes/db/mysqli.php
LINE: 143
CALL: dbal->sql_error()

FILE: includes/auth.php
LINE: 773
CALL: dbal_mysqli->sql_query()

FILE: includes/auth.php
LINE: 345
CALL: auth->acl_raw_data_single_user()

FILE: includes/auth.php
LINE: 71
CALL: auth->acl_cache()

FILE: includes/sitemap_functions.php
LINE: 40
CALL: auth->acl()

FILE: includes/acp/acp_sitemap_generator.php
LINE: 47
CALL: generate_sitemap()

FILE: includes/functions_module.php
LINE: 471
CALL: acp_sitemap_generator->main()

FILE: adm/index.php
LINE: 74
CALL: p_master->load_active()

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 1:48 pm
by BartVB
This is the query:

Code: Select all

$sql = 'SELECT *
                FROM ' . USERS_TABLE . '
                WHERE user_id = ' . intval($config['sitemap_bot_user']);
So apparently in your case $config['sitemap_bot_user'] is empty... A quick (and really dirty :P) fix is replacing that code with:

Code: Select all

$sql = 'SELECT *
                FROM ' . USERS_TABLE . '
                WHERE user_id = 1234' ;
Where 1234 is the user_id of your bot user.

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 8:26 pm
by Dogs and things
Okay,

I got it working nicely now, allthough I made some changes to your code.

I choose to simply use as user_id 1, the anonymous user, because this user_id has guest permissions and will exclude forums without guest-read permissions too.

I cleaned out the bot_auth stuff.

I also cleaned the url-output. For this I changed

Code: Select all

				$f_xml .= '<loc>' . $board_url . "viewtopic.$phpEx?f=" . $row['forum_id'] . '&t=' . $row['topic_id'] . "</loc>\n";
into

Code: Select all

				$f_xml .= '<loc>' . $board_url . "viewtopic.$phpEx?t=" . $row['topic_id'] . "</loc>\n";
As I have a small board and it will be quite some time before I reach the 50.000 url mark I also took out the part that generates the sitemap_index.xml.

I removed the part that generates a sitemap_forums.xml as I consider Google doesn´t need it.

Finally I reduced the name of the sitemap_topics_xx.xxx-xx.xxx.xml and now it is written as simply sitemap.xml.

I'm not sure if I did everything cleanly but the result is a perfect sitemap.xml with a list of all topics that have guest access and as lastmod date the date/time of the last post in that topic.

I figure that this is exactly what Google needs allthough I don't know if <changefreq> and <priority> are also welcomed by Godoogle. My guess is that he likes those too.

What do you think?

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 9:02 pm
by BartVB
It doesn't hurt to have an index file, Google (and the other search engines) know perfectly well how to handle them :) But I can see your point for a small forum although small forums can get big :) Took a quick look and my board had over 50k topics after 16 months and the site wasn't exactly huge at that time...

I wouldn't change the URLs because

viewtopic.php?f=1&t=1234

is what phpBB uses by default. It's what you see on all viewforum.php pages, it's what AdSense sees, it's what people link to. For Google

viewtopic.php?f=1&t=1234
and
viewtopic.php?t=1234

are two completely separate pages...

Using user 1 is a smart idea, would also be a sane default. I'm using the GoogleBot user because that user can do even less than a guest on the board. There are some forums that I don't want to have in the Google index.

What do you mean with 'I cleaned out the bot_auth stuff'?

<priority> and <changefreq> are probably nice but they are superfluous if you do something like:

Code: Select all

print "<priority>0.5</priority>\n";
:) IMO those are only useful if you assign sane and useful values to them. I can imagine that 'priority' is related to topic_views, the hard part would be normalising the range of values for topic_views. <changefreq> can be related to the age (topics older than 2 months can probably be set to 'never') but it wouldn't surprise me if Google does that already (take the last modified date into account), hmm, not sure if phpBB actually sends a 'Last-Modified' header on viewtopic.php

Re: [BETA] SiteMap Generator

Posted: Tue Apr 29, 2008 9:28 pm
by Dogs and things
In that case you did a good job using a bot_user as your sitemap-creator, I'm glad to hear user-id1 sounds good to you too. ;)

I'll have a look at he url part, allthough I guess you are right. Thing is that I´m still running my live board on phpBB2 and am only testing phpBB3 on localhost. For my live phpBB2 I am using urls as I described, simply http://www.anything.com/phpBB2/viewtopic.php?t=2118 which is good. As I have some mods installed that add other things to the topic urls I think it's good to simply point Google to the clean topic url and let it decide what url will be shown on the serps.

Looking at it I see that phpBB3 writes urls differently, as you indicate, so I'll undo that change.
BartVB wrote:What do you mean with 'I cleaned out the bot_auth stuff'?
For instance:

Code: Select all

	$bot_userdata = $db->sql_fetchrow($result);
	$bot_auth = new Auth();
	$bot_auth->acl($bot_userdata);
and

Code: Select all

	$bot_no_access = array();
	$bot_forum_auth_ary = $bot_auth->acl_getf('!f_read');
	foreach ($bot_forum_auth_ary as $forum_id => $not_allowed)
	{
		if ($not_allowed['f_read'])
		{
			$bot_no_access[] = (int) $forum_id;
		}
	}
BartVB wrote:IMO those are only useful if you assign sane and useful values to them. I can imagine that 'priority' is related to topic_views, the hard part would be normalising the range of values for topic_views. <changefreq> can be related to the age (topics older than 2 months can probably be set to 'never') but it wouldn't surprise me if Google does that already (take the last modified date into account), hmm, not sure if phpBB actually sends a 'Last-Modified' header on viewtopic.php
At the moment I am using a sitemapgenerator on my phpBB2 that actually does give good output for <changefreq> and <priority>. Here is a copy of that file, have a look at how it does that if you want, maybe you get some ideas out of it. I'm by no means a coder so I can't take parts out of one file and implement them in another. :cry:

Code: Select all

<?php

$phpbb_root_path = '';  // muß angepasst werden, falls eine Ebene unter dem Forumsordner.

define('IN_PHPBB', true);
include($phpbb_root_path . 'extension.inc');
include($phpbb_root_path . 'common.'.$phpEx);

$userdata = session_pagestart($user_ip, PAGE_INDEX);
init_userprefs($userdata);
$script_name = preg_replace('/^\/?(.*?)\/?$/', "\\1", trim($board_config['script_path']));
$server_name = trim($board_config['server_name']);
$server_protocol = ( $board_config['cookie_secure'] ) ? 'https://' : 'http://';
$server_port = ( $board_config['server_port'] <> 80 ) ? ':' . trim($board_config['server_port']) . '/' : '/';
$server_url = $server_protocol . $server_name . $server_port . $script_name;
if(substr($server_url, -1, 1) != "/") {   $server_url .= "/"; }
$server_url = code_utf8($server_url);
$zeit = time();
$pre_timezone = date('O', $zeit);
$time_zone = substr($pre_timezone, 0, 3).":".substr($pre_timezone, 3, 2);
$topics_per_page = $board_config['posts_per_page'];


// die foren einlesen:
$max_priority = 0;$anz_forums = 0;
$result = mysql_query("SELECT forum_id,forum_posts,auth_read FROM " . FORUMS_TABLE ) ;
while( $row =  mysql_fetch_assoc($result)) {
   if($row[auth_read]==0) { // wenn foren für alle lesbar sind
      $forum_ist_lesbar[$row[forum_id]] = true;
      $max_priority = max($max_priority,$row[forum_posts]);
      $anz_forums++;
      $forum_nr[$anz_forums] = $row[forum_id];
      $forum_priority[$forum_nr[$anz_forums]] = $row[forum_posts];
   }
}

$result = mysql_query("SELECT topic_id,post_time  FROM " . POSTS_TABLE ) ;
while( $row =  mysql_fetch_assoc($result)) {
   if($last_time[$row["topic_id"]] < $row["post_time"]) { $last_time[$row["topic_id"]] = $row["post_time"]; }
}

echo '<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
';


$result = mysql_query("SELECT topic_id,forum_id,topic_replies,topic_time FROM " . TOPICS_TABLE . " ORDER BY topic_id DESC LIMIT 50000") ;
while( $row =  mysql_fetch_assoc($result)) {
   if($forum_ist_lesbar[$row["forum_id"]]) {
      $row["topic_replies"]++;
      $topicId = $row["topic_id"] ;
      $alter = $zeit - $last_time[$row["topic_id"]] ;
      $zeitbonus = max(min(round((600000 - $alter ) / 80000),7),0) ;
      $last_time_post = date("Y-m-d\TH:i:s",$last_time[$row["topic_id"]]) . $time_zone ;
      if ($alter < 604800) { $changefreq = 'weekly'; } else { $changefreq = 'monthly'; }
      if ($alter < 84000) { $changefreq = 'daily'; }
      $topicpriority = min(( $row["topic_replies"] + $zeitbonus ),9) ;
      $page = '';
      $start = 0;
      while($row["topic_replies"]>0) {
echo "<url>
<loc>".$server_url."viewtopic.php?t=$topicId$page</loc>
<lastmod>$last_time_post</lastmod>
<changefreq>$changefreq</changefreq>
<priority>0.$topicpriority</priority>
</url>
";
         $row["topic_replies"] = $row["topic_replies"] - $topics_per_page;
         $start = $start + $topics_per_page;
         $page = '&start='.$start;
      }
   }
}


echo '</urlset>';


function code_utf8($text) {
   $array_1 = array("&","\"","'",">","<","");
   $array_2 = array("&",""","&apos;",">","<","");
   for($x=0;$x<4;$x++){
      $text = str_replace($array_1[$x],$array_2[$x],$text);
   }
   return $text;
}
?>
Anyhow, I think the most important part of having a sitemap in place is that Google is able to index a board better and most importantly faster, it loses less time looking for new content and can very easily re-index.

Re: [BETA] BartVB SiteMap Generator

Posted: Tue Apr 29, 2008 9:53 pm
by Highway of Life