[BETA] BartVB SiteMap Generator

A place for MOD Authors to post and receive feedback on MODs still in development. No MODs within this forum should be used within a live environment!
Ideas Centre
BartVB
Consultant
Consultant
Posts: 1288
Joined: Thu Aug 02, 2001 1:32 pm
Location: The Netherlands
Contact:

[BETA] BartVB SiteMap Generator

Post by BartVB »

I needed a sitemap generator for my site (http://www.bokt.nl) and used this mod as a starting point. I ended up rewriting a large part of it though :)

Things I've changed:

- Permissions are taken into account to decide what forums/topics to put in the sitemap
- Sitemap generation code has been simplified
- It's able to handle large amounts of topics (approx 1M on my board in under a minute)
- Use proper <lastmod> in sitemap index
- Use proper <lastmod> in sitemaps
- Support for a 'sitemap' directory to avoid annoying permission problems
- Config variables all start with 'sitemap_' to avoid clashes with other mods/phpBB updates

These config values are used:

sitemap_seo_mod - (boolean) Not using this seo mod myself, no idea if I broke something.
sitemap_gzip - (boolean)
sitemap_ping - (boolean) ping Google?
sitemap_show_stats - (boolean) probably broken in my version :) Not using the stats
sitemap_bot_user - user_id of a bot user. I use this to calculate what forums are accessible
sitemap_directory - Directory (with trailing /) to use for the sitemaps

TODO (and I'm probably not going to do it :P)
- Fix the admin screen to account for the new config variables (most notably bot_user and directory)
- Take approval of topics into account
- Fix (or remove) the stats, counting the number of <loc> tokens in the files is not the way to go IMO :)
- Automate generating the sitemaps
- Create the config variables during installation time

This is an example of a generated index sitemap:

http://www.bokt.nl/sitemap_forum.xml.gz

Google was able to parse all that without a hitch.

My version can be found at:

http://www.typo.nl/m isc/SiteMap_Generator_BartVB.tgz [edit: Replaced by new version below]

do with it what you want, I'm not going to give support to endusers or add features I don't need/want :) Everything works on my server with my settings, haven't tested beyond that.

Again; I have no intention of maintaining or supporting this mod! I have modified it for my own needs and I am contributing my modifications back to the community. I'm hoping that someone (Joshua2100 perhaps?) will integrate some or all of my changes in his mod.
I Hate oversized sigs and Love Penguins :D
User avatar
Dogs and things
Registered User
Posts: 2114
Joined: Fri Sep 01, 2006 9:04 am
Location: Spain
Contact:

Re: [BETA] SiteMap Generator

Post by Dogs and things »

Great stuff Bart,

If you ever get to do your TODO please let us know the results.

Thanks so far. :P
For phpBB2 support visit phpBB2refugees.
User avatar
UncleVIBES
Registered User
Posts: 64
Joined: Sat Jun 09, 2007 5:08 am
Location: France
Contact:

Re: [BETA] SiteMap Generator

Post by UncleVIBES »

I see only the xml file but no other roots files. But it's on the seo conversion I need changes, you didn't touch this part of the mod.
BartVB
Consultant
Consultant
Posts: 1288
Joined: Thu Aug 02, 2001 1:32 pm
Location: The Netherlands
Contact:

Re: [BETA] SiteMap Generator

Post by BartVB »

Dogs and things wrote: If you ever get to do your TODO please let us know the results.
As I said; it's not a TODO list I plan to finish. It's not even my todo list, I don't need the items on that list. I don't use the admin page, don't use the stats, already have automated generation of the sitemaps. So if you want to see the TODO list finished you either have to do this yourself or wait for someone else to fix that for you :)

UncleVIBES: Here's a .zip version:
http://www.typo.nl/misc/ SiteMap_Generator_BartVB.zip [edit: Replaced by new version below]

The only file I changed are /includes/sitemap_functions.php and /includes/acp/acp_sitemap_generator.php (all based on B4 by Joshua2100).
I Hate oversized sigs and Love Penguins :D
Aexo
Registered User
Posts: 20
Joined: Sun Dec 09, 2007 12:27 pm
Contact:

Re: [BETA] SiteMap Generator

Post by Aexo »

I have error 500 "Internal Server Error" when I click to "run now" and I have "SEO mod" activate. I have SEO mod installed.
A few days agos I could run this mod and I have a xml with links in .html format and I don't want to loose it. I wouldn't like hava a xml without seo mod. any idea?

BartVB, I don't work your mod.

Code: Select all

Error General
SQL ERROR [ mysql4 ]

You have an error in your SQL syntax. Check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 3 [1064]

SQL

SELECT forum_id, auth_option_id, auth_role_id, auth_setting FROM bbb_acl_users WHERE user_id = 

BACKTRACE

FILE: includes/db/mysql.php
LINE: 158
CALL: dbal_mysql->sql_error()

FILE: includes/auth.php
LINE: 773
CALL: dbal_mysql->sql_query()

FILE: includes/auth.php
LINE: 345
CALL: auth->acl_raw_data_single_user()

FILE: includes/auth.php
LINE: 71
CALL: auth->acl_cache()

FILE: includes/sitemap_functions.php
LINE: 40
CALL: auth->acl()

FILE: includes/acp/acp_sitemap_generator.php
LINE: 47
CALL: generate_sitemap()

FILE: includes/functions_module.php
LINE: 471
CALL: acp_sitemap_generator->main()

FILE: adm/index.php
LINE: 75
CALL: p_master->load_active()
Thanks
http://www.clubifone.con
User avatar
Dogs and things
Registered User
Posts: 2114
Joined: Fri Sep 01, 2006 9:04 am
Location: Spain
Contact:

Re: [BETA] SiteMap Generator

Post by Dogs and things »

I get the exact same error, I don't have SEO MOD activated.

Btw. This Error is produced using Bart's version, with the original version I didn't get this error.
For phpBB2 support visit phpBB2refugees.
BartVB
Consultant
Consultant
Posts: 1288
Joined: Thu Aug 02, 2001 1:32 pm
Location: The Netherlands
Contact:

Re: [BETA] SiteMap Generator

Post by BartVB »

My version is not really ready for casual use, it needs some manual nurturing :) I guess 'improved error handling' needs to be on the TODO list too :D

Most important is that you create a 'sitemap_bot_user' value in the config table that contains the ID of the bot user that you want to use the permissions of. So you need to do something like:

insert into phpbb_config (config_name, config_value) values ('sitemap_bot_user', '28508');
insert into phpbb_config (config_name, config_value) values ('sitemap_directory', 'sitemaps/');

where 28508 is the user_id of my 'GoogleBot' user. If you use a sitemaps directory you need to create that directory under your $phpbb_root_path and allow the webserver to write in that directory.

BTW I noticed two small (but crucial :D) bugs in my code that result in incorrect URLs in the sitemaps. New version is at:

http://www.typo.nl/misc/SiteMap_Generator_BartVB2.zip

only includes/sitemap_functions.php has changed.
I Hate oversized sigs and Love Penguins :D
User avatar
Dogs and things
Registered User
Posts: 2114
Joined: Fri Sep 01, 2006 9:04 am
Location: Spain
Contact:

Re: [BETA] SiteMap Generator

Post by Dogs and things »

I´m sorry Bart,

I created the bot-user in phpbb_config running your query as you posted and changed the includes/sitemap_functions.php for the new one but I get the exact same error.

Should I do anything more to make it work?
For phpBB2 support visit phpBB2refugees.
BartVB
Consultant
Consultant
Posts: 1288
Joined: Thu Aug 02, 2001 1:32 pm
Location: The Netherlands
Contact:

Re: [BETA] SiteMap Generator

Post by BartVB »

Clear your cache :) phpBB caches most config variables. If you use the default file cache there should be a file like 'global_vars.php' or something like that in /forums/cache/ delete that file to get rid of the cached config values.
I Hate oversized sigs and Love Penguins :D
User avatar
Dogs and things
Registered User
Posts: 2114
Joined: Fri Sep 01, 2006 9:04 am
Location: Spain
Contact:

Re: [BETA] SiteMap Generator

Post by Dogs and things »

Makes no difference :(
SQL ERROR [ mysqli ]

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 3 [1064]

SQL

SELECT forum_id, auth_option_id, auth_role_id, auth_setting FROM phpbb_acl_users WHERE user_id =

BACKTRACE

FILE: includes/db/mysqli.php
LINE: 143
CALL: dbal->sql_error()

FILE: includes/auth.php
LINE: 773
CALL: dbal_mysqli->sql_query()

FILE: includes/auth.php
LINE: 345
CALL: auth->acl_raw_data_single_user()

FILE: includes/auth.php
LINE: 71
CALL: auth->acl_cache()

FILE: includes/sitemap_functions.php
LINE: 40
CALL: auth->acl()

FILE: includes/acp/acp_sitemap_generator.php
LINE: 47
CALL: generate_sitemap()

FILE: includes/functions_module.php
LINE: 471
CALL: acp_sitemap_generator->main()

FILE: adm/index.php
LINE: 74
CALL: p_master->load_active()
For phpBB2 support visit phpBB2refugees.
BartVB
Consultant
Consultant
Posts: 1288
Joined: Thu Aug 02, 2001 1:32 pm
Location: The Netherlands
Contact:

Re: [BETA] SiteMap Generator

Post by BartVB »

This is the query:

Code: Select all

$sql = 'SELECT *
                FROM ' . USERS_TABLE . '
                WHERE user_id = ' . intval($config['sitemap_bot_user']);
So apparently in your case $config['sitemap_bot_user'] is empty... A quick (and really dirty :P) fix is replacing that code with:

Code: Select all

$sql = 'SELECT *
                FROM ' . USERS_TABLE . '
                WHERE user_id = 1234' ;
Where 1234 is the user_id of your bot user.
I Hate oversized sigs and Love Penguins :D
User avatar
Dogs and things
Registered User
Posts: 2114
Joined: Fri Sep 01, 2006 9:04 am
Location: Spain
Contact:

Re: [BETA] SiteMap Generator

Post by Dogs and things »

Okay,

I got it working nicely now, allthough I made some changes to your code.

I choose to simply use as user_id 1, the anonymous user, because this user_id has guest permissions and will exclude forums without guest-read permissions too.

I cleaned out the bot_auth stuff.

I also cleaned the url-output. For this I changed

Code: Select all

				$f_xml .= '<loc>' . $board_url . "viewtopic.$phpEx?f=" . $row['forum_id'] . '&t=' . $row['topic_id'] . "</loc>\n";
into

Code: Select all

				$f_xml .= '<loc>' . $board_url . "viewtopic.$phpEx?t=" . $row['topic_id'] . "</loc>\n";
As I have a small board and it will be quite some time before I reach the 50.000 url mark I also took out the part that generates the sitemap_index.xml.

I removed the part that generates a sitemap_forums.xml as I consider Google doesn´t need it.

Finally I reduced the name of the sitemap_topics_xx.xxx-xx.xxx.xml and now it is written as simply sitemap.xml.

I'm not sure if I did everything cleanly but the result is a perfect sitemap.xml with a list of all topics that have guest access and as lastmod date the date/time of the last post in that topic.

I figure that this is exactly what Google needs allthough I don't know if <changefreq> and <priority> are also welcomed by Godoogle. My guess is that he likes those too.

What do you think?
For phpBB2 support visit phpBB2refugees.
BartVB
Consultant
Consultant
Posts: 1288
Joined: Thu Aug 02, 2001 1:32 pm
Location: The Netherlands
Contact:

Re: [BETA] SiteMap Generator

Post by BartVB »

It doesn't hurt to have an index file, Google (and the other search engines) know perfectly well how to handle them :) But I can see your point for a small forum although small forums can get big :) Took a quick look and my board had over 50k topics after 16 months and the site wasn't exactly huge at that time...

I wouldn't change the URLs because

viewtopic.php?f=1&t=1234

is what phpBB uses by default. It's what you see on all viewforum.php pages, it's what AdSense sees, it's what people link to. For Google

viewtopic.php?f=1&t=1234
and
viewtopic.php?t=1234

are two completely separate pages...

Using user 1 is a smart idea, would also be a sane default. I'm using the GoogleBot user because that user can do even less than a guest on the board. There are some forums that I don't want to have in the Google index.

What do you mean with 'I cleaned out the bot_auth stuff'?

<priority> and <changefreq> are probably nice but they are superfluous if you do something like:

Code: Select all

print "<priority>0.5</priority>\n";
:) IMO those are only useful if you assign sane and useful values to them. I can imagine that 'priority' is related to topic_views, the hard part would be normalising the range of values for topic_views. <changefreq> can be related to the age (topics older than 2 months can probably be set to 'never') but it wouldn't surprise me if Google does that already (take the last modified date into account), hmm, not sure if phpBB actually sends a 'Last-Modified' header on viewtopic.php
I Hate oversized sigs and Love Penguins :D
User avatar
Dogs and things
Registered User
Posts: 2114
Joined: Fri Sep 01, 2006 9:04 am
Location: Spain
Contact:

Re: [BETA] SiteMap Generator

Post by Dogs and things »

In that case you did a good job using a bot_user as your sitemap-creator, I'm glad to hear user-id1 sounds good to you too. ;)

I'll have a look at he url part, allthough I guess you are right. Thing is that I´m still running my live board on phpBB2 and am only testing phpBB3 on localhost. For my live phpBB2 I am using urls as I described, simply http://www.anything.com/phpBB2/viewtopic.php?t=2118 which is good. As I have some mods installed that add other things to the topic urls I think it's good to simply point Google to the clean topic url and let it decide what url will be shown on the serps.

Looking at it I see that phpBB3 writes urls differently, as you indicate, so I'll undo that change.
BartVB wrote:What do you mean with 'I cleaned out the bot_auth stuff'?
For instance:

Code: Select all

	$bot_userdata = $db->sql_fetchrow($result);
	$bot_auth = new Auth();
	$bot_auth->acl($bot_userdata);
and

Code: Select all

	$bot_no_access = array();
	$bot_forum_auth_ary = $bot_auth->acl_getf('!f_read');
	foreach ($bot_forum_auth_ary as $forum_id => $not_allowed)
	{
		if ($not_allowed['f_read'])
		{
			$bot_no_access[] = (int) $forum_id;
		}
	}
BartVB wrote:IMO those are only useful if you assign sane and useful values to them. I can imagine that 'priority' is related to topic_views, the hard part would be normalising the range of values for topic_views. <changefreq> can be related to the age (topics older than 2 months can probably be set to 'never') but it wouldn't surprise me if Google does that already (take the last modified date into account), hmm, not sure if phpBB actually sends a 'Last-Modified' header on viewtopic.php
At the moment I am using a sitemapgenerator on my phpBB2 that actually does give good output for <changefreq> and <priority>. Here is a copy of that file, have a look at how it does that if you want, maybe you get some ideas out of it. I'm by no means a coder so I can't take parts out of one file and implement them in another. :cry:

Code: Select all

<?php

$phpbb_root_path = '';  // muß angepasst werden, falls eine Ebene unter dem Forumsordner.

define('IN_PHPBB', true);
include($phpbb_root_path . 'extension.inc');
include($phpbb_root_path . 'common.'.$phpEx);

$userdata = session_pagestart($user_ip, PAGE_INDEX);
init_userprefs($userdata);
$script_name = preg_replace('/^\/?(.*?)\/?$/', "\\1", trim($board_config['script_path']));
$server_name = trim($board_config['server_name']);
$server_protocol = ( $board_config['cookie_secure'] ) ? 'https://' : 'http://';
$server_port = ( $board_config['server_port'] <> 80 ) ? ':' . trim($board_config['server_port']) . '/' : '/';
$server_url = $server_protocol . $server_name . $server_port . $script_name;
if(substr($server_url, -1, 1) != "/") {   $server_url .= "/"; }
$server_url = code_utf8($server_url);
$zeit = time();
$pre_timezone = date('O', $zeit);
$time_zone = substr($pre_timezone, 0, 3).":".substr($pre_timezone, 3, 2);
$topics_per_page = $board_config['posts_per_page'];


// die foren einlesen:
$max_priority = 0;$anz_forums = 0;
$result = mysql_query("SELECT forum_id,forum_posts,auth_read FROM " . FORUMS_TABLE ) ;
while( $row =  mysql_fetch_assoc($result)) {
   if($row[auth_read]==0) { // wenn foren für alle lesbar sind
      $forum_ist_lesbar[$row[forum_id]] = true;
      $max_priority = max($max_priority,$row[forum_posts]);
      $anz_forums++;
      $forum_nr[$anz_forums] = $row[forum_id];
      $forum_priority[$forum_nr[$anz_forums]] = $row[forum_posts];
   }
}

$result = mysql_query("SELECT topic_id,post_time  FROM " . POSTS_TABLE ) ;
while( $row =  mysql_fetch_assoc($result)) {
   if($last_time[$row["topic_id"]] < $row["post_time"]) { $last_time[$row["topic_id"]] = $row["post_time"]; }
}

echo '<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
';


$result = mysql_query("SELECT topic_id,forum_id,topic_replies,topic_time FROM " . TOPICS_TABLE . " ORDER BY topic_id DESC LIMIT 50000") ;
while( $row =  mysql_fetch_assoc($result)) {
   if($forum_ist_lesbar[$row["forum_id"]]) {
      $row["topic_replies"]++;
      $topicId = $row["topic_id"] ;
      $alter = $zeit - $last_time[$row["topic_id"]] ;
      $zeitbonus = max(min(round((600000 - $alter ) / 80000),7),0) ;
      $last_time_post = date("Y-m-d\TH:i:s",$last_time[$row["topic_id"]]) . $time_zone ;
      if ($alter < 604800) { $changefreq = 'weekly'; } else { $changefreq = 'monthly'; }
      if ($alter < 84000) { $changefreq = 'daily'; }
      $topicpriority = min(( $row["topic_replies"] + $zeitbonus ),9) ;
      $page = '';
      $start = 0;
      while($row["topic_replies"]>0) {
echo "<url>
<loc>".$server_url."viewtopic.php?t=$topicId$page</loc>
<lastmod>$last_time_post</lastmod>
<changefreq>$changefreq</changefreq>
<priority>0.$topicpriority</priority>
</url>
";
         $row["topic_replies"] = $row["topic_replies"] - $topics_per_page;
         $start = $start + $topics_per_page;
         $page = '&start='.$start;
      }
   }
}


echo '</urlset>';


function code_utf8($text) {
   $array_1 = array("&","\"","'",">","<","");
   $array_2 = array("&",""","&apos;",">","<","");
   for($x=0;$x<4;$x++){
      $text = str_replace($array_1[$x],$array_2[$x],$text);
   }
   return $text;
}
?>
Anyhow, I think the most important part of having a sitemap in place is that Google is able to index a board better and most importantly faster, it loses less time looking for new content and can very easily re-index.
For phpBB2 support visit phpBB2refugees.
User avatar
Highway of Life
Former Team Member
Posts: 6048
Joined: Wed Feb 02, 2005 5:41 pm
Location: Seattle, WA
Name: David Lewis
Contact:

Re: [BETA] BartVB SiteMap Generator

Post by Highway of Life »

The phpBB Weekly Podcast - Discussing the developments of phpBB4 and beyond.

New to phpBB3? Want to learn about programing?
Visit phpBB Academy at StarTrekGuide to learn how.
Locked

Return to “[3.0.x] MODs in Development”