[INFO] How gogle PHPBB!

A place for MOD Authors to post and receive feedback on MODs still in development. No MODs within this forum should be used within a live environment! No new topics are allowed in this forum.
Forum rules
READ: phpBB.com Board-Wide Rules and Regulations

IMPORTANT: MOD Development Forum rules

On February 1, 2009 this forum will be set to read only as part of retiring of phpBB2.

Postby menyak » Tue Nov 05, 2002 12:06 pm

PS: Oh, maybe you should send a PM to TC about this - might save some folks some confusion...
menyak
Registered User
 
Posts: 53
Joined: Mon Feb 04, 2002 9:55 pm

Googlebot

Postby nova9999 » Wed Nov 06, 2002 7:58 pm

Hello :)

I followed the tutorial for google and today as I was browsing through my administration panel I noticed an IP from a "familiar" IP range (google.com's IP = 216.239.51.100); it was crawl9.googlebot.com from 216.239.46.226. It was stuck in profile page (btw I also have made some modifications so guests are redirected to the login page when trying to see registered members list or profiles) for a while but perhaps I noticed it towards the end of the crawling. I searched google, but still no results from my forums (although they are fairly new, 3 weeks old only!). Just thought to contribute to this thread and see how things are going with "the google quest" :)
nova9999
Registered User
 
Posts: 3
Joined: Tue Oct 22, 2002 7:38 pm

Postby R. U. Serious » Wed Nov 06, 2002 8:46 pm

You cannot find your pages right after the crawl, that is normal. Usually Google crawl at the beginning of the month and those pages make it into the index at the end of the month.

Another Things:
The number of topics crawled and cached by google is limited, therefore it is helpful to keep the bots out of the uninteresting parts of the forum. That way it will (hopefully) crawl more of the interesting parts (=topics/forums). Simply create or add to an existing robots.txt in the root directory of your site:

Code: Select all

User-agent: *
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /privmsg.php
Disallow: /profile.php
Disallow: /posting.php

(Note: If you do not have phpBB in the root of your domain you'll have to adjust the path, e.g. Disallow /phpBB2/memberlist.php )

I have not yet tested results, but I have it on my site right now for this crawl and can tell results by the end of november.
Last edited by R. U. Serious on Thu Nov 07, 2002 5:03 pm, edited 1 time in total.
R. U. Serious
Registered User
 
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

Postby netclectic » Wed Nov 06, 2002 8:50 pm

Yeah, here's a robots.txt i put together. I think it covers just about everything you wouldn't want google to spider. There are probably some mods in there which you may not have and maybe some mods you do have which you should maybe add...

Code: Select all
User-agent: *
Disallow: /admin/
Disallow: /attach_mod/
Disallow: /db/
Disallow: /files/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mycalendar_mod/
Disallow: /spelling/
Disallow: /templates/
Disallow: /common.php
Disallow: /config.php
Disallow: /glance_config.php
Disallow: /groupcp.php
Disallow: /memberlist.php
Disallow: /mini_cal.php
Disallow: /modcp.php
Disallow: /mycalendar.php
Disallow: /news_insert.php
Disallow: /posting.php
Disallow: /printview.php
Disallow: /privmsg.php
Disallow: /profile.php
Disallow: /ranks.php
Disallow: /search.php
Disallow: /statistics.php
Disallow: /tellafriend.php
Disallow: /viewonline.php
Defend the game:
Image
User avatar
netclectic
Former Team Member
 
Posts: 4439
Joined: Wed Mar 13, 2002 3:08 pm
Location: Omnipresent

Feedback

Postby Webby » Thu Nov 07, 2002 11:48 am

Hi folks,
Just wanted to let you guys know that last week the googlebot spider and 'Slurp' (inktomi) stuffed their hairy faces silly crawling just about every single page of my forum that isn't registration protected. Onb Monday a total of 60 hits from google. All my main poages plus about 30 forum pages were spidered. I'm using a combination of the R.U.Serious 'cloak' mod and HSIM SE friendly url mod. On top of this I have Ca5ey's fetch posts mod as an include on my .shtml home page which makes for an excellent 'feeder' to the search engine bots. I've only last week allowed the bots in (excluded with robots.txt previously) and they are lapping it up.
I'm like a kid waiting for christmas :-) I know for a fact a LOT of my forum pages are going to get indexed as the crawlers have been round.

In time this is going to bring in a lot of new memberships and visitors for sure.

Many thanks to all the programmers who have made it possible.

My homepage where you can see the ssi include of fetch posts here..
http://www.abakus-internet-marketing.de

THE most search engine friendly forum out there here ;-)...
http://www.abakus-internet-marketing.de/foren/index.php

Phpbb rules for sure!
User avatar
Webby
Registered User
 
Posts: 91
Joined: Tue May 21, 2002 9:28 am
Location: Hannover, Germany

Postby conniew » Thu Nov 07, 2002 3:05 pm

Just to let you all know....

I installed this mod and the Google spider just spent THREE DAYS ( :mrgreen: ) crawling my site! I had no idea they would spend that long!

Thanks so much for this!
conniew
Registered User
 
Posts: 16
Joined: Sun Aug 25, 2002 8:40 pm

Postby Ralendil » Thu Nov 07, 2002 3:07 pm

the problem it is that Google try all links !

and all links mean reply, quote etc... :)

That is a problem.
Search in one week for your forum you will find some link to answer directly to a post ;)
User avatar
Ralendil
Registered User
 
Posts: 410
Joined: Thu May 30, 2002 9:13 pm
Location: France

Postby netclectic » Thu Nov 07, 2002 3:44 pm

Ralendil wrote:the problem it is that Google try all links !

and all links mean reply, quote etc... :)

That is a problem.
Search in one week for your forum you will find some link to answer directly to a post ;)


You need a decent robots.txt file like i posted on the previous page!
Defend the game:
Image
User avatar
netclectic
Former Team Member
 
Posts: 4439
Joined: Wed Mar 13, 2002 3:08 pm
Location: Omnipresent

Postby Ralendil » Thu Nov 07, 2002 4:15 pm

netclectic wrote:
Ralendil wrote:the problem it is that Google try all links !

and all links mean reply, quote etc... :)

That is a problem.
Search in one week for your forum you will find some link to answer directly to a post ;)


You need a decent robots.txt file like i posted on the previous page!


he he thx !!!
It seems I miss all important thing you post that could be an help for me :lol: ;) !
I will try this...
User avatar
Ralendil
Registered User
 
Posts: 410
Joined: Thu May 30, 2002 9:13 pm
Location: France

Postby Dzien Dobry » Thu Nov 07, 2002 5:34 pm

netclectic,

Please pardon my ignorance. I’ve just now added the enhance-google-indexing mod to my forum. I also added a file called robots.txt to the root directory of my site. (I believe I have phpBB in the root of my domain. http://www.everythingimportant.org is my forum index page). I used everything you listed, even though the few mods I have are only slight changes in code to existing files. I assume that my using Disallow: /a file that doesn’t exist/ isn’t going to be a problem? Is that right?
User avatar
Dzien Dobry
Registered User
 
Posts: 545
Joined: Thu Nov 08, 2001 3:55 pm

Postby netclectic » Thu Nov 07, 2002 5:50 pm

You are correct. That will not cause a problem.
Defend the game:
Image
User avatar
netclectic
Former Team Member
 
Posts: 4439
Joined: Wed Mar 13, 2002 3:08 pm
Location: Omnipresent

Postby Dzien Dobry » Thu Nov 07, 2002 6:38 pm

R. U. Serious wrote:The number of topics crawled and cached by google is limited, therefore it is helpful to keep the bots out of the uninteresting parts of the forum. That way it will (hopefully) crawl more of the interesting parts (=topics/forums). Simply create or add to an existing robots.txt in the root directory of your site:
Code: Select all
User-agent: *
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /privmsg.php
Disallow: /profile.php
Disallow: /posting.php

netclectic wrote:Yeah, here's a robots.txt i put together. I think it covers just about everything you wouldn't want google to spider. There are probably some mods in there which you may not have and maybe some mods you do have which you should maybe add...
Code: Select all
User-agent: *
Disallow: /admin/
Disallow: /attach_mod/
Disallow: /db/
Disallow: /files/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mycalendar_mod/
Disallow: /spelling/
Disallow: /templates/
Disallow: /common.php
Disallow: /config.php
Disallow: /glance_config.php
Disallow: /groupcp.php
Disallow: /memberlist.php
Disallow: /mini_cal.php
Disallow: /modcp.php
Disallow: /mycalendar.php
Disallow: /news_insert.php
Disallow: /posting.php
Disallow: /printview.php
Disallow: /privmsg.php
Disallow: /profile.php
Disallow: /ranks.php
Disallow: /search.php
Disallow: /statistics.php
Disallow: /tellafriend.php
Disallow: /viewonline.php

These comments should be added to the Tutorial, Google & phpBB but that thread is locked. This is an important feature for everyone’s benefit.

Thank you netclectic and R. U. Serious!
User avatar
Dzien Dobry
Registered User
 
Posts: 545
Joined: Thu Nov 08, 2001 3:55 pm

Postby TC » Thu Nov 07, 2002 7:19 pm

thank you to RUS & neclectic for the PMs alerting me to these developments. i was waiting for confirmation from RUS that they in fact work as advertised before i added it to the tutorial.

but when this is done it will be added. thanks to all who continue to work on this! :mrgreen:
User avatar
TC
Former Team Member
 
Posts: 3601
Joined: Tue Sep 25, 2001 7:23 pm
Location: Kµlt °ƒ Ø, working on my time machine

Postby R. U. Serious » Thu Nov 07, 2002 8:54 pm

@TC: Judging from my logs (and I think netcletic will confirm this) it works just as it should. Google is skipping the mentioned pages.


However I now have another thing on my mind. See those little icons for every post: --> Image <--- each of them links to viewtopic with a different post_id, however alll 10 (or 15 or 20 ...) posts from the same page lead to identical pages, but with different urls. But you don't want to serve google many different urls with exact same content.

So this should be considered in the append_sid function. The following is experimental. netclectic maybe you can try to test this or maybe you have a better idea how to handle this. This code should still append the session_id to urls which have a 'viewtopic.php' and a 'p=' in them, thus google will ignore viewtopic links to posts, but will still crawl viewtopic-links to topic-ids.
Code: Select all
# ------ EXPERIMENTAL ------
# Prevent duplicate content by post-links v0.1
#
#-----[ OPEN  ]------------------------------------------
includes/sessions.php

#-----[ FIND ]------------------------------------------
   global $SID;

   if ( !empty($SID) && !eregi('sid=', $url) )

#-----[ REPLACE WITH ]------------------------------------------
   global $SID, $HTTP_SERVER_VARS;

   if ( !empty($SID) && !eregi('sid=', $url) && (!strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot')
         || (strstr($url, 'viewtopic.php') && strstr($url, 'p=')) ) )

#
#-----[ SAVE/CLOSE ALL FILES ]------------------------------------------
#
# EoM
R. U. Serious
Registered User
 
Posts: 830
Joined: Mon Feb 11, 2002 2:07 pm

Postby netclectic » Thu Nov 07, 2002 9:38 pm

Good thinking!

I don't see any possible drawbacks of adding the extra bit in sessions.php. I've just tested it on my forums and it certainly works.
Defend the game:
Image
User avatar
netclectic
Former Team Member
 
Posts: 4439
Joined: Wed Mar 13, 2002 3:08 pm
Location: Omnipresent

PreviousNext

Return to [2.0.x] MODs in Development

Who is online

Users browsing this forum: No registered users and 8 guests