Hi there,
I am running several larger forums.
As I am very uncomfortable with ChatGPT and AI in general I want to block them from using my content to "learn" from.
How can I do that? Any ideas?
I think if you hide your sensitive forums and make them only available to members, that should exclude robots - including AI - who would be "guests". Several of my forums are members only and they aren't listed unless you are signed in.
I doubt blocking 1 site is going to hinder them very much.
You'd need to know IPs and/or user agent IDs of the bots being used so you could block them at either a server level or with htaccess.
-:|:- Support Request Template -:|:- "Step up to red alert. Sir, are you absolutely sure? It does mean changing the bulb"
Holger wrote: Thu Jun 06, 2024 7:43 amthat would block the forums content being listed by Google
So you think Google is not using what it crawled for AI purposes? It has even its own division Google AI, founded in 2017. And now you're getting "uncomfortable"?
As long as your content is public - to allow Google to index it for search as you say - you cannot prevent your content from being used potentially by AI / ML programs.
It’s exactly the same thing as preventing right clicking to stop people saving images from your website. People can screenshot it, use a third party capture program or use a physical camera and take a photograph.
Normal people… believe that if it ain’t broke, don’t fix it. Engineers believe that if it ain’t broke, it doesn’t have enough features yet. – Scott Adams
Except that's not what those articles say. Or they are at least quick to clarify, "adherence to a robots.txt is entirely voluntary." You're not "blocking" anything.
It's the equivalent of putting your valuable data out on a public sidewalk with a robots.txt Post-It note on it which says "Please do not steal this data."
You'll get people who don't steal it, people who tell you they didn't steal it but they actually did, and people who never even looked at the note while they stole the data.
Holger wrote: Fri Jun 07, 2024 6:16 am
Better blocking those agents in the htaccess? That is controlled by the server and not voluntary?
Typically if a bot is identifying itself they will behave according to robots.txt. That said user agents are often spoofed so blocking with .htaccess rule or other means can prevent those bots. If you going to do that I would allow for access to the robots.txt file.
The larger issue is the bots with browser user agents scraping content. They are difficult to stop. Cloudflare has an option for this but they have access to data from millions of sites to analyze and identify IP's for that type of traffic.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”
Except that's not what those articles say. Or they are at least quick to clarify, "adherence to a robots.txt is entirely voluntary." You're not "blocking" anything.
Exactly.
Which is what my point further into my post was talking about regarding photos where you try and prevent people from downloading them. At the end of the day, if someone wants to take the data from your website, they will find a way.
danieltj wrote: Fri Jun 07, 2024 3:58 pm
....if someone wants to take the data from your website, they will find a way.
I'd agree but with the bots you can mitigate a lot of these issues. It really depends on how much time and effort you want to put into it. I mentioned Cloudflare, they have a lot of tools both automated and manual. Simple example using one of their tools, I have seen a lot scraper traffic form OVH in the past. You can block the entire IP range of a network like OVH with one entry using their ASN number.
You'll never get rid of all of them but you can eliminate the bulk of them.
danieltj wrote: Thu Jun 06, 2024 9:10 am
It’s exactly the same thing as preventing right clicking to stop people saving images from your website. People can screenshot it, use a third party capture program or use a physical camera and take a photograph.
The easy way to do this is open the developer console.
“Results! Why, man, I have gotten a lot of results! I have found several thousand things that won’t work.”