Forum::Announcements::Scrapers ye be warned!

Index :: Announcements :: Scrapers ye be warned!

Pages: [1] 2 3 4

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-15 19:23:34
last modified: 2015-11-22 09:41:50

After several request and warnings it's now time for a serious talk with those who still scrape the pages of BOINCstats.

Scraping is the automated downloading of standard BOINCstats pages, and parsing them for use on other sites or statistics.

Scraping puts too much load on the BOINCstats server. Hundreds, sometimes thousands of pages are requested in sequence, bringing the server to its knees. NO MORE!

From now on, every scraper detected will have his IP address blocked indefinitely. Your stats history will be purged and if the scraper is a member of a team, the team stats histories will be erased as well. For all projects! This cannot be undone.

I'm sorry it has to come to this, but BOINCstats is a source of information and fun for the viewers of THIS site and the pleasure of using BOINCstats is greatly reduced by a slow or unresponsive website.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

UBT - Halifax--lad: BAM!ID: 25; Joined: 2006-02-27; Posts: 366; Credits: 49,272; World-rank: 920,745

2006-06-15 21:15:28

Willy whats your e-mail have mis-placed it and need to pass it to a naughty scraper

Join us in Chat (see the forum) Click the Sig

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-15 21:31:23
last modified: 2006-06-15 21:32:41

Actually, it was a member of UBT that triggered me tonight

Try me at webmaster { at } boincstats [ dot ] com

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

UBT - Halifax--lad: BAM!ID: 25; Joined: 2006-02-27; Posts: 366; Credits: 49,272; World-rank: 920,745

2006-06-15 21:33:13

Cheers will pass the email address along now, and yes it was one of the naughty people in our team he is going to email you ASAP

Join us in Chat (see the forum) Click the Sig

william: BAM!ID: 633; Joined: 2006-05-22; Posts: 79; Credits: 77,326; World-rank: 764,650

2006-06-15 23:33:37

what is a scaper? I am dumb anout this stuff I guess? is it your stats in the forms?

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 02:06:30

what is a scaper? I am dumb anout this stuff I guess? is it your stats in the forms?

a scraper is a person who uses a program to automatically download full HTML pages to only extract a small part of them (in this case, stats data)

viewing the pages normally in a web browser is fine
the problem is that these "scrapers" usually request many pages at once (hundreds/thousands in quick succession)

this places a high load on the web server (and any supporting servers, such as a database) which is unfair, because it reduces the speed of the site for other, regular, users

most scrapers use the data they aquire for personal use, or for their team, which is of no benifit to this site (team members could just look here anyway)

the issue is not about people wanting their stats, it's that the methods involved in "scraping" are very inefficient and therefore detremental, the same data could be retrieved by more efficient means such as XML, which is what i assume willy provides for those who really want to use the data provided on BOINCstats, in their own way.

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 02:08:17
last modified: 2006-06-16 02:11:34

i'd just like to say that i agree it's disapointing that the situation has come to this, and requires such drastic measures, but if people don't listen and respect the wishes of others volunteerilly, then what do they expect.

I say good on Willy for telling these server-load and bandwidth guzzlers where to go

on a slightly different note, would it be possible to have a definition of "scraping" with regard to the BOINCstats site, so that there's no confusion about what's allowed and what's not
that way, if a user complains about having their IP blocked, and they claim they were unaware that their conduct was unacceptable, they'll have no excuse, because a clear set of guidelines would be available

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

william: BAM!ID: 633; Joined: 2006-05-22; Posts: 79; Credits: 77,326; World-rank: 764,650

2006-06-16 03:42:04

Thanks you for explaining that to me. I now know what it is, I say good for Willy for doing this.

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-16 07:27:41

I will add a more detailed explanation somewhere on BOINCstats, but there is already a FAQ about this subject for over a year.

Lee: as your English is clearly better than mine, do you mind if I take same of the sentences of your post for this explanation?

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Egofreak: BAM!ID: 274; Joined: 2006-05-12; Posts: 119; Credits: 18,582; World-rank: 1,330,769

2006-06-16 07:30:31

Yo willy,
I have tested the "Bad BOT" and it banned since 2 month 13 Spiders.

try this: http://www.spider-trap.de/

I'm sure it will help you! If you have problems understanding german I will help you.
Sorry, I havent found an english version of this page

Test nBOINC and help me to make this tool better

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-16 07:52:10

I read through the site and I think that this is not what I need.

THis program places a link on a site, and if a grabber follows that link, it is added to a blacklist.

Scrapers do not follow links. The have all URL's they need defined in their program. They will never hit the spider trap.

On the other hand, and innocent visitor might click it.

The blocking method used by the spider trap is similar to the one I implemented on BOINCstats. It 's very efficient. When blocked, a scraper only gets a few characters on each page of BOINCstats.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Egofreak: BAM!ID: 274; Joined: 2006-05-12; Posts: 119; Credits: 18,582; World-rank: 1,330,769

2006-06-16 08:50:43

thats true... i misunderstood you.

[...]
On the other hand, and innocent visitor might click it.
[...]

If you place a 1px image on the site then not

other thing:
this can help you up to 70% i think

http://www.thesitewizard.com/archive/bandwidththeft.shtml

Test nBOINC and help me to make this tool better

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-16 10:07:08

I have no problems with other sites using the images of BOINCstats. That's a relativly low load.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 13:38:46
last modified: 2006-06-16 13:44:55

Lee: as your English is clearly better than mine, do you mind if I take same of the sentences of your post for this explanation?

sure

- use it in anyway you wish, although that was written "on the fly" and could be improved further.

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 13:42:28
last modified: 2006-06-16 13:43:25

When blocked, a scraper only gets a few characters on each page of BOINCstats.

clever

i guess only the tecnical people will realise the genius of this, but still, an innovative idea

i wonder though, do they get actual data retrieved from the DB, or is it just a static dummy page which doesn't use the DB? (thus reducing load even further)

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-16 13:51:50

The few characters spell 'Scraping not allowed.' That's it. So, if that's genius, well, then I'm guilty.

No database access is required for it, just a check against a hard coded blocked-IP table.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 14:05:02
last modified: 2006-06-16 14:05:38

what is a scraper? I am dumb about this stuff I guess? is it your stats in the forms?

some further thoughts:

when using a web browser in the normal fashion, pages are requested slowly, because it takes time for the user to view/read them, and look at their stats data before they request another page (and usually only one page at a time)

but as stated, a scraper uses an automated program which only downloads the page(s), it doesn't display them, the purpose of scraping (even though it's a very bad practice, I don't like scrapers any more than Willy does) is to just download the data, rather than display it like a regular browser would

what happens with the data (the web pages) that are downloaded is then up to the scraper, but generally the stats data is extracted and put into the scrapers' database, or used for the benefit of their team website (which obviously is not the BOINCstats website, and is of no benefit to BOINCstats)

because the data isn't viewed by the scraper (at least not while downloading it) most scrapers don't see the point of waiting between requests, and download lots of pages quickly, this is the very essence of the problem, lots of pages being requested (causing high server-load) and they're not even being viewed, only to be put in someone else's' database or on their own website

in short, this significantly increased server-load either has a very detrimental impact on the site (it's slow or the web server fails under the extremely high load), neither of which are good for the site, or it's regular visitors
the only alternative is to increase the capacity of the site, but it costs quite a lot to do, it means getting new servers etc. and considering this is only a hobby, I'm sure that's money that Willy would rather spend on something else, rather than feeding scrapers

the site works perfectly well when people don't scrape the pages, yet the scrapers contribute nothing towards the costs of their unacceptable activities

Another point: the scraping issue is not with bandwidth, from what I remember Willy said he's got lots of bandwidth, the problem is with the extra demand placed on the servers (in terms of CPU & memory resources etc......)

Lots of bandwidth is all very well (it's never a bad thing to have), but if the servers aren't coping with the high load caused by scrapers, then the site doesn't work (that's why pages are slow, or even fail to load)
servers not working or not able to cope = no site

requesting static images (like the signature images) doesn't increase the server-load much at all, it only uses bandwidth (which is fine)

but requesting pages is different, they're complex, and require the server to do a lot more processing (Willy would have to give details about [I]how[/ii] much more) to display a page than to send a signature image, also it's relatively easy to make the images cachable, so that they cause even lower server-load and bandwidth usage

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 14:11:22
last modified: 2006-06-16 14:14:46

The few characters spell 'Scraping not allowed.' That's it. So, if that's genius, well, then I'm guilty.

well, the genuis is that the wget program (or whatever these people use now) thinks it's got the page, but can't parse it

i'm sure some of them have been developed to be clever enough to be aware of error codes/pages, so by suplying a very simple page you avoid anything that the scraping program might do to get round the "blockage" (like repeatedly trying to get the page untill it doesn't get a "HTTP 404" anymore)

so yea, by actually supplying some text i think you've avoided a bunch of other problems (you've defused the arms race between hacker and... "site admin" )

and again, by having a static page which doens't touch the DB it reduces the load further

with thinking like that it's no wonder you've created the best stats site

(which is probably why you've got lots of people scraping it, life is never without irony! cue Alanis Morissette)

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-16 16:03:01

http://www.thesitewizard.com/archive/bandwidththeft.shtml

relying on the referer being present is not a good idea, many packages block the referer being sent anyway, which can break the whole process horribly, or a user can mis-configure a referer blocker

in general it's more hastle than it's worth

as stated, bandwidth is not the problem, and static images (like the sig images) don't cause much server-load

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Egofreak: BAM!ID: 274; Joined: 2006-05-12; Posts: 119; Credits: 18,582; World-rank: 1,330,769

2006-06-18 09:32:20

[...]
as stated, bandwidth is not the problem, and static images (like the sig images) don't cause much server-load

Willy said that already

Test nBOINC and help me to make this tool better

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-06-19 14:10:43
last modified: 2006-06-19 14:11:22

[...]
as stated, bandwidth is not the problem, and static images (like the sig images) don't cause much server-load

Willy said that already

which is why i wrote "as stated" at the begining
i was emphasising the point, which is obvious when you actually read what i posted

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Saenger: Tester - Translator; BAM!ID: 5; Joined: 2006-01-10; Posts: 1735; Credits: 228,207,205; World-rank: 6,610

2006-06-19 16:39:36
last modified: 2006-06-19 16:40:24

[...]
as stated, bandwidth is not the problem, and static images (like the sig images) don't cause much server-load

Willy said that already

I may be wrong, as I don't know nothing about scraping and so forth, but could it be, that Willy is asking you this question:

Where do you get the data for project rank and the BOINC combined info?

in this thread because of a certain suspicion regarding the behaviour of you tewl?
You havn't answered it yet, at least not openly.

edit for speeling

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9442; Credits: 353,172,950; World-rank: 4,872

2006-06-19 19:05:38

[...]
as stated, bandwidth is not the problem, and static images (like the sig images) don't cause much server-load

Willy said that already

I may be wrong, as I don't know nothing about scraping and so forth, but could it be, that Willy is asking you this question:

Where do you get the data for project rank and the BOINC combined info?

in this thread because of a certain suspicion regarding the behaviour of you tewl?
You havn't answered it yet, at least not openly.

edit for speeling

I checked the outgoing connections when I used the tool, and I saw no connections to BOINCstats, but I'm curious why the BOINCstats user and team id are needed.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Egofreak: BAM!ID: 274; Joined: 2006-05-12; Posts: 119; Credits: 18,582; World-rank: 1,330,769

2006-06-25 19:44:14

AFAIK i answered...

Test nBOINC and help me to make this tool better

Lee Carre: BAM!ID: 41; Joined: 2006-04-19; Posts: 262; Credits: 299,581; World-rank: 398,565

2006-07-08 01:12:02

how's the status of the anti-scraping issues?
just after a general update really...

Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins

Pages: [1] 2 3 4

Index :: Announcements :: Scrapers ye be warned!

Status

Shoutbox