Jump to content

Primary: Sky Slate Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate Marble
Secondary: Sky Slate Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate Marble
Pattern: Blank Waves Squares Notes Sharp Wood Rockface Leather Honey Vertical Triangles
Photo

Batoto becoming registered only?


  • This topic is locked This topic is locked
512 replies to this topic

#201
aviar

aviar

    Fingerling Potato

  • Members
  • 64 posts
<snipped>

 

I honestly don't want to go the API route. Among 10 pages of people commenting here, it was mentioned a few times now. But here's why I don't want it:

  • We'll basically be acknowledging that we are a re-distribution source. That's not who we want to be.
  • This may end up spawning even more aggregates out there as it makes it to be an aggregate than ever before. Then we'd be back here with the same problem. API's in general are often paid or there is a separate benefit for providing it that increases their business. This would do neither for us. If we charge, they'll probably just go the free and current route.
  • There's also issue of trust. That they'd trust our API and conversely they would actually use the API instead of just crawling. There's no motive for them to change at all to the API if it's already working.
  • APIs aren't magically super efficient. It'd still have do an extensive search into our database for obscure titles. If I make a per comic api, it'd be really no more significantly efficient than it is now.

 

The reasoning behind an API on my part is that you can avoid the hit of others manually crawling over your content loading material unecessarily, which I am guessing is the problem here, and instead change that to querying a database, which I'd like to think is less expensive in terms of computational costs. And sure, an API may legitimize/acknowledge/canonize bots/trawlers/scrapers/etc, but I think shifting who determines the computational cost of an action/activity onto your side of the playing field is a good first step to getting resource usage under control. Pragmatically, automated content crawlers exist and are a real problem today, and an API -may- (only you can determine that) provide a viable alternative to making the site member only.

 

As to who or what you are are, philosophically, I think other people determine that (not looking to be rude here, just mean to say people can't really control what people think).

 

P.S.: I'm not against a site where membership may be a requirement, since well, I'm already a member.

 

P.S.S.: Theres also things like  attempting to verify client side browser capabilities: http://security.stackexchange.com/questions/4759/how-can-i-detect-and-block-bots

 

----------

 

Just finished reading comments and noted others have recommended the exact same solution with the same argument before me. /facepalm


Edited by aviar, 21 October 2015 - 01:19 PM.

I have come to warn you of the things beyond the wall and the men behind the machines.


#202
LACabeza

LACabeza

    Potato Sprout

  • Donator
  • 8 posts
  • LocationBrazil

can't we just uglyfy the html so they cant use regex to get the contents?


Edited by LACabeza, 21 October 2015 - 01:00 PM.


#203
aviar

aviar

    Fingerling Potato

  • Members
  • 64 posts

can't we just uglyfy the html so they cant use regex to get the contents?

If a browsers rendering engine can parse it, wouldn't any standards compliant library parse it as well?


Edited by aviar, 21 October 2015 - 01:02 PM.

I have come to warn you of the things beyond the wall and the men behind the machines.


#204
wolfpup

wolfpup

    Potato Sprout

  • Members
  • 1 posts

waht u want this manga site to be like manga/doujintoshokan? have the "secret" sister site?

I know of other sites that have a 'secret(18+)' side and that side is for registered users only to acess. Now i will admit that i use an 'off-line' reader program to get manga to read on my tablet when i do not have wi-fi. so haveing an api for programs like that could be a good thing and have the api so that only registered users could use it and you could also make it so that scanalators could use it to upload thier releases via the same api to make it easier for scanilators as well.



#205
EbonyBeast

EbonyBeast

    Potato Sprout

  • Members
  • 2 posts
If grumpy think it's for the better of the site then I'm good with it.

#206
ObviousCat

ObviousCat

    Potato Spud

  • Members
  • 12 posts

How about instead of the latest chapters, it's a few days delay for non-members since one of the main goals of aggregator sites is to get the content out as fast as possible to keep their viewers satisfied?


Edited by ObviousCat, 21 October 2015 - 01:07 PM.


#207
Pixel_J

Pixel_J

    Potato Sprout

  • Members
  • 9 posts

How about instead of the latest chapters, it's a few days delay for non-members since one of the main goals of aggregator sites is to get the content out as fast as possible to keep their viewers satisfied?

If I understood correctly, it's the opposite, crawlers look at "all" the content on the site, even the old mangas nobody knows anything about anymore, and that's what takes a lot of processing, if it was only the new releases the difference wouldn't be that big over a long period of time since the number of mangas coming out isn't rising that much. (Is it even raising?)



#208
Halo

Halo

    Potato

  • Donator
  • 171 posts

I don't think anyone's trying to download half the site in a single day. I think the current biggest hitters are comic page refreshes for new chapters.

I struggle to understand as to why if these pages are being cached. I mean you can even cut off comments and forum section, make them 100% static for guests. 
Well, never mind. 
 
Oh, by the way - I crawl too for scientific reasons. Nothing ridiculous, only 2.3k pages per week (only still updating comics) with 2.5 sec delay (takes 2 hours).
As a registered user (for the follows counter), from same IP. Got caught only once when scrapy derped. 
Guess I have to stop, though it was nice to have some neat discoverability and rating.
 
Now looking at it, with 2~ million views per week these pages do need some serious optimization.
 



#209
aviar

aviar

    Fingerling Potato

  • Members
  • 64 posts

I struggle to understand as to why if these pages are being cached. I mean you can even cut off comments and forum section, make them 100% static for guests. 
Well, never mind. 
 
Oh, by the way - I crawl too for scientific reasons. Nothing ridiculous, only 2.3k pages per week (only still updating comics) with 2.5 sec delay (takes 2 hours).
As a registered user (for the follows counter), from same IP. Got caught only once when scrapy derped. 
Guess I have to stop, though it was nice to have some neat discoverability and rating.
 
Now looking at it, with 2~ million views per week these pages do need some serious optimization.
 

Homepage > research

 

As to the matter at hand, I think people tend to attempt to address the problem by making an access unfeasible technologically (paywalls, membershipwalls, great walls), but sometimes it is also worthwhile to explore other aspects such as:

 

  • Making the data retrieved worthless in a given context. With a membership system, this may mean keying an image to an access code to uncover payload/image (steganography, watermarking).
  • Address the man behind the machine (behaviour). Raising awareness of the situation, I personally came here thanks to one of those batoto images at the end of a manga.
  • Facilitating the behaviour through channels one can control (as stated before, an API, though any path with less resistence will do)

Not really trying to plan solutions here, just trying to point out there are other venues to redress a behaviour. To be fair, I guess a part of the real problem is classifying/distinguishing one group from another in an automated fashion, after all, if one could determine with 100% certainty bots from people within a reasonable time-frame, simply banning them as they crop up wouldn't be out of the question. Since that's not possible, it appears that things are moving instead into a policy that simply covers everybody


Edited by aviar, 21 October 2015 - 02:03 PM.

I have come to warn you of the things beyond the wall and the men behind the machines.


#210
sneezemonkey

sneezemonkey

    Potato

  • Members
  • 122 posts
  • Making the data retrieved worthless in a given context. With a membership system, this may mean keying an image to an access code to uncover payload/image (steganography, watermarking).
  • Address the man behind the machine (behaviour). Raising awareness of the situation, I personally came here thanks to one of those batoto images at the end of a manga.

1. This is just gonna give them more traffic coz the casuals gonna think it a hassle and the bots gonna have countermeasures. They're gonna get the clean image anyway.
2. The groups have done a lot to raise awareness over the years. Sadly, not everyone reads the credits and not everyone cares. I can't think of what more groups can do except slapping on a massive watermark telling people to read from Batoto and that is a bad idea.

 

In any case, the current proposals by Grumpy will cost traffic, I think this has pretty much been established. What the question really is is if the cost worth it.

 

And any route that facilitates the bots to rip from this site, I fear would go against the whole point of this site. We might as well just turn the site into an update aggregate with actual links to the group pages.


Edited by sneezemonkey, 21 October 2015 - 02:16 PM.

Tired of halved double page spreads? Want to read manga like an actual tankoubon? Just want to load all pages in a chapter at once?

Try Manga OnlineViewer Fluid Mode+ Now!!!!


#211
Halo

Halo

    Potato

  • Donator
  • 171 posts

Homepage > research

It doesn't have chapters counter - so I can detect deleted stuff and count new chapter per specific week.
Comments counter - for all my drama needs.
Votes counter - can't perform bayesian estimate without it.
Follows counter is just outdated.



#212
aviar

aviar

    Fingerling Potato

  • Members
  • 64 posts

It doesn't have chapters counter - so I can detect deleted stuff and count new chapter per specific week.
Comments counter - for all my drama needs.
Votes counter - can't perform bayesian estimate without it.
Follows counter is just outdated.

No no, what I mean is that your homepage is greater than your research page. Thats not saying the research page isn't great, it's really neat and I find it awesome that you took the time to build such a project with programming and mathematics. Even studying software engineering as I am I've never really had the consistency and dedication to complete a personal project, always end up waylaying it due to Uni.


Edited by aviar, 21 October 2015 - 02:10 PM.

I have come to warn you of the things beyond the wall and the men behind the machines.


#213
roch

roch

    Potato Sprout

  • Members
  • 7 posts

I don't really know how it works but..

 

Is it possible for a Captcha?

 

like when a user opens 50+ comic pages in less than a minute he needs to enter captcha to verify. 

I mean no human would do that in less than a minute. (if he/she did then he did not read even a single title, just randomly opening stuff).


Edited by roch, 21 October 2015 - 02:15 PM.


#214
crealque

crealque

    Potato Sprout

  • Members
  • 4 posts

I have not read the previous comments and I'm not sure if this had been suggested (it should after 11 pages of posts).

 

Introduce reCaptchas (http://www.google.com/recaptcha/intro/index.html), these had been designed to be the improved version (and very much less annoying version) of Captchas. This should signifcantly reduce non-complicated crawlers.

 

Force reCaptchas on the following scenarios:

- With guest access

---- Force reCaptcha input for every new IP once a day or more frequent

---- Force reCaptcha

 

-With privatized logins. (May introduce elevated privilege logins in the future to bypass this)

---- Upon login

---- Once a day for every login session

---- Upon IP Change of the same login session logged within the server

 

The nice thing about reCaptchas is that they are 

- Free

- Hosted by google, processing will be done by google. Should elevate requirement for processing within batoto's server nodes

- Can be a point and click captcha, doesn't require a keyboard input


Edited by crealque, 21 October 2015 - 02:18 PM.


#215
Guest_BAnon_von_Kartoffel

Guest_BAnon_von_Kartoffel
  • Guests

Oh, by the way - I crawl too for scientific reasons

This is amazing, would be cool if you keep that up or batoto somehow shows this stuff on it's own.



#216
Halo

Halo

    Potato

  • Donator
  • 171 posts

No no, what I mean is that your homepage is greater than your research page. 

Ah. Thought you were suggesting to use the search list to get the data I needed. /overthinking

But that "homepage" is glorious indeed.



#217
sneezemonkey

sneezemonkey

    Potato

  • Members
  • 122 posts

I don't really know how it works but..

 

Is it possible for a Captcha?

 

like when a user opens 50+ comic pages in less than a minute he needs to enter captcha to verify. 

I mean no human would do that in less than a minute. (if he/she did then he did not read even a single title, just randomly opening stuff).

 

I have not read the previous comments and I'm not sure if this had been suggested (it should after 11 pages of posts).

 

Introduce reCaptchas (http://www.google.com/recaptcha/intro/index.html), these had been designed to be the improved version (and very much less annoying version) of Captchas. This should signifcantly reduce non-complicated crawlers.

 

Force reCaptchas on the following scenarios:

- With guest access

---- Force reCaptcha input for every new IP once a day or more frequent

---- Force reCaptcha

 

-With privatized logins. (May introduce elevated privilege logins in the future to bypass this)

---- Upon login

---- Once a day for every login session

---- Upon IP Change of the same login session logged within the server

 

The nice thing about reCaptchas is that they are 

- Free

- Hosted by google, processing will be done by google. Should elevate requirement for processing within batoto's server nodes

- Can be a point and click captcha, doesn't require a keyboard input

Two words: human farm


Tired of halved double page spreads? Want to read manga like an actual tankoubon? Just want to load all pages in a chapter at once?

Try Manga OnlineViewer Fluid Mode+ Now!!!!


#218
crealque

crealque

    Potato Sprout

  • Members
  • 4 posts

Two words: human farm

 

Human farms would indicate that they are willing, at some point in time, to invest money into such captcha solvers. Not sure how much they would cost though



#219
sneezemonkey

sneezemonkey

    Potato

  • Members
  • 122 posts

Human farms would indicate that they are willing, at some point in time, to invest money into such captcha solvers. Not sure how much they would cost though

At most it would cost the same to pay some guy in an outsourced indian call centre surely.


Tired of halved double page spreads? Want to read manga like an actual tankoubon? Just want to load all pages in a chapter at once?

Try Manga OnlineViewer Fluid Mode+ Now!!!!


#220
Halo

Halo

    Potato

  • Donator
  • 171 posts

At most it would cost the same to pay some guy in an outsourced indian call centre surely.

You wish.

*** is an online service which provides real-time captcha-to-text decodings. This works easy: your software uploads a captcha to our server and receives text from it within seconds.
Cheapest price on the market - starting from 0.7USD per 1000 images, depending on the daily volume