/c/ - Anime/Cute

HowsOnFirst !!N7/2JlXSpJR 04/03/09(Fri)14:45 No.940179

Good day, Anon.

Today, I bring you a tool which I'm sure you'll find useful.

If you don't know what Danbooru is, I suggest you check it out now: http://danbooru.donmai.us/post/
Basically, it's a sort of image repository. Like it, there are a few other, all of which seem to use the same code:
http://nekobooru.net/post/
http://konachan.com/post/
http://chan.sankakucomplex.com/post/

What is it that I brought, you ask?
Well, a tool to produce an image list from any of those sites, I say.

http://www.wikifortio.com/792191/dan.7z

Tools you'll need: wget, g++, mv, rm. Cygwin provides all of these for Windows.

Compile with:
g++ dan-pages.cpp -o dan-pages -s -O3 -fexpensive-optimizations
g++ dan-extract.cpp -o dan-extract -s -O3 -fexpensive-optimizations

HowsOnFirst !!N7/2JlXSpJR 04/03/09(Fri)14:46 No.940180

How to use:
First, we need to get all the relevant to what we're looking for. Let's suppose we want to get all the pages from Danbooru. We use this command line:
dan-pages danbooru.donmai.us
To get the first 10 pages of the top level we use:
dan-pages danbooru.donmai.us _null_ 10
The first argument is the server (other examples: nekobooru.net, konachan.com). The second argument is the tag. _null_ signals the program not to use any tags. The third argument is the number of pages. The program can automatically detect when there are no more pages.
Examples:
Download all files tagged with tohno_akiha:
dan-pages danbooru.donmai.us tohno_akiha
Download all files tagged with tohno_akiha and highres:
dan-pages danbooru.donmai.us tohno_akiha+highres

Once this is done, run dan-extract to extract the URLs from the pages. You can pass it the number of pages you want to extract from, but it can detect them anyway.
The program will generate urls.txt, which contains all the URLs that could be found (ideal to use with wget), and urls####.txt, which contain the same list split every 1000 files (some graphical download managers, e.g. FlashGet, become rather sluggish when adding too many files at once).

Enjoy.

goodtimesfreegrog 04/03/09(Fri)15:02 No.940191

File :1238785367.jpg-(181 KB, 518x570, 1223624036972.jpg)

Not a bad little piece of code if you don't mind me saying.

I'm a comp sci student and I'm studying C and C++ programming right now, could anyone maybe tell me a little more about the code itself in this thing?

HowsOnFirst !!N7/2JlXSpJR 04/03/09(Fri)15:15 No.940198

The code is actually very simple.

All it does is prepare a command line string based on the user's input or the arguments passed and pass that to wget through system(). I know it's not the best method and about the bad reputation system() has, but anything more sophisticated would have been overkill and probably not as portable.

Once all the pages are downloaded, the code looks line by line in the resulting HTMLs for a certain pattern I noticed a while ago. Namely, "\"file_url\"". Immediately after this string, follows a colon, and a quote-enclosed URL, which is the direct link to the file itself (not the thumbnail). After that, it's simply a matter of finding the next quote. This string has the peculiarity of using "\\/" instead of decent slashes. My guess is that it's some limitation of the language the string is in (my knowledge of web is very limited, but I'm guessing JavaScript; although I really don't know). Removing the backslashes from the string is trivial.

That's it.

>>	Anonymous 04/03/09(Fri)15:23 No.940207 Uh.. Welcome to /c/. Thanks for the tool... But I don't think I'll be using anything posted on 4chan. I'd just rather write my own in Perl. Which seems more appropriate for the task, anyway.

>>	Anonymous 04/04/09(Sat)09:32 No.941165 So this can do a complete site rip?

>>	HowsOnFirst !!N7/2JlXSpJR 04/04/09(Sat)21:03 No.941706 File :1238893400.jpg-(39 KB, 500x600, 1221770786706.jpg) Yes, that's right. Also, bump.

Mike 04/04/09(Sat)21:32 No.941741

File :1238895135.jpg-(337 KB, 1678x1046, wr.jpg)

Aren't there already programs out there that do that? and I thought wget essentially grabbed all content from pages. I made something like this a year ago, but I haven't had time lately to work on it. It works on all chan sites, plus the ones op mentioned, several wallpaper sites (including 4scrape) and numerous other sites I haven't event tested. It's sadly still in a beta, but has an executable available, which I only released to friends to help test it. (Yes, I have a Miku skin for it)

	Poster	Thread Title
[V][X]	Anonymous
[V][X]	Anonymous	Exotic Clothing...
[V][X]	AnoNYmous	Gardevoir 2.5
[V][X]	Anonymous
[V][X]	Anonymous	Musi/c/al Threa...
[V][X]	Anonymous
[V][X]	Anonymous	Megurine Luka
[V][X]	Anonymous	Gender Bender
[V][X]	Anonymous	Final Fantasy T...
[V][X]	Anonymous ...	Gardevoir
[V][X]	TX2!/mnJQs7QaU
[V][X]	Anonymous	Kanaria