[Return]
Posting mode: Reply
Name
E-mail
Subject
Comment
File
Password(Password used for file deletion)
  • Supported file types are: GIF, JPG, PNG
  • Maximum file size allowed is 3072 KB.
  • Images greater than 250x250 pixels will be thumbnailed.
  • Read the rules and FAQ before posting.
  • このサイトについて - 翻訳
  • Blotter updated: 01/01/09


  • File deleted. HowsOnFirst !!N7/2JlXSpJR 04/03/09(Fri)14:45 No.940179  
    Good day, Anon.

    Today, I bring you a tool which I'm sure you'll find useful.

    If you don't know what Danbooru is, I suggest you check it out now: http://danbooru.donmai.us/post/
    Basically, it's a sort of image repository. Like it, there are a few other, all of which seem to use the same code:
    http://nekobooru.net/post/
    http://konachan.com/post/
    http://chan.sankakucomplex.com/post/

    What is it that I brought, you ask?
    Well, a tool to produce an image list from any of those sites, I say.

    http://www.wikifortio.com/792191/dan.7z

    Tools you'll need: wget, g++, mv, rm. Cygwin provides all of these for Windows.

    Compile with:
    g++ dan-pages.cpp -o dan-pages -s -O3 -fexpensive-optimizations
    g++ dan-extract.cpp -o dan-extract -s -O3 -fexpensive-optimizations
    >> HowsOnFirst !!N7/2JlXSpJR 04/03/09(Fri)14:46 No.940180
    How to use:
    First, we need to get all the relevant to what we're looking for. Let's suppose we want to get all the pages from Danbooru. We use this command line:
    dan-pages danbooru.donmai.us
    To get the first 10 pages of the top level we use:
    dan-pages danbooru.donmai.us _null_ 10
    The first argument is the server (other examples: nekobooru.net, konachan.com). The second argument is the tag. _null_ signals the program not to use any tags. The third argument is the number of pages. The program can automatically detect when there are no more pages.
    Examples:
    Download all files tagged with tohno_akiha:
    dan-pages danbooru.donmai.us tohno_akiha
    Download all files tagged with tohno_akiha and highres:
    dan-pages danbooru.donmai.us tohno_akiha+highres

    Once this is done, run dan-extract to extract the URLs from the pages. You can pass it the number of pages you want to extract from, but it can detect them anyway.
    The program will generate urls.txt, which contains all the URLs that could be found (ideal to use with wget), and urls####.txt, which contain the same list split every 1000 files (some graphical download managers, e.g. FlashGet, become rather sluggish when adding too many files at once).


    Enjoy.
    >> goodtimesfreegrog 04/03/09(Fri)15:02 No.940191
         File :1238785367.jpg-(181 KB, 518x570, 1223624036972.jpg)
    181 KB
    Not a bad little piece of code if you don't mind me saying.

    I'm a comp sci student and I'm studying C and C++ programming right now, could anyone maybe tell me a little more about the code itself in this thing?
    >> HowsOnFirst !!N7/2JlXSpJR 04/03/09(Fri)15:15 No.940198
    The code is actually very simple.

    All it does is prepare a command line string based on the user's input or the arguments passed and pass that to wget through system(). I know it's not the best method and about the bad reputation system() has, but anything more sophisticated would have been overkill and probably not as portable.

    Once all the pages are downloaded, the code looks line by line in the resulting HTMLs for a certain pattern I noticed a while ago. Namely, "\"file_url\"". Immediately after this string, follows a colon, and a quote-enclosed URL, which is the direct link to the file itself (not the thumbnail). After that, it's simply a matter of finding the next quote. This string has the peculiarity of using "\\/" instead of decent slashes. My guess is that it's some limitation of the language the string is in (my knowledge of web is very limited, but I'm guessing JavaScript; although I really don't know). Removing the backslashes from the string is trivial.

    That's it.
    >> Anonymous 04/03/09(Fri)15:23 No.940207
    Uh.. Welcome to /c/.

    Thanks for the tool... But I don't think I'll be using anything posted on 4chan.
    I'd just rather write my own in Perl. Which seems more appropriate for the task, anyway.
    >> Anonymous 04/04/09(Sat)09:32 No.941165
    So this can do a complete site rip?
    >> HowsOnFirst !!N7/2JlXSpJR 04/04/09(Sat)21:03 No.941706
         File :1238893400.jpg-(39 KB, 500x600, 1221770786706.jpg)
    39 KB
    Yes, that's right.

    Also, bump.
    >> Mike 04/04/09(Sat)21:32 No.941741
         File :1238895135.jpg-(337 KB, 1678x1046, wr.jpg)
    337 KB
    Aren't there already programs out there that do that? and I thought wget essentially grabbed all content from pages. I made something like this a year ago, but I haven't had time lately to work on it. It works on all chan sites, plus the ones op mentioned, several wallpaper sites (including 4scrape) and numerous other sites I haven't event tested. It's sadly still in a beta, but has an executable available, which I only released to friends to help test it. (Yes, I have a Miku skin for it)



    Delete Post [File Only]
    Password
    Style [Yotsuba | Yotsuba B | Futaba | Burichan]
    Watched Threads
    PosterThread Title
    [V][X]Anonymous
    [V][X]AnonymousExotic Clothing...
    [V][X]AnoNYmousGardevoir 2.5
    [V][X]Anonymous
    [V][X]AnonymousMusi/c/al Threa...
    [V][X]Anonymous
    [V][X]AnonymousMegurine Luka
    [V][X]AnonymousGender Bender
    [V][X]AnonymousFinal Fantasy T...
    [V][X]Anonymous ...Gardevoir
    [V][X]TX2!/mnJQs7QaU
    [V][X]AnonymousKanaria