15 October 2005 Updated. A. writes: It's not well documented, but the command line option "-e robots=off" will cause wget to ignore any site's robots.txt file. 14 October 2005. Thanks to A. Robots exclusion file available from any site which uses it: http://www.anysite.xxx/robots.txt Samples below. --------------------------------------------------------------- http://www.google.com/robots.txt User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalog_list Disallow: /news Disallow: /nwshp Disallow: /? Disallow: /addurl/image? Disallow: /pagead/ Disallow: /relpage/ Disallow: /sorry/ Disallow: /imgres Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /advanced_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /wml? Disallow: /xhtml? Disallow: /imode? Disallow: /jsky? Disallow: /pda? Disallow: /sprint_xhtml Disallow: /sprint_wml Disallow: /pqa Disallow: /palm Disallow: /gwt/ Disallow: /purchases Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local? Disallow: /local_url Disallow: /froogle? Disallow: /froogle_ Disallow: /print? Disallow: /scholar? Disallow: /complete Disallow: /sponsoredlinks Disallow: /videosearch? Disallow: /videopreview? Disallow: /videoprograminfo? Disallow: /maps? Disallow: /translate? Disallow: /ie? Disallow: /sms/demo? Disallow: /katrina? Disallow: /blogsearch? Disallow: /reader/ http://www.sun.com/robots.txt # /robots.txt for www.sun.com #-------------------------------------------------------------------------- # Mon Feb 2 11:59:27 PST 1998, Fred Elliott # A NOTE TO THOSE WHO'D BOTHER TO LOOK AT THIS FILE: # # Bertrand Meyer's excellent "comp.risks" posting about the potential # for misusing "robots.txt" files # (http://www.eiffel.com/private/meyer/robots.html) includes a snapshot # of the contents of this file here on www.sun.com. # # In the article, Bertrand speculates that the directories listed below # contain proprietary information. Well, they don't. They do, though, # contain information that we'd prefer people register for before they # download it. # # The purpose of the "robots.txt" file is to keep these directories # from being indexed so that the average user doesn't stumble across them # while performing searches, and those that should be accessing these # directories will do so through the URL that requires them to register. # Of course, having the contents of this file advertised in "comp.risks" # diminishes its purpose. Thanks Bertrand. ;-) # # If you do actually go to the trouble of figuring out how to download # the files without registering, what you'll end up with is 1 or 2MB of # stuff that is meaningless to you unless you have purchased an # Ultra AX board from Sun. So, please do purchase an Ultra AX board, # but then you might as well use the URL you'll be given along with it. #-------------------------------------------------------------------------- # # Thu Jan 30 16:58:19 PST 1997, Fred Elliott # o Created this file to prevent indexing of one # SME directory. # User-agent: * Disallow: /sparc/SPARCengineUltraAX/download/ Disallow: /microelectronics/SPARCengineUltraAX/download/ Disallow: /javachip/SPARCengineUltraAX/download/ Disallow: /javachips/SPARCengineUltraAX/download/ Disallow: /joeroebuck/ Disallow: /*_print.html$ Disallow: /servers/bri/ Disallow: /bri/ # Java Systems files Disallow: /javastation/remotewindowing/citrix/JICAEng.zip Disallow: /javastation/remotewindowing/citrix/JavaEnt.tar http://www.earthlink.net/robots.txt # robots.txt for http://www.earthlink.net User-agent: * Disallow: /1demo950 Disallow: /1demo957 Disallow: /3Dtechnologies Disallow: /BROADBAND Disallow: /BUSINESS Disallow: /DSL Disallow: /T900 Disallow: /TeamEarthLink Disallow: /XeroxM760Rebate Disallow: /XeroxM760printer Disallow: /_DSLCOMPARISON Disallow: /__thankyou_OUTofCOMMISSION Disallow: /_adtest Disallow: /_cobrandpsp Disallow: /_demo Disallow: /_htmlemail Disallow: /_interface Disallow: /_monopoly Disallow: /_new Disallow: /_pcs Disallow: /_pointroll Disallow: /_staging Disallow: /_testing Disallow: /_visa Disallow: /access Disallow: /ads Disallow: /advertise Disallow: /amazon Disallow: /amex Disallow: /apple Disallow: /architecture Disallow: /assistance Disallow: /bd Disallow: /benifits Disallow: /bizfilings Disallow: /blackberry Disallow: /blink_temp Disallow: /blinksurvey Disallow: /broadband Disallow: /business_20000801_Pre_MSPG_BACKUPprint Disallow: /business_40 Disallow: /business_temp Disallow: /c4sure Disallow: /camera Disallow: /cameraoffer Disallow: /camerapromo Disallow: /card Disallow: /carsdirect Disallow: /cash Disallow: /ccupdate Disallow: /ceiva Disallow: /checkout Disallow: /cinfo Disallow: /clicknbuild Disallow: /company Disallow: /connected Disallow: /consumerinfo Disallow: /convenience Disallow: /coolsites Disallow: /countrywide Disallow: /data Disallow: /digcam Disallow: /digcamoffer Disallow: /digital Disallow: /digitalcamera Disallow: /digitalwork Disallow: /directory Disallow: /discover Disallow: /disney Disallow: /dslinstallsurvey Disallow: /dslpromo052201 Disallow: /du Disallow: /e5 Disallow: /e5_installers_twc Disallow: /earn Disallow: /earthlinkmall Disallow: /earthlinkvsaol Disallow: /elink030600 Disallow: /elinkoffers Disallow: /elnadvertise Disallow: /elnbook Disallow: /elnk_uunet Disallow: /elnkmall Disallow: /email Disallow: /emailauth Disallow: /ememories Disallow: /ememoriesoffer Disallow: /ememoriesphoto Disallow: /enspot Disallow: /epa Disallow: /esurance Disallow: /evite Disallow: /explore Disallow: /fastlane Disallow: /fastlanemac Disallow: /features Disallow: /forms Disallow: /freescan Disallow: /freescanone Disallow: /frontline Disallow: /frontpage Disallow: /get_linked Disallow: /getlinked Disallow: /getrich Disallow: /gh Disallow: /glconnection Disallow: /godiva Disallow: /gra Disallow: /graphics Disallow: /hh Disallow: /hl Disallow: /home-networking Disallow: /hosting Disallow: /i21_old Disallow: /icw Disallow: /ie5 Disallow: /ientertain Disallow: /images Disallow: /imagestore Disallow: /inside_earthlink Disallow: /internet Disallow: /ipo Disallow: /js Disallow: /kidzone Disallow: /loyalty Disallow: /m3communications Disallow: /mall Disallow: /mallholiday99 Disallow: /mallpage Disallow: /mallpromo Disallow: /media Disallow: /memberbenunsubscribe Disallow: /merger Disallow: /migrationinfo Disallow: /motd Disallow: /motorola_old Disallow: /msacct-mgmt Disallow: /mschat Disallow: /msdw Disallow: /msglobal_roam Disallow: /mshelp Disallow: /msnetstatus Disallow: /mspgpcflowers Disallow: /msservice Disallow: /mssup Disallow: /mssupport Disallow: /mssupportfaqs Disallow: /mssupportlive Disallow: /nethelp Disallow: /nettools Disallow: /networking Disallow: /newpsp Disallow: /nickelnights Disallow: /nowires Disallow: /noworries Disallow: /nwa Disallow: /nwa-elite Disallow: /onecore Disallow: /opendoor Disallow: /opera Disallow: /ordernow Disallow: /palm Disallow: /partner Disallow: /patagon Disallow: /pcflowers Disallow: /peoplechat Disallow: /popup Disallow: /popupdate Disallow: /portalproduction Disallow: /printer Disallow: /printeroffer Disallow: /printoffer Disallow: /product Disallow: /productoffer Disallow: /productoffers Disallow: /products Disallow: /profootball Disallow: /psp Disallow: /psp_ads Disallow: /quitaol Disallow: /radio Disallow: /rd Disallow: /realm Disallow: /referrals Disallow: /reflect Disallow: /ricochet Disallow: /room Disallow: /sams Disallow: /search Disallow: /secure Disallow: /sharebuilder Disallow: /simplicity Disallow: /sisterhazel Disallow: /smallworld Disallow: /smallworldsports Disallow: /smile Disallow: /software_club Disallow: /sos Disallow: /spacer Disallow: /spam Disallow: /spaminator Disallow: /sprint Disallow: /sprint1000 Disallow: /sprintpcs Disallow: /sprintverify Disallow: /sprintwelcome Disallow: /standard_html_header Disallow: /start Disallow: /start_page Disallow: /startpage Disallow: /strategy Disallow: /superfly Disallow: /sweeps Disallow: /switch Disallow: /switchover Disallow: /switchoverfiles Disallow: /t900 Disallow: /talkway Disallow: /teamearthlink Disallow: /teamelink Disallow: /temp Disallow: /testing Disallow: /thearena Disallow: /thecellmovie Disallow: /theearthlinkmall Disallow: /themall Disallow: /titleGraphic Disallow: /traffix1 Disallow: /traffix2 Disallow: /traffix3 Disallow: /transition Disallow: /travelscape Disallow: /unsubscribe Disallow: /update Disallow: /version3 Disallow: /video Disallow: /visacard Disallow: /wasecure Disallow: /webchannels Disallow: /webhosting Disallow: /webmail_notify Disallow: /webtour Disallow: /welcomekit Disallow: /wgbh Disallow: /win Disallow: /wireless1 Disallow: /wishesgranted Disallow: /zdnetdownloads Disallow: /ziprealty http://www.nsa.gov/robots.txt User-agent: * Disallow: /about/images/ Disallow: /about/includes/ Disallow: /about/styles/ Disallow: /ads/ Disallow: /african/images/ Disallow: /business/images/ Disallow: /business/includes/ Disallow: /business/styles/ Disallow: /careers/admin/ Disallow: /careers/expo/ Disallow: /careers/Flash/ Disallow: /careers/forms/ Disallow: /careers/images/ Disallow: /careers/includes/ Disallow: /careers/scripts/ Disallow: /careers/styles/ Disallow: /cch/images/ Disallow: /contacts/ Disallow: /coremsgs/images/ Disallow: /cuba/images/ Disallow: /diversity/images/ Disallow: /diversity/includes/ Disallow: /diversity/styles/ Disallow: /error_templates/ Disallow: /errors/ Disallow: /foia/includes/ Disallow: /forms/ Disallow: /gallery/thumbs/ Disallow: /history/images/ Disallow: /history/includes/ Disallow: /history/styles/ Disallow: /honor/images/ Disallow: /ia/images/ Disallow: /ia/includes/ Disallow: /ia/industry/CPQDB/ Disallow: /ia/styles/ Disallow: /images/ Disallow: /includes/ Disallow: /kids/images/ Disallow: /kids/includes/ Disallow: /kids/styles/ Disallow: /korea/images/ Disallow: /memorial/images/ Disallow: /mepp/images/ Disallow: /museum/images/ Disallow: /notices/ Disallow: /programs/ Disallow: /public/images/ Disallow: /public/includes/ Disallow: /public/styles/ Disallow: /publications/images/ Disallow: /research/images/ Disallow: /research/includes/ Disallow: /research/styles/ Disallow: /scripts/ Disallow: /search/ Disallow: /selinux/includes/ Disallow: /sigint/images/ Disallow: /sigint/includes/ Disallow: /sigint/styles/ Disallow: /snac/images/ Disallow: /styles/ Disallow: /techtrans/images/ Disallow: /venona/images/ Disallow: /vigilance/images/ Disallow: /women/images/ Disallow: /Application.cfm Disallow: /home.swf Disallow: /home_html_menu.js Disallow: /intro.swf http://www.cia.gov/robots.txt # Disallow any type of crawler. User-agent: * Disallow: /_notes Disallow: /Templates Disallow: /includes Disallow: /javascript Disallow: /scripts Disallow: /graphics Disallow: /search http://senate.gov/robots.txt # robots.txt file for www.senate.gov # User-agent: os-heritrix Disallow: / User-agent: NPBot Disallow: / User-agent: VoilaBot Disallow: / User-agent: Openbot Disallow: / User-agent: openbot Disallow: / User-agent: Openbot/3.0 Disallow: / User-agent: dloader(NaverRobot)/1.1 Disallow: / User-agent: NaverBot Disallow: / User-agent: Szukacz Disallow: / User-agent: Dumbot Disallow: / User-agent: dumbot Disallow: / User-agent: Dumbot(version 0.1 beta) Disallow: / # www.girafa.com User-agent: Girafabot Disallow: / # www.picsearch.com User-agent: psbot Disallow: / # www.picsearch.com User-agent: psbot/0.1 Disallow: / User-agent: EMPAS_ROBOT Disallow: / # www.skizzle.com User-agent: SKIZZLE Disallow: / # www.nurelm.com #User-agent: nuSearch #Disallow: / #User-agent: lmspider #Disallow: / User-agent: iltrovatore-setaccio Disallow: / User-agent: BaiDuSpider Disallow: / User-agent: MSIECrawler Disallow: / User-agent: blogbot Disallow: / User-agent: BlogBot/1.2 Disallow: / http://house.gov/robots.txt # # No robots allowed in the following directories ! # User-agent: * Disallow: /htbin Disallow: /docs/ARCHIVE Disallow: /docs/apps Disallow: /docs/moved_sites Disallow: /docs/temp Disallow: /docs/test Disallow: /docs/sites/bin Disallow: /docs/sites/etc Disallow: /docs/sites/dev Disallow: /docs/sites/usr Disallow: /docs/sites/other/webassistance http://uscourts.gov/robots.txt User-Agent: Disallow: http://www.usdoj.gov/robots.txt User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot) Disallow: / User-agent: * Disallow: /Admin/ Disallow: /cgi-bin/ Disallow: /help/ Disallow: /img/ Disallow: /gif/ Disallow: /ins/ Disallow: /gopherdata/ Disallow: /ojp/ Disallow: /wusage/ Disallow: /archive/ Disallow: /opa/pr/support/ User-agent: Netscape-Compass-Robot/Archive Disallow: http://www.cisco.com/robots.txt #-------------------------------- User-agent: cisco-googlebot-enterprise Disallow: /bug-navigator # Bug Data # Disallow: /cgi-bin # allow test crawls for TAC support content Disallow: /pcgi-bin # no programs Disallow: /univ-src/ccden # will get content through /univercd Disallow: /cpropub/univercd # obsolete Disallow: /jobs # temporary entry per performance team #-------------------------------- #-------------------------------- User-agent: * Disallow: /bug-navigator # Bug Data Disallow: /cgi-bin # no programs Disallow: /pcgi-bin # no programs Disallow: /univ-src/ccden # will get content through /univercd Disallow: /cpropub/univercd # obsolete Disallow: /jobs # temporary entry per performance team #-------------------------------- Time Magazine http://www.time.com/robots.txt # Welcome to Pathfinder's robots.txt # # If you have any questions about indexing our site, # especially regarding more efficient or convenient # methods, please write to: # # request@timeinc.net # #-------------------------- User-agent: * Disallow: /cgi-bin/ Disallow: /event.ng/ Disallow: /money/money101/ Disallow: /offers/cp/ Disallow: /time/magazine/printout/ Disallow: /time/magazine/1997/ Disallow: /time/health/article/0,8599,130962,00.html Disallow: /time/europe/magazine/2000/0717/ecstasy.html Disallow: /time/asia/magazine/2000/1113/cover1.html User-agent: Mozilla Disallow: /cgi-bin/Money/netc/story.cgi User-agent: MSIECrawler Disallow: / User-agent: Mediapartners-Google* Disallow: User-agent: yahoo-newscrawler Disallow: User-agent: Yahoo! Slurp Disallow: New York Times http://www.nytimes.com/robots.txt # robots.txt, www.nytimes.com 3/24/2005 # User-agent: * Disallow: /pages/college Disallow: /college Disallow: /library Disallow: /learning Disallow: /aponline Disallow: /reuters Disallow: /cnet Disallow: /partners Disallow: /archives Disallow: /indexes Disallow: /thestreet Disallow: /nytimes-partners Disallow: /financialtimes Allow: /pages/ Allow: /2003/ Allow: /2004/ Allow: /2005/ Allow: /top/ Allow: /ref/ Allow: /services/xml/ User-agent: Mediapartners-Google* Disallow: http://www.foxnews.com/robots.txt User-agent: * Disallow: / User-agent: fusionbot User-agent: Googlebot Disallow: /printer_friendly_story User-agent: Mediapartners-Google* Disallow: /printer_friendly_story User-agent: Teoma Disallow: /printer_friendly_story User-agent: yahoo-newscrawler Disallow: /printer_friendly_story User-agent: Yahoo! Slurp Disallow: /printer_friendly_story User-agent: newslookup-bot Disallow: /printer_friendly_story User-agent: gsa-crawler Disallow: /printer_friendly_story http://abcnews.go.com/robots.txt # robots.txt for http://abcnews.go.com/ User-agent: * Disallow: /cgi Disallow: /xls Disallow: /imp Disallow: /kmail Disallow: /images Disallow: /map Disallow: /log Disallow: /gif Disallow: /panel Disallow: /0/ Disallow: /flashHeads.txt Disallow: /Imodium_Promo.html Disallow: /promo/ Disallow: /abclinks/ Disallow: /Video/playerIndex Disallow: /Video/guide Disallow: /toc.html Disallow: /houseads/ Disallow: /dispatches/ Disallow: /gallery/ Disallow: /avantgo/ Disallow: /Library/ Disallow: /sections/popoff/ Disallow: /century/slides/ Disallow: /onair/samdonaldson/ Disallow: /sections/politics/elections2004/counties/ Disallow: /sections/politics/electionNov2003/ Disallow: /sections/us/elections98/ Disallow: /sections/us/yahoo/ Disallow: /test/ Disallow: /swen/ Disallow: /PR/ Disallow: /intro/ Disallow: /sections/us/quiz/ Disallow: /sections/politics/quiz/ Disallow: /sections/world/quiz/ Disallow: /sections/business/quiz/ Disallow: /sections/entertainment/quiz/ Disallow: /sections/travel/quiz/ Disallow: /sections/science/quiz/ Disallow: /sections/tech/quiz/ Disallow: /sections/sports/quiz/ Disallow: /sections/living/quiz/ Disallow: /onair/quiz/ Disallow: /go/ Disallow: /news/go/ Disallow: /sections/us/popoff/ Disallow: /sections/politics/popoff/ Disallow: /sections/world/popoff/ Disallow: /sections/business/popoff/ Disallow: /sections/entertainment/popoff/ Disallow: /sections/travel/popoff/ Disallow: /sections/science/popoff/ Disallow: /sections/tech/popoff/ Disallow: /sections/sports/popoff/ Disallow: /sections/living/popoff/ Disallow: /onair/popoff/ Disallow: /sections/us/slides/ Disallow: /sections/politics/slides/ Disallow: /sections/world/slides/ Disallow: /sections/business/slides/ Disallow: /sections/entertainment/slides/ Disallow: /sections/travel/slides/ Disallow: /sections/science/slides/ Disallow: /sections/tech/slides/ Disallow: /sections/sports/slides/ Disallow: /sections/living/slides/ Disallow: /onair/slides/ Disallow: /local/wpvi/ Disallow: /local/kabc/ Disallow: /local/wls/ Disallow: /local/wabc/ Disallow: /local/kfsn/ Disallow: /local/test/ http://www.fema.gov/robots.txt User-agent: * Disallow: /stats/ Disallow: /adminRoot/ Disallow: /appeals/ Disallow: /gems/ Disallow: /radio/ Disallow: /cgi-bin/ Disallow: /cgi-shl/ Disallow: /career/ Disallow: /hlt/ Disallow: /staff/ Disallow: /kidsApps/ Disallow: /mitigationss/ Disallow: /maprogress/ Disallow: /compendium/ Disallow: /nfipInsurance/ Disallow: /graphics/ Disallow: /img/ Disallow: /nwz01/ Disallow: /nwz00/ Disallow: /nwz99/ Disallow: /nwz03/ Disallow: /nwz02/ Disallow: /nwz98/ Disallow: /nwz97/ Disallow: /diz97/ Disallow: /diz98/ Disallow: /diz99/ Disallow: /diz00/ Disallow: /diz01/ Disallow: /diz02/ Department of the Interior http://www.doi.gov/robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /testps/ Disallow: /testdp/ Disallow: /testisc/ Disallow: /logs/ Disallow: /oepc/ra32g6ws/ Disallow: /oepc/guil5de/ Disallow: /backup/ Disallow: /testtad/ Disallow: /noonecansee_bia/ Disallow: /usparkpolice/ Disallow: /budget/POBSecure/ Disallow: /gallery/ Disallow: /hrm/rpl/ Disallow: /intl/itap/reports/ Disallow: /itmr/ Disallow: /newssummary/ Disallow: /nigc/restored_files/employee/ Disallow: /ocio/security/guidance/ Disallow: /ocio/architecture/gap/ Disallow: /ocio/architecture/finance/ Disallow: /ocio/architecture/projects/ Disallow: /ocio/architecture/documents/pw/ Disallow: /ocio/architecture/documents/omb/ Disallow: /ocio/tsd/tcm/ Disallow: /ocio/itmc/ Disallow: /ocio/erm/symantec/sympass/ Disallow: /ocio/erm/pr/ Disallow: /ocio/erm/hardware/contacts/ Disallow: /ocio/erm/microsoft/mspricelist/ Disallow: /ocio/erm/microsoft/mspocs/ Disallow: /ocio/erm/oracle/oraclebpa/ Disallow: /ocio/erm/oracle/oracleprice/ Disallow: /ocio/erm/oracle/oraclepocs/ Disallow: /ocio/architecture_old/gap/ Disallow: /ocio/architecutre_old/finance/ Disallow: /ocio/architecture_old/projects/ Disallow: /pam/hackett/ Disallow: /pam/pppm/blm/cldo/ Disallow: /pam/pppm/blm/mtna/ Disallow: /pam/pppm/blm/utah/ Disallow: /pam/pppm/blm/wymg/ Disallow: /pam/pppm/bia/alsk/ Disallow: /pam/pppm/bia/cntr/ Disallow: /pam/pppm/bia/pcfc/ Disallow: /pam/pppm/bia/rkmt/ Disallow: /pam/pppm/bor/pnro/ Disallow: /pam/pppm/bor/mpro/ Disallow: /pam/pppm/bor/lcro/ Disallow: /pam/pppm/bor/ucro/ Disallow: /pam/pppm/bor/gpro/ Disallow: /pam/pppm/fws/reg1/ Disallow: /pam/pppm/fws/reg4/ Disallow: /pam/pppm/fws/reg9/ Disallow: /pam/pppm/mms/hdqs/ Disallow: /pam/pppm/nbc/mib/ Disallow: /pam/pppm/nbc/dnvr/ Disallow: /pam/pppm/nbc/rest/ Disallow: /pam/pppm/nps/akro/ Disallow: /pam/pppm/nps/ncro/ Disallow: /pam/pppm/nps/mwro/ Disallow: /pam/pppm/nps/nero/ Disallow: /pam/pppm/nps/imro/ Disallow: /pam/pppm/nps/sero/ Disallow: /pam/pppm/nps/pwro/ Disallow: /pam/pppm/nps/dsco/ Disallow: /pam/pppm/nps/hfco/ Disallow: /pam/pppm/oas/ Disallow: /pam/pppm/osm/west/ Disallow: /pam/pppm/osm/appa/ Disallow: /pam/pppm/osm/wadc/ Disallow: /pam/pppm/osm/midc/ Disallow: /pam/pppm/usgs/east/ Disallow: /pam/pppm/usgs/hdqs/ Disallow: /pam/quic/fws/reg1/ Disallow: /pam/quic/fws/reg4/ Disallow: /pam/quic/fws/reg9/ Disallow: /pam/quic/blm/mtna/ Disallow: /pam/quic/blm/utah/ Disallow: /pam/quic/blm/wymg/ Disallow: /pam/quic/blm/cldo/ Disallow: /pam/quic/nbc/albq/ Disallow: /pam/quic/nbc/wadc/ Disallow: /pam/quic/usga/cntr/ Disallow: /pam/wfp/ Disallow: /pfm/audissue/ Disallow: /pfm/intranet/ Disallow: /pfm/migrate/ Disallow: /pfm/fbms1/ Disallow: /pfm/fbms_conops/ Disallow: /pfm/fbms/ Disallow: /pfm/fbms_old/ Disallow: /tma/team/ Disallow: /ocio/architecture/finance/ Disallow: /ocio/architecture/modblu/financial/RPT/ Disallow: /ocio/architecture/modblu/law/RPT/ Disallow: /ocio/architecture/documents/omb/ Disallow: /ocio/architecture/gap/ Disallow: /ocio/architecture/data/ Disallow: /ocio/architecture/blueprint/ Disallow: /ocio/architecture/modblu/recreation/RPT/ Disallow: /ocio/architecture/modblu/fire/RPT/ Small Business Administration http://www.sba.gov/robots.txt User-agent: * Disallow: /test/ Disallow: /private/ Disallow: /cgi-bin/ # don't let search engines get to app1 calendar/buscards # need to see how this affects our search engine User-agent: Googlebot/2.1 User-agent: InfoNaviRobot(F107) User-agent: TV33_Mercator_1-1.0 User-agent: AVSearch-3.0 User-agent: Scooter/2.0 User-agent: Slurp/2.0 User-agent: SearchengineLicenceSheep_v1.0 User-agent: shadow/2.0 User-agent: MultiText/0.1 User-agent: FAST-WebCrawler/2.2.5 User-agent: Atomz/1.0 User-agent: htdig/ (searchit@netmind.com) User-agent: spider00.logika.net. Disallow: /app1.sba.gov/buscard/ Disallow: /app1.sba.gov/calendar/ # updated 2002-05-01 http://www.dc.gov/robots.txt User-agent: * Disallow: /inc/ # This is an infinite virtual URL space Disallow: /img/ http://www.archive.org/robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /details/software http://www.altavista.com/robots.txt # Tells Scanning Robots Where They Are And Are Not Welcome # # User-agent: can also specify by name; "*" is for everyone # Disallow: disallow if this matches first part of requested path # # For now disallow all we can modify this as needed to allow certain crawlers. # User-agent: * Disallow: /search Disallow: /sidebar Disallow: /advanced Disallow: /alchemist Disallow: /customize Disallow: /go Disallow: /go2 Disallow: /cgi-bin Disallow: /g/ Disallow: /web Disallow: /r Disallow: /babelfish Disallow: /urltrurl Disallow: /translate Disallow: /image/results Disallow: /image/samepage Disallow: /image/res_detail Disallow: /audio/results Disallow: /audio/samepage Disallow: /audio/res_detail Disallow: /video/results Disallow: /video/samepage Disallow: /video/res_detail Disallow: /news/more http://www.microsoft.com/robots.txt # Robots.txt file for http://www.microsoft.com # User-agent: * Disallow: /australia/careers/library/ Disallow: /backoffice/ Disallow: /canada/Library/mnp/2/aspx/ Disallow: /careers/international/ Disallow: /catalog/ Disallow: /communities/bin.aspx Disallow: /communities/eventdetails.mspx Disallow: /communities/blogs/PortalResults.mspx Disallow: /communities/rss.aspx Disallow: /downloads/info.aspx Disallow: /france/formation/centres/planning.asp Disallow: /france/mnp_utility.mspx Disallow: /germany/library/images/mnp/ Disallow: /germany/mnp_utility.mspx Disallow: /hwdev/ Disallow: /hwdq/ Disallow: /ie/ie40/ Disallow: /info/customerror.htm Disallow: /info/smart404.asp Disallow: /intl_kb/ Disallow: /intlkb/ Disallow: /isapi/ Disallow: /japan/enable/textview.asp Disallow: /japan/hwdq/hwtest/ Disallow: /japan/mnp_utility.mspx Disallow: /japan/products/library/search.asp Disallow: /japan/showcase/print/default.aspx Disallow: /japan/terminology/query.asp Disallow: /library/errorpages/smarterror.aspx Disallow: /library/toolbar/3.0/ Disallow: /mnp_utility.mspx Disallow: /netherlands/mnp_utility.mspx Disallow: /portugal/consumo/ Disallow: /proxy/ Disallow: /resources/casestudies/casestudyimageshow.asp Disallow: /resources/casestudies/CompanyLogoShow.asp Disallow: /resources/casestudies/ddi/companylogoshow.asp Disallow: /resources/casestudies/ddi/showfile.asp Disallow: /resources/casestudies/FindCaseStudyResults.aspx Disallow: /resources/casestudies/showfile.asp Disallow: /servers/ Disallow: /sna/ Disallow: /solutions/ Disallow: /technet/support/ee/ Disallow: /uk/mnp_utility.mspx Disallow: /windows.netserver/ Disallow: /windows/catalog/ Disallow: /windows/powered/ Disallow: /windowsmobile/catalog/ Disallow: /winhec/ http://www.ibm.com/robots.txt # $Id: robots.txt,v 1.22 2005/10/04 09:10:17 krusch Exp $ # # This is a file retrieved by webwalkers a.k.a. spiders that # conform to a defacto standard. # See # # Comments to the webmaster should be posted at # # Format is: # User-agent: # Disallow: | # ----------------------------------------------------------------------------- # User-agent: Fast corporate crawler User-agent: * Disallow: // Disallow: /account/registration Disallow: /Admin Disallow: /cgi- Disallow: /common Disallow: /data Disallow: /db2s Disallow: /fcgi- Disallow: /fscripts Disallow: /i/ Disallow: /image/ Disallow: /link Disallow: /perl Disallow: /products/finder Disallow: /products/learn/action Disallow: /scripts Disallow: /Scripts Disallow: /search Disallow: /Search Disallow: /tmp Disallow: /webmaster Disallow: /zx Disallow: /zz # http://www.morganstanley.com/robots.txt User-agent: * Disallow: /bulkemail Disallow: /institutional/investmentmanagement/10 Disallow: /institutional/investmentmanagement/20 Disallow: /institutional/investmentmanagement/30 Disallow: /institutional/investmentmanagement/40 Disallow: /institutional/investmentmanagement/50 Disallow: /institutional/investmentmanagement/hnavs Disallow: /institutional/investmentmanagement/products Disallow: /institutional/investmentmanagement/clbuttons Disallow: /institutional/investmentmanagement/img Disallow: /institutional/investmentmanagement/cgi-bin/msdwim/parser.pl Disallow: /institutional/investmentmanagement/cgi-bin/msdwim/siteSearch.pl Disallow: /institutional/investmentmanagement/cgi-bin/msdwim/productSearch.pl Disallow: /institutional/investmentmanagement/70/71 Disallow: /institutional/investmentmanagement/70/72 Disallow: /institutional/investmentmanagement/70/73 Disallow: /institutional/investmentmanagement/70/74 Disallow: /institutional/investmentmanagement/70/75 Disallow: /institutional/investmentmanagement/80/82 Disallow: /institutional/investmentmanagement/80/83 Disallow: /institutional/investmentmanagement/80/84 Disallow: /institutional/investmentmanagement/80/85 Disallow: /im/10 Disallow: /im/20 Disallow: /im/30 Disallow: /im/40 Disallow: /im/50 Disallow: /im/hnavs Disallow: /im/products Disallow: /im/clbuttons Disallow: /im/img Disallow: /im/cgi-bin/msdwim/parser.pl Disallow: /im/cgi-bin/msdwim/siteSearch.pl Disallow: /im/cgi-bin/msdwim/productSearch.pl Disallow: /im/70/71 Disallow: /im/70/72 Disallow: /im/70/73 Disallow: /im/70/74 Disallow: /im/70/75 Disallow: /im/80/82 Disallow: /im/80/83 Disallow: /im/80/84 Disallow: /im/80/85 Disallow: /im/uk/comm/archive.htm Disallow: /im/uk/comm/comp_edge.htm Disallow: /im/uk/comm/euro_equity.htm Disallow: /im/uk/comm/glo_equity.htm Disallow: /im/uk/comm/index_linked.htm Disallow: /im/uk/comm/japan_equity.htm Disallow: /im/uk/comm/m_class.htm Disallow: /im/uk/comm/martin.htm Disallow: /im/uk/comm/pacific_equity.htm Disallow: /im/uk/comm/uk_corp.htm Disallow: /im/uk/comm/uk_equity.htm Disallow: /im/uk/comm/uk_index_linked.htm Disallow: /im/uk/comm/uk_long.htm Disallow: /im/uk/comm/us_equity.htm http://www.aclu.org/robots.txt User-Agent: msnbot Crawl-Delay: 20 http://www.rand.org/robots.txt # http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html User-agent: * Disallow: http://www.rand.org/cgi-bin/ Disallow: /test/ Disallow: http://www.rand.org/site_info/robots1.html Disallow: http://www.rand.org/site_info/robots2.html http://www.irs.gov/robots.txt User-agent: * Disallow: /newsroom/article/0,,id=130650,00.html Los Alamos National Laboratory http://www.lanl.gov/robots.txt User-agent: * Disallow: /tools/hypermail/ Disallow: /projects/etcap/bib/ Disallow: /projects/sspe/usage/ Disallow: /orgs/cic/cic1/testsite Disallow: /orgs/im/im1/testsite Disallow: /projects/asci/statusreports/ Disallow: /projects/asci/ascijobs/ Disallow: /projects/asci/OLD_ARCHIVE/ Disallow: /projects/asci/bluemtn/OLD_BLUE_ARCHIVE/ Disallow: /projects/doe/interlab/ Disallow: /projects/sme/OLD_ARCHIVE/ Disallow: /www-team/ Disallow: /projects/wwwug/OLD_STUFF Disallow: /projects/asci/DCE/OLD Disallow: /orgs/citpo Disallow: /security/badge/images/ Disallow: /security/badge/includes/ Disallow: /security/badge/scripts/ Disallow: /security/badge/styles/ Disallow: /security/badge/Templates/ Disallow: /security/badge/usage/ Disallow: /security/badge/wusage/ Disallow: /security/badge/_notes/ Disallow: /security/clearances/images/ Disallow: /security/clearances/includes/ Disallow: /security/clearances/scripts/ Disallow: /security/clearances/styles/ Disallow: /security/clearances/Templates/ Disallow: /security/clearances/usage/ Disallow: /security/clearances/wusage/ Disallow: /security/clearances/_notes/ Disallow: /orgs/s/ Disallow: /orgs/im/im1/testsite/ Disallow: /orgs/cic/cic1/testsite/ Disallow: /orgs/dvo/laurie_test/ Disallow: /orgs/cr/familyday/ Disallow: /emergency/ Disallow: /cgi-bin/ Disallow: /css-gang/ Disallow: /04calendar Disallow: /05calendar Disallow: /viewtopic.php? Lawrence Livermore National Laboratory http://www.llnl.gov/robots.txt # robots.txt file for www.llnl.gov User-agent: * # all web crawlers and searchers Disallow: /tmp/ # temp files #Disallow: /www/llnl-bin/ # stay out of binaries #Disallow: /www/llnl_only # stay out of internal #Disallow: /www/llnl_only-bin # stay out of internal binaries #Disallow: /www/review # stay out of unreviewed pages # This is how Lee thinks this should look Disallow: cgi-bin/ # stay out of binaries Disallow: llnl-bin/ # stay out of binaries Disallow: /llnl-bin/ # stay out of binaries #Disallow: /llnl_only/ # stay out of internal # disallowed by httpd server Disallow: /llnl_only-bin/ # stay out of internal binaries Disallow: /development/ # Stay out of development Disallow: /development-bin/ # stay out of development-bin dirs Disallow: /review/ # stay out of unreviewed pages Disallow: /stats/ # stay out of statistics pages Disallow: /llnl_only/stats/ # stay out of statistics pages Disallow: /llnl/lists/historyarc # Stay out of list-of-lists history Disallow: /historyarc/ # Stay out of list-of-lists history Disallow: atp/comprehensive2-95.html # jed's killers Disallow: atp/www-servers.html Disallow: atp/telecom-media.html Disallow: /atp/crackdown/ # wrong stuff Disallow: llnl_only/tid/lof/test # Library of Future test files Disallow: llnl/lists/ # memory fault problems Disallow: /www/IPandC/opportunities93 # obsolete pages Disallow: /www/tid/lof/documents # lof pages, index them manually Government of Canada http://www.gc.ca/robots.txt User-agent: * Disallow: /canadians-canadiens/ Disallow: /cgi-bin/ Disallow: /datadump/ Disallow: /dev/ Disallow: /canada/ Disallow: /graphics/ Disallow: /images/ Disallow: /incimages/ Disallow: /includes/ Disallow: /include/ Disallow: /infocen/ Disallow: /infocentre/ Disallow: /kentesting/ Disallow: /maildata/ Disallow: /pcoSpotlight/ Disallow: /production/ Disallow: /sujets/ Disallow: /support/ Disallow: /usage/ # This will prevent robots from entering the dev, homepage, script and test areas http://www.yale.edu/robots.txt User-agent: * Disallow: /engineering/ Disallow: /webmaster/stats/ Disallow: /webmaster/logs/ Disallow: /napster/ Disallow: /resnet2008/ Disallow: /search/2003Directory_ORG.pdf Disallow: /search/directory_pdfs/DirectoryofOrganizations_05.pdf http://www.rit.edu/robots.txt User-agent: * # Sites excluded for mimics of RIT bookstore # Contact rbisd for more info Disallow: /~oxa4488/ Disallow: /~wxm9841/ Disallow: /~vds1136/ Disallow: /~lxt0233/ Disallow: /~yxl4725/ Disallow: /~bpm1730/ Disallow: /~sab1867/ Disallow: /~epr3589/ Disallow: /~hxy6190/ # Sites excluded for collecting credit card information Disallow: /~ppb9491/ # Sites excluded for mimicing the EMBA Disallow: /~ama8808/ Disallow: /~kfs6089/ # Site with spam posts re open bb Disallow: /~andpph/cgi-bin # Site with spam posts Disallow: /~rwm4604/nick US Postal Service http://www.usps.com/robots.txt # robots.txt for an individual directory # Commented lines are directories which were excluded before. # edited by Patrick Richardson 3/01/02 # edited by Hernan Ciudad 9/28/2005 # Provide access to USPS Google Search Appliances. User-agent: usps-gsa-crawler # Allow all. Disallow: # Provide access to spiders. #Verity Internal Spider User-agent: Verity-URL-Gateway/2.4 # Allow all. Disallow: # New Employment usability web presence. Disallow: /newemployment User-agent: Xenu Link Sleuth 1.2b # Allow all. Disallow: # New Employment usability web presence. Disallow: /newemployment # provide access to active directories for Fristgov. #First Gov Spider User-agent: Slurp/2.0-MakoCrawl # Allowing district sites to be indexed. # IT Purchasing Information #Disallow: /access # Bulk Mail Centers Redirect Page Disallow: /bmc # Redirect Page Disallow: /busctr # Redirect to Purchasing Disallow: /business # Page times out??. Redirect http://56.0.78.44/calendar/cfml/login.cfm Disallow: /calendar # Do not want links to cgi-bin Disallow: /cgi-bin Disallow: /cgi-bin2 # Redirect Page is no longer active. Disallow: /clr # Contains rates information - i.e. csv files. #Disallow: /consumer # Redirect to lower case cps. Disallow: /CPS # Old consumers guide delete when cpim is moved out of BV. Disallow: /csmrguid # Celebrate the century, blocked b/c obsolete Disallow: /ctc # Directory accidentally launched. Disallow: /ebillpay/ # Econnectivity program, allowing this spider. #Disallow: /econnectivity # Redirect to feedback page. Disallow: /feedback # Foia documents excluded. Disallow: /foia/_pdf Disallow: /foia/_csv # Obsolete page should be deleted. Disallow: /front_static # Old stamp information, obsolete. Disallow: /fyi # Annual report info and history info # Disallow: /history Disallow: /history/anrpt98/ # Exclude actual listings. Disallow: /hrisp/documents # Stamp information. #Disallow: /images # Old showtime ad - delete. Disallow: /inspectors # Old ibip site. Disallow: /ibip # The itk site. Disallow: /itk # Law department site, exlcuding decisions. Disallow: /lawdept/contract # Redirect to home page. Disallow: /letters # Directory with image map welcome page. Disallow: /maps # Files associated with the mns acquisition. Disallow: /mns # Old site, redirects to moversguide. Disallow: /moversnet # News online, for postal employees. Disallow: /news/online # New Employment usability web presence. Disallow: /newemployment # Go to the home page link. Disallow: /pdf # Redirect to shop.usps.com # Disallow: /postmark # Redirect that says out of service then takes you back to home page. Disallow: /restroom # Please update your link page. Disallow: /search97bin # Go to the home page link. Disallow: /search97doc # No longer active redirect. Disallow: /stampexpo # Old lost or stolen money orders dir. Disallow: /stascmo # No longer active redirect. Disallow: /swa-intltrack # Taxes info. Disallow: /tax # Server update information obsolete. Disallow: /timestamp # Redirect page. Disallow: /vbc # Redirect page, obsolete. Disallow: /vehsales # Link page to home page. Disallow: /verity # Mix of sites, some district. Disallow: /websites/local # Redirect to home, obsolete. Disallow: /year2000 # the disallows apply to all search engines User-agent: * # All search engines. # IT Purchasing Information Disallow: /access # District Site Disallow: /atlanta # Bulk Mail Center Redirect Disallow: /bmc # Redirect Page Disallow: /busctr # Redirect to Purchasing Disallow: /business # Page times out??. Redirect http://56.0.78.44/calendar/cfml/login.cfm Disallow: /calendar # District Site Disallow: /capdistrict # Do not want links to cgi-bin Disallow: /cgi-bin Disallow: /cgi-bin2 # Redirect Page is no longer active. Disallow: /clr # Contains rates information - i.e. csv files. #Disallow: /consumer # Spanish site. #Disallow: /correo # Disallowed b/c cpim contains a lot of pdf files which could clutter external results. Disallow: /cpim # Redirect to lower case cps. Disallow: /CPS # Old consumers guide delete when cpim is moved out of BV. Disallow: /csmrguid # Celebrate the century, blocked b/c obsolete Disallow: /ctc # Directory accidentally launched. Disallow: /ebillpay/ # Ebill pay old dir contains some information. #Disallow: /ebpp # Econnectivity program, usps employee only. Disallow: /econnectivity # Redirect to feedback page. Disallow: /feedback # Foia documents excluded. Disallow: /foia/_pdf Disallow: /foia/_csv # Obsolete page should be deleted. Disallow: /front_static # Old stamp information, obsolete. Disallow: /fyi # Redirect to home page, old district site. Disallow: /greatersc # Annual reports and history information #Disallow: /history Disallow: /history/anrpt98/ # Exclude actual listings. Disallow: /hrisp/documents # Global delivery services directory. #Disallow: /ibu # Old ibip site. Disallow: /ibip # Stamp information #Disallow: /images # Old showtime ad - delete. Disallow: /inspectors # The itk site. Disallow: /itk # Law department site, exlcuding decisions. Disallow: /lawdept/contract # Redirect to home page. Disallow: /letters # Directory with image map welcome page. Disallow: /maps # Files associated with the mns acquisition. Disallow: /mns # Contracts, passwd protected. Disallow: /moa # Old site, redirects to moversguide. Disallow: /moversnet # New Employment usability web presence Disallow: /newemployment # News online, for postal employees. Disallow: /news/online # Northern VA district site. Disallow: /novadistrict # Go to the home page link. Disallow: /pdf # Redirect for www.postmarkamerica.com #Disallow: /postmark # Southeast New England district site. Disallow: /provdist # Two purchasing forms here. #Disallow: /purchase # Rail page giving email address for access. Disallow: /rail # Redirect that says out of service then takes you back to home page. Disallow: /restroom # San Diego district site. Disallow: /sandiego # Please update your link page. Disallow: /search97bin # Go to the home page link. Disallow: /search97doc # No longer active redirect. Disallow: /stampexpo # Old lost or stolen money orders dir. Disallow: /stascmo # No longer active redirect. Disallow: /swa-intltrack # Server update information obsolete. Disallow: /timestamp # Redirect page, obsolete. Disallow: /vbc # Redirect page, obsolete. Disallow: /vehsales # Link page to home page. Disallow: /verity # Mix of sites, some district. Disallow: /websites/local # P. Mason request. Disallow: /webtools # Redirect to home, obsolete. Disallow: /year2000 # Redirect to commercialmailers Disallow: /commercialmailers # Voluntary Early Retirement Authority. Disallow: /vera # News Links or link online Disallow: /news/link # Carrier Pickup eBay Content 12042003 Disallow: /shipping/carrierpickup/ebay http://www.northrup.com/robots.txt # Robots.txt file from http://www.searchengineworld.com # # Built from text file http://info.webcrawler.com/mak/projects/robots/active/all.txt # # This restricts access to only known and registered robots. # User-agent: Mozilla/3.0 (compatible;miner;mailto:miner@miner.com.br) Disallow: User-agent: WebFerret Disallow: User-agent: Due to a deficiency in Java it's not currently possible to set the User-agent. Disallow: User-agent: no Disallow: User-agent: 'Ahoy! The Homepage Finder' Disallow: User-agent: Arachnophilia Disallow: User-agent: ArchitextSpider Disallow: User-agent: ASpider/0.09 Disallow: User-agent: AURESYS/1.0 Disallow: User-agent: BackRub/*.* Disallow: User-agent: Big Brother Disallow: User-agent: BlackWidow Disallow: User-agent: BSpider/1.0 libwww-perl/0.40 Disallow: User-agent: CACTVS Chemistry Spider Disallow: User-agent: Digimarc CGIReader/1.0 Disallow: User-agent: Checkbot/x.xx LWP/5.x Disallow: User-agent: CMC/0.01 Disallow: User-agent: combine/0.0 Disallow: User-agent: conceptbot/0.3 Disallow: User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 Disallow: User-agent: root/0.1 Disallow: User-agent: CS-HKUST-IndexServer/1.0 Disallow: User-agent: CyberSpyder/2.1 Disallow: User-agent: Deweb/1.01 Disallow: User-agent: DragonBot/1.0 libwww/5.0 Disallow: User-agent: EIT-Link-Verifier-Robot/0.2 Disallow: User-agent: Emacs-w3/v[0-9\.]+ Disallow: User-agent: EmailSiphon Disallow: User-agent: EMC Spider Disallow: User-agent: explorersearch Disallow: User-agent: Explorer Disallow: User-agent: ExtractorPro Disallow: User-agent: FelixIDE/1.0 Disallow: User-agent: Hazel's Ferret Web hopper, Disallow: User-agent: ESIRover v1.0 Disallow: User-agent: fido/0.9 Harvest/1.4.pl2 Disallow: User-agent: H�m�h�kki/0.2 Disallow: User-agent: KIT-Fireball/2.0 libwww/5.0a Disallow: User-agent: Fish-Search-Robot Disallow: User-agent: Mozilla/2.0 (compatible fouineur v2.0; fouineur.9bit.qc.ca) Disallow: User-agent: Robot du CRIM 1.0a Disallow: User-agent: Freecrawl Disallow: User-agent: FunnelWeb-1.0 Disallow: User-agent: gcreep/1.0 Disallow: User-agent: ??? Disallow: User-agent: GetURL.rexx v1.05 Disallow: User-agent: Golem/1.1 Disallow: User-agent: Gromit/1.0 Disallow: User-agent: Gulliver/1.1 Disallow: User-agent: yes Disallow: User-agent: AITCSRobot/1.1 Disallow: User-agent: wired-digital-newsbot/1.5 Disallow: User-agent: htdig/3.0b3 Disallow: User-agent: HTMLgobble v2.2 Disallow: User-agent: no Disallow: User-agent: IBM_Planetwide, Disallow: User-agent: gestaltIconoclast/1.0 libwww-FM/2.17 Disallow: User-agent: INGRID/0.1 Disallow: User-agent: IncyWincy/1.0b1 Disallow: User-agent: Informant Disallow: User-agent: InfoSeek Robot 1.0 Disallow: User-agent: Infoseek Sidewinder Disallow: User-agent: InfoSpiders/0.1 Disallow: User-agent: inspectorwww/1.0 http://www.greenpac.com/inspectorwww.html Disallow: User-agent: 'IAGENT/1.0' Disallow: User-agent: IsraeliSearch/1.0 Disallow: User-agent: JCrawler/0.2 Disallow: User-agent: Jeeves v0.05alpha (PERL, LWP, lglb@doc.ic.ac.uk) Disallow: User-agent: Jobot/0.1alpha libwww-perl/4.0 Disallow: User-agent: JoeBot, Disallow: User-agent: JubiiRobot Disallow: User-agent: jumpstation Disallow: User-agent: Katipo/1.0 Disallow: User-agent: KDD-Explorer/0.1 Disallow: User-agent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html) Disallow: User-agent: LabelGrab/1.1 Disallow: User-agent: LinkWalker Disallow: User-agent: logo.gif crawler Disallow: User-agent: Lycos/x.x Disallow: User-agent: Lycos_Spider_(T-Rex) Disallow: User-agent: Magpie/1.0 Disallow: User-agent: MediaFox/x.y Disallow: User-agent: MerzScope Disallow: User-agent: NEC-MeshExplorer Disallow: User-agent: MOMspider/1.00 libwww-perl/0.40 Disallow: User-agent: Monster/vX.X.X -$TYPE ($OSTYPE) Disallow: User-agent: Motor/0.2 Disallow: User-agent: MuscatFerret Disallow: User-agent: MwdSearch/0.1 Disallow: User-agent: NetCarta CyberPilot Pro Disallow: User-agent: NetMechanic Disallow: User-agent: NetScoop/1.0 libwww/5.0a Disallow: User-agent: NHSEWalker/3.0 Disallow: User-agent: Nomad-V2.x Disallow: User-agent: NorthStar Disallow: User-agent: Occam/1.0 Disallow: User-agent: HKU WWW Robot, Disallow: User-agent: Orbsearch/1.0 Disallow: User-agent: PackRat/1.0 Disallow: User-agent: Patric/0.01a Disallow: User-agent: Peregrinator-Mathematics/0.7 Disallow: User-agent: Duppies Disallow: User-agent: Pioneer Disallow: User-agent: PGP-KA/1.2 Disallow: User-agent: Resume Robot Disallow: User-agent: Road Runner: ImageScape Robot (lim@cs.leidenuniv.nl) Disallow: User-agent: Robbie/0.1 Disallow: User-agent: ComputingSite Robi/1.0 (robi@computingsite.com) Disallow: User-agent: Roverbot Disallow: User-agent: SafetyNet Robot 0.1, Disallow: User-agent: Scooter/1.0 Disallow: User-agent: not available Disallow: User-agent: Senrigan/xxxxxx Disallow: User-agent: SG-Scout Disallow: User-agent: Shai'Hulud Disallow: User-agent: SimBot/1.0 Disallow: User-agent: Open Text Site Crawler V1.0 Disallow: User-agent: SiteTech-Rover Disallow: User-agent: Slurp/2.0 Disallow: User-agent: ESISmartSpider/2.0 Disallow: User-agent: Snooper/b97_01 Disallow: User-agent: Solbot/1.0 LWP/5.07 Disallow: User-agent: Spanner/1.0 (Linux 2.0.27 i586) Disallow: User-agent: no Disallow: User-agent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31 1997 12:25:00 Disallow: User-agent: Tarantula/1.0 Disallow: User-agent: tarspider Disallow: User-agent: dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/) Disallow: User-agent: Templeton/ Disallow: User-agent: TitIn/0.2 Disallow: User-agent: TITAN/0.1 Disallow: User-agent: UCSD-Crawler Disallow: User-agent: urlck/1.2.3 Disallow: User-agent: Valkyrie/1.0 libwww-perl/0.40 Disallow: User-agent: Victoria/1.0 Disallow: User-agent: vision-search/3.0' Disallow: User-agent: VWbot_K/4.2 Disallow: User-agent: w3index Disallow: User-agent: W3M2/x.xxx Disallow: User-agent: WWWWanderer v3.0 Disallow: User-agent: WebCopy/ Disallow: User-agent: WebCrawler/3.0 Robot libwww/5.0a Disallow: User-agent: WebFetcher/0.8, Disallow: User-agent: weblayers/0.0 Disallow: User-agent: WebLinker/0.0 libwww-perl/0.1 Disallow: User-agent: no Disallow: User-agent: WebMoose/0.0.0000 Disallow: User-agent: Digimarc WebReader/1.2 Disallow: User-agent: webs@recruit.co.jp Disallow: User-agent: webvac/1.0 Disallow: User-agent: webwalk Disallow: User-agent: WebWalker/1.10 Disallow: User-agent: WebWatch Disallow: User-agent: Wget/1.4.0 Disallow: User-agent: w3mir Disallow: User-agent: no Disallow: User-agent: WWWC/0.25 (Win95) Disallow: User-agent: none Disallow: User-agent: XGET/0.7 Disallow: User-agent: Nederland.zoek Disallow: User-agent: BizBot04 kirk.overleaf.com Disallow: User-agent: HappyBot (gserver.kw.net) Disallow: User-agent: CaliforniaBrownSpider Disallow: User-agent: EI*Net/0.1 libwww/0.1 Disallow: User-agent: Ibot/1.0 libwww-perl/0.40 Disallow: User-agent: Merritt/1.0 Disallow: User-agent: StatFetcher/1.0 Disallow: User-agent: TeacherSoft/1.0 libwww/2.17 Disallow: User-agent: WWW Collector Disallow: User-agent: processor/0.0ALPHA libwww-perl/0.20 Disallow: User-agent: wobot/1.0 from 206.214.202.45 Disallow: User-agent: Libertech-Rover www.libertech.com? Disallow: User-agent: WhoWhere Robot Disallow: User-agent: ITI Spider Disallow: User-agent: w3index Disallow: User-agent: MyCNNSpider Disallow: User-agent: SummyCrawler Disallow: User-agent: OGspider Disallow: User-agent: linklooker Disallow: User-agent: CyberSpyder (amant@www.cyberspyder.com) Disallow: User-agent: SlowBot Disallow: User-agent: heraSpider Disallow: User-agent: Surfbot Disallow: User-agent: Bizbot003 Disallow: User-agent: WebWalker Disallow: User-agent: SandBot Disallow: User-agent: EnigmaBot Disallow: User-agent: spyder3.microsys.com Disallow: User-agent: www.freeloader.com. Disallow: User-agent: Googlebot Disallow: User-agent: METAGOPHER Disallow: User-agent: * Disallow: / Center for Disease Control http://www.cdc.gov/robots.txt # Ignore FrontPage files User-agent: * Disallow: /_borders Disallow: /_derived Disallow: /_fpclass Disallow: /_overlay Disallow: /_private Disallow: /_themes Disallow: /_vti_bin Disallow: /_vti_cnf Disallow: /_vti_log Disallow: /_vti_map Disallow: /_vti_pvt Disallow: /_vti_txt # Rover is a bad dog User-agent: Roverbot Disallow: / # EmailSiphon is a hunter/gatherer which extracts email addresses for spam-mailers to use User-agent: EmailSiphon Disallow: / Military Robots.txt Austria http://www.bmlv.gv.at/robots.txt # robots.txt for http://www.bundesheer.at/ # example: # Disallow: /verzeichnis/ # Disallow: /datei.html User-agent: * Disallow: /php_docs/ Disallow: /toolbar/ Disallow: /suche/ Disallow: /misc/image_popup/ Disallow: /misc/topthema/ Finland http://www.mil.fi/robots.txt User-agent: * # directed to all spiders, not just Scooter Disallow: /virhe Disallow: /temp Disallow: /tilastot Disallow: /webstats Disallow: /pillar Disallow: /logs Disallow: /smarty Disallow: /asiointi/tiedotus/kokoonpano.dsp France http://www.defense.gouv.fr/robots.txt User-agent: htdig Disallow: /RepertoireInexistant/ User-agent: * Crawl-delay: 120 Disallow: /wai_ Disallow: ?_&wai=1 Disallow: ?_&pp=1 Disallow: /site_map Disallow: /sites/air/site_map Disallow: /sites/caj/site_map Disallow: /sites/ced/site_map Disallow: /sites/cga/site_map Disallow: /sites/cgarm/site_map Disallow: /sites/commemorations_du_60e/site_map Disallow: /sites/commemorations_du_60e_/site_map Disallow: /sites/csfm/site_map Disallow: /sites/csrm/site_map Disallow: /sites/das/site_map Disallow: /sites/defense/site_map Disallow: /sites/dga/site_map Disallow: /sites/dgse/site_map Disallow: /sites/dird/site_map Disallow: /sites/dpsd/site_map Disallow: /sites/ecpad/site_map Disallow: /sites/ema/site_map Disallow: /sites/essences/site_map Disallow: /sites/gendarmerie/site_map Disallow: /sites/marine/site_map Disallow: /sites/sante/site_map Disallow: /sites/sedcac/site_map Disallow: /sites/sga/site_map Disallow: /sites/terre/site_map Norway http://www.mil.no/robots.txt ## /robots.txt file for http://www.mil.no The Norwegian armed forces Internett site User-agent: * Disallow: /template Disallow: /error Disallow: /incoming Disallow: /multimedia Disallow: /forsvarsnett/start/aktuelt/pressemeldinger Russia http://www.mil.ru/robots.txt User-agent: * Disallow: /main.shtml #for removing the doubling with / Disallow: /style.css Disallow: /cgi-bin/ Disallow: /dyn_images/ Disallow: /images/ Disallow: /flash/ Disallow: /htdig/ Disallow: /js/ Disallow: /pdf/ Disallow: /ppt/ Sweden http://www.mil.se/robots.txt User-agent: * Disallow: /attachments/ Disallow: /images/ User-agent: sitecheck.internetseer.com Disallow: / United States http://www.dod.gov/robots.txt http://www.dod.mil/robots.txt User-agent: * Disallow: /cgi-bin/ Disallow: /readiness/ Disallow: /tmp/ Disallow: /srch/ Disallow: /dodsrch/ Disallow: /americasupportsyou/troops/messages Disallow: /americasupportsyou/kids/messages Disallow: /americasupportsyou/america/messages Disallow: /americasupportsyou/support/messages http://www.centcom.mil/robots.txt User-agent: * Disallow: /_borders Disallow: /_derived Disallow: /_fpclass Disallow: /_overlay Disallow: /_private Disallow: /_themes Disallow: /_vti_bin Disallow: /_vti_cnf Disallow: /_vti_log Disallow: /_vti_map Disallow: /_vti_pvt Disallow: /_vti_txt Disallow: /cgi-bin Disallow: /HoldingBin Disallow: /Include Disallow: /Js Disallow: /newcentcomstage Disallow: /personnelstage Disallow: /demining_stage Disallow: /demining Disallow: /ccrf Disallow: /ccj1 Disallow: /OLD GALLERIES Disallow: /CentcomNews Disallow: /wc.asp Disallow: /casRep.asp Disallow: /CentcomNews/Investigation Reports/A-10 Disallow: /ReadAhead -------------------------------------------------------------------------------- http://www.cryptome.org/robots.txt # go away User-agent: msnbot disallow: / User-agent: Zealbot Disallow: / User-agent: VoilaBot Disallow: / User-agent: YahooFeedSeeker Disallow: / User-agent: appie Disallow: / User-agent: Teoma Disallow: / User-agent: MSIECrawler Disallow: / User-agent: sun4u Disallow: / User-agent: HTTrack Disallow: / User-agent: Nutch Disallow: / User-agent: Sisi Disallow: / User-agent: FyberSpider Disallow: / User-agent: girafabot Disallow: / User-agent: lmspider Disallow: / User-agent: NP Disallow: / User-agent: Robi Disallow: / User-agent: Webster Pro Disallow: / User-agent: Webster Disallow: / User-agent: Zeus Disallow: / User-agent: Gigabot Disallow: / User-agent: Slurp Disallow: / User-agent: Scirus Disallow: / User-agent: PicoSearch Disallow: / User-agent: WGet Disallow: / User-agent: Plucker Disallow: / User-agent: DISCo Pump Disallow: / User-agent: Gulliver Disallow: / User-agent: vspider Disallow: / User-agent: EmailSiphon Disallow: / User-agent: Teleport Pro Disallow: / User-agent: Fetch Disallow: / User-agent: pamuk Disallow: / User-agent: WebCopier Disallow: / User-agent: WebCapture Disallow: / User-agent: Mass Downloader Disallow: / User-agent: WebCopy Disallow: / User-agent: AWV0.8d Disallow: / User-agent: Crescent Internet ToolPak Disallow: / User-agent: JOC Web Spider Disallow: / User-agent: WebStripper Disallow: / User-agent: SiteSucker Disallow: / User-agent: Webdup Disallow: / User-agent: Scooter Disallow: / User-agent: Python-urllib Disallow: / User-agent: Python Disallow: / User-agent: Franklin Locator Disallow: / User-agent: CK-SillyDog Disallow: / User-agent: PocketHTTP Disallow: / User-agent: VoilaBot Disallow: / User-agent: Xaldon WebSpider Disallow: / User-agent: WebCapture Disallow: / User-agent: WebStripper Disallow: / User-agent: Java Disallow: / User-agent: WebReaper Disallow: / User-agent: TeragramWebcrawler Disallow: / User-agent: Vagabondo Disallow: / User-agent: nogoop-HttpClient Disallow: / User-agent: Baiduspider Disallow: / User-agent: W3CRobot Disallow: / User-agent: MyOperaTB/1.0 Disallow: / User-agent: MyOperaTB Disallow: /