Defying Web Usage Log Retention

3 April 2003. Add responses.

2 April 2003. Add responses. Link to working paper on usage log data management:

http://cryptome.org/usage-logs.htm

1 April 2003

Cryptome and Cartome attended a session today in New York of the Usage Log Data Management Working Group, on how site operators and ISPs might address problems of web user privacy, law enforcement access to user logs, commercial exploitation of logs, and creation of tools for management of usage log retention to protect user privacy. This was a private meeting held prior to the Computer, Freedom and Privacy conference beginning tomorrow.

E-mail on the session:

Subject: working group meeting on usage log retention
Date: 	 Fri, 21 Feb 2003 11:49:14 -0800
From: 	 Jeff Ubois <jeff@ubois.com>
To: 	 <jya@pipeline.com>

On April 1 at the Computers, Freedom, and Privacy conference in New 
York, I'm arranging a working group meeting that will develop a model 
policy and the specification for a tool to manage web usage log 
retention. I'm hoping you can attend.

A draft of the announcement is below. If you have comments, suggestions, 
or know people who should attend, please let me know; it's not quite 
ready for public posting.

I would really like to get your ideas about this; is there a time we 
could talk by phone?

Jeff Ubois
510-527-2707

Usage Log Data Management Workshop at the Computers, Freedom & Privacy 
Conference
New York City, April 1

The usage logs generated by web servers contain much data that is useful 
for site owners, but the current default configurations pose a threat to 
the privacy of individuals, and present a serious legal risk to 
organizations.

The IP addresses collected in these logs by web servers are becoming 
increasingly easy to associate with the identity of particular 
individuals. For organizations, this means that they may violate their 
own privacy policies by retaining portions of these logs. At a minimum, 
web site owners are exposed to potential lawsuits and discovery requests.

But today, no standard policy exists to manage the retention and 
eventual destruction of usage log data, and no tool exist to implmenet 
such policies.

The goal of this all day working group meeting is to develop a policy 
that organizations can use to govern their retention of usage log data, 
and a specification for a utility for the Apache web server that will 
delete usage log data according to this policy.

-----

Date: Thu, 27 Mar 2003 11:21:22 -0800
Subject: Re: [Fwd: working group meeting on usage log retention]
From: Jeff Ubois <jeff@ubois.com>
To: John Young <jya@pipeline.com>

That's terrific.  The meeting will be held in the New Yorker Hotel's Bay
Kipp room, at 481 8th Ave from 9.30 - 5 on April 1.  There will be folks
from the EFF, the Internet Archive, Chilling Effects, the FTC, CMU,
Umich, Gartner and a few other organizations showing up.  I'll be sending
out a longer note tomorrow to everyone attending.

I'm very glad you can make it, I think your experience in this will be
invaluable. 

Jeff 

On 3/27/03 1:40 PM, "John Young" <jya@pipeline.com> wrote:

> Jeff,
> 
> Nope, not disinterest, rather working outside NYC for 6 months on a
> consultant job.
> 
> Now I'm back and would welcome the opportunity to participate. Point
> to the place and time and I'll be there with my partner in cyber-logrolling,
> Deborah Natsios, Cartome operator.

Date: Fri, 28 Mar 2003 16:47:08 -0800
Subject: Usage log working group: pre-conference notes  
From: Jeff Ubois <jeff@ubois.com>
To: <jeff@ubois.com>, <lmazzarella@ftc.gov>, <i@post.harvard.edu>,
<bob.page@accrue.com>, <brewster@archive.org>, <lesk@acm.org>,
<chlaurant@epic.org>, <nancy.kranich@nyu.edu>, <jya@pipeline.com>,
<wild@eff.org>, <fran@truste.org>, <nk@jstor.org>, <ver@umich.edu>,
<jurban@law.berkeley.edu>, <rms@computerbytesman.com>,
<bob.page@accrue.com>, <latanya@cs.cmu.edu>, <jeff@ubois.com>,
<mlecky@sympatico.ca>, <wendy@seltzer.com>, <bsteinhardt@aclu.org>

Hi,

Attached are some notes on the upcoming meeting about web usage log
retention, which will be held April 1 from 9:00 a.m. to 5:00 p.m. in the
Kips Bay Room at the New Yorker Hotel, 481 8th Ave, in New York City.

I want to thank everyone for their willingness to contribute to this effort.
Thanks to the referrals provided by many of you, we now have attendees
and/or input from the Federal Trade Commission; the University of Michigan,
UC Berkeley and Carnegie-Mellon; AT&T; EFF; Accrue; Chilling Effects;
GartnerGroup; TrustE, the American Library Association; JSTOR; and the
Internet Archive.  We also have some prominent researchers from the security
and privacy community.

The agenda of the meeting is somewhat loosely structured, with a general
progression over the course of the day from law, to technology, to
recommendations.  If you want to add something to it or make a short
presentation, please let me know.

A number of people have made excellent suggestions regarding questions for
discussion, pre-conference reading, and possible solutions. A summary of
these is in the attachment following the agenda.

Based on discussions with several of you, I¹ve pulled together some notes on
what a draft recommendation might include. This is very, very, far from a
final product, but hopefully it can serve as a useful strawman that we can
critique and improve.

The best way to reach me is generally via email, but I am also generally
available by cell phone at 415 850 5431.  I would be happy to talk to anyone
in advance of this meeting who has something to suggest, and especially to
anyone who has some ideas about how we can best to converge on a set of
recommendations. If there¹s anything anyone would like broadcast to
participants in advance of the meeting, please let me know.

I look forward to seeing you next week.

Best Regards, 
Jeff Ubois

Attached to the last message, meeting purpose, agenda, background, bibliography, topics for discussion, references:

http://cryptome.org/usage-logs.htm

An "interesting paper on deducing identity from IP addresses" paper was also distributed to the group beforehand

http://cryptome.org/trails1.pdf

Cryptome suggested that while a usage log retention policy is being formulated there should be an immediate public warning about the privacy threat posed by usage logs -- that logs are being subpoenaed and covertly surveilled by officials along with unpublicized commercial exploitation -- and that the group should announce its study was now commencing due to the urgency of the threat.

However, the group decided to make no announcement at the conference of its initiative, instead an announcement of policy proposals and management tools will be made on July 4, 2003.

Cryptome has long warned that no web user privacy policy is reliable and that system administrators should not be trusted, including Cryptome's, due to compromises made for economic, legal and political purposes. That users alone should determine what data about them, if any, is to be generated, collected, archived and manipulated. This should include web usage logs.

Cryptome invites today public participation in formulating and promoting no-logging of web usage instead of retention management cloaked with unverifiable privacy policies --

1. To counter the long-promoted premise that usage logs must be kept for system administration.
2. To counter the present automatic logging of users and retention of user data.
3. To counter the premise that management of usage data retention should be the primary privacy goal rather than no logging at all.
4. To counter the premise that privacy policy is sufficient warning to and protection of users.
5. To advance the premise that users should be the only parties to approve logging of their visits prior to any form of data retention management.
6. To refute the concept that site operators and ISPs, rather than users themselves, should be the parties which protect Internet users by privacy policy and usage log retention management.
7. Describe the benefits of no-logging over privacy policies.

8. Describe the benefits of users' controlling what logs are created, retained and managed.
9. Descriptions of technical means for the user to prevent usage logging rather than rely on site operators' and ISPs' privacy assurances.
10. Means to deceive web logging programs used by sites which will not agree to no-logging.
11. Means for detecting covert logging taking place behind privacy policy.

Send to: jya@pipeline.com

From: "Stef Caunter" <stef@caunter.ca>
To: <jya@pipeline.com>
Subject: formulating and promoting no-logging 
Date: Tue, 1 Apr 2003 21:18:24 -0500

  JYA
  Interesting topic. My thoughts follow your numbering:

  1. To counter the long-promoted premise that usage logs must be kept for
system administration.

  The only necessary log for this is the error_log. It can be set to several
levels of detail about the webserver function, and about "500" category
server errors. All other logs simply record file requests, successful or
not, and client browser headers, and show nothing about the relative health
of the server. To the contrary, they force a data write for every file
requested, increasing the load on the box.

  2. To counter the present automatic logging of users and retention of user
data.

  Apache runs better and faster without automatic logging of user requests.
The default "high-performance" configuration shipped with version 2 provides
for zero logging of client requests. It is not necessary, and can be seen as
counter-productive.

  3. To counter the premise that management of usage data retention should
be the primary privacy goal rather than no logging at all.

  We tend to retain far too much data in this business, just because it is
possible.

  4. To counter the premise that privacy policy is sufficient warning to and
protection of users.

  A view you have been championing, that these privacy policy documents are
irrelevant to law enforcement and data mining operations, is the only
intelligent approach. You can be tied to your IP.

  5. To advance the premise that users should be the only parties to approve
logging of their visits prior to any form of data retention management.

  One of my favourite demonstrations when I lecture on this subject is to
fire up a new install of IE and to dwell on the first popup warning, which
says, "You are about to send information over the internet. It might be
possible for others to view ..." and to make people understand that this is
true and they should not forget about it. The default "secure connection"
warning gives a false sense of security, as it makes no mention of the abili
ty to track down an IP address to an ISP assigned customer's activities,
encrypted session or not.

  6. To refute the concept that site operators and ISPs, rather than users
themselves, should be the parties which protect Internet users by privacy
policy and usage log retention management.

  Site operators interested in traffic levels can run counters.  ISPs
notoriously ignore any concept of privacy; their email server logs are
readable by admins, and their web server logs chart advertiser success.

  7. Describe the benefits of no-logging over privacy policies.

  Absolute no-logging as a policy can be more secure. The absence of
information means that a compromise is less meaningful. The ability to read
server logs is one of the pleasures of a system cracker. The only meaningful
log to system administration is the webserver error log, and system daemon
message logging. Everything else is trivia, or navel gazing.

  8. Describe the benefits of users' controlling what logs are created,
retained and managed.

  9. Descriptions of technical means for the user to prevent usage logging
rather than rely on site operators' and ISPs' privacy assurances.

  10. Means to deceive web logging programs used by sites which will not
agree to no-logging.

  These three points force us to confront the request/response nature of the
HTTP protocol. To successfully communicate, you must self-identify. Proxying
itself is subject to the same logging problem. In a heavily ad and traffic
driven industry, the logfiles demonstrate success; this success can be
denoted anonymously in a webserver log, but not without technical expertise.
Logging is seen to be a "value-add" for hosting companies. The sense of this
could theoretically be reversed. Unless no-logging is seen to be a value-add
of its own by both surfers and sites, we will not see it put in practice.
Traffic can be measured by byte transfer and bandwidth usage as effectively;
measuring network usage is done by default on most UNIX systems; counting
and charting the number of webserver processes responding to client requests
is a very effective way of seeing how busy a dedicated webserver is at any
time; transaction based webservers are already counting sales and money, and
that is what matters to their owners. Extra server traffic can show up in
the network transfer data; larger server logs without higher sales mean
nothing; they mean nothing and are of no value anyway unless they are data
mined by law enforcement or demographers.

  11. Means for detecting covert logging taking place behind privacy policy.

  Trusted third party verification would be required, much like CAs. If a
site is running advertising it's fair to say it's logging and data mining.

Date: Tue, 01 Apr 2003 21:53:39 -0600
To: jya@pipeline.com
From: namebase@earthlink.net
Subject: Copy of my email to Jeff Ubois

Dear Jeff:

I have a comment on logging and such. There is a page
I put up that tries to address this issue, at:

http://www.google-watch.org/cgi-bin/urldemo.htm

Basically, I think you should approach Apache and ask them
to change the default configuration for logging so that it does
not include the QUERY_STRING. That's the portion after the
question mark in a CGI request. In the case of search engines,
this string contains the search terms. These get propagated all
over the world because they often end up as REFERER strings in 
other logs. Search terms are rather sensitive items of information.

It is already possible to configure Apache logging to strip out the
QUERY_STRING, but it takes a fair amount of effort and research
to pull it off, so no one does it. What I'm recommending is that
Apache change their default logging so that it's the other way
around -- it ought to require a lot of trouble and effort to make
the logs *include* the QUERY_STRING!

This one change would make a very big difference in the total
privacy picture.

Regards,
Daniel Brandt

---------------------------------------------------------------------
Public Information Research, PO Box 680635, San Antonio TX 78268-0635
Tel:210-509-3160   Fax:210-509-3161   Nonprofit publisher of NameBase
     http://www.namebase.org/              namebase@earthlink.net
---------------------------------------------------------------------

Date: Thu, 03 Apr 2003 14:04:14 +0100
From: Ben Laurie <ben@algroup.co.uk>
To: John Young <jya@pipeline.com>
Cc: cypherpunks@lne.com, cryptography@wasabisystems.com
Subject: Re: Logging of Web Usage

John Young wrote:
> Ben,
> 
> Would you care to comment for publication on web logging 
> described in these two files:
> 
>   http://cryptome.org/no-logs.htm
> 
>   http://cryptome.org/usage-logs.htm
> 
> Cryptome invites comments from others who know the capabilities 
> of servers to log or not, and other means for protecting user privacy 
> by users themselves rather than by reliance upon privacy policies 
> of site operators and government regulation.
> 
> This relates to the data retention debate and current initiatives 
> of law enforcement to subpoena, surveil, steal and manipulate
> log data.

I don't have time right now to comment in detail (I will try to later), 
but it seems to me that, as someone else commented, relying on operators 
to not keep logs is really not the way to go. If you want privacy or 
anonymity, then you have to create it for yourself, not expect others to 
provide it for you.

Of course, it is possible to reduce your exposure to others whilst still 
taking advantage of privacy-enhancing services they offer. Two obvious 
examples of this are the mixmaster anonymous remailer network, and onion 
routing.

It seems to me if you want to make serious inroads into privacy w.r.t. 
logging of traffic, then what you want to put your energy into is onion 
routing. There is _still_ no deployable free software to do it, and that 
is ridiculous[1]. It seems to me that this is the single biggest win we 
can have against all sorts of privacy invasions.

Make log retention useless for any purpose other than statistics and 
maintenance. Don't try to make it only used for those purposes.

Cheers,

Ben.

[1] FWIW, I'd be willing to work on that, but not on my own (unless 
someone wants to keep me in the style to which I am accustomed, that is).

-- 

http://www.apache-ssl.org/ben.html       http://www.thebunker.net/

"There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit." - Robert Woodruff

Date: Thu,  3 Apr 2003 12:09:41 -0600
From: Keith Ray <keith@nullify.org>
To: Cypherpunks <cypherpunks@lne.com>
Subject: Re: Logging of Web Usage

Quoting Ben Laurie <ben@algroup.co.uk>:

> It seems to me if you want to make serious inroads into privacy w.r.t. 
> logging of traffic, then what you want to put your energy into is onion 
> routing. There is _still_ no deployable free software to do it, and that 
> is ridiculous[1]. It seems to me that this is the single biggest win we 
> can have against all sorts of privacy invasions.

This sounds like an interesting project to work on.  It's hard to belive that
only the DoD has played with this technology.  Onion routing would seem to have
a much larger impact on personal privacy on the Internet than projects like
Freenet ever could.

After browsing through some of the descriptions of the system, it appears to be
a real-time remailer-type system for IP traffic.  A client proxy will take the
IP traffic, break it up into identically sized packets, and then layer encrypt
them starting with the last onion router to the first.  Each router along the
path would decrypt its layer and then forward the packet to the next router.

The part that I am worried about is the liability of running an exit router.  I
ran a mixmaster remailer for over six months and found out first hand the
reaction of people to receiving anonymous death-threats, racial slurs, and spam.
  The saving grace was the opt-out list for people to refuse to receive future
anonymous messages.

However, with a real-time system that could encapsulate all IP traffic, this
could be used for anonymous hacking.  Even if you limit the exit remailer's
traffic to just port 80 and actual HTTP requests, there are plenty of exploits
and probes that require nothing more.  Thanks to the PATRIOT act, those of us in
the US can look forward to federal prosecution with possible life sentences if
the wrong system is hacked through a router.  When the FBI comes knocking, I
doubt they will be satisifed with anonymous free speech arguments.

DoD's Onion Routing research project

http://www.onion-router.net/

--
Keith Ray <keith@nullify.org> -- OpenPGP Key: 0x79269A12

Date: Wed, 2 Apr 2003 13:19:40 -0800 (PST)
From: Morlock Elloi <morlockelloi@yahoo.com>
Subject: Re: Logging of Web Usage
To: cypherpunks@lne.com

Frankly, it seems that some brains around here are softening. Relying on httpd
operators to protect those who access is plain silly, even if echelon (funny
how that word dropped below radar lately) did not exist.

The proper way is, of course, self-protection. Start with tight control of
outgoing info from the end-user machine (remove or fake all fields that are not
essential, such as referrer, client application, client OS). Use proxies. If
you own a multi-IP subnet randomly switch the originating IP - this fucks up
most automated tracking.

What doesn't exist is mixmaster-grade anon re-httpers. I guess that ones that
would let just text through (no images/scripting etc.) would be repulsive
enough for wide public and therefore useful.

Once you provide your data, it is always retained forever. Learn to live with
it.

Date: Wed, 2 Apr 2003 13:24:58 -0800
To: John Young <jya@pipeline.com>, Ben Laurie <ben@algroup.co.uk>
From: Bill Frantz <frantz@pwpconsult.com>
Subject: Re: Logging of Web Usage
Cc: cypherpunks@lne.com, cryptography@wasabisystems.com

The http://cryptome.org/usage-logs.htm URL says:

>Low resolution data in most cases is intended to be sufficient for
>marketing analyses.  It may take the form of IP addresses that have been
>subjected to a one way hash, to refer URLs that exclude information other
>than the high level domain, or temporary cookies.

Note that since IPv4 addresses are 32 bits, anyone willing to dedicate a
computer for a few hours can reverse a one way hash by exhaustive search.
Truncating IPs seems a much more privacy friendly approach.

This problem would be less acute with IPv6 addresses.

Cheers - Bill

-------------------------------------------------------------------------
Bill Frantz           | Due process for all    | Periwinkle -- Consulting
(408)356-8506         | used to be the         | 16345 Englewood Ave.
frantz@pwpconsult.com | American way.          | Los Gatos, CA 95032, USA

Date: Thu, 3 Apr 2003 01:05:27 +0200 (CEST)
From: Thomas Shaddack <shaddack@ns.arachne.cz>
To: Morlock Elloi <morlockelloi@yahoo.com>
cc: <cypherpunks@lne.com>
Subject: Re: Logging of Web Usage

> Relying on httpd operators to protect those who access is plain silly,
> even if echelon (funny how that word dropped below radar lately) did
> not exist.

Echelon could be grouped together with Carnivore and CALEA devices into
the group of Generic Transport-level Eavesdroppers. No need to consider it
separately, at least for technological purposes. (...am I right?)

> What doesn't exist is mixmaster-grade anon re-httpers. I guess that ones that
> would let just text through (no images/scripting etc.) would be repulsive
> enough for wide public and therefore useful.

Could it be constructed as eg. a FreeNet extension? Piggybacking on an
existing system is easier than rolling out a whole new thing.

> Once you provide your data, it is always retained forever. Learn to
> live with it.

What worries me a LOT is Google (and search engines in general). Very
useful tool, and way too attractive to profile people by their search
queries.

Date: Wed, 2 Apr 2003 18:16:18 -0800
From: Seth David Schoen <schoen@loyalty.org>
To: cypherpunks@lne.com, cryptography@wasabisystems.com
Subject: Re: Logging of Web Usage

Bill Frantz writes:

> The http://cryptome.org/usage-logs.htm URL says:
> 
> >Low resolution data in most cases is intended to be sufficient for
> >marketing analyses.  It may take the form of IP addresses that have been
> >subjected to a one way hash, to refer URLs that exclude information other
> >than the high level domain, or temporary cookies.
> 
> Note that since IPv4 addresses are 32 bits, anyone willing to dedicate a
> computer for a few hours can reverse a one way hash by exhaustive search.
> Truncating IPs seems a much more privacy friendly approach.
> 
> This problem would be less acute with IPv6 addresses.

I'm skeptical that it will even take "a few hours"; on a 1.5 GHz
desktop machine, using "openssl speed", I see about a million hash
operations per second.  (It depends slightly on which hash you choose.)
This is without compiling OpenSSL with processor-specific optimizations.

That would imply a mean time to reverse the hash of about 2100 seconds,
which we could probably improve with processor-specific optimizations
or by buying a more recent machine.  What's more, we can exclude from our
search parts of the IP address space which haven't been allocated, and
optimize the search by beginning with IP networks which are more
likely to be the source of hits based on prior statistical evidence.  Even
without _any_ of these improvements, it's just about 35 minutes on average.

I used to advocate one-way hashing for logs, but a 35-minute search on
an ordinary desktop PC is not much obstacle.  It might still be
helpful if you used a keyed hash and then threw away the key after a
short time period (perhaps every 6 hours).  Then you can't identify or
link visitors across 6-hour periods.  If the key is very long,
reversing the hash could become very hard.

The logging problem will depend on what server operators are trying to
accomplish.  Some people just want to try to count unique visitors;
strangely enough, they might get more privacy-protective (and comparably
precise) results by issuing short-lived cookies.

-- 
Seth David Schoen <schoen@loyalty.org> | Very frankly, I am opposed to people
     http://www.loyalty.org/~schoen/   | being programmed by others.
     http://vitanuova.loyalty.org/     |     -- Fred Rogers (1928-2003),
                                       |        464 U.S. 417, 445 (1984)