Working Paper on Usage Log Data Management

Following are some notes on the upcoming meeting about web usage log retention, which will be held April 1 from 9:00 a.m. to 5:00 p.m. in the Kips Bay Room at the New Yorker Hotel, 481 8th Ave, in New York City.

I want to thank everyone for their willingness to contribute to this effort. Thanks to the referrals provided by many of you, we now have attendees and/or input from the Federal Trade Commission; the University of Michigan, UC Berkeley and Carnegie-Mellon; AT&T; EFF; Accrue; Chilling Effects; GartnerGroup; TrustE, the American Library Association; JSTOR; and the Internet Archive. We also have some prominent researchers from the security and privacy community.

The agenda of the meeting is somewhat loosely structured, with a general progression over the course of the day from law, to technology, to recommendations. If you want to add something to it or make a short presentation, please let me know.

A number of people have made excellent suggestions regarding questions for discussion, pre-conference reading, and possible solutions. A summary of these follows the agenda.

Based on discussions with several of you, I’ve pulled together some notes on what a draft recommendation might include. This is very, very, far from a final product, but hopefully it can serve as a useful strawman that we can critique and improve.

The best way to reach me is generally via email, but I am also generally available by cell phone at 415 850 5431. I would be happy to talk to anyone in advance of this meeting who has something to suggest, and especially to anyone who has some ideas about how we can best to converge on a set of recommendations. If there’s anything anyone would like broadcast to participants in advance of the meeting, please let me know.

I look forward to seeing you next week.

Best Regards,

Jeff Ubois

Agenda

9 – 9.30 Registration opens. Informal pre-meeting discussion.

9.30 – 11 Law & policy background

This discussion will include some basic definitions of logging terms, applicable laws, and prior art, i.e. policies in use now. We will also hear comments from the FTC, Chilling Effects, & Richard Smith

11 – 12.30 Technology background

How much data do logging tools gather? How often can this data become personally identifiable information?

Comments and possible demonstrations by privacy researchers, including Latanya Sweeney from CMU.

12.30 – 1.30 Lunch

1.30 – 3.00 Scenarios and possible solutions

This discussion will focus on the scenarios related to different types of logging and retention practices, and recommended responses for different types of sites, particularly sites that simply publish or provide information, gather user data for other purposes such as academic registration, or conduct financial transactions.

3.00 – 4.30 Recommendations

Based on the legal and technical discussions, and on the scenarios developed in the previous section, we will attempt to converge on guidelines and recommendations.

4.30 – 5 Areas for future research

In this session we will list the unresolved issues, and identify volunteers who wish to investigate potential solutions.

Questions for discussion, ideas for consideration, and references

(thanks especially to the Virginia Rezmierski’s LAMP study and Richard Smith)

Exactly what is being logged?

How much logging is needed to secure and manage systems?

What actions are being taken to expand or make logging more comprehensive?

How many steps must be taken to get from the logs to an individual’s identification?

Which laws are applicable in this context?

Is it practical to use a random one-way hash to hide real IP addresses in log files?

What are some appropriate rules for cleaning URLs to remove personal information?

What are some ways to get log file policies implemented in popular Web server software like Apache, IIS, and SunOne?

How will these kinds of changes affect programs which post process log files?

What are current practices for log file retention?

How common today is for log files to be used in criminal investigations?

Suggested approaches

From a site owner concerned with this issue:

Before providing the usage path information, we "strip" from the URLs a portion of your IP address and the information that appears after the "?," with limited exceptions for certain types of search terms, leading "search engine" websites, and many e-commerce websites. Before providing the shopping path information to any third party, we strip a portion of your IP address from the data string.

This is similar to Netscape’s method…we added exceptions for certain eCommerce and search sites. It's pretty simple. For all sites:

- Log only http requests, not https, ftp, etc.

- Do not log http requests from domains known to be intranets etc. as identified by the IE "zone" or IP address.

- Truncate the URL at the ? if the URL contains any of the following: user=, userid=, username=, pass=, pw=, passwd=, password=, email=, e-mail=

- Strip logins of the form http://username:password@hostname.domain:portnumber/

- Change the last 3 digits of the IP address to 666.

For all sites except selected search engines and selected e-commerce sites:

- Truncate URLs at the ? if they have one.

- Truncate referrers at the ? if they have one

From Richard Smith:

To: 'Jeff Ubois' jeff@ubois.com
From: Richard M. Smith rms@computerbytesman.com
Date: Mon, 24 Mar 2003 08:37:23 –0500
RE: Usage Log Meeting -- examples of IP addresses and other log data being personally identifiable?

One way that personal data ends up in log files is thru forms that use the GET method. Here's something I wrote awhile back about personal data that DoubleClick was getting about me as I surfed the web:

http://www.computerbytesman.com/privacy/banads.htm

DoubleClick said they solved the problem by deleting query strings in referring URLs before writing referring URLs to a log file.

Alexa had a similar issue, but with regular URLs: http://www.computerbytesman.com/privacy/alexa.htm

As a general rule, the GET method should never be used with forms that contain personal data. Email newsletter sign-ups is primarily where the GET method is misused.

Another experiment that might be interesting to run is to use Google to search for online log files to see how often they contain email addresses, etc. Web sites sometimes leave open their log files and Google is able to find them.

Richard

Draft Recommendations for Managing Web Usage Logs

The log files generated by web servers contain data that is important to site owners, but which may pose a threat to the privacy of individuals, and present legal risks to corporations.

This document is intended to provide recommendations regarding the management of IP addresses, referrer logs, and persistent cookie data stored in these usage logs. It also provides some general guidelines regarding default configurations of web servers.

Because the requirements of different web site owners vary widely, the recommendations here take the form of a series of scenarios that reflect these diverse needs for log collection, processing, and retention.

As noted in the NSF's LAMP report, the default configurations of logging tools are used by the vast majority of web site owners. Therefore changing the default server configurations to reflect this policy will help enable wide adoption. At the same time, there is a need to develop utilities and web server plug ins that implement aspects of the processing functions, such as hashing and aggregation.

This document applies the terms ‘high resolution’ data, and ‘low resolution’ data to distinguish between usage log data that may be possibly be processed into Personally Identifiable Information, and data that has been hashed or aggregated such that there is very low risk that it might be processed into Personally Identifiable Information.

High resolution data is important for site owners that want to maintain their security, find the causes of system failures, conduct monetary transactions, or otherwise establish a persistent, unique connection with their users. High resolution data may include static IP addresses, referrer logs that include lengthy URLs with form data, or persistent cookies.

Low resolution data in most cases is intended to be sufficient for marketing analyses. It may take the form of IP addresses that have been subjected to a one way hash, to refer URLs that exclude information other than the high level domain, or temporary cookies.

Many web sites fall into one of two categories: sites that handle financial and other transactions (and thus have a higher need for logging), and sites that primarily exist to publish data. Following are two scenarios that reflect these needs with respect to the collection, processing and retention of high and low resolution data, including IP addresses, referrer logs, and cookies.

Publishers

Data Type
Collection

Processing

Retention

Hi Resolution
Collection is primarily for the purpose of ensuring security, fixing failures, understanding how the site is used. Logged data is generally processed into low resolution data as soon as possible. High resolution data is generally not retained for more than seven days.

Low Resolution
Low resolution data may result from limiting initial collection, or by processing high resolution data. Unique IPs are not collected, or are hashed immediately. For referrer logs, no data after a question mark is collected. Site owners avoid the use of the GET method. Persistent tracking cookies are not used. Processing is done primarily to understand marketing, user interface, and audience issues. IP addresses may be aggregated using a 'many to one' approach (eg every other address is thrown away), or hashed with a one way function. Left entirely to the discretion of the web site owner.

Transaction sites

Data Type
Collection

Processing

Retention

Hi Resolution
Collection is primarily for the purpose of ensuring security, fixing failures, understanding how the site is used, and to provide an audit trail for companies that sell impressions. This data is generally processed into low resolution data as soon as possible. Personally identifiable information may be forwarded to other systems. Time limitations on retention may be related to transactions (e.g. data is retained until 30 days after a transaction and any associated returns policy has expired.

Low Resolution
Low resolution data may result from limiting initial collection, or by processing high resolution data. Unique IPs are not collected, or are hashed immediately. For referrer logs, no data after a question mark is collected. Site owners avoid the use of the GET method. Processing is done primarily to understand marketing, user interface, and audience issues. IP addresses may be aggregated using a 'many to one' approach (eg every other address is thrown away), or hashed with a one way function. Left entirely to the discretion of the web site owner.

References & recommended reading

The Logging and Monitoring Privacy (LAMP) project. A must read.

http://www.aacrao.com/publications/catalog/NSF-LAMP.pdf

EPIC’s page on data retention:

http://www.epic.org/privacy/intl/data_retention.html

EPIC’s letter to university presidents about logging and monitoring.

http://www.epic.org/privacy/student/p2pletter.html

Search Requests: http://searchrequests.weblogs.com/ for an example of a weblog built form odd search requests.

Judge orders Verizon to identify P2P user

http://www.nwfusion.com/newsletters/fileshare/2003/0127p2p1.html

District Court Rules DMCA Subpoenas Available for P2P Infringers

http://www.techlawjournal.com/topstories/2003/20030121.asp

DMCA, Section 512

http://www4.law.cornell.edu/uscode/17/512.html

Data Type	Collection	Processing	Retention
Hi Resolution	Collection is primarily for the purpose of ensuring security, fixing failures, understanding how the site is used, and to provide an audit trail for companies that sell impressions.	This data is generally processed into low resolution data as soon as possible. Personally identifiable information may be forwarded to other systems.	Time limitations on retention may be related to transactions (e.g. data is retained until 30 days after a transaction and any associated returns policy has expired.
Low Resolution	Low resolution data may result from limiting initial collection, or by processing high resolution data. Unique IPs are not collected, or are hashed immediately. For referrer logs, no data after a question mark is collected. Site owners avoid the use of the GET method.	Processing is done primarily to understand marketing, user interface, and audience issues. IP addresses may be aggregated using a 'many to one' approach (eg every other address is thrown away), or hashed with a one way function.	Left entirely to the discretion of the web site owner.