UC Berkeley Home IP Web Traces
收藏Zenodo2020-09-09 更新2026-05-25 收录
下载链接:
https://zenodo.org/record/3749535
下载链接
链接失效反馈官方服务:
资源简介:
<strong>Description</strong> This dataset consists of 18 days' worth of HTTP traces gathered from the Home IP service offered by UC Berkeley to its students, faculty, and staff Home IP provides dial-up PPP/SLIP IP connectivity using 2.4 kb/s, 9.6 kb/s, 14.4 kb/s, or 28.8 kb/s wireline modems, or Metricom Ricochet (approximately 20-30 kb/s) wireless modems. These client traces were unobtrusively gathered through the use of a packet sniffing machine placed at the head-end of the Home IP modem bank; the tracing program used was a custom module written on top of the Internet Protocol Scanning Engine (IPSE) created by Ian Goldberg. Only traffic destined for port 80 was traced; all non-HTTP protocols and HTTP connections for other ports were excluded from these traces. The traces contain the following information: a total of <strong>9,244,728</strong> references spanning the period from <strong>Friday, November 1st, 1996 at 15:18:59 PST</strong> through <strong>Tuesday, November 19th, 1996 at 05:52:03 PST</strong>. 8,377 unique clients were seen in the traces. the time at which the client made the request the time at which the first byte of the server response was seen the time at which the last byte of the server response was seen the client IP address (suitably anonymized) the client port the server IP address (suitably anonymized) the server port (always 80 for these traces) the presence of the <code>no-cache</code>, <code>keep-alive</code>, <code>cache-control</code>, <code>if-modified-since</code>, and <code>unless</code> client headers. the presence of the <code>no-cache</code>, <code>cache-control</code>, <code>expires</code>, and <code>last-modified</code> server headers. the values of the client <code>if-modified-since</code>, the server <code>expires</code>, and the server <code>last-modified</code> headers, if present. the length of the response HTTP header the length of the response data the request URL (suitably anonymized) <strong>Format</strong> For the sake of storage efficiency, the (gzipped) traces are stored in a binary representation. This archive of tools includes the following code to parse and manipulate the archives: <strong>showtrace</strong>: this program will print out a human readable ASCII representation of what is in the traces. To use, type: <code>gzcat <tracefile> | showtrace </code> Take a look at the source file <code>showtrace.c</code> to see how you can use <code>logparse.[ch]</code> to write code that parses and manipulates the traces. All times displayed are as reported by the <code>gettimeofday()</code> system call. <strong>anon_clients</strong>: this is the program that we used to anonymize the traces. I include this program under the principle that the anonymization used is strong enough that distributing the anonymization code cannot help anybody break the anonymization. <strong>timeconvert</strong>: a program that accepts a calendar time (i.e. time in seconds since the Epoch, as reported by showtrace and as used in the trace filenames) and outputs the local time corresponding to that calendar time. The <strong>showtrace</strong> tool will display lines in the following format: <pre>848278028:829593 848278028:893670 848278028:895350 23.240.8.98:1462 207.36.205.194:80 2 8 4294967295 4294967295 835418853 170 844 37 GET 9168504434183313441..gif HTTP/1.0 </pre> 848278028:829593 is the time at which the client made the request 848278028:893670 is the time at which the first byte of the server response was seen 848278028:895350 is the time at which the last byte of the server response was seen 23.240.8.98:1462 is the anonymized client IP address and the client port number 207.36.205.194:80 is the anonymized server IP address and the server port number 2 is the decimal representation of the client headers bitfield 8 is the decimal representation of the server headers bitfield the first 4294967295 is the if-modified-since client header value (note that 4294967295 is 0xFFFFFFFF, which means this header value was not present for this entry) the second 4294967295 is the expires server header value (again not present) 835418853 is the last-modified server header value 170 is the length of the HTTP response header 844 is the length of the response data 37 is the length of the anonymized request URL "GET 9168504434183313441..gif HTTP/1.0" is the anonymized request URL. The interpretation of the client and server header bitfields are as defined in the <strong>logparse.h</strong> header in the tools code. The tools code has been tested on both Linux and Solaris. The provided Makefile assumes Solaris - you may have to play with the LIBS definition for other platforms. HPUX is a mess; I didn't even try, but it should be possible to get these tools to work with little effort. If you do, please let me know what you did so that I can make your changes available to the world. <strong>Measurement</strong> The Home IP population gains IP connectivity using PPP or SLIP across their 2.4 kb/s, 9.6 kb/s, 14.4kb/s or 28.8kb/s wireline modem, or their (approximately) 20-30kb/s wireless Metricom Ricochet modem. There are a total of roughly 600 modems available via the Home IP bank. All traffic from these modems ends up feeding over a single 10Mb/s shared Ethernet segment, on which we placed a network monitoring computer (a Pentium Pro 200Mhz running Linux 2.0.27). The monitor was running the IPSE user-level packet scanning engine and a custom-written HTTP module that reconstructed HTTP connections from the gathered IP packets <em>on-the-fly</em> and emitted an unanonymized trace file. Each trace file was then anonymized and transmitted to our research workstations for further postprocessing and analysis. The trace gathering engine was brought down and restarted approximately every 4 hours (for administrative and address-space-growth reasons). This implies that there are two weaknesses in these traces that you should be aware of: any connection active when the engine was brought down will have a possibly incorrect timestamp for the last byte seen from the server, and a possibly incorrect reported size. We estimate that no more than 150 such entries (out of roughly 90000-100000) are misreported for each 4 hour period. any connection that was forged in the very small time window (about 300 milliseconds) between when the engine was shut down and restarted will not appear in the logs. We estimate that no more than 30 such drops occur for each 4 hour period. The packet capture tool reported no packet drops. Considering that a Pentium Pro 200MHz was used to capture the traces on a 10 Mb/s Ethernet segment, it is virtually certain that no trace drops besides those mentioned above occurred. There may be periods of uncharacteristically low activity in the traces - these correspond to network outages from Berkeley's ISP, rather than trace failures. The traces do contain entries for requests issued by the client but that weren't completed (because, for instance, the user pressed the STOP button and the TCP connection was shut down before the request completed). Unknown timestamps in the traces contain the value 0xFFFFFFFF (reported by showtrace as 4294967295), and incomplete requests contain header and data length values that report as much header/data was seen. The trace data is sorted by completion time (i.e. the time at which the last bye of the server response was seen, or the time at which the connection was dropped). However, because of inaccuracies and apparent time travel in the Linux system clock, some trace entries appear slightly out of order. All timestamps within the traces are as reported by the gettimeofday() system call, so these timestamps ostensibly have microsecond resolution.<br> <strong>Privacy</strong> To maintain the privacy of each individual Home IP user, we have stripped identity information out of the traces through a post-processing phase. Because it is very trivial to identify a user based solely on the pages that the user has visited, we were forced to anonymize the URL and destination IP address of each web request as well as the source IP address. All anonymization was done using a keyed MD5 hash of the data (32 bits for client and server IP addresses, 64 bits for URLs). <strong>We ourselves do not know the key used to salt the MD5 hash</strong>, so don't bother asking us for it. Similarly, don't bother asking us for unanonymized traces. In order to preserve <em>some</em> information about the URLs, the post-processed URLs have the following format: <strong><code>COMMAND URLHASH.[flags][.suffix] [HTTPVERS]</code></strong> where: <strong><code>COMMAND</code></strong> is one of <code>GET</code>, <code>HEAD</code>, <code>POST</code>, or <code>PUT</code>, <strong><code>URLHASH</code></strong> is the string representation of the 64-bit MD5 hash of the URL, <strong><code>flags</code></strong> contains the character <strong>q</strong> to indicate that a question mark was seen in the URL, and the character <strong>c</strong> to indicate that the string <strong>CGI</strong> or <strong>cgi</strong> was seen in the URL, <strong><code>suffix</code></strong> is the filename suffix, if present, and <strong><code>HTTPVERS</code></strong> is the HTTP version field of the HTTP command issued by the client, and is one of HTTP/1.0 HTTP/1.1 the NULL string (indicating HTTP/0.9). To our knowledge, however, no HTTP 1.1 requests were observed during the tracing period. Here are some examples of URLs contained in the traces: <strong><code>GET 8252631242092696791.q.map HTTP/1.0</code></strong> - the client issued a GET request, the URL contained a question mark, the URL ended in the suffix .map, and HTTP/1.0 was used by the client. An example of a request that may generate this anonymized URL is <code>GET /foo.map?BAR=BAZ HTTP/1.0</code>. <strong><code>POST 36782605103285618862.c HTTP/1.0</code></strong> - the client issued a POST, the URL contained the substring CGI or cgi, the URL did not end with a dotted suffix, and HTTP/1.0 was used by the client. An example of a request that may generate this anonymized URL is <code>POST /cgi-bin/foo HTTP/1.0</code>. <strong><code>GET 103551731373256697..gif HTTP/1.0</code></strong> - the client issued a GET request, the URL contained neither the substring [CGI|cgi] nor a question mark, the filename ended with the .gif suffix, and HTTP/1.0 was used. An example of a request that may generate this anonymized URL is <code>GET /image.gif HTTP/1.0</code>. <strong><code>GET 41438582632480924518. HTTP/1.0</code></strong> - the client issued a GET request, the URL contained neither the substring [CGI|cgi] nor a question mark, the filename didn't end with a dotted suffix, and HTTP/1.0 was used. An example of a request that may generate this anonymized URL is <code>GET /foo HTTP/1.0</code>. Privacy was the firstmost concern during this trace gathering experiment - UC Berkeley and the CS department consider the privacy of the student body to be paramount, and whenever we had the choice of putting more information in these published logs at the cost sacrificing the privacy of the traced users, we have invariably chosen to maintain the users' privacy at the cost of losing this information. It is our hope that someday the web protocols and servers will become secure enough to make a tracing effort of the kind we have done impossible. <strong>Acknowledgements</strong> Steven D. Gribble contributed the traces to the ITA. He also maintains the official UC Berkeley page dedicated to this tracing effort. For inquiries, contact Steve Gribble at <em>gribble [at] gmail [dot] com</em>. <em>These traces, documentation, and associated trace tools were created by Steve Gribble with the assistance of Armando Fox, Ian Goldberg, Eric Brewer, and Cliff Frost.</em>
提供机构:
Zenodo
创建时间:
2020-09-04



