Knowing about the visitor to a web site

Studying the audience



Sheizaf Rafaeli, Winter 1998

Outline:

  1. Why have an interest in hits, visitors, visits and sessions?
  2. Sources of Information
  3. Types of Logs
  4. What does the log count/record?
  5. Log analysis
  6. What are the drawbacks of access stats?
  7. Ethics
  8. Sources for more information


Why have an interest in hits, visitors, visits and sessions?


Return to top

Information sources about the audience "clickstream":


Tracking and profiling users


Return to top

Different types of LOGS:


Logs and their attendant analysis products can tell managers who is accessing their site, when, and what is being accessed. Different servers create a variety of logs:


Return to top

What do log files count (esp. access_log files) ?

Remember that
web sessions are "connectionless" or "stateless!

The common access log file contains data entries for (read from left to right):


  1. Remote host name or IP number
  2. User_logname - often not implemented and replaced by "-"
  3. Authenticated_user - replaced by "-" if not an authenticated request
  4. Date and time
  5. Request from client (name of document, file or command)
  6. HTTP status code returned to client (200 is success)
  7. Number of bytes sent

Following is a sample of a few lines from an access_log of one of the servers of the Journal of Computer Mediated Communication (JCMC): (http://jcmc.huji.ac.il and http://www.ascusc.org/jcmc)

annex246-38.bmi.net - - [04/Jan/1998:21:53:36 +0200] "GET /vol1/issue4/giff2sm.gif HTTP/1.0" 200 38381
annex246-38.bmi.net - - [04/Jan/1998:21:53:37 +0200] "GET /vol1/issue4/giff1sm.gif HTTP/1.0" 200 35289
annex246-38.bmi.net - - [04/Jan/1998:21:53:44 +0200] "GET /vol1/issue4/watz2sm.gif HTTP/1.0" 200 40960
annex246-38.bmi.net - - [04/Jan/1998:21:54:45 +0200] "GET /vol1/issue4/mclaugh.html HTTP/1.0" 200 77702
annex246-38.bmi.net - - [04/Jan/1998:21:55:48 +0200] "GET /vol1/issue4/watz1sm.gif HTTP/1.0" 200 40960
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:56 +0200] "GET /bluebar.jpg HTTP/1.0" 200 1367
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:56 +0200] "GET /address2.gif HTTP/1.0" 200 1049
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:57 +0200] "GET /names.gif HTTP/1.0" 200 1393
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:57 +0200] "GET /Platsite.jpg HTTP/1.0" 200 15716
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:57 +0200] "GET /SCOUTSEL.GIF HTTP/1.0" 200 1531
lib21.lib.edinboro.edu - - [04/Jan/1998:21:56:15 +0200] "GET / HTTP/1.0" 200 5469
port327.cjnetworks.com - - [04/Jan/1998:21:58:02 +0200] "GET / HTTP/1.0" 200 5469
ww-tp06.proxy.aol.com - - [04/Jan/1998:21:58:49 +0200] "GET /vol1/issue3/crede.html HTTP/1.0" 200 82310
port327.cjnetworks.com - - [04/Jan/1998:21:59:28 +0200] "GET /journal.html HTTP/1.0" 200 5469
ww-tp06.proxy.aol.com - - [04/Jan/1998:21:59:50 +0200] "GET /vol1/issue3/vol1no3.html HTTP/1.0" 200 5508
ww-tp06.proxy.aol.com - - [04/Jan/1998:22:00:12 +0200] "GET /vol3/issue2/ HTTP/1.0" 200 11668
ww-tp06.proxy.aol.com - - [04/Jan/1998:22:00:20 +0200] "GET /vol3/issue2/head.jpg HTTP/1.0" 200 7203
cumin.mcc.ac.uk - - [04/Jan/1998:22:01:07 +0200] "GET /vol1/issue3/search.vol1no3.html HTTP/1.0" 200 923
shelley.gre.ac.uk - - [04/Jan/1998:22:01:25 +0200] "POST /cgi-jcmc/srch.vol1no3.cgi HTTP/1.0" 200 1802
cumin.mcc.ac.uk - - [04/Jan/1998:22:01:31 +0200] "GET /vol1/ HTTP/1.0" 200 764
saffron.lut.ac.uk - - [04/Jan/1998:22:01:53 +0200] "GET /icons/back.gif HTTP/1.0" 200 216
sorrel.mcc.ac.uk - - [04/Jan/1998:22:01:53 +0200] "GET /icons/blank.gif HTTP/1.0" 200 148
sorrel.mcc.ac.uk - - [04/Jan/1998:22:01:53 +0200] "GET /icons/folder.gif HTTP/1.0" 200 225
saffron.lut.ac.uk - - [04/Jan/1998:22:01:57 +0200] "GET / HTTP/1.0" 200 5470

Note that this tiny sample captures seven simultaneous "hosts", "visiting" over a randomly chosen period of just under ten minutes in "an evening in the life of an online journal". Even in this small sample one can sense that some of the "sessions" are "directed", some "meandering". Some are passive, just reading textual HTML files or requesting graphic "gif" files. Other "sessions" are more interactive. One visitor generated a "POST" record that results in running a CGI program. Note how the visitors are from different countries (two in this case). Several visitors originates from an educational/academic institution while others are using commercial and network accounts/connections. If we follow the log long enough, we can begin to discern a pattern of interests for individual visitors and in the aggregate for types of visitors.

So there are a lot of "hooks" upon which we can hang the analysis of data.


Distinguish between counting:

Return to top

Analysis

Logs can be watched in real time (using tools like "tail"), or "eyeballed" informally. More systematic approaches would require some tallying and crosstabulation, at least.
Log analyses can be done manually, for example by using spreadsheet tools, unix counting tools lice 'wc' and 'grep', or by writing simple scripts.
But they can also be done using any one of a variety of log analysis tools.
Here are some simple examples for outputs of statistical log analyses:







The above examples, taken from different servers, were created using Accesswatch, Analog, http-analyze, 3Dstats 3.1, and wwwstats. Please see Yahoo's list of log analysis tools, cited below.
More sophisticated analysis tools, and certainly the commercially available tools, will also conform to some external standard of reporting, would make audits easier, and would allow the aggregation of data from different sites. Additionally, click-through analyses could be combined with the analysis of "cookie" data, and data from other sources.

Problems and drawbacks with logs:

(biases and distortions)


Return to top

Ethics issues!

Return to top

Sources:

This page was last updated by sr :
This page was last viewed on: