Knowing about the visitor to a web site
Studying the audience
Sheizaf Rafaeli, Winter 1998
|
 |
Outline:
- Why have an interest in hits, visitors, visits and sessions?
- Sources of Information
- Types of Logs
- What does the log count/record?
- Log analysis
- What are the drawbacks of access stats?
- Ethics
- Sources for more information
Why have an interest in hits, visitors, visits and sessions?
- Bragging rights
- Assessing advertising rates
- Pricing and billing
- Site design
- Feedback and navigational help for users
- Quality assurance
- Security concerns
- Copyright enforcement
- Marketing research
- Editorial and marketing quality improvements
Return to top
Information sources about the audience
"clickstream":
Tracking and profiling users
- Speculate (data-less research)
- Surveys (ask 'em)
- Experiments (watch 'em in the lab)
- Counters (visible and hidden; internal and external)
- Cookies
- Javascript background processes
- Active-X background processes
- Java background processes
- CGI environment variables
- CGI events (info sent with POST commands)
- ***Logs***
- non-server logs, such as history files created by the browsers, intermediate logs recorded by firewall and proxy devices,
cookie logs on individual clients and intermediating networks.
- PC-Meters (measuring at the source, like TV-meters) (watch 'em at home)
perhaps use in panels of participating households (likely to happen in
major markets. unlikely to capture the essence of the web, as the sample is
already biased).
- Eyeball-trackers and other physiological measures
Return to top
Different types of LOGS:
Logs and their attendant analysis products can tell managers
who is accessing their site, when, and what is being accessed.
Different servers create a variety of logs:
- Error
- Security
- Referrer
- Browser (Agent)
- Access-Logs
Return to top
What do log files count (esp. access_log files) ?
Remember that
web sessions are "connectionless" or "stateless!
The common access log file
contains data entries for (read from left to right):
1. Remote host name or IP number
2. User_logname - often not implemented and replaced by "-"
3. Authenticated_user - replaced by "-" if not an authenticated request
4. Date and time
5. Request from client (name of document, file or command)
6. HTTP status code returned to client (200 is success)
7. Number of bytes sent
Following is a sample of a few lines from an access_log
of one of the servers of the Journal of Computer Mediated Communication
(JCMC): (http://jcmc.huji.ac.il and http://www.ascusc.org/jcmc)
annex246-38.bmi.net - - [04/Jan/1998:21:53:36 +0200] "GET /vol1/issue4/giff2sm.gif HTTP/1.0" 200 38381
annex246-38.bmi.net - - [04/Jan/1998:21:53:37 +0200] "GET /vol1/issue4/giff1sm.gif HTTP/1.0" 200 35289
annex246-38.bmi.net - - [04/Jan/1998:21:53:44 +0200] "GET /vol1/issue4/watz2sm.gif HTTP/1.0" 200 40960
annex246-38.bmi.net - - [04/Jan/1998:21:54:45 +0200] "GET /vol1/issue4/mclaugh.html HTTP/1.0" 200 77702
annex246-38.bmi.net - - [04/Jan/1998:21:55:48 +0200] "GET /vol1/issue4/watz1sm.gif HTTP/1.0" 200 40960
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:56 +0200] "GET /bluebar.jpg HTTP/1.0" 200 1367
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:56 +0200] "GET /address2.gif HTTP/1.0" 200 1049
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:57 +0200] "GET /names.gif HTTP/1.0" 200 1393
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:57 +0200] "GET /Platsite.jpg HTTP/1.0" 200 15716
lib21.lib.edinboro.edu - - [04/Jan/1998:21:55:57 +0200] "GET /SCOUTSEL.GIF HTTP/1.0" 200 1531
lib21.lib.edinboro.edu - - [04/Jan/1998:21:56:15 +0200] "GET / HTTP/1.0" 200 5469
port327.cjnetworks.com - - [04/Jan/1998:21:58:02 +0200] "GET / HTTP/1.0" 200 5469
ww-tp06.proxy.aol.com - - [04/Jan/1998:21:58:49 +0200] "GET /vol1/issue3/crede.html HTTP/1.0" 200 82310
port327.cjnetworks.com - - [04/Jan/1998:21:59:28 +0200] "GET /journal.html HTTP/1.0" 200 5469
ww-tp06.proxy.aol.com - - [04/Jan/1998:21:59:50 +0200] "GET /vol1/issue3/vol1no3.html HTTP/1.0" 200 5508
ww-tp06.proxy.aol.com - - [04/Jan/1998:22:00:12 +0200] "GET /vol3/issue2/ HTTP/1.0" 200 11668
ww-tp06.proxy.aol.com - - [04/Jan/1998:22:00:20 +0200] "GET /vol3/issue2/head.jpg HTTP/1.0" 200 7203
cumin.mcc.ac.uk - - [04/Jan/1998:22:01:07 +0200] "GET /vol1/issue3/search.vol1no3.html HTTP/1.0" 200 923
shelley.gre.ac.uk - - [04/Jan/1998:22:01:25 +0200] "POST /cgi-jcmc/srch.vol1no3.cgi HTTP/1.0" 200 1802
cumin.mcc.ac.uk - - [04/Jan/1998:22:01:31 +0200] "GET /vol1/ HTTP/1.0" 200 764
saffron.lut.ac.uk - - [04/Jan/1998:22:01:53 +0200] "GET /icons/back.gif HTTP/1.0" 200 216
sorrel.mcc.ac.uk - - [04/Jan/1998:22:01:53 +0200] "GET /icons/blank.gif HTTP/1.0" 200 148
sorrel.mcc.ac.uk - - [04/Jan/1998:22:01:53 +0200] "GET /icons/folder.gif HTTP/1.0" 200 225
saffron.lut.ac.uk - - [04/Jan/1998:22:01:57 +0200] "GET / HTTP/1.0" 200 5470
|
Note that this tiny sample captures seven simultaneous "hosts",
"visiting" over a randomly chosen period of just under ten minutes
in "an evening in the life of an online journal".
Even in this small sample one can sense that some of the "sessions" are "directed", some "meandering".
Some are passive, just reading textual HTML files or requesting graphic "gif" files.
Other "sessions" are more interactive. One visitor generated a "POST" record that results in running a CGI program.
Note how the visitors are from different countries (two in this case).
Several visitors originates from an educational/academic institution
while others are using commercial and network accounts/connections.
If we follow the log long enough, we can begin to discern a
pattern of interests for individual visitors and in the aggregate for types of visitors.
So there are
a lot of "hooks" upon which we can hang the
analysis of data.
- Hit: Any request for a document, a script, an image, sound, video or any other file
which causes the server to respond in some way.
If an HTML page is requested this may cause further requests
for images or other embedded objects to be sent to the server.
Each line in the log is a "hit". Hits are both the easiest
to grasp and the most deceptive element in the log file.
Ostensibly, the log file is the record of the number of times a page or file accessed.
Hits have gotten a bad rap -- mostly because of the way log analysis
programs are misused (see below). Just because each page or image file accessed
is recorded as a hit in the log, doesn't mean each page or file should be
used by the log analysis program. hits should not be counted as people,
visits or sessions.
- Visit: Each time a specific user accesses a page or file is
considered a visit. Multiple visits by a single user reflects
a high degree of interest in the content and presentation of the site.
Frequently updated and popular sites generate multiple visits.
- Session: The activities of a user during a single visit is
referred to as a session. There is considerable interest in being able to
track the length of a session and the path that a user
follows within a web site. Since client software doesn't send the
server a "good-bye for now" message, it is difficult to tell just when
a session ends.
- Date and Time:The date and time are provided based on the server's internal clock,
and usually reported in terms of the time zone in which the server is located. A correction factor
synchronizing the server's time zone to GMT is part of the time stamp. Note that the time stamp is
accurate to the resolution level of seconds, and can be set to record even higher resoolutions.
- Status code:
Distinguish between counting:
- Exposure
- Click through
- Interactivity
- Outcome
Return to top
Analysis
Logs can be watched in real time (using tools like "tail"), or "eyeballed" informally.
More systematic approaches would require some tallying and crosstabulation, at least.
Log analyses can be done manually, for example by using
spreadsheet tools, unix counting tools lice 'wc' and 'grep', or by writing simple scripts.
But they can also be done using any one of a variety of log analysis tools.
Here are some simple examples for outputs of statistical log analyses:
The above examples, taken from different servers, were created using
Accesswatch, Analog, http-analyze, 3Dstats 3.1, and wwwstats. Please see
Yahoo's list of log analysis tools, cited below.
More sophisticated analysis tools, and certainly the commercially available tools,
will also conform to some external standard of reporting, would make audits easier,
and would allow the aggregation of data from different sites.
Additionally, click-through analyses could be combined with the analysis
of "cookie" data, and data from other sources.
Problems and drawbacks with logs:
(biases and distortions)
- Inappropriate use
(mistaking "hits" for "Accesses",
not investigating "sessions",
not aggregating
not separating graphics from text (wheat from chaff)
not separating "pages received" from "pages sent".
- Deliberate "fudging" or worse
- What was sent, not what was received
- Dynamic content
- Robots
- Offline reading
- Proxies
- Caches, forwarding
- Prrinted copies
- Multi casting
- plug-ins and media objects
No record of what happens offsite (inside java, javascript, etc.)
Return to top
Ethics issues!
- Should Web stats be made public?
- Would it be ethical to NOT make logs public?
- Should web stats be made public even if only in aggregate?
- Should decisions, allocation, pricing, strategy, etc. be based on log stats?
Return to top
Sources:
- Reach and hits stats, Salon article part 1
- Reach and hits stats, Salon article part2
- Barnes, B. (1997) You Count : How many people read SLATE?
Slate, 8 August 1997 http://www.slate.com/Webhead/97-08-08/WebHead.asp
- Rea
ch for the Hits S. Rosenberg, Salon Magazine
- InformationWeek's
Tapping the Pipeline
- Doug Linder: Interpreting WWW statistics
Doug Linder, Synetics webmaster@nara.gov
http://www.ario.ch/etc/webstats.html
- Yahoo's list of log analysis tools, at:
http://www.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Servers/Log_Analysis_Tools/
- UMICH loggate at http://www.umich.edu/~websvcs/umweb/index.html
- Analog: http://brendanr.simplenet.com/analog
- BBM bureau of measurement, a primer on web-site measurement
- http://www.bbm.ca/newmedia/whitep.htm
- Bhatia, M. (1997) Web Audience Measurement: Issues, Challenges and
- Solutions, Nielsen interactive Services, july 23, 1997.
http://www.nielsenmedia.com/nypres/sld001.htm
- Goldberg Why web usage statistics are (worse than) meaningless
http://www.cranfield.ac.uk/docs/stats/
- GVU's Log File Analysis tools survey
http://www.gvu.gatech.edu/user_surveys/survey-1997-10/graphs/webmaster/Log_File_Analysis_Tools.html
- Musciano, C. (1996, March 1). Collecting and using server statics. SunWorld
Online [Online], 10 paragraphs.
Available:http://www.sun.com/sunworldonline/swol-03-1996/swol-03-webmaster.html [1997, March 25]
- Noonan Making Sense of Web Usage Statistics
http://www.piperinfo.com/pl01/usage.html
- Novak, P. T. & Hoffman, L. D. (1996). New Metrics for New Media: Toward the
Development of Web Measurement Standards, [Online]. Available:
http://www2000.ogsm.vanderbilt.edu/novak/web.standards/webstand.html
- Internet.com Product Watch List of Analyzers:
http://ipw.internet.com/analysis/index.html
- HitBox rating service and java visitor tracker
- Hebrew explanation
- Comprehensive articleM
- Another
- Products on the market (zdnet)
- Statistical inaccuracies
- The Drudge Report Traffic Stats "bragging page"
- Junkbusters, about data profiling
- Software to capture mouseclicks and keystrokes:
-
Visual Basic's Keypress event,
- Invisible KeyLogger, from Amecisco, www.amecisco.com/support.htm
- Stealth Keyboard Interceptor, from ANNA Ltd., www.geocities.com/SiliconValley/Hills/8839/skin984.html
- BackOrifice 2000, http://www.bo2k.com
- WinRunner from Mercury Interactive, www.winrunner.com
- Invisible KeyLogger 97 and Invisible KeyLogger Stealth (IKS) www.cnte.com/content/gadgets/guides/terrors/
- phantom.exe,
Superkey,
Investigator, www.winwhatwhere.com/
Ventana Group Systems.
This page was last updated by
sr :
This page was last viewed on: