The logeval
GPL Project
Copyright (C) 2002 by Steve Litt
NO WARRANTY!
There is no warranty for anything contained in the
logeval distribution or documentation or its web pages, to the extent
permitted by applicable law. Except when otherwise stated in writing
the copyright holders and/or other parties provide the program, documentation
and web pages "as is" without warranty of any kind, either expressed or
implied, including, but not limited to, the implied warranties of merchantability
and fitness for a particular purpose. The entire risk as to the quality
and performance of the program is with you. Should the program, documentation
or web pages prove defective, you assume the cost of all necessary servicing,
repair or correction.
logeval is a program to analyze a set of
UNIX/apache log files and come up with meaningful statistics.
CONTENTS
Project charter
The logeval program is intended to give daily statistics important in using
a website as an advertisement. Like other log analysis programs, it prints
out all web pages in reverse order of traffic. But in addition, it allows
you to flag specific web pages to analyze. Typically these would be advertisement
pages.
Also, unlike most analysis programs, it prints the top 10 most hit sites
for each day, together with the total visits for the day and the total
distinct IP addresses for the day. This daily refinement enables you to
more quickly and accurately gauge the effect of changes in advertisements,
correlating changes in content with both changes in traffic and changes
in sales.
Another feature is the ability to place special events in an event file,
so that the events will print before the day on which they occurred. Thus
you might have a June 4, 2002 event called "Got a link from bigsite.com",
which will then print above the June 4, 2002 report entry, thus reminding
you why your stats went up 15%.
This object oriented program can be enhanced as desired. The Accumulator
class compiles totals for a given period. Each day gets an Accumulator
object, and there's an Accumulator object for the entire report. It would
be easy to create weekly or monthly Accumulator objects, or an Accumulator
object for the last 7 days.
There are a few downsides. The program is written in Perl, and is therefore
slower than you might expect. To minimize the Perl effect, the program
has a preprocessor (logeval.cgi) which pre-trims the log file using highly
efficient and thoroughly tested UNIX utilities. Other downsides include
the fact that it ignores graphic files and it ignores bandwidth. This would
NOT be a good tool to analyze bandwidth.
Depending on your web host and the size of your log(s), this program
might be runnable on the web host. However, it might time out, in which
case your best alternative is every day to use ftp to incrementally get
(reget) the new parts of your log file, and then run the program on your
desktop computer. The Troubleshooters.Com logs from 3/24/2002 thru 6/4/2002
comprise 518703 html page accesses, and the analysis takes 3 minutes and
11 seconds to run on my dual Celeron 450 with 512Meg and Mandrake Linux
8.2.
To repeat, logeval is built to analyze the immediate effect of content
changes on traffic patterns and sales.
Project specifications
This program consists of the following files:
-
logeval.cgi
-
logfilelist.cgi
-
logeval_worker.cgi
-
logeval.conf
-
specialevents.list
logeval.cgi
This is a shellscript that cats the list of files produced by logfilelist.cgi,
and pipes it through grep statements to get rid of graphic file records
and other filetypes that aren't being tracked, as well as accesses that
didn't produce a 200 result, and finally pipes the result to logeval_worker.cgi,
which does all the analysis.
logfilelist.cgi
Based on a wildcard in configuration file $HOME/.logeval/logeval.conf,
this program outputs a list of log files, sorted in date order from earliest
to latest. This program may need to be changed to accommodate the way your
ISP names log files.
logeval_worker.cgi
This is an OOP Perl program that does all the analysis work. For best performance
with high traffic sites, the per-line algorithm looks something like this:
foreach line
parse the line
if datestamp != previous datestamp
do break logic
add to current daily Accumulator
On a site like Troubleshooters.Com, only 1 out of every 5000 lines triggers
a date change, so by offloading anything date related to the break logic,
and by updating ONLY the daily Accumulator, you maximize performance. Other
accumulators are updated during break logic by accumulating the proper
daily Accumulators.
Two classes are intended to be substantially modified: the Breaklogic
and Writer classes. You can customize the report by modifying these. As
far as the Writer object, you're probably better off subclassing it. For
instance, you could have a DailyWriter, WeeklyWriter, MonthlyWriter, Last7FullDayWriter,
and ReportWriter, all descended from the Writer class. The current program
just uses Writer to write both the daily Accumulators and the Report Accumulator.
The following is a list of classes in this program:
-
Breaklogic: Update any non-daily Accumulators, close out this day, and
start the next day.
-
Writer: Writes the data for an accumulator.
-
Specialevents: Tracks special events from ~/.logeval/specialevents.txt.
-
Options: Storage for global variables.
-
Accumulator: Accumulates statistics for a given period (minimum 1 day).
-
Programlogic: Handles the top level logic of the program.
logeval.conf
This is the config file for the program. At present it contains only 2
types of lines:
-
Log file wildcard
-
Special URL's to track
There's only 1 log file wildcard record, and it looks something like this:
log wildcard = /scratch/tclogs/troubleshooters.com-access_log*
Each special URL record defines one URL to track separately, and looks
like this:
special url = /bookstore/order.htm
specialevents.list
This is a list of major special events that you believe would explain changes
in traffic or sales. For instance, on 4/21/2002 I changed the bookstore
main page to be an order form, and aimed ALL Troubleshooters.Com links
at that main page. Main page traffic skyrocketed, but sales plummeted.
T.C's readers obviously needed to read about the book before purchasing
it. On 5/12/2002 I aimed Troubleshooters.Com book links at the pages for
the specific books, instead of the main page. Main page (order form) visits
dropped like a rock, but book pages skyrocketed and sales went back to
its pre 4/21 sales.
The following is my ~/.logeval/specialevents.log:
2002/04/21@15:00 Bookstore main page becomes order form, all links aimed there
2002/05/12@17:00 Move links back to book ads
These entries print above the output for their respective days, yielding
a very clear view of exactly what happened.
Maintainers list
Needed Programming and Documentation Tasks
Here are some items on the todo list:
-
Implement a "last 7 full days" accumulator to print right before the report
accumulator.
-
Command line arguments to define the start and end dates for the report.
Include logfilelist.cgi modifications so that if the dates preclude a specific
log file, that log file is not run, thus speeding the program.
-
Change the Accumulator objects to include a delta for each URL, and print
the URL's that change the most. This will help alert the webmaster to new
incoming links.
How to Participate
Currently this project isn't mature enough for multiple programmers. If
you make what you consider a valuable change to the program, please feel
free to email me describing
the change.
Instructions on how to join the project mailing
list
No mailing list currently.
FAQ (Frequently Asked Questions) list.
None currently.
HTMLized versions of the project documentation
None currently.
Links to related projects.
None.
Dedication: We Stand On Their Shoulders
Larry originated the language that made this an easy 1 day project, Linus
originated the OS that it runs on, and Richard orginated the license that
made all the rest possible.
Progress
To be annonced
Top of Page