About Analytics

Discover Where Your Site Visitors come from, What pages they visit,How long they stay,what they buy, what makes them give up, and how often they return.

Friday, January 23, 2009

Error Codes

It’s worth noting that Google Analytics, by its nature, doesn’t collect information
on HTTP errors. Those errors are part of the log file, and Google Analytics
doesn’t look at logs. The only real point of interest is the 404 “not found”
errors. If you click the 404 link, it’ll take you to a page such as Figure 3-21 that
lists all the URLs for the pages visitors requested that could not be found.
The web server will know where to send visitors looking for those pages. If
the URL is pointing to an old name that no longer exists, or shows that there’s
a misspelling in your HTML, you can fix it. On the other hand, it may be an
attack on your web server if the URL isn’t familiar (especially if it contains
funny characters like \x05), or if the URL has a long, long string of nonsense
after it, or if it contains things like admin or .dll, which you don’t have on
your web site. But unless those attacks are successful or overwhelming in
number, you’re probably safe in ignoring them.

We’re Done!

We made it! Now that we’ve tied up all the loose ends and, we hope, taught
you all the basics, it’s time to move on to Google Analytics. You’ll see, as you
go on, more of what Google Analytics can do that AWStats can’t, but you’ll
also see that what you’ve learned about AWStats is valuable in itself.
Onward!

Key Words and Key Phrases

Speaking of targeted traffic, if you want to know what people are searching for
in those search engines, look no further than the Keywords and Keyphrases

These search terms are bringing visitors to your site. Unless you’re an 800-
pound gorilla yourself, the Keywords table won’t mean a whole lot, except
that having your best keywords appear the most is desirable. See how “figure”
appears 5,586 times and “skating” clocks in at 6,128, but “figure skating”
brings in only 135? That’s what we mean. SkateFic is ranked so far down in the
search for “figure skating” that it seldom gets found. The traffic you do see in
the table is actually brought in by AdWords.
For the most part, however, SkateFic’s key-phrase performance is pretty
good. There are only two anomalies (Kristina Lenko and Kristina Cousins),
which bring people in but don’t actually appear anywhere on the site. The rest
of the key phrases are on topic and likely point to relevant content. By clicking
Full List, one would see that “figure skating” appears in roughly half the
searches, meaning people are generally interested in the subject matter
SkateFic offers. This is important. Years ago, the two top searches were “Tina
Wild,” a porn star, and “hockey wives” —don’t ask, we don’t know either.
Obviously, those searches did not bring in people who were interested in what
SkateFic had to offer: a skating serial chapter that mentioned hockey players’
wives and where the main character Tina had a wild-hair day.

Miscellaneous

At present, as shown on Figures 3-18 and 3-19, the only part of the Miscellaneous
table that’s working is the tally of bookmar
The measures of “favorite” bookmarking in the two figures above are
important for at least two reasons. First, they tell you how many people liked
your site enough to bookmark it —meaning that they plan to return again and
again. They may never actually return, but some do. This is called “stickiness.”
You want your site to be sticky. The “favorites” metric helps you keep tabs on
how big your core audience is—the people who intend to come back.
Second, if you follow this over the months, you can see whether your content
is becoming more compelling or less—are more or fewer people intending
to come back? SkateFic.com ran at 4.7 percent bookmarks for the five
years or so before February 2006, as recorded cumulatively in Figure 3-18, Figure
3-19, covering February 2006 to February 2007, shows a big change. Not
only did a healthy 5,749 enter the site during that year, but 2,156 bookmarked
it—37.5 percent compared to the average of 4.7 percent for previous years.
One could explore the reasons for this, but Google Analytics records only
effects; it won’t tell you why. What it does give you are hard numbers showing
that the stickiness of SkateFic.com more than tripled in a year, from 709 to
2,156. The site has gotten stickier, and it’s doing better at attracting an audience
interested in figure-skating fiction.

Connect to Site from

The “Connect to site from” report has two sections: top and bottom. Figure 3-14
shows the upper part of the top section.
First is the traffic (pages and hits) coming from people who type your
URL—direct addresses—or use a bookmark. These are your regular customers
or readers, your core traffic. They know where your site is from memory
or they have your site bookmarked. Chances are, they’ll be back because
they know you have what they want.
The second line of the upper section is for people coming in from newsgroups.
Newsgroups are one of the more ancient forms of Internet communication,
the killer app of 1991. Newsgroups, which tend to be very uncontrolled
and egalitarian, are falling by the wayside, whereas conduits where content
can be controlled (such as mailing lists and web forums) are on the rise. There
was no incoming traffic from newsgroups. If there had been, it would have
indicated that there was some word of mouth about your site and that people
were visiting based on recommendations from other visitors.
The rest of the upper section lists search engine activity. Google tends to rule
this list, with five times more traffic than everyone else put together. The first
line has numbers for the aggregate of all search engines. The rest of the lines
have names of individual search engines with two unlabeled numbers. Those
numbers should be labeled, from left to right, Pages and Hits.
The bottom section of the table lists the external URLs that drive the most
traffic. The top URL in Figure 3-15 is a Google AdWords ad. Fourteen (of 25)
other URLs in the top-external-links list are either ad forms that repurpos
Google results or are AdWords-for-content placements from third-party web
sites using the Google AdSense program.
Clicking Full List will give you a full list of all external URLs. On SkateFic
.com, the vast majority of those URLs are AdWords-for-content placements.
But you have to know what to look for. You can’t count on AWStats to tell the
difference between a real external link and yet another AdWords placement or
search engine result.
You can, however, filter the full list results. Figure 3-16 shows only the
results that explicitly come from Google (there will be others that come from
Google but don’t say so). It’s worthwhile to note that the percentages given on
a filtered full list refer to the percentage of that filtered data set, not to the overall
full list of external links. So the 63.1 percent of all the ads that reference
Google comes directly from AdWords placement on Google’s own web site
So what does this all mean? Should you be concerned with the raw numbers
or only with the percentages? How should your percentages of direct-address,
search engine, and external-link traffic compare?

It’s like this: You want to keep current readers and customers coming back.
You also want new readers and customers to find you. A very low percentage
of direct addresses may indicate that people are not returning after their first
visit or that your offline promotional efforts are not effective. This means that
your site is not sufficiently sticky, or that people get to your site and don’t find
what they need. It means you’re not building a core audience.
A low percentage of search engine-driven traffic can mean that your site is
not well optimized for search engines and people are not finding it. About two
years ago, Mary overhauled SkateFic.com with search engine optimization
(SEO) in mind. The percentage of traffic driven by search engines doubled, as
did total traffic.
If the external links aren’t bringing in the traffic, you need to be concerned
about word of mouth and viral marketing. This is especially so if most of your
external-link traffic is coming from repurposed Google searches, small search
engines, and AdWords placements. It means that you don’t have a lot of sites
that spontaneously link to yours.
So how is SkateFic.com doing? Search engine traffic is about 50 percent—
not too shabby. Bringing in that many new people every month is growing the
core readership by hundreds of eyes every month. Direct-address traffic is
about 43 percent, which means SkateFic.com has a happy and returning fan
base and is a healthy content site. But with only 6.3 percent of page views coming
from external links, and many of them from small search engines and
AdWords, SkateFic.com isn’t doing very well as far as word of mouth. Putting
more effort into getting links from other sites, especially figure skating–related
sites, could pay off handsomely in the long run. Independent external links are
a crucial part of an SEO strategy and would improve search-engine results,
bringing in more, better-targeted traffic.

Operating Systems and Browsers

If you’re a Mac or Linux person, how many times have you heard what
amounts to “We only care about Windows users”? Certain designers and even
web site owners want to design sites only for the very newest Windows version,
the very newest Internet Explorer browser. Cross-platform compatibility
be damned! “So few people use Mac or Linux (or Netscape or FireFox or visit
from their PDA or mobile phone) that we don’t need to support it.”
But is that really true?Is it really true that all you need to support is the newest
IE and the newest Windows? According to Figure 3-11, it would indeed seem that 87.4 percent of
hits come from Windows machines and 75 percent from IE.

It’s not as exact as it would be if AWStats gave us pages or unique visitors,
but it’s the best we’ve got. Looks like a lot of Windows users. It might lead you
to decide that the right thing to do is to support the newest IE 7 and the newest
Windows Vista.
And you would be dead wrong.
Let’s do some estimates. The earlier examples and screenshots show February
2006 (so don’t get confused), but in February 2007, there were 5,749 unique
visitors and 72,780 hits. That’s 12.65 hits per visitor. The 63,653 hits from Windows
machines work out to about 5,000 visitors. The other 750 odd visitors are
on Mac or another OS. Maybe 12 percent doesn’t seem like much, but are you
willing to turn away more than 750 potential readers and customers? Mary
doesn’t happen to be, so even if she weren’t a Mac-hack-from-way-back, she’d
be putting the extra time and dollars into cross-platform compatibility. It’s
good business.
But say you’re willing to sacrifice 12 percent of possible customers. You’re
sticking to the major-OS/major-browser strategy to save money. Saving
money is good business. Are you sure you’re saving money only supporting
the most recent IE version?
Take a look at Figure 3-12 to see who’s using IE 7. Certainly not the majority.
An estimated 2,700 visitors are using IE 6. About 1,500 are using IE7 (up from
14 in the first edition of this book). On the flip side, the approximately 400 people
using versions of IE5 in February 2006 have dropped to 78 in February 2007.
IE4 has dropped to single digits, and IE3 has dropped off the radar completely.
It’s probably time to drop support for IE3 and IE4 and to consider dropping
support for IE5. But IE7? If you only supported IE7 (which is notoriously
finicky), you’d be leaving the majority of your visitors who are still on IE6
behind.
And other browsers: FireFox, Safari, Netscape, Mozilla and so on? A scant
hundred fewer visitors use those browsers than use IE7. So in an effort to support
1,500 users, a whole lot of sites are ignoring 1,400 users. Supporting only
the latest and greatest is starting to look foolhardy indeed, isn’t it?
What’s more, Mozilla, FireFox, Netscape, and Camino are all related, as are
Safari and Konqueror. Support FireFox and Safari and you’re likely to support
Netscape, Mozilla, Camino, and Konqueror as well, with little extra effort. It’s
a six-for-the-price-of-two sale!
Now what exactly does “Unknown” mean? Many of those “unknown”
browsers are not as unknown as you might think. Being book authors distinctly
lacking in curiosity, no one here ever clicked that Unknown link in the
title bar to find what you now see in Figure 3-13.

Pages-URL

The Pages-URL report (see Figure 3-8) lists the top 25 URLs by the number of
times that page was viewed. Links across the title bar will take you to the Full
List of all URLs recorded for your site.
The Entry and Exit links (see Figure 3-9) go to pages showing the full list of
URLs sorted by the most entries and most exits, respectively.
The Entry and Exit lists, as with many of the secondary pages, allow you to
filter the list with Regular Expressions. A Regular Expression (abbreviated
RegEx) matches patterns using a special syntax that we’ll discuss in more
depth in Chapter 6. Also in Figure 3-9, the RegEx .*/serials/.* matches all
the URLs that contain the directory /serials/. At SkateFic.com, the serials
directory contains all the currently running serial novels. From a business
standpoint, knowing how to filter the Pages-URL list gives you the ability to
look at different sections of your web site —that is, if your web site is structured
so that different sectors of your business correspond to different structural
parts of your site.
What if they don’t? What if you use variables to steer people to different parts
of your site? For example, in Figure 3-9, the top two URLs are for /chapters/
index.php. While not immediately apparent, those two URLs can represent
hundreds of individual chapters, because each of them comes with a variable
such as: /chapters/index.php?Chapter=23 for Chapter 23 of the serial.
A business with an online catalogue might have one catalogue page that
uses an item number to pull item descriptions from a database. A site that uses
a content management system (CMS) may have very few actual pages and
may only differentiate pages by a series of variables in the URL. See any of
those variables in the URLs that AWStats shows? Nope?
We don’t either.
This is another one of those things that AWStats doesn’t do that you’ll find
you need. It’s great to know how many people read chapters of one serial or
another (or read articles or visit the catalogue). But it’s not as helpful as knowing
that 2,000 people read the newest chapter (or article) and that 337 people
read 10 other chapters or that 1,500 people looked at the week’s sale item and
that 1,800 people looked at a bunch of other catalogue items.
Here’s another important piece of information that you both need and
don’t.
The /figure-skating-trivia/ directory contains a single page with
numerous short biographies of figure skaters. It has turned out to be a top
search term for SkateFic.com. It’s also the most visited page on a regular basis.
Look at the Entry and Exit numbers. You would think they’d have some
relationship to one another, but they don’t. A person could enter the site on
another page, poke around a while, find the trivia page, read for a while and
then leave for another site (or a cuppa joe) —no entry, one exit, one view. A
person could do the reverse, enter at the trivia page, exit elsewhere—one entry,
no exit, one view. A person could enter on a different page, read some, check out
the trivia, and end up reading one of the poems in a different part of the site—
no entry, no exit, one view. Finally, a person could enter the site on the trivia page
and leave immediately—a “bounce”—one entry, one exit, one view. That’s the
person we want to know more about! Do we know anything about them? No.
The trivia page is only a draw insomuch as it lures people further into the site.
The trivia page on SkateFic.com is like a controversial article on a content site or
a sale item on an e-commerce site. It’s all well and good that people look at
that page, but what you really want is people to be pulled further into the site.
It’s that supercheap sale item at the grocery store, a loss leader. How effective
your loss leader is depends on how many people get further into your site
from that page.
AWStats can’t tell you that. It can say how many people viewed a page. It
can say how many people entered there. It can say how many people exited.
What it doesn’t say is how many people saw that page and only that page.
That particular analytical association is a crucial one.

Days and Hours

The Days of Month, plus Days of Week and Hours reports (see Figure 3-3), all
answer the same basic questions: “Is traffic to the web site cyclical?” and “Did
any special events influence traffic?” Days of Month gives you a daily breakdown,
lets you compare against the average, and shows how AWStats arrived
at the Summary numbers.
From a business standpoint, comparing monthly reports shows that
SkateFic has a much stronger showing in the winter, during the figure-skating
season—duh. The 2006 Olympics also boosted traffic considerably in February
2006. There aren’t any particular intramonthly trends, even when comparing
across months.
Too bad that the Days of Week and Hours reports aren’t as useful. In the Days
of Week report, averaging tends to even out both anomalous bumps and meaningful
anomalies. The Hours chart, unlike the Days of Week chart, gives you
aggregate numbers where averages would be more meaningful. The Hours
graph is the saving grace, showing peak hours around 8:00 a.m., 2:00 p.m. to
3:00 p.m., and 9:00 p.m. (remember those are Central Time).
What does it mean in a business sense? The Days of Week chart means
absolutely nothing because averaging kills any bumps that might have meant
something. The Hours chart shows that SkateFic is busy before work, after
school, and after the nightly news. Most visitors are probably from the continental
U.S. because the site is busiest during the U.S. day. There’s a significant
population of night owls and people from the Eastern Hemisphere because
there is a base line of traffic even while westerners are sound asleep. This
raises the geographical question.

Countries
Americans have a terribly bad habit of being Amero-centric. AWStats uses a
reverse domain name system (DNS) to figure out where site visitors are coming
from. The top 25 countries of origin are listed on the main page in order
from most traffic to least. Usually, there are a significant number of incoming
IP addresses that cannot be resolved. These are listed as “Unknown.”
By clicking the Full List link, you can see all the countries that showed up in
the logs. Would you think that people in 96 countries—including Iran,
Bermuda, Nigeria, Mongolia—would be interested in figure-skating fiction?
That seems to surprise everyone who isn’t still laughing over the idea that figureskating
fiction actually exists.
Your site may have a much greater reach than you realize. Knowing this can
influence decisions about content and e-commerce. Would your site strategy
change if you knew that 35 percent of your traffic was coming from the European
Union?
We thought so.

Hosts
The hosts list (see Figure 3-5) offers several different views of the same information:
the host names and IP addresses of visitors. This is the same information
used to tell which country visitors hail from.
On the main page of AWStats, the first line after the title bar gives an
overview of how many known and unknown/unresolved hosts there were, as
well as how many unique visitors this represents. Then the main report starts
with the host who requested the most pages, listing hosts in descending order
from most traffic to least.
In Figure 3-5, you should note two interesting points. First, unlike the other
reports that show only “people,” the hosts list shows both “people” and “not
people.” Spiders and other robots are not second-class citizens on the hosts
list. Second, Google spiders have the top five wrapped up. What does this
mean? Well, Google indexes the site for new content at least once a week,
sometimes twice. For a small site, this is very good news. It means that the 800-
pound gorilla of search engines has taken notice and indexes regularly. New
content will not languish in obscurity

Robots and Spiders

In Chapter 2, we talked about visitors who are people and visitors who are not
people. One particularly important kind of visitor that is not a person is an
indexing spider or web crawler. The Robots/Spiders report (see Figure 3-6)
lists the various named and unnamed but identified web crawlers that have
run their sticky little legs all over your pages.
Named spiders are known robots from known entities: Google, Inktomi,
MSN, Yahoo, and so forth. Other spiders are not known, but when they hit a
special file on the top level of the web site called robots.txt, the server marks
them as spiders. Robots.txt tells spiders where they are allowed to go and
what they are allowed to index. For example, if you didn’t want the pictures
on your web site indexed, you could put a line in your robots.txt to make the
whole images directory off limits to spiders. Most good spiders pay attention
to these directives, but there’s no money-back guarantee.
Hits from spiders are reported a little differently from hits by other entities.
For each spider, the first number under Hits is the number of requests the spider
made. Then there’s a plus sign and the number of times the spider successfully
“saw” the robots.txt file. As you can see from Figure 3-7, different spiders
hit the robots.txt file in greatly varying numbers. Those numbers could mean
anything from lots of spider visits to very inefficient spidering methods. In general,
spiders are good. Being indexed is good. Being found is even better.

Visits Duration

Why does this report make us cringe —okay, just Mary, it’s her web site after
all. The Visits Duration report shows how long visits were. The average visit is
about 2.5 minutes. That’s not too bad. But then, you look at the numbers that
went into those 2.5 minutes. Fewer than 2,000 people stayed more than two
minutes. Only 15 percent stayed more than 30 seconds! For a content site,
that’s enough to shake an editor to her soul.
One of the measures of a successful content web site is how “sticky” that site
is. Stickiness is about whether visitors bounce in and then bounce out just as
fast. Apparently, lots of people do. Either they find what they want and leave,
or they don’t find what they want and leave. Either way, they leave before they
get deeper into the site.
This observation in itself is valuable. But where did most of these people
come from? How did they encounter the site? Did they leave immediately, or
did they try to load another page? Did they find what they wanted and leave?
Or didn’t they look? Those last two are very different things.
AWStats can’t tell us. While AWStats provides the raw data of “who came,
how many, where?” it can’t say “who came and left immediately, how many
dug in deeper, and where did they go?” For that, you need Google Analytics.
This is one park where the Little Leaguer, good as he is, can’t hit a homer.

Monthly History

The Monthly History has two parts: a bar chart and a table of values. The values
in the chart and the numbers in the table correspond to the Summary
information for each month. Each column of the chart has a total at the bottom
that appears on the earlier –Year– Summary. As with the –Year– Summary, the
total of Unique Visitors is not accurate. (This is the problem discussed near the
end of Chapter 2.)
In the bar chart, each colored bar is in proportion to other bars of that color.
However, there is no correlation between different colored bars. In Figure 3-1,
the tallest yellow bar and the tallest turquoise bar are the same height. But the
tallest yellow bar is 18,530 visits, whereas the tallest turquoise bar is 173,849 hits.
The Monthly History has a simple purpose. It exists solely so that you can
compare traffic numbers from month to month. Why did traffic double in February?
Why did it drop off in March?
These questions are as much business related as site related. In the specific
case at SkateFic.com, the 2006 Winter Olympics were in February, driving
interest in figure skating through the roof for a short period. But then, despite
TV coverage of the world championships in March, traffic fell as casual fans
went back to their regularly scheduled programs. With eight years of historical
data behind us, it’s easy to see that the pattern of activity was the same during
the 1998 and 2002 Olympics.
This is another benefit of having metrics. You can discern both short-term
and long-term patterns, sometimes just by looking. Does your web site peak in
August every year? Did editorial coverage in a major magazine spike traffic in
January? Do you get a lot of traffic around a particular real-world event? What
are the long-term and short-term trends?
Another way of looking at traffic, by days and hours, is shown in Figure 3-2.

Unique Visitors

The big problem with counting unique visitors is that it’s impossible to figure
out from server logs who’s unique and who’s a visitor. Figure 2-7 deals with
this problem.
There are caveats aplenty here because you’re counting visits from unique
IP addresses, not actual people:
■■ Any sort of local area network connected by a single Internet gateway
may have several users with the same apparent IP address.
■■ A proxy server owned by an ISP that caches frequently accessed pages
will show up as one unique visitor even though it represents hundreds,
if not thousands, of users. You can put a no-cache directive on your pages,
but it works only if the proxy pays attention to it. And using such a directive
may slow your site for some users.
■■ In the home, it is very common to have more than one person using the
same computer. You may have three different people visiting from one
IP address.
■■ People visit from different places: from home, work, school, or from a
laptop at the coffee shop. What looks like four unique visitors may
actually be only one.
■■ People on dial-up change IP addresses almost every time they log in. If
a person visits every day from a different IP, that person looks like 20 or
30 people, depending on how the ISP assigns IP addresses

There isn’t much you can do about these issues. It’s the nature of the
beast — and log analyzers. Google Analytics is script-based, so it does not
have many of these problems, but it has a series of issues of its own. The bottom
line is that you can’t measure unique visitors with complete accuracy. You
measure unique visitors as well as you can and you make sure to compare
apples to apples. As far as the technology goes, AWStats Unique Visitors is the
number of unique IP addresses that made requests to your web server. It’s the
best measurement a log analyzer can provide of how many people visite

Yearly Summary
AWStats calculates its metrics on a monthly basis. To produce yearly metrics,
it adds the results from all months, with the warning While this strategy doesn’t affect the other metrics, it also doesn’t produce an accurate number of unique visitors. If a particular IP appeared in January, March, and July, it would add three unique visitors rather than just one. It’s not practical to save all the logs and run the analysis on one huge lump every time
the user wants a year-to-date. Suffice it to say that the AWStats unique-visitors
metric is not accurate in the aggregate.

People and Not People

First off, there’s the difference between Traffic Viewed and Traffic Not Viewed.
In general terms, Traffic Viewed is generated by people. This isn’t a completely
sure thing, but it’s close enough for most purposes. Traffic Not Viewed is generally
generated by things that are not people. This includes robots, worms, or
replies with special HTTP status codes.
Robots are software programs that access web pages for their own purposes.
Search-engine crawlers (also known as spiders) are robots that index web pages
for inclusion in their search results. There are other spiders with less savory
purposes such as harvesting e-mail addresses for use by spammers. Worms
attack your web server, either to shut the server down (a denial-of-service
attack) or to break into the server. Either way, worms can create a large amount
of traffic that is of no interest beyond making sure it doesn’t overwhelm your
server completely. We’ll get into “special status” HTTP requests a bit later. But
in general, these are “noncontent” responses that redirect the visitor to another
page or inform the user that the page cannot be found.

Bandwidth

The bandwidth measurement is a webmaster’s first lesson in the importance
of collecting useful metrics as opposed to useless ones. With the exception of
knowing whether a site is nearing or over its bandwidth limits, there is pretty
much no useful business purpose to a measurement of bandwidth. Most web
sites don’t benefit from knowing the size of the average download.
With one small exception. Here in the United States, we tend to think of
everyone as having high-speed Internet. The fact is that broadband penetration
is less than 50 percent in the United States. According to the Organization
for Economic Co-operation and Development (www.oecd.org) only 137 million
people have high-speed access worldwide. Such figures could mean that half
of the people who visit your web site are using dial-up at 56 Kbps or less.
At 56 Kbps, loading time for pages and other content such as multimedia is
a big issue. It used to be that you had about 10 seconds for your page to load
before a user would abandon the page. Now you have about two seconds. You
can use the average bandwidth per visit along with the average pages per visit
to get a very rough estimate of how much data your average visitor is downloading
and how much time it takes.

Hits

For the first few years that we had web sites, we all quoted the number of
“hits.” It wasn’t until 1997 that we realized hits are another meaningless metric.
Why? To a web server, any access of any document — a page, a script, a
multimedia file, an image, and so on — is a hit. Because one page or site may
have lots of images, and another may be mostly all text, hits become a particularly
poor measure of a site’s performance and an even worse measure of how
a site performs in comparison to other sites.

Pages

Finally, we’ve reached a meaningful metric — pages, also known as page
views or page hits, the subject of Figure 2-4.
Back in the dark ages of 1997, when we were all using page counters, page
views were what we were actually trying to count. In AWStats, the Pages metric
is the aggregate of page request
s.
Still looking at the summary on the main page, scroll down to (or click the
navigation link for) Files Type. The Pages total, 37,395, includes 19,037 static
HTML page views, 18,330 dynamic views for pages with a .php extension, 27
CGI script accesses, and 1 “com” page, which has no description. You wouldn’t
be a dummy if you didn’t even know what that file type was. As it happens, it’s
a command file, a program, but exactly what it does is beyond our scope here.
Is that com file a page? Why? A program can output a page. Not always, but
that’s one of the caveats of analytics software — assumptions. AWStats makes
the assumption that a com file is a program that outputs a page, and it counts
an access of that com file as a page.
Is it a page for business purposes? Unless you have a com file that you
specifically know produces a viewable page, it probably isn’t. And that means,
for business purposes, that this portion of the Pages metric is meaningless.
Only pages that are pages should count. If you have a lot of pages that are not
pages counting, it’s a problem. If it’s only a few, a small percentage of your
total, you’re probably safe to ignore the pages that are not pages.

Number of Visits

The Number of Visits a web site receives should be straightforward. That
would be nice and easy, wouldn’t it? Of course, it would.
No such luck, as Figure 2-6 indicates.
Like Pages, Number of Visits has two key assumptions: How long a visit is
and how much time has to pass between page loads to make one person have
two visits? Fortunately, there are industry standards — after all, this isn’t 1997.

A visit is as long as it is. As long as the visitor keeps clicking from page to page,
it’s still one visit. However, when the user stops clicking for 30 minutes, the
visit ends. If the user starts clicking again, it’s a new visit. Thirty minutes is the
industry-standard timeout for visits.
So, say a user toddles into SkateFic.com at 9:00 a.m., and between 9:00 and
9:30 she clicks from page to page, reading her favorite serial fiction. At 9:30 she
gets a phone call. For the next 28 minutes, she talks on the phone. When she
hangs up at 9:58, she finishes reading the page she left to answer the call and
loads the next page at 9:59. That’s one visit, because the break between page
loads was less than 30 minutes.
Now, say the same user is having a Grand Central Terminal sort of day. The
phone rings again at 10:00 a.m. This time the user talks for 31 minutes. When
she goes back to reading and loads a new page, she’s initiating a second visit
as far as AWStats is concerned. Same person, same day — and, if you asked
the user, same visit — but for pretty much every stats and analytics package,
it’s two different visits.
The average of 1.23 visits per visitor varies in meaningfulness. For a site that
gets a lot of returning visitors, it might have some meaning. For a site where
90 percent of visitors never return, the average doesn’t mean much, because it
is dragged down by the vast bulk of people who never return. You could have
10 people who average three visits per month and 90 people who come once and
never come back. Average visits will be 1.2, but it won’t be a very useful metric,
except to tell you that most of your visitors don’t return after the first visit.


Analytics and AWStats


AWStats AWStats (Advanced Web Statistics) is an open source log analyzer written in Perl that can use a variety of log formats and runs on a variety of operating systems. The official documentation of AWStats is mostly targeted to system administrators rather than to owners of web site businesses. In short, it’s not much help in figuring out what the statistics mean. Wait a minute! This is a book about Google Analytics, so why the heck are we talking about some open source stats program? Because the thing about analytics is that to make any sense, there needs to be some data. It’s going to take at least a couple days to get any data into Google Analytics. It’ll be months before there’s enough data to make any sense. But you may already have a wealth of historical data right there in AWStats. Never looked at it, you say? Thought so. That data you’ve probably got in AWStats, which maybe you never really understood because there’s no in-depth documentation on it, are still valuable. This is your past. For some things, bigger and newer isn’t necessarily better. Google Analytics and AWStats have different features with different strengths and weaknesses. For some things — many things — Google Analytics blows AWStats out of the water. For other things, Google Analytics uses a different methodology, with its own limitations. There are two main differences between Google Analytics and AWStats. First, AWStats is primarily a site statistics program. AWStats counts more than it calculates. It has far fewer metrics and capabilities than Google Analytics. It’s intended to be a simpler sort of program — nothing wrong with that. Google Analytics is intended from the get-go to be a business strength program. It calculates as much as counts and gives you metrics that, as a business person, you’ll want. Second, AWStats is a log analyzer. Google Analytics relies on cookies and JavaScript (referred to as “scripting” from here on out). This has several farranging implications. For example, to a log analyzer, all traffic coming from a single IP address is one “user.” When using scripting, you set a cookie on an individual user’s machine, or even in a particular account profile. Then, if five computers share an outside IP on a local area network, and there are three user accounts on each computer, you “see” 15 users, not one. On the other hand, if users turn off cookies, or don’t allow “third-party” cookies, you may not be able to track them at all with Google Analytics. At best, you may be able to track them for a particular session, but a half-hour later (or the next day), they will look like brand-new visitors. Another excellent example is tracking search engine visits (see the section “Robots and Spiders” in Chapter 3). A log analyzer has to identify search engines from lists of known spiders, by the spider’s identifying itself, or by a wild guess. Some small percentage of a log analyzer’s traffic may be misidentified as a real person when it’s not. On the other hand, most spiders, robots, and search engines, by default, don’t execute JavaScript code. Google Analytics won’t misidentify these sorts of false visitors. Of course, Google Analytics won’t pick up real visitors who have JavaScript turned off, either. Just as you can argue Mac vs. PC or football vs. figure skating, you can argue script-based tracking vs. log analysis. I’m not going to say one is intrinsically better than the other. There are tradeoffs with either methodology. As long as you know what those tradeoffs are, and what effect they may have on your metrics, you can allow for any ambiguity that might arise. At some point, no matter how you gather data, you’re going to have to plow into the nit-picky little boring stuff: log analysis vs. scripts, nobodies vs. people, pages that are pages vs. pages that really aren’t. So because we work hard and play hard — and you note which comes first — we’re going to dig in and go through some of the details, the basic concepts that will make what you see in Google Analytics mean something.



AWStats Browser

We’re going to get under way by taking a look at the AWStats window

(http://awstats.sourceforge.net/)

shown in Figure 2-1.

The AWStats window has a left-hand and a right-hand frame. The righthand
frame shows the reports. The left-hand frame shows the domain name
for the site statistics you’re viewing followed by a text link navigation list. You
can go directly to sections of the main report from any flush-left link. Secondary
reports, left-indented with a tiny AWStats icon, replace the main report
in the right-hand frame when you click the navigation link.

AWStats Dashboard
AWStats doesn’t have many controls on the dashboard (shown in Figure 2-2).
Much of what can be configured is set by your web host at install time. The
dashboard appears at the top of the main report. AWStats notes the time of the
last update. Most web hosts update in the middle of the night. The time listed
is on the server’s time zone and is not necessarily your time zone. You can
force an update by clicking the Update Now link.

If you need up-to-the-minute results, or if your site is very busy during a
specific part of the day, it’s probably smart to force an update before you look
at the stats. If you’re updating results for a couple days, the update can take
some serious time — upwards of a half-hour — depending on how busy your
web site is. If your site is not very busy, or if it has been only a couple of hours
since the last update, you might have the same overhead as a normal page
reload.
Use the drop-down menus to change the month and year. To view a whole
year, choose Year from the month menu and then the year from the year menu.
Click the globe to go to AWStats home page at SourceForge.net. Click the
flags below the globe to change the reporting language. Available languages
depend on which ones your web host has installed. In this screenshot, French,
German, Italian, Dutch, and Spanish are installed, as well as the English
default.


What Analytics Is Not

The short answer is: Google Analytics is not magic. It’s not some mystical force
that will automatically generate traffic to your web site. Nor is it the flashing
neon sign that says, “Hey, you really should be doing this instead of that.” And
it’s most certainly not the answer to all your web site traffic problems. No, analytics
is none of those things.
What analytics is is a tool for you to use to understand how visitors behave
when they visit your web site. What you do with that information is up to you.
If you simply look at it and keep doing what you’re doing, you’re going to
keep getting what you’re getting.
You wouldn’t place a screw driver on the hood of a car and expect it to fix
the engine. So don’t enable Google Analytics on your site and expect the application
to create miracles. Use it as the tool that helps you figure out how to
achieve those goals.

If Analytics Are So Great, Why Don’t We Have Them?

The short and simple answer to this is that medium and large companies that
can afford analytics do have them. There are many analytics software packages
that cost money, among them WebTrends, HitBox Professional, and Manticore
Technology’s Virtual Touchstone. The low-end price for web analytics is $200
per month. The high-end price? A couple grand a month is not unusual. To the
microsite, the small site, the web merchant on a shoestring, the mom-and-pop
site, the struggling e-zine, the blogger who aspires to be Wonkette but isn’t
yet—that is, to most of the sites on the web —two hundred bucks a month
sounds like a lot of money!
Then, in mid-2005, Google rocked the boat, buying a small company called
Urchin. Urchin was no Oliver Twist. It was, in fact, a runner-up for the 2004
ClickZ Marketing Excellence Award for Best Small Business Analytics Tool. Itproduct, Urchin Analytics, had a monthly cost on the low end of the market —
about $200 a month —and was designed for small businesses.
Six months later, Google did something completely unprecedented. It
rebranded Urchin’s service as Google Analytics with the intention of releasing
it as a free application. Google prelaunched it to a number of large web publications
(among them NewsForge.com, where Mary Tyler is a contributing editor).
And shortly after that, Google opened it to the public, apparently
completely underestimating the rush of people who would sign up — a quarter
of a million in two days.
Google quickly limited the number of sites that registrants could manage to
three, although if you knew HTML at all, the limitation was pathetically easy
to bypass. Google also initiated a sign-up list for people who were interested,
which eventually morphed into an invitation system reminiscent of the controlled
launch of Google’s Gmail. The moral of this story is, “Don’t underestimate
the attraction of free.”

Why Analytics?

First there were log files and only people who bought really expensive software
could figure out what the heck the half-million lines of incomprehensible
gobbledygook really meant. The rest of us used web-page counters. Anyone
could see how many people had come to a page. As long as the counter didn’t
crash, or corrupt its storage, or overflow and start again at zero, there would
be a nifty little graphic of numbers that looked like roller skates (or pool balls
or stadium scoreboard numbers or whatnot).
Around 1998, the arbiters of taste on the Internet (i.e., everybody) decided
that page counters were so 1997 and that there must be a better way.
And also about that time, web site statistics packages or “stats” came into
common use—not common use by huge businesses that could afford thousands
of dollars for software but common use by us peons who rent our web
space from hosting companies for as little as $5 a month. Stats packages basically
collect data but leave you to analyze that data. So they tell you what happens;
they just don’t put what happens into any type of business context.
If you have Windows-based hosting, you may have a Windows-specific
stats package, or your host may use the Windows version of one of the open
source stats packages. If you have hosting on a Linux web server running
Apache (and about 60 percent of web servers run Linux and Apache), you’ll
most likely have Analog, Webalizer, or AWStats, and you may have all three.
These software packages are open source under various versions of the GNU
Public License (GPL). This neatly explains their ubiquity.

Basic Analytics

Having web site statistics is one thing. Understanding what they mean and
what you should do with them is another thing altogether. If what you want is
to get into the nitty-gritty, reams of information are available to you. If, however,
what you’re really looking for is a quick, easy-to-understand explanation
of analytics and why you should care, read on.
This part of the book gives you the working knowledge you need to understand
the importance of analytics, all in three short chapters. When you’ve
finished reading these first three chapters, you’ll understand basic web measurements,
how they apply to your web site, and the difference between site
statistics and analytics. Then you’ll be ready to tackle Google Analytics.