The bugfix that could make the internet 5% faster

I’ve been working with Google Analytics for the last 3 years. When I started working with it it was already a very huge player on the market, but I’ve seen enormous growth on these  years. Google Analytics is the most used web analytics solution in the world. It’s used on currently 44.67% of the top million websites on the internet. ga.js is the most popular javascript snippet in the history of the internet.

Google Analytics Usage on top websites:

source: builtwith.com

Imagine the responsibility of the Google engineering team that maintains the ga.js javascript file. While having to deal with multiple recent changes and new features on Google Analytics still have to make sure that their code runs as fast as possible and on all browsers that exist. They must support ie5.5 and low end mobile devices, otherwise these browsers wouldn’t show up on Google analytics reports. Still they must do it while keeping the code from affecting the website performance.

I must say that they do a great work on keeping that code. The asynchronous syntax while confusing at first is a very clever way to push code execution and loading way down on the queue, so browsers don’t delay the page loading to register a GA pageview. It’s clear that the GA team takes great care when it comes to how fast and seamless their code is.

The one point that still bothers me a lot regarding performance are the Google Analytics cookies. Let’s take a look at what GA cookies look like:

>document.cookie
"__utma=96182344.347392035.1326382423.1326382423.1326382423.1; __utmb=96182344.1.10.1326382423; __utmc=96182344; __utmz=96182344.1326382423.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
>document.cookie.length
188

This is a minimum GA cookie. It can get longer if you use Custom Variables and Google Website Optimizer. But let’s settle down with the minimum for now.
These cookies are used iternally in GA to keep state and are manipulated by the code on ga.js javascript file. Different from most other cookies you might see out there these cookies don’t need to hit your webservers never. Still they hit your website every single time an HTTP request is made.

According to Google SPDY whitepaper the average HTTP request is 700-800 bytes long. That means that GA Cookies represent about 25% of that HTTP request size. The moment you notice GA is present in about 50% of top websites you notice that useless GA cookies going around the internet represent 12% of all HTTP requests.

I’ve posted a bug regarding this issue on GA-Issues a while ago. The idea is to use HTML5 localStorage to store the cookies on browsers that support it. Still it has attracted no attention so far. This bug fix could easily make the average HTTP request around 5% faster. We’re talking about the average speed of the whole internet.

The real picture is not that bad, since this only affect HTTP requests and not HTTP responses and that’s where the real data is. Still it’s funny to see something that huge going around unnoticed.

Track social interactions as events for the google +1 button

Google Analytics released recently the _trackSocial  for Google Analytics. It was part of a bigger release on several Social applications including Google+. Sometimes things have to be pushed out before they’re extensively tested, and a couple of bugs may come up.

With the social Tracking one specific bug bit me the other day. Google Analytics won’t apply hostname filter’s to the social interactions, and it may cause profiles that are filtered to only include traffic to domain A, showing social interactions for domain B. From there all sorts of bad things follow: 0 pageview visits, lower pages/visit and so on.

At first I thought about disabling the socialTracking on the +1 buttons, but it seems that the API don’t support it yet. But I found an undocumented feature to disable it. Now you can disable the socialtracking on the +1 button and use Events instead, since they go through the filters before showing up in your Google Analytics profile.

You’ll only want to use it if you are having problems with social tracking and hostnames filter. Otherwise the default behavior is way better since it will populate in separate Social reports.

Update

If you are using the asynchronous code for Google +1 button, loading the syntax is a little bit different.

Thanks Fábio Phms.

Google Analytics source override precedence

Google Analytics keeps track of 5 campaign variables all this information goes into _utmz cookie. This cookie has the following format:
_utmz=1.1267299040.3.4.utmcsr=Source|utmccn=Campaign|utmcmd=Media|utmcct=Content|utmctr=Term
There are basically 4 types of origins:
  • Campaigns: this means the user clicked on an AdWords link or a link with campaign variables.
    • You can customized all 5 Campaign Variables
    • If this is an AdWords visit the cookie is slightly different, it has the gclid number. GA will pull the correct value for the 5 variables from AdWords provided the accounts are linked.
  • Organic: When the user clicks on a link from a Search engine (e.g. google, bing, yahoo!, etc)
    • Source: google/bing/yahoo/etc
    • Campaign: (organic)
    • Media: organic
    • Term: Searched keyword
  • Referral: When a user clicks on a link from another site.
    • Source: www.referral-site.com
    • Campaign: (referral)
    • Media: referral
    • Content: /path/from/clicked/link
  • Direct: When Google can’t determine a better origin it uses this one. Usually it means the user typed the address directly in the address bar. But it could mean the user bookmarked the link or still clicked this link in msn, or another desktop application.
    • Source: (direct)
    • Campaign: (direct)
    • Media: (none)
What happens if you change your source during a visit? What if it happens in a different visit?
It all depends, here are the basic rules for this precedence.

Returning Visitor

  • Direct never overrides
  • Campaign always overrides
  • Referral always overrides
  • Organic always overrides

Same Visit

  • Direct never overrides
  • Campaign always overrides
  • Referral never overrides
  • Organic always overrides

Extra

I created a graphic to illustrate precedence.

Update 1:

People complained about the graphic being insanely hard to read. Here are the rules that make the graphic. Looks simpler but the graphic took me too much time to just remove it now. Besides it makes the post look good.

  • Campaign, Organic and Referral source always override a previous source
  • Direct never overrides a previous source
  • If it’s inside the same session a referral source will never override previous source

Update 2:

You can also use the parameter utm_nooverride=1 in your URLs. If you use this parameter and already have a previous origin it  will never overrides the existing origin.

Update 3:

Google has changed the way it creates new visits. Note that the rules here still aplies. The only difference now is that any time a new source overwrites a previous source a new visit is created.

Before this change if the change occurred on the same visit a new visit would not be spawned but the new origin was still sent to analytics, this could cause sources with 0 visits in Google Analytics Reports.

Google Analytics _setAllowHash bug

GATC (Google Analytics Tracking Code) has an annoying bug with _setAllowHash.

Suppose you have something like this:

  • Domain http://test.cereto.net/ that has a GATC for multiple sub-domains.
  • Domain http://test2.cereto.net/ that has both our tracker and a secondary default tracker that we don’t control.

If you use this configuration it should work as expected. Two sets of cookies are gonna be created. One set inside domain test2.cereto.net and the second set inside .cereto.net. GATC will know which cookie to look at on both cases.

But now suppose you also want to track domain www.my-other-domain.com in the same account. What you’d need to do is:

  • Use _setAllowLinker(true) on both accounts.
  • Use _getLinkerUrl() or _link() on the links that go from one site to another and vice versa.
  • Use _setAllowHash(false) on both domains.

_link() and _getLinkerUrl() will move the cookies from one domain to another. _setAllowLinker(true) is needed for GATC to look on URI for cookie parameters.

Q: Now why would you need _setAllowHash(false)? A: The GA cookies have a parameter that is a domain hash (in red below ). Of course the hashes from both our domains are gonna be diferent. In that case Google will trash the cookie when it sees that the hash doesn’t match the current domain. So we set _setAllowHash(false) and everything is  fine. Is it?

Cookie with a domain hash
__utma=253008534.504424944.1258547704.1258547704.1258547704.1

Cookie with Domain Hash disabled
__utma=1.504424944.1258547704.1258547704.1258547704.1

But now we don’t have the Hash anymore and it’s very important to GATC. When we’re reading cookies with javascript we have no information about the cookie besides value and name. The Hash is important so GATC knows which is the right cookie for that specific GATC.

In our setup we’ll have two sets of cookies available from test2.directperformance.com.br. If we disable the Domain Hash on both there’s no way for GATC to get the correct one. This will lead to ga.js firing pageviews with mixed data from both the cookies. This will mix origin, user hash, custom variables and more. Generating unexpected results on Analytics Interface.

This is an old bug. You must avoid it but sometimes there’s no way. It’s present no matter if you use default _gat or the new Async Tracker _gaq.

Example

I created a little test to illustrate the issue:

  • Access first domain using a url with campaign variables. This has a single tracker/single cookie
  • Now access second domain directly. It has two trackers each in a diferent domain, so 2 different cookies.
  • The cookies were created accordingly, one for each tracker. The first one still has the campaign origin, and the second should be a refferal from this blog now.
  • Since you have _setAllowHash(false) on both trackers, GATC don’t know which cookie to parse.
  • You can see using HttpFox or similar that both pageviews have the same origin.

As explained GATC didn’t know which was the correct cookie, and got the first one.

Solution

There’s no good solution at this time, besides avoiding this setup.

All could be solved if _setAllowLinker(true) simply ignored the domain hash and used the hash for the current domain instead, after all it makes no sense to check the domain hash on the cookies you’re importing.

There’s an undocumented feature on ga.js that seems to fix it. It’s the function _setNamespace(‘ns’). If you use this on one or both trackers (with different Namespace for each of course). This problem is gone. But it’s not safe to use undocumented features as it might change in the future or removed completely generating unexpected results. You won’t want to use that on your production code.

This post is intended to get this bug properly documented since there’s no public bug tracker for ga.js and I didn’t get proper response or position from Google on any related user groups out there;