Response to: Google Bypassing User Privacy Settings

The clueless guys from ie blog posted this blog post a couple hours ago:

http://blogs.msdn.com/b/ie/archive/2012/02/20/google-bypassing-user-privacy-settings.aspx

These are my comments on that post.

They evoke the most broken standards of all time to base their accusations. The P3P. A standard that only Internet Explorer implements.

But my point here is not to beat the dead horse that P3P is, but to comment on Microsofts accusations.

By default, IE blocks third-party cookies unless the site presents a P3P Compact Policy Statement indicating how the site will use the cookie and that the site’s use does not include tracking the user. Google’s P3P policy causes Internet Explorer to accept Google’s cookies even though the policy does not state Google’s intent.

P3P is not a standard to block tracking. It’s a standard to identify what you’re tracking and for what purposes. It just blocks any 3rd party cookie from a domain that doesn’t have a P3P CP and accepts everything for domains that have any string inside P3P CP. But tracking cookies are not just allowed by the P3P but the whole idea of P3P is to support tracking cookies.

Later they show the Microsofts own P3P settings, as if it was a good example of P3P usage:

By the way here’s what Microsoft’s P3P stands for:

P3P: CP=”ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo CNT COM INT NAV ONL PHY PRE PUR UNI”

What they don’t explain in the post is what does that mean. So here I try to list what these letters stand for. They are indications of what information Microsoft tracks, for what purpose and who has access to this data.

  • ALL – All Identified Data: access is given to all identified data.
  • IND – Indefinitely: Information is retained for an indeterminate period of time. The absence of a retention policy would be reflected under this option. Where the recipient is a public fora, this is the appropriate retention policy.
  • DSP – there are some DISPUTES
  • COR – Errors or wrongful actions arising in connection with the privacy policy will be remedied by the service.
  • ADM – Web Site and System Administration: Information may be used for the technical support of the Web site and its computer system. This would include processing computer account information, information used in the course of securing and maintaining the site, and verification of Web site activity by the site or its agents.
  • CONo – Contacting Visitors for Marketing of Services or Products: Information may be used to contact the individual, through a communications channel other than voice telephone, for the promotion of a product or service. This includes notifying visitors about updates to the Web site. This does not include a direct reply to a question or comment or customer service for a single transaction — in those cases, would be used. In addition, this does not include marketing via customized Web content or banner advertisements embedded in sites the user is visiting — these cases would be covered by the , and , or and purposes.
  • CUR – Completion and Support of Activity For Which Data Was Provided: Information may be used by the service provider to complete the activity for which it was provided, whether a one-time activity such as returning the results from a Web search, forwarding an email message, or placing an order; or a recurring activity such as providing a subscription service, or allowing access to an online address book or electronic wallet.
  • IVAo – Individual Analysis: Information may be used to determine the habits, interests, or other characteristics of individuals and combine it with identified data for the purpose of research, analysis and reporting. For example, an online Web site for a physical store may wish to analyze how online shoppers make offline purchases.
  • IVDo – Individual Decision: Information may be used to determine the habits, interests, or other characteristics of individuals and combine it with identified data to make a decision that directly affects that individual. For example, an online store suggests items a visitor may wish to purchase based on items he has purchased during previous visits to the Web site.
  • PSA – Pseudonymous Analysis: Information may be used to create or build a record of a particular individual or computer that is tied to a pseudonymous identifier, without tying identified data (such as name, address, phone number, or email address) to the record. This profile will be used to determine the habits, interests, or other characteristics of individuals for purpose of research, analysis and reporting, but it will not be used to attempt to identify specific individuals. For example, a marketer may wish to understand the interests of visitors to different portions of a Web site.
  • PSD – Pseudonymous Decision: Information may be used to create or build a record of a particular individual or computer that is tied to a pseudonymous identifier, without tying identified data (such as name, address, phone number, or email address) to the record. This profile will be used to determine the habits, interests, or other characteristics of individuals to make a decision that directly affects that individual, but it will not be used to attempt to identify specific individuals. For example, a marketer may tailor or modify content displayed to the browser based on pages viewed during previous visits.
  • TAI – One-time Tailoring: Information may be used to tailor or modify content or design of the site where the information is used only for a single visit to the site and not used for any kind of future customization. For example, an online store might suggest other items a visitor may wish to purchase based on the items he has already placed in his shopping basket.
  • TELo – Contacting Visitors for Marketing of Services or Products Via Telephone: Information may be used to contact the individual via a voice telephone call for promotion of a product or service.
  • OUR – Ourselves and/or entities acting as our agents or entities for whom we are acting as an agent: An agent in this instance is defined as a third party that processes data only on behalf of the service provider for the completion of the stated purposes. (e.g., the service provider and its printing bureau which prints address labels and does nothing further with the information.)
  • SAMo – Legal entities following our practices: Legal entities who use the data on their own behalf under equable practices. (e.g., consider a service provider that grants the user access to collected personal information, and also provides it to a partner who uses it once but discards it. Since the recipient, who has otherwise similar practices, cannot grant the user access to information that it discarded, they are considered to have equable practices.)
  • CNT – Content : The words and expressions contained in the body of a communication — such as the text of email, bulletin board postings, or chat room communications.
  • COM – Computer Information: Information about the computer system that the individual is using to access the network — such as the IP number, domain name, browser type or operating system.
  • INT – Interactive Data: Data actively generated from or reflecting explicit interactions with a service provider through its site — such as queries to a search engine, or logs of account activity.
  • NAV – Navigation and Click-stream Data: Data passively generated by browsing the Web site — such as which pages are visited, and how long users stay on each page.
  • ONL – Online Contact Information: Information that allows an individual to be contacted or located on the Internet — such as email. Often, this information is independent of the specific computer used to access the network. (See the category “Computer Information”)
  • PHY – Physical Contact Information: Information that allows an individual to be contacted or located in the physical world — such as telephone number or address.
  • PRE – Preference Data: Data about an individual’s likes and dislikes — such as favorite color or musical tastes.
  • PUR – Purchase Information: Information actively generated by the purchase of a product or service, including information about the method of payment.
  • UNI – Unique Identifiers: Non-financial identifiers, excluding government-issued identifiers, issued for purposes of consistently identifying or recognizing the individual. These include identifiers issued by a Web site or service.

They basically use every single options they have to tell you that they have the most possible freedom to use your data as P3P allows and that they track every single thing that P3P allows them to track.

So what’s the difference of not providing a compliant P3P CP just for the sake of making it work on internet explorer and saying that you own all your user information?

Reference:

Serve static files locally with python

Aside

This is just a self reference and of course can’t be used in production code. But if you’re willing to test something static real quick and want to avoid file:// protocol you can setup a convenient webserver with python.

Just find the root you want to serve and use:

python -m SimpleHTTPServer

Now just point your browser to localhost:8000.

Bye bye local apache.

The bugfix that could make the internet 5% faster

I’ve been working with Google Analytics for the last 3 years. When I started working with it it was already a very huge player on the market, but I’ve seen enormous growth on these  years. Google Analytics is the most used web analytics solution in the world. It’s used on currently 44.67% of the top million websites on the internet. ga.js is the most popular javascript snippet in the history of the internet.

Google Analytics Usage on top websites:

source: builtwith.com

Imagine the responsibility of the Google engineering team that maintains the ga.js javascript file. While having to deal with multiple recent changes and new features on Google Analytics still have to make sure that their code runs as fast as possible and on all browsers that exist. They must support ie5.5 and low end mobile devices, otherwise these browsers wouldn’t show up on Google analytics reports. Still they must do it while keeping the code from affecting the website performance.

I must say that they do a great work on keeping that code. The asynchronous syntax while confusing at first is a very clever way to push code execution and loading way down on the queue, so browsers don’t delay the page loading to register a GA pageview. It’s clear that the GA team takes great care when it comes to how fast and seamless their code is.

The one point that still bothers me a lot regarding performance are the Google Analytics cookies. Let’s take a look at what GA cookies look like:

>document.cookie
"__utma=96182344.347392035.1326382423.1326382423.1326382423.1; __utmb=96182344.1.10.1326382423; __utmc=96182344; __utmz=96182344.1326382423.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
>document.cookie.length
188

This is a minimum GA cookie. It can get longer if you use Custom Variables and Google Website Optimizer. But let’s settle down with the minimum for now.
These cookies are used iternally in GA to keep state and are manipulated by the code on ga.js javascript file. Different from most other cookies you might see out there these cookies don’t need to hit your webservers never. Still they hit your website every single time an HTTP request is made.

According to Google SPDY whitepaper the average HTTP request is 700-800 bytes long. That means that GA Cookies represent about 25% of that HTTP request size. The moment you notice GA is present in about 50% of top websites you notice that useless GA cookies going around the internet represent 12% of all HTTP requests.

I’ve posted a bug regarding this issue on GA-Issues a while ago. The idea is to use HTML5 localStorage to store the cookies on browsers that support it. Still it has attracted no attention so far. This bug fix could easily make the average HTTP request around 5% faster. We’re talking about the average speed of the whole internet.

The real picture is not that bad, since this only affect HTTP requests and not HTTP responses and that’s where the real data is. Still it’s funny to see something that huge going around unnoticed.

Track social interactions as events for the google +1 button

Google Analytics released recently the _trackSocial  for Google Analytics. It was part of a bigger release on several Social applications including Google+. Sometimes things have to be pushed out before they’re extensively tested, and a couple of bugs may come up.

With the social Tracking one specific bug bit me the other day. Google Analytics won’t apply hostname filter’s to the social interactions, and it may cause profiles that are filtered to only include traffic to domain A, showing social interactions for domain B. From there all sorts of bad things follow: 0 pageview visits, lower pages/visit and so on.

At first I thought about disabling the socialTracking on the +1 buttons, but it seems that the API don’t support it yet. But I found an undocumented feature to disable it. Now you can disable the socialtracking on the +1 button and use Events instead, since they go through the filters before showing up in your Google Analytics profile.

You’ll only want to use it if you are having problems with social tracking and hostnames filter. Otherwise the default behavior is way better since it will populate in separate Social reports.

Update

If you are using the asynchronous code for Google +1 button, loading the syntax is a little bit different.

Thanks Fábio Phms.

Cleaning up files with eval(base64 Malware

This blog was recently infected with a eval(base64 malware. This kind of malware use site vulnerabilities to inject a long list of link in the beginning of pages so it theoretically improves those site’s SEO performance.

This kind of strategy is just sad, telling from the perspective of an SEO.

I came up with a nice oneliner to clear all that nasty code. Works great for me. May be useful for others.

find . -name "*.php" -print0 | \
xargs -0 -n 1 grep -l -Z eval.*base64 | \
xargs -0 -n 1 sed -i'.old' '/eval.*base64/ d'