Google Human Raters – How Processes and 7 Biases influence results

Google Search results are not entirely based on algorithms. Human Raters play a crucial role in evaluating algorithm changes.  They have to first understand the query type – navigational, informational & transactional. Then they have to evaluate the pages in the search results for relevance and the value it offers to the user. Based on relevance, Human Raters (Quality Raters) have to rate the pages as Vital, Useful, Relevant, Not Relevant, or Off-Topic. Vital rating is mostly seen in navigational searches like if I search for “bestbuy cameras”, my intent is to browse the cameras available in bestbuy. Here are the top results and the most likely evaluation that a Human Rater might perform.

Best Buy Camera Query Evaluation by Human Raters

 

Best Buy Camera Query Example

Before an algorithm change is made live, the search quality team follows this process:


1) Search Result, Not Satisfactory: Search Engineers notices that the search result in a particular query space is not satisfactory.

2) Idea from Engineer to Improve Results:  Engineers provide ideas to improve search results

3) Implement Idea in a Sandbox: The ideas are then implemented in a Sandbox or a test environment.

4) Sent the before and After Implementation: The search results before and after implementing the idea is shown side by side to Human Raters. Please note that Human raters will have no idea whether the results are before or after the algorithm change.

5) Send fraction of the Traffic to the new Results: A fraction of live traffic is sent to results with the algorithm change to evaluate how users are responding to the new results and whether the results are answering user’s questions and providing a better user experience.

6) Analyst Evaluate Results: The Analyst evaluates the real time response to algorithm changes and rating by Human Raters and a comprehensive report is sent to a Search Committee.

7) Result Accepted or Rejected: A detailed discussion follows and the algorithm changes are accepted or rejected.

On rejecting the change, the results for the query space remains the same and engineers will have to come up with another idea to tweak the algorithm.

Although the process seems foolproof, many factors can influence the quality of search results.

1) Design Bias: When a human rater evaluates a website, the design of the page plays a major role in evaluating whether the website is current or not. There are several Wordpress or blogger themes that might look primitive.  First impression matters and it is highly likely that web design can create false notion of quality.

2) Domain Bias: Another major factor in evaluating the quality of the page is the domain of the site. A webpage hosted in .biz or .info Domain is not likely to give the same impression as a .org or a .com page.

3) Knowledge Bias: Human raters are selected after a rigorous recruitment process and they are allowed to do rating for one year, after which they have to wait for 3 months to reapply.  But if your job is to evaluate query space for relevance on a regular basis, it is likely that you will learn more about SEO and Internet marketing. With knowledge comes biases, mostly against certain thought leaders and if the query space is related to the rater’s educational background or SEO, the biases will come into play during the evaluation process.

4) Amateur Bias: Like with knowledge, amateur raters will not be able to differentiate between an expert site with original content and a site that is good at copying and deriving content from experts. Also, raters without the knowledge about an industry cannot spot spam unless the spamming site is using obvious spamming techniques like keyword stuffing.

5) Intent Bias: Understanding relevance is completely different from understanding intent. For short keywords, it becomes extremely difficult to understand intent unless you have expertise in that industry. Human raters cannot provide value in query spaces that are technical. 

6) Relevance Bias: Content Scrappers might rank in the first page for certain keywords than original content creators, who might be ranked in the fourth page. Human raters are likely to rate the results for the 1st page and might look into pages 2 and 3. They might miss the original page that should have been ranked in the first page. Since Human raters are asked to evaluate the relevance for a particular keyword, they are unlikely to look for original content and based on content, scrapper site might rank better than the original site.

7) Cultural Bias: Although raters are recruited from all around the world, biases against cultures are ingrained in the psyche. How often we associate quality with German Engineering, outsourcing with India and cheap goods with China.  Humans tend to generalize cultural traits, and it can affect how raters evaluate a page.

Although the Human Rater training material shows that a thorough understanding of relevance is required before evaluation, the knowledge or lack of it about a query space can help or hamper how the search results are evaluated.

In the following video, Matt Cutts explains how Human Raters are used in Web Search


To see how algorithm changes are implemented, watch the following video.