Pushleads | Asheville SEO Services

Is This The Algorithm for Google's Useful Content?

Is This The Algorithm for Google’s Useful Content?

According to a Google research paper, an algorithm has been developed that can identify low-quality webpages, similar to the function of the helpful content signal.

 

Google recently published a research paper detailing the use of artificial intelligence to identify webpages’ quality. The specifics of the algorithm described in the paper closely resemble the known capabilities of the helpful content algorithm. This groundbreaking research represents a significant advancement in the field.

Google Doesn't Recognize Algorithm Technologies

It is not certain that the research paper in question is the basis for the helpful content signal, as Google needs to disclose the technologies used in its algorithms to the public. This is consistent with Google’s general practice of not revealing the inner workings of its various algorithms, such as the Penguin, Panda, or SpamBrain algorithms.

 

It is impossible to definitively state that the algorithm described in the research paper is the same as the practical content algorithm, as Google needs to disclose the technologies used in its algorithms. However, the similarities between the two are noteworthy and worth considering. Ultimately, it is only possible to speculate and offer an informed opinion on the matter.

The Useful Content Signal

1. A classifier’s performance is improved by it.

Although Google has given some indication about the helpful content signal, much speculation remains about its true nature. The first hints about this signal were provided in a tweet on December 6, 2022, which announced the launch of the first helpful content update. The tweet stated:

“This improvement to our classifier is effective for all types of content in any language worldwide.”

A classifier in machine learning is a tool that categorizes data into distinct categories or classes.

2. This is Not a Manual or Spam Action

According to Google’s explanation of the Helpful Content algorithm, it is neither spam nor a manual penalty. This information can be found in the “What creators should know about Google’s August 2022 helpful content update” explainer.

“This classifier method is automated, using a machine-learning model. It is not a manual action nor a spam action.”

3. It is a Ranking-Related Signal

According to the helpful content update explainer, the practical content algorithm is a ranking factor used to determine the position of content in search results.

“… it’s just a recent signal and one of many signals Google considers to rank content.”

 

4. It Checks Whether People Create the Content

The helpful content signal determines whether individuals or organizations created the content. According to Google’s blog post on the Helpful Content Update, entitled “More content by people, for people in Search,” the signal is used to identify content created by and for individuals. Danny Sullivan of Google stated:

“We are introducing a series of enhancements to Search that will make it simpler for users to locate helpful content created by and for people. We are excited to continue building upon this work to make it even more effortless to find authentic content produced by and for real people in the future.”

The phrase “by people” appears three times in the announcement, which suggests that it is a key characteristic of the helpful content signal. If the content is not “by people,” it will likely be machine-generated. This is an important distinction, as the algorithm is designed to detect machine-generated content.

5. Is the Helpful Content Signal Multiple Factors?

Lastly, Google’s blog post suggests that the Helpful Content Update is not a single entity, such as a standalone algorithm. Danny Sullivan writes that it is a “series of improvements,” which implies that multiple algorithms or systems work together to identify and eliminate unhelpful content. This is what he wrote:

“We are introducing a range of enhancements to Search to make it simpler for users to locate helpful content created by and for people.”

 

Text Generation Models Can Anticipate the Quality of a Webpage.

This research paper demonstrates that large language models (LLMs) such as GPT-2 can identify low-quality content accurately. The study utilized classifiers trained to recognize machine-generated text and found that they could also identify low-quality text, even though they were not specifically trained for this purpose.

Large language models can learn new tasks they were not originally trained to perform. For example, an article from Stanford University about GPT-3 discusses how the model independently learned to translate text from English to French simply by being given more data to learn from, something that did not happen with GPT-2, which was trained on a smaller dataset.

The article highlights how increasing the amount of data can lead to the emergence of new behaviors due to unsupervised training. The term “emerge” is significant because it refers to the machine’s ability to learn to do something that it was not explicitly trained to do. Unsupervised training is when a machine learns how to perform a task it was not explicitly trained to do.

According to the Stanford University article on GPT-3:

“Workshop attendees were surprised to see this behavior emerge from increasing the amount of data and computational resources and were curious to see what other capabilities could be discovered through further scaling.”

The research paper discusses the emergence of a new ability, specifically a machine-generated text detector’s ability to predict low-quality content. The researchers state:

“Our research has two main objectives: firstly, we demonstrate through human evaluation that classifiers trained to distinguish between human-generated and machine-generated text can also act as unsupervised predictors of ‘page quality’ and can detect low-quality content without any additional training. This enables the rapid establishment of quality indicators in resource-constrained environments.

Secondly, we conducted an extensive qualitative and quantitative analysis of 500 million web articles to understand better the prevalence and nature of low-quality pages on the web. This is the largest study ever conducted on this topic.”

The key takeaway from this analysis is that using a text generation model trained to identify machine-generated content led to the emergence of a new behavior: the ability to detect low-quality pages.

Read Next: Using AI for SEO in Content Creation

OpenAI GPT-2 Detector

The researchers evaluated the performance of two systems for detecting low-quality content. One of these systems employed Roberta, an enhanced version of BERT, a pretraining method. The two systems that were tested are:

  • OpenAI’s RoBERTa-based GPT-2 detector
  • GLTR (Statistical Detection and Visualization of Generated Text) utilizes machine-generated content’s “statistical signature” to identify it. This system employs BERT and GPT-2

 

The researchers found that OpenAI’s GPT-2 detector was more effective at detecting low-quality content. The characteristics of the test results closely match those of the helpful content signal.

AI Can Identify All Types of Language Spam.

According to the research paper, there are numerous quality signals, but this approach only examines linguistic or language quality. The phrases “page quality” and “language quality” are interchangeable for this algorithm research paper.

The key innovation in this research is using the OpenAI GPT-2 detector’s prediction of whether a piece of content is machine-generated as a score for language quality. The researchers write:

“Documents with a high probability of being machine-written tend to have low language quality. This makes machine authorship detection an effective way to assess the quality of the text. This method doesn’t require labeled examples, only a corpus of text to train on.

This is especially useful when there is a limited amount of labeled data or the data needs to be more complex to be accurately sampled. For example, it can be difficult to create a labeled dataset for certain applications.”

This means the system can be trained to detect specific types of low-quality content. Instead, it can learn how to identify all forms of low quality on its own. This is a powerful method for identifying pages that could be of better quality.

Outcomes Mirror Helpful Content Update

The researchers tested this system on a sample of 500 million web pages, evaluating the pages based on various attributes such as document length, age of the content, and topic. The age of the content was not used to mark newer content as low quality. Rather, the web content analysis revealed a significant increase in low-quality pages starting in 2019, coinciding with the increasing use of machine-generated content.

The analysis by topic showed that certain topic areas tended to have higher quality pages, such as legal and government topics. Interestingly, the researchers found many low-quality pages in the education sector, which they attributed to websites providing students with essays. This is noteworthy because education is a topic specifically mentioned by Google as being impacted by the Helpful Content update. In a blog post written by Danny Sullivan, Google stated:

“…our testing has found it will improve results related to online education….”

 

Three Language Quality Scores

Google’s Quality Raters Guidelines (PDF) utilize four quality scores: low, medium, high, and very high. The researchers employed three quality scores for testing the new system in addition to an additional category called undefined.

Files with an undefined rating were taken out as they could not be evaluated for any purpose. Ratings for the documents went from 0 to 2, with the highest being 2.

Here is a breakdown of the Language Quality Scores:

  • 0 Low LQ: The quality of the text could be better, and it is hard to understand or make sense of it.
  • 1 Medium LQ: The quality of the text was average, and it was understandable. However, it could have been written more skillfully (many grammatical and syntactic mistakes).
  • 1 High LQ: The writing is understandable and generally well-constructed, with only occasional grammatical or syntactical mistakes.

 

Here is the meaning provided in the Quality Raters Guidelines for what is considered to be of low quality:

Lowest Quality – Making content without the necessary amount of work, originality, talent, and skill necessary to accomplish the goal of the page to a satisfactory degree is MC. Negligible focus on essential elements like clarity and structure. Some substandard material is created with a need for more effort to have something to generate income rather than producing unique or thoughtful content to help people.

Additional material may be included, particularly at the beginning of the webpage, making people scroll down to get to the main text. The composition of this post needs to be revised, with various syntax and punctuation mistakes.

The instructions provided to quality raters contain a more thorough explanation of what is considered poor quality compared to the algorithm. Notably, the algorithm takes account of mistakes in grammar and syntax. The syntax is a term that alludes to the arrangement of words.

The incorrect order of words can sound jarring, similar to how the character Yoda speaks in Star Wars (“See the future is impossible”).

The algorithm behind the Helpful Content signal in search engines may rely on grammar and syntax signals. However, other aspects also play a role in determining the helpfulness of a piece of content. The algorithm has been improved with input from the quality raters’ guidelines between 2021 and 2022.

The Algorithm Has a Significant Amount of Power

It’s beneficial to review the conclusions of research papers when determining the effectiveness of an algorithm. Doing so can give insight into whether or not the algorithm is suitable for use in search results. Research papers often end with a statement indicating that further research is needed or that the improvements are minimal. However, the most intriguing papers are the ones that claim to have achieved new state-of-the-art results.

The researchers note that this algorithm is robust and surpasses the benchmark performances.

It’s beneficial to review the conclusions of research papers when determining the effectiveness of an algorithm. Doing so can give insight into whether or not the algorithm is suitable for use in search results. Research papers often end with a statement indicating that further research is needed or that the improvements are minimal. However, the most intriguing papers are the ones that claim to have achieved new state-of-the-art results.

The researchers note that this algorithm is robust and surpasses the benchmark performances.

One of the reasons why this algorithm is a suitable candidate for the Helpful Content signal is its ability to operate with limited resources while still being able to handle a large-scale web-based application.

In Conclusion, They Reiterate the Favorable Outcomes:

“The study suggests that using detectors trained to distinguish between text written by humans and machines can accurately predict the language quality of webpages, with better results than a supervised spam classification model used as a comparison.”

The conclusion of the research paper had a positive outlook on the advancements made and expressed optimism that others will utilize the findings. Additionally, the conclusion did not indicate a need for additional research.

This research paper presents a significant advancement in identifying low-quality web pages. The conclusion of the paper suggests that the proposed algorithm has the potential to be incorporated into search engines’ algorithms, such as Google. Additionally, the algorithm is designed to function in a “web-scale” and “low-resource” setting, making it well suited for deployment and continuous operation, similar to the functionality of the Helpful Content signal.

It is still being determined if this development is tied to the recent update of the Helpful Content signal. However, it is undoubtedly a significant advancement in detecting low-quality content.

You might also like: Google Search Essentials is Replacing the Google Webmaster Guidelines.

For SEO audits, click here to book your appointment.

 

What’s Your SEO Score?

Enter the URL of any landing page or blog article and see how optimized it is for one keyword or phrase.

Share this post