SEMrush Ranking Factors Study 2017 — Methodology Demystified

The Wow-Score shows how engaging a blog post is. It is calculated based on the correlation between users’ active reading time, their scrolling speed and the article’s length.
Learn more

SEMrush Ranking Factors Study 2017 — Methodology Demystified

Xenia Volynchuk
SEMrush Ranking Factors Study 2017 — Methodology Demystified

After the SEMrush Ranking Factors Study 2017 was published a month ago, many brows were raised in disbelief — indeed, direct website visits are usually assumed to be the result of higher SERP positions, not vice versa. And yet site visits is exactly what our study named the most important Google ranking factor among those we analyzed. Moreover, the methodology we used was unique to the field of SEO studies — we traded correlation analysis for the Random Forest algorithm. As the ultimate goal of our study was to help SEOs prioritize tasks and do their jobs more effectively, we would like to reveal the details of our methodology and bust some popular misconceptions, so that you can safely rely on our takeaways.

SEMrush Ranking Factors Study 2017

Jokes aside, this post is for real nerds, so here is a short glossary:

Decision tree — a tree-like structure that represents a machine learning algorithm usually applied to classification tasks. It splits a training sample dataset into homogeneous groups/subsets based on the most significant of all the attributes.

Supervised machine learning — a type of machine learning algorithm that trains a model to find patterns in the relationship between input variables (features, A) and output variable (target value, B): B = f(A). The goal of SML is to train this model on a sample of the data so that, when offered, the out-of-sample data the algorithm could be able to predict the target value precisely, based on the features set offered. The training dataset represents the teacher looking after the learning process. The training is considered successful and terminates when the algorithm achieves an acceptable performance quality.

Feature (or attribute, or input variable) — a characteristic of a separate data entry used in analysis. For our study and this blog post, features are the alleged ranking factors.

Binary classification — a type of classification tasks, that falls into supervised learning category. The goal of this task is to predict a target value (=class) for each data entry, and for binary classification, it can be either 1 or 0 only.

Using the Random Forest Algorithm For the Ranking Factors Study

The Random Forest algorithm was developed by Leo Breiman and Adele Cutler in the mid-1990s. It hasn’t undergone any major changes since then, which proves its high quality and universality: it is used for classification, regression, clustering, feature selection and other tasks.

Although the Random Forest algorithm is not very well known to the general public, we picked it for a number of good reasons:

  • It is one of the most popular machine learning algorithms, that features unexcelled accuracy. Its first and foremost application is ranking the importance of variables (and its nature is perfect for this task — we’ll cover this later in this post), so it seemed an obvious choice.

  • The algorithm treats data in a certain way that minimizes errors:

    1. The random subspace method offers each learner random samples of features, not all of them. This guarantees that the learner won’t be overly focused on a pre-defined set of features and won’t make biased decisions about an out-of-sample dataset.

    2. The bagging or bootstrap aggregating method also improves precision. Its main point is offering learners not a whole dataset, but random samples of data.

Given that we do not have a single decision tree, but rather a whole forest of hundreds of trees, we can be sure that each feature and each pair of domains will be analyzed approximately the same number of times. Therefore, the Random Forest method is stable and operates with minimum errors.

The Pairwise Approach: Pre-Processing Input Data

We have decided to base our study on a set of 600,000 keywords from the worldwide database (US, Spain, France, Italy, Germany and others), the URL position data for top 20 search results, and a list of alleged ranking factors. As we were not going to use correlation analysis, we had to conduct binary classification prior to applying the machine learning algorithm to it. This task was implemented with the Pairwise approach — one of the most popular machine-learned ranking methods used, among others, by Microsoft in its research projects.

The Pairwise approach implies that instead of examining an entire dataset, each SERP is studied individually - we compare all possible pairs of URLs (the first result on the page with the fifth, the seventh result with the second, etc.) in regards to each feature. Each pair is assigned a set of absolute values, where each value is a quotient after dividing the feature value for the first URL by the feature value for the second URL. On top of that, each pair is also assigned a target value that indicates whether the first URL is positioned higher than the second one on the SERP (target value = 1) or lower (target value = 0).

Procedure outcomes:

  1. Each URL pair receives a set of quotients for each feature and a target value of either 1 or 0. This variety of numbers will be used as a training dataset for the decision trees.
  2. We are now able to make statistical observations that certain features values and their combinations tend to result in a higher SERP position for a URL. This allows us to build a hypothesis about the importance of certain features and make a forecast about whether a certain set of feature values will lead to higher rankings.

Growing the Decision Tree Ensemble: Supervised Learning

The dataset we received after the previous step is absolutely universal and can be used for any machine learning algorithm. Our preferred choice was Random Forest, an ensemble of decision trees.

Before the trees can make any reasonable decisions, they have to train — this is when the supervised machine learning takes place. To make sure the training is done correctly and unbiased decisions about the main data set are made, the bagging and random subspace methods are used.

Using the Random Forest algorithm for the ranking factors study

Bagging is the process of creating a training dataset by sampling with replacement. Let’s say we have X lines of data. According to bagging principles, we are going to create a training dataset for each decision tree, and this set will have the same number of X lines. However, these sample sets will be populated randomly and with replacement — so it will include only approximately two-thirds of the original X lines, and there will be value duplicates. About one-third of the original values remain untouched and will be used once the learning is over.

We did the similar thing for the features using the random subspace method — the decision trees were trained on random samples of features instead of the entire feature set.

Not a single tree uses the whole dataset and the whole list of features. But having a forest of multiple trees allows us to say that every value and every feature are very likely to be used approximately the same amount of times.

Growing the Forest

Each decision tree repetitively partitions the training sample dataset based on the most important variable and does so until each subset consists of homogeneous data entries. The tree scans the whole training dataset and chooses the most important feature and its precise value, which becomes a kind of a pivot point (node) and splits the data into two groups. For the one group, the condition chosen above is true; for the other one — false (YES and NO branches). All final subgroups (node leaves) receive an average target value based on the target values of the URL pairs that were placed into a certain subgroup.

Since the trees use the sample dataset to grow, they learn while growing. Their learning is considered successful and high-quality when a target percentage of correctly guessed target values is achieved.

Once the whole ensemble of trees is grown and trained, the magic begins — the trees are now allowed to process the out-of-sample data (about one-third of the original dataset). A URL pair is offered to a tree only if it hasn’t encountered the same pair during training. This means that a URL pair is not offered to 100 percent of the trees in the forest. Then, voting takes place: for each pair of URLs, a tree gives its verdict, aka the probability of one URL taking a higher position in the SERP compared to the second one. The same action is taken by all other trees that meet the ‘haven’t seen this URL pair before’ requirement, and in the end, each URL pair gets a set of probability values. Then all the received probabilities are averaged. Now there is enough data for the next step.

Estimating Attribute Importance with Random Forest

Random Forest produces extremely credible results when it comes to attributing importance estimation. The assessment is conducted as follows:

  1. The attribute values are mixed up across all URL pairs, and these updated sets of values are offered to the algorithm.

  2. Any changes in the algorithm’s quality or stability are measured (whether the percentage of correctly guessed target values remains the same or not).

  3. Then, based on the values received, conclusions can be made:

  • If the algorithm’s quality drops significantly, the attribute is important. Wherein the heavier is the slump in quality, the more important the attribute is.  

  • If the algorithm’s quality remains the same, then the attribute is of minor importance.

The procedure is repeated for all the attributes. As a result, a rating of the most important ranking factors is obtained.

Why We Think Correlation Analysis is Bad for Ranking Factors Studies

We intentionally abandoned the general practice of using correlation analysis, and we have still received quite a few comments like “Correlation doesn’t mean causation,” “Those don’t look like ranking factors, but more like correlations.” Therefore we feel this point deserves a separate paragraph.

First and foremost, we would like to stress again that the initial dataset used for the study is a set of highly changeable values. Just to remind you that we examined not one, but 600,000 SERPs. Each SERP is characterized by its own average attribute value, and this uniqueness is completely disregarded in the process of correlation analysis. That being said, we believe that each SERP should be treated separately and with respect to its originality.

Correlation analysis gives reliable results only when examining the relationship between two variables (for example, the impact of the number of backlinks on a SERP position). “Does this particular factor influence position?” —  this question can be answered quite precisely since the only impacting variable is involved. But are we in a position to study each factor in isolation? Probably not, as we all know that there is a whole bunch of factors that influence a URL position in a SERP.

Another quality criterion for correlation analysis is the variety of the received correlation ratios. For example, if there is a lineup of correlation ratios like (-1, 0.3 and 0.8), then it is pretty fair to say that there is one parameter that is more important than others. The closer the ratio’s absolute value, or modulus, is to one, the stronger the correlation. If the ratio’s modulus is under 0.3, such a correlation can be disregarded — the dependency between the two variables, in this case, is too weak to make any trustworthy conclusions. For all the factors we analyzed, the correlation ratio was under 0.3, so we had to shed this method.

One more reason to dismiss this analysis method was the high sensitivity of the correlation value to outliers and noises, and the data for various keywords suggests a lot of them. If one extra data entry is added to the dataset, the correlation ratio changes immediately. Hence this metric can’t be viable in the case of multiple variables, e.g. in a ranking factors study, and can even lead to incorrect deductions.

Coming down to the final curtain, it is hard to believe that one or two factors with a correlation ratio modulus so close to one exist — if this were true, anyone could easily hack Google’s algorithms, and we would all be in position 1!

Frequently Asked Questions

Although we tried to answer most of the frequently raised questions above, here are some more for the more curious readers.

Why didn’t we use artificial neural networks (ANNs)?

Although artificial neural networks are perfect for tasks with a large number of variables, e.g. image recognition (where each pixel is a variable), they produce results that are difficult to interpret and don’t allow you to compare the weight of each factor. Besides, ANNs require a massive dataset and a huge number of features to produce reliable results, and the input data we had collected didn’t match this description.

Unlike Random Forest, where each decision tree votes independently and thus a high level of reliability is guaranteed, neural networks process data in one pot. There is nothing to indicate that using ANNs for this study would result in more accurate results.

Our main requirements for a research method were stability and the ability to identify the importance of the factors. That being said, Random Forest was a perfect fit for our task, which is proven by numerous ranking tasks of a similar nature, also implemented with the help of this algorithm.

Why are website visits the most important Google ranking factor?

Hands down, this was probably the most controversial takeaway of our study. When we saw the results of our analysis, we were equally surprised. At the same time, our algorithm was trained on a solid scope of data, so we decided to double-check the facts. We excluded the organic search, as well as social and referral traffic, and taken into account only the direct traffic, and the results were pretty much the same — the position distribution remained unchanged (the graphs on pp. 25-26 of the study illustrate this point).

To us, this finding makes perfect sense and confirms that Google prioritizes domains with more authority, as described in its Search Quality Evaluator Guidelines. Although it may seem that domain authority is just a lame excuse and a very vague and ephemeral concept, these guidelines dispel this myth completely. So, back in 2015 Google introduced this handbook to help estimate website quality and “reflect what Google thinks search users want.”

The handbook lists E-A-T, which stands for Expertise, Authoritativeness, and Trustworthiness, as an important webpage-quality indicator. Main content quality and amount, website information (i.e. who is responsible for the website), and website reputation all influence the E-A-T of a website. We suggest thinking of it in the following way: if a URL ranks in the top 10, by default, it contains content that is relevant to a user search query.

But to distribute the places between these ten leaders, Google starts to count the additional parameters. We all know that there is a whole team of search quality raters behind the scenes, which is responsible for training the Google’s search algorithms and improving search results' relevance. As advised by Google Quality Evaluator Guidelines, raters should give priority to the high-quality pages and teach the algos to do so as well. So, the ranking algorithm is trained to assign a higher position to pages that belong to trusted and highly authoritative domains, and we think this may be the reason behind the data we received for direct traffic and for its importance as a signal. For more information, check out our EAT and YMYL: New Google Search Guidelines Acronyms of Quality Content blog post.

Domain reputation and E-A-T — Google Search Quality Evaluator Guidelines

What does it mean to you again? Well, brand awareness (estimated, among other things, by your number of direct and social website visits) strongly affects your SEO.

Difference in Ranking Factors for Branded vs. Non-Branded Keywords

As you may have spotted, every graph from our study has a noticeable spike for the second position. We assume that this deviation is related to branded keywords. A domain will probably take the first position in the SERP for any search query that contains its branded keywords. And despite how well a website is optimized, it will rank number one anyway, so it has nothing to do with SEO efforts. This explains why ranking factors affect a SERP’s second position more than the first one.

We will definitely dig deeper into this topic in our future studies — stay tuned.

Conclusion: Understanding the Cause-and-Effect Relationship

There is no guarantee that if you improve your website’s metrics for any of the above factors your pages will start to rank higher. We conducted a very thorough study that allowed us to make reliable conclusions about the importance of these 12 factors to ranking higher in Google. Yet, this is just a reverse-engineering job well done, not a universal action plan — and this is what each and every ranking factor study is about. Nobody but Google knows all the secrets. We encourage you to take a closer look at our study, reconsider the E-A-T concept and get yourself a good, fact-based SEO strategy!

Download SEMrush Ranking Factors Study 2017

Online Marketing Specialist at SEMrush.
Share this post


2000 symbols remain
Thomas Herman
So, does website visits as #1 mean that basically the system favors the incumbents? IE it's hard to get into the top rankings because they're already dominated by a small number of sites who get the most visits because they are already "one of the top ranking sites"? And because it's "direct website visits" you can't even buy Adwords to get into the top rankings?
Xenia Volynchuk
Thomas Herman
Hi Thomas, thanks for your Q! My answer would be 'not exactly', and here is why. Of course, it would be very hard to revolt and jump from the 10th page to position 1. At the same time, distribution of rankings between the first 20 pages is something one can try to influence and succeed. Although domain authority as such an the number of direct visits mean a lot to higher rankings, these are not the only ranking factor and can't be considered in isolation. Google prioritizes users needs, e.g. relevance of the content to their queries, source authority, users security and much more. Speaking of AdWords, you have all the chances to get a sweet spot above organic results, however one need to remember, that landing page quality (as a component of a Quality Score) also affects the chances of your ad to show up and its position. So, the bottom line here is that all your digital activities should be user-centered -- Google knows how to distinguish it and will return the favour :)
"Direct website visits are the most important ranking factor" - This is only on websites using google analytics I take it? Correct? Otherwise, google could never know how many direct visits you get. With that being said how do sites without analytics installed fair? Also, if this truly is the biggest ranking factor perhaps you could write a post explaining the best way to increase direct visits?
Paul Gillooly
Even if there is no GA, Google still knows how much traffic this website gets because they also have Chrome.
Xenia Volynchuk
Paul Gillooly
Hey Paul, thanks for your questions! Although there is no public evidence that Google uses GA data to estimate direct website visits, the fact that Chrome sends part of users data including bookmarks is well-known (details can be found in its privacy policy). Since Google guys don't put their cards on the table, we can only guess whether they use GA data or not. However, as described in the Search Quality Evaluator Guidelines, there is a precise workflow that allows the raters to judge about the domain/page authority, and it doesn't involve GA. As to increasing direct visits -- it's an excellent idea, definitely going yo cover it in one of our next posts. Cheers
what about domain age :/ isn't a ranking factor
Xenia Volynchuk
Adeel Aquarious
Hi Adeel, thanks for you comment! We haven't analyzed the domain age in this particular study. Just the 12 factors listed above. However, this is definitely not the last ranking factors study, maybe we'll have a look at the domain age later — stay tuned!
thanks, you are genius!!!
Thanks for precious Post on the ranking factors
Ghayoor Shaikh
Thannks for precious Post on the ranking factors. I have seen the [link removed by moderator] They sale the all tools to get the best access for needy freelancers. By the way, Semrush is one of the most favorite tools that is used by all SEOs.
Nishi Chandra
Nice...Number Driven Ranking Factors help us to understand quickly.
Xenia Volynchuk
Nishi Chandra
Happy you found it helpful, Nishi!
Xenia Volynchuk
Thanks, Luis!