In the second edition of the SEMrush Ranking Factors Study 2017 we’ve added 5 more backlink-related factors and compared the strength of their influence on a particular URL vs. an entire domain. According to tradition, we offer you a deeper look at our methodology. Back in June, when the first edition of the study was published, many brows were raised in disbelief — indeed, direct website visits are usually assumed to be the result of higher SERP positions, not vice versa. And yet site visits is exactly what our study confirmed to be the most important Google ranking factor among those we analyzed, both times. Moreover, the methodology we used was unique to the field of SEO studies — we traded correlation analysis for the Random Forest machine learning algorithm. As the ultimate goal of our study was to help SEOs prioritize tasks and do their jobs more effectively, we would like to reveal the behind-the-scenes details of our research and bust some popular misconceptions, so that you can safely rely on our takeaways.
Jokes aside, this post is for real nerds, so here is a short glossary:
Decision tree — a tree-like structure that represents a machine learning algorithm usually applied to classification tasks. It splits a training sample dataset into homogeneous groups/subsets based on the most significant of all the attributes.
Supervised machine learning — a type of machine learning algorithm that trains a model to find patterns in the relationship between input variables (features, A) and output variable (target value, B): B = f(A). The goal of SML is to train this model on a sample of the data so that, when offered, the out-of-sample data the algorithm could be able to predict the target value precisely, based on the features set offered. The training dataset represents the teacher looking after the learning process. The training is considered successful and terminates when the algorithm achieves an acceptable performance quality.
Feature (or attribute, or input variable) — a characteristic of a separate data entry used in analysis. For our study and this blog post, features are the alleged ranking factors.
Binary classification — a type of classification tasks, that falls into supervised learning category. The goal of this task is to predict a target value (=class) for each data entry, and for binary classification, it can be either 1 or 0 only.
Using the Random Forest Algorithm For the Ranking Factors Study
The Random Forest algorithm was developed by Leo Breiman and Adele Cutler in the mid-1990s. It hasn’t undergone any major changes since then, which proves its high quality and universality: it is used for classification, regression, clustering, feature selection and other tasks.
Although the Random Forest algorithm is not very well known to the general public, we picked it for a number of good reasons:
It is one of the most popular machine learning algorithms, that features unexcelled accuracy. Its first and foremost application is ranking the importance of variables (and its nature is perfect for this task — we’ll cover this later in this post), so it seemed an obvious choice.
The algorithm treats data in a certain way that minimizes errors:
The random subspace method offers each learner random samples of features, not all of them. This guarantees that the learner won’t be overly focused on a pre-defined set of features and won’t make biased decisions about an out-of-sample dataset.
The bagging or bootstrap aggregating method also improves precision. Its main point is offering learners not a whole dataset, but random samples of data.
Given that we do not have a single decision tree, but rather a whole forest of hundreds of trees, we can be sure that each feature and each pair of domains will be analyzed approximately the same number of times. Therefore, the Random Forest method is stable and operates with minimum errors.
The Pairwise Approach: Pre-Processing Input Data
We have decided to base our study on a set of 600,000 keywords from the worldwide database (US, Spain, France, Italy, Germany and others), the URL position data for top 20 search results, and a list of alleged ranking factors. As we were not going to use correlation analysis, we had to conduct binary classification prior to applying the machine learning algorithm to it. This task was implemented with the Pairwise approach — one of the most popular machine-learned ranking methods used, among others, by Microsoft in its research projects.
The Pairwise approach implies that instead of examining an entire dataset, each SERP is studied individually - we compare all possible pairs of URLs (the first result on the page with the fifth, the seventh result with the second, etc.) in regards to each feature. Each pair is assigned a set of absolute values, where each value is a quotient after dividing the feature value for the first URL by the feature value for the second URL. On top of that, each pair is also assigned a target value that indicates whether the first URL is positioned higher than the second one on the SERP (target value = 1) or lower (target value = 0).
- Each URL pair receives a set of quotients for each feature and a target value of either 1 or 0. This variety of numbers will be used as a training dataset for the decision trees.
- We are now able to make statistical observations that certain features values and their combinations tend to result in a higher SERP position for a URL. This allows us to build a hypothesis about the importance of certain features and make a forecast about whether a certain set of feature values will lead to higher rankings.
Growing the Decision Tree Ensemble: Supervised Learning
The dataset we received after the previous step is absolutely universal and can be used for any machine learning algorithm. Our preferred choice was Random Forest, an ensemble of decision trees.
Before the trees can make any reasonable decisions, they have to train — this is when the supervised machine learning takes place. To make sure the training is done correctly and unbiased decisions about the main data set are made, the bagging and random subspace methods are used.
Bagging is the process of creating a training dataset by sampling with replacement. Let’s say we have X lines of data. According to bagging principles, we are going to create a training dataset for each decision tree, and this set will have the same number of X lines. However, these sample sets will be populated randomly and with replacement — so it will include only approximately two-thirds of the original X lines, and there will be value duplicates. About one-third of the original values remain untouched and will be used once the learning is over.
We did the similar thing for the features using the random subspace method — the decision trees were trained on random samples of features instead of the entire feature set.
Not a single tree uses the whole dataset and the whole list of features. But having a forest of multiple trees allows us to say that every value and every feature are very likely to be used approximately the same amount of times.
Growing the Forest
Each decision tree repetitively partitions the training sample dataset based on the most important variable and does so until each subset consists of homogeneous data entries. The tree scans the whole training dataset and chooses the most important feature and its precise value, which becomes a kind of a pivot point (node) and splits the data into two groups. For the one group, the condition chosen above is true; for the other one — false (YES and NO branches). All final subgroups (node leaves) receive an average target value based on the target values of the URL pairs that were placed into a certain subgroup.
Since the trees use the sample dataset to grow, they learn while growing. Their learning is considered successful and high-quality when a target percentage of correctly guessed target values is achieved.
Once the whole ensemble of trees is grown and trained, the magic begins — the trees are now allowed to process the out-of-sample data (about one-third of the original dataset). A URL pair is offered to a tree only if it hasn’t encountered the same pair during training. This means that a URL pair is not offered to 100 percent of the trees in the forest. Then, voting takes place: for each pair of URLs, a tree gives its verdict, aka the probability of one URL taking a higher position in the SERP compared to the second one. The same action is taken by all other trees that meet the ‘haven’t seen this URL pair before’ requirement, and in the end, each URL pair gets a set of probability values. Then all the received probabilities are averaged. Now there is enough data for the next step.
Estimating Attribute Importance with Random Forest
Random Forest produces extremely credible results when it comes to attributing importance estimation. The assessment is conducted as follows:
The attribute values are mixed up across all URL pairs, and these updated sets of values are offered to the algorithm.
Any changes in the algorithm’s quality or stability are measured (whether the percentage of correctly guessed target values remains the same or not).
Then, based on the values received, conclusions can be made:
If the algorithm’s quality drops significantly, the attribute is important. Wherein the heavier is the slump in quality, the more important the attribute is.
If the algorithm’s quality remains the same, then the attribute is of minor importance.
The procedure is repeated for all the attributes. As a result, a rating of the most important ranking factors is obtained.
Why We Think Correlation Analysis is Bad for Ranking Factors Studies
We intentionally abandoned the general practice of using correlation analysis, and we have still received quite a few comments like “Correlation doesn’t mean causation,” “Those don’t look like ranking factors, but more like correlations.” Therefore we feel this point deserves a separate paragraph.
First and foremost, we would like to stress again that the initial dataset used for the study is a set of highly changeable values. Just to remind you that we examined not one, but 600,000 SERPs. Each SERP is characterized by its own average attribute value, and this uniqueness is completely disregarded in the process of correlation analysis. That being said, we believe that each SERP should be treated separately and with respect to its originality.
Correlation analysis gives reliable results only when examining the relationship between two variables (for example, the impact of the number of backlinks on a SERP position). “Does this particular factor influence position?” — this question can be answered quite precisely since the only impacting variable is involved. But are we in a position to study each factor in isolation? Probably not, as we all know that there is a whole bunch of factors that influence a URL position in a SERP.
Another quality criterion for correlation analysis is the variety of the received correlation ratios. For example, if there is a lineup of correlation ratios like (-1, 0.3 and 0.8), then it is pretty fair to say that there is one parameter that is more important than others. The closer the ratio’s absolute value, or modulus, is to one, the stronger the correlation. If the ratio’s modulus is under 0.3, such a correlation can be disregarded — the dependency between the two variables, in this case, is too weak to make any trustworthy conclusions. For all the factors we analyzed, the correlation ratio was under 0.3, so we had to shed this method.
One more reason to dismiss this analysis method was the high sensitivity of the correlation value to outliers and noises, and the data for various keywords suggests a lot of them. If one extra data entry is added to the dataset, the correlation ratio changes immediately. Hence this metric can’t be viable in the case of multiple variables, e.g. in a ranking factors study, and can even lead to incorrect deductions.
Coming down to the final curtain, it is hard to believe that one or two factors with a correlation ratio modulus so close to one exist — if this were true, anyone could easily hack Google’s algorithms, and we would all be in position 1!
Frequently Asked Questions
Although we tried to answer most of the frequently raised questions above, here are some more for the more curious readers.
Where the study dataset comes from? Is it SEMrush data?
The traffic and user behavior data within our dataset is the anonymized clickstream data that comes from third party data providers. The data is accumulated from the behavior of over 100 million real internet users, and over a hundred different apps and browser extensions are used to collect it.
Why didn’t we use artificial neural networks (ANNs)?
Although artificial neural networks are perfect for tasks with a large number of variables, e.g. image recognition (where each pixel is a variable), they produce results that are difficult to interpret and don’t allow you to compare the weight of each factor. Besides, ANNs require a massive dataset and a huge number of features to produce reliable results, and the input data we had collected didn’t match this description.
Unlike Random Forest, where each decision tree votes independently and thus a high level of reliability is guaranteed, neural networks process data in one pot. There is nothing to indicate that using ANNs for this study would result in more accurate results.
Our main requirements for a research method were stability and the ability to identify the importance of the factors. That being said, Random Forest was a perfect fit for our task, which is proven by numerous ranking tasks of a similar nature, also implemented with the help of this algorithm.
Why are website visits the most important Google ranking factor?
Hands down, this was probably the most controversial takeaway of our study. When we saw the results of our analysis, we were equally surprised. At the same time, our algorithm was trained on a solid scope of data, so we decided to double-check the facts. We excluded the organic and paid search data, as well as social and referral traffic, and taken into account only the direct traffic, and the results were pretty much the same — the position distribution remained unchanged (the graphs on pp. 40-41 of the study illustrate this point).
To us, this finding makes perfect sense and confirms that Google prioritizes domains with more authority, as described in its Search Quality Evaluator Guidelines. Although it may seem that domain authority is just a lame excuse and a very vague and ephemeral concept, these guidelines dispel this myth completely. So, back in 2015 Google introduced this handbook to help estimate website quality and “reflect what Google thinks search users want.”
The handbook lists E-A-T, which stands for Expertise, Authoritativeness, and Trustworthiness, as an important webpage-quality indicator. Main content quality and amount, website information (i.e. who is responsible for the website), and website reputation all influence the E-A-T of a website. We suggest thinking of it in the following way: if a URL ranks in the top 10, by default, it contains content that is relevant to a user search query.
But to distribute the places between these ten leaders, Google starts to count the additional parameters. We all know that there is a whole team of search quality raters behind the scenes, which is responsible for training the Google’s search algorithms and improving search results' relevance. As advised by Google Quality Evaluator Guidelines, raters should give priority to the high-quality pages and teach the algos to do so as well. So, the ranking algorithm is trained to assign a higher position to pages that belong to trusted and highly authoritative domains, and we think this may be the reason behind the data we received for direct traffic and for its importance as a signal. For more information, check out our EAT and YMYL: New Google Search Guidelines Acronyms of Quality Content blog post.
Here’s more: at the recent SMX East conference, Google’s Gary Illyes confirmed that ‘how people perceive your site will affect your business.’ And although this, according to Illyes, does not necessarily affect how Google ranks your site, it still seems important to invest in earning users’ loyalty: happy users = happy Google.
— Ari Finkelstein (@arifinkels) October 26, 2017
What does this mean to you again? Well, brand awareness (estimated, among other things, by your number of direct website visits) strongly affects your rankings and deserves your putting effort into it on par with SEO.
Difference in Ranking Factors Impact on a URL vs a Domain
As you may have spotted, every graph from our study shows a noticeable spike for the second position. We promised to have a closer look at this deviation and thus added a new dimension to our study. The second edition covers the impact of the three most important factors (direct website visits, time on site and the number of referring domains) on the rankings of a particular URL, rather than just the domain that it resides on.
One would assume that the websites on the first position are the most optimized, and yet we saw that every trend line showed a drop on the first position.
We connected this deviation with branded keyword search queries. A domain will probably take the first position in the SERP for any search query that contains its branded keywords. And despite how well a website is optimized, it will rank number one anyway, so it has nothing to do with SEO efforts. This explains why ranking factors affect a SERP’s second position more than the first one.
To prove this, we decided to look at our data from a new angle: we investigated how the ranking factors impact single URLs that appear on the SERP. For each factor, we built separate graphs showing the distribution of URLs and domains across the first 10 SERP positions (please see pp. 50-54). Although the study includes graphs only for the top three most influential factors, the tendency that we discovered persists for other factors as well.
What does this mean to you as a marketer? When a domain is ranking for a branded keyword, many factors lose their influence. However when optimizing for non-branded keywords, keep in mind that the analyzed ranking factors have more influence on the positions of the particular URL than on the domain on which it resides. That means that the rankings of a specific page are more sensitive to on-page optimization, link-building efforts and other optimization techniques.
Conclusion: How to Use the SEMrush Ranking Factors Study
There is no guarantee that, if you improve your website’s metrics for any of the above factors, your pages will start to rank higher. We conducted a very thorough study that allowed us to draw reliable conclusions about the importance of these 17 factors to ranking higher on Google SERPs. Yet, this is just a reverse-engineering job well done, not a universal action plan — and this is what each and every ranking factors study is about. No one but Google knows all the secrets. However, here is a workflow that we suggest for dealing with our research:
Step 1. Understand which keywords you rank for — do they belong to low, medium or high search volume groups?
Step 2. Benchmark yourself against the competition: take a closer look at the methods they use to hit top 10 and at their metrics — Do they have a large scope of backlinks? Are their domains secured with HTTPS?
Step 3. Using this study, pick and start implementing the optimization techniques that will yield the best results based on your keywords and the competition level on SERPs.
Once again, we encourage you to take a closer look at our study, reconsider the E-A-T concept and get yourself a good, fact-based SEO strategy!