Submit post
Go to Blog

SEMrush Ranking Factors Study 2017 — Methodology Demystified

The Wow-Score shows how engaging a blog post is. It is calculated based on the correlation between users’ active reading time, their scrolling speed and the article’s length.
Learn more

SEMrush Ranking Factors Study 2017 — Methodology Demystified

This post is in English
Xenia Volynchuk updated
This post is in English
SEMrush Ranking Factors Study 2017 — Methodology Demystified

In the second edition of the SEMrush Ranking Factors Study 2017 we’ve added 5 more backlink-related factors and compared the strength of their influence on a particular URL vs. an entire domain. According to tradition, we offer you a deeper look at our methodology.  Back in June, when the first edition of the study was published, many brows were raised in disbelief — indeed, direct website visits are usually assumed to be the result of higher SERP positions, not vice versa. And yet site visits is exactly what our study confirmed to be the most important Google ranking factor among those we analyzed, both times. Moreover, the methodology we used was unique to the field of SEO studies — we traded correlation analysis for the Random Forest machine learning algorithm. As the ultimate goal of our study was to help SEOs prioritize tasks and do their jobs more effectively, we would like to reveal the behind-the-scenes details of our research and bust some popular misconceptions, so that you can safely rely on our takeaways.

SEMrush Ranking Factors Study 2017

Jokes aside, this post is for real nerds, so here is a short glossary:

Decision tree — a tree-like structure that represents a machine learning algorithm usually applied to classification tasks. It splits a training sample dataset into homogeneous groups/subsets based on the most significant of all the attributes.

Supervised machine learning — a type of machine learning algorithm that trains a model to find patterns in the relationship between input variables (features, A) and output variable (target value, B): B = f(A). The goal of SML is to train this model on a sample of the data so that, when offered, the out-of-sample data the algorithm could be able to predict the target value precisely, based on the features set offered. The training dataset represents the teacher looking after the learning process. The training is considered successful and terminates when the algorithm achieves an acceptable performance quality.

Feature (or attribute, or input variable) — a characteristic of a separate data entry used in analysis. For our study and this blog post, features are the alleged ranking factors.

Binary classification — a type of classification tasks, that falls into supervised learning category. The goal of this task is to predict a target value (=class) for each data entry, and for binary classification, it can be either 1 or 0 only.

Using the Random Forest Algorithm For the Ranking Factors Study

The Random Forest algorithm was developed by Leo Breiman and Adele Cutler in the mid-1990s. It hasn’t undergone any major changes since then, which proves its high quality and universality: it is used for classification, regression, clustering, feature selection and other tasks.

Although the Random Forest algorithm is not very well known to the general public, we picked it for a number of good reasons:

  • It is one of the most popular machine learning algorithms, that features unexcelled accuracy. Its first and foremost application is ranking the importance of variables (and its nature is perfect for this task — we’ll cover this later in this post), so it seemed an obvious choice.

  • The algorithm treats data in a certain way that minimizes errors:

    1. The random subspace method offers each learner random samples of features, not all of them. This guarantees that the learner won’t be overly focused on a pre-defined set of features and won’t make biased decisions about an out-of-sample dataset.

    2. The bagging or bootstrap aggregating method also improves precision. Its main point is offering learners not a whole dataset, but random samples of data.

Given that we do not have a single decision tree, but rather a whole forest of hundreds of trees, we can be sure that each feature and each pair of domains will be analyzed approximately the same number of times. Therefore, the Random Forest method is stable and operates with minimum errors.

The Pairwise Approach: Pre-Processing Input Data

We have decided to base our study on a set of 600,000 keywords from the worldwide database (US, Spain, France, Italy, Germany and others), the URL position data for top 20 search results, and a list of alleged ranking factors. As we were not going to use correlation analysis, we had to conduct binary classification prior to applying the machine learning algorithm to it. This task was implemented with the Pairwise approach — one of the most popular machine-learned ranking methods used, among others, by Microsoft in its research projects.

The Pairwise approach implies that instead of examining an entire dataset, each SERP is studied individually - we compare all possible pairs of URLs (the first result on the page with the fifth, the seventh result with the second, etc.) in regards to each feature. Each pair is assigned a set of absolute values, where each value is a quotient after dividing the feature value for the first URL by the feature value for the second URL. On top of that, each pair is also assigned a target value that indicates whether the first URL is positioned higher than the second one on the SERP (target value = 1) or lower (target value = 0).

Procedure outcomes:

  1. Each URL pair receives a set of quotients for each feature and a target value of either 1 or 0. This variety of numbers will be used as a training dataset for the decision trees.
  2. We are now able to make statistical observations that certain features values and their combinations tend to result in a higher SERP position for a URL. This allows us to build a hypothesis about the importance of certain features and make a forecast about whether a certain set of feature values will lead to higher rankings.

Growing the Decision Tree Ensemble: Supervised Learning

The dataset we received after the previous step is absolutely universal and can be used for any machine learning algorithm. Our preferred choice was Random Forest, an ensemble of decision trees.

Before the trees can make any reasonable decisions, they have to train — this is when the supervised machine learning takes place. To make sure the training is done correctly and unbiased decisions about the main data set are made, the bagging and random subspace methods are used.

Using the Random Forest algorithm for the ranking factors study

Bagging is the process of creating a training dataset by sampling with replacement. Let’s say we have X lines of data. According to bagging principles, we are going to create a training dataset for each decision tree, and this set will have the same number of X lines. However, these sample sets will be populated randomly and with replacement — so it will include only approximately two-thirds of the original X lines, and there will be value duplicates. About one-third of the original values remain untouched and will be used once the learning is over.

We did the similar thing for the features using the random subspace method — the decision trees were trained on random samples of features instead of the entire feature set.

Not a single tree uses the whole dataset and the whole list of features. But having a forest of multiple trees allows us to say that every value and every feature are very likely to be used approximately the same amount of times.

Growing the Forest

Each decision tree repetitively partitions the training sample dataset based on the most important variable and does so until each subset consists of homogeneous data entries. The tree scans the whole training dataset and chooses the most important feature and its precise value, which becomes a kind of a pivot point (node) and splits the data into two groups. For the one group, the condition chosen above is true; for the other one — false (YES and NO branches). All final subgroups (node leaves) receive an average target value based on the target values of the URL pairs that were placed into a certain subgroup.

Since the trees use the sample dataset to grow, they learn while growing. Their learning is considered successful and high-quality when a target percentage of correctly guessed target values is achieved.

Once the whole ensemble of trees is grown and trained, the magic begins — the trees are now allowed to process the out-of-sample data (about one-third of the original dataset). A URL pair is offered to a tree only if it hasn’t encountered the same pair during training. This means that a URL pair is not offered to 100 percent of the trees in the forest. Then, voting takes place: for each pair of URLs, a tree gives its verdict, aka the probability of one URL taking a higher position in the SERP compared to the second one. The same action is taken by all other trees that meet the ‘haven’t seen this URL pair before’ requirement, and in the end, each URL pair gets a set of probability values. Then all the received probabilities are averaged. Now there is enough data for the next step.

Estimating Attribute Importance with Random Forest

Random Forest produces extremely credible results when it comes to attributing importance estimation. The assessment is conducted as follows:

  1. The attribute values are mixed up across all URL pairs, and these updated sets of values are offered to the algorithm.

  2. Any changes in the algorithm’s quality or stability are measured (whether the percentage of correctly guessed target values remains the same or not).

  3. Then, based on the values received, conclusions can be made:

  • If the algorithm’s quality drops significantly, the attribute is important. Wherein the heavier is the slump in quality, the more important the attribute is.  

  • If the algorithm’s quality remains the same, then the attribute is of minor importance.

The procedure is repeated for all the attributes. As a result, a rating of the most important ranking factors is obtained.

Why We Think Correlation Analysis is Bad for Ranking Factors Studies

We intentionally abandoned the general practice of using correlation analysis, and we have still received quite a few comments like “Correlation doesn’t mean causation,” “Those don’t look like ranking factors, but more like correlations.” Therefore we feel this point deserves a separate paragraph.

First and foremost, we would like to stress again that the initial dataset used for the study is a set of highly changeable values. Just to remind you that we examined not one, but 600,000 SERPs. Each SERP is characterized by its own average attribute value, and this uniqueness is completely disregarded in the process of correlation analysis. That being said, we believe that each SERP should be treated separately and with respect to its originality.

Correlation analysis gives reliable results only when examining the relationship between two variables (for example, the impact of the number of backlinks on a SERP position). “Does this particular factor influence position?” —  this question can be answered quite precisely since the only impacting variable is involved. But are we in a position to study each factor in isolation? Probably not, as we all know that there is a whole bunch of factors that influence a URL position in a SERP.

Another quality criterion for correlation analysis is the variety of the received correlation ratios. For example, if there is a lineup of correlation ratios like (-1, 0.3 and 0.8), then it is pretty fair to say that there is one parameter that is more important than others. The closer the ratio’s absolute value, or modulus, is to one, the stronger the correlation. If the ratio’s modulus is under 0.3, such a correlation can be disregarded — the dependency between the two variables, in this case, is too weak to make any trustworthy conclusions. For all the factors we analyzed, the correlation ratio was under 0.3, so we had to shed this method.

One more reason to dismiss this analysis method was the high sensitivity of the correlation value to outliers and noises, and the data for various keywords suggests a lot of them. If one extra data entry is added to the dataset, the correlation ratio changes immediately. Hence this metric can’t be viable in the case of multiple variables, e.g. in a ranking factors study, and can even lead to incorrect deductions.

Coming down to the final curtain, it is hard to believe that one or two factors with a correlation ratio modulus so close to one exist — if this were true, anyone could easily hack Google’s algorithms, and we would all be in position 1!

Frequently Asked Questions

Although we tried to answer most of the frequently raised questions above, here are some more for the more curious readers.

Where the study dataset comes from? Is it SEMrush data?

The traffic and user behavior data within our dataset is the anonymized clickstream data that comes from third party data providers. The data is accumulated from the behavior of over 100 million real internet users, and over a hundred different apps and browser extensions are used to collect it.

Why didn’t we use artificial neural networks (ANNs)?

Although artificial neural networks are perfect for tasks with a large number of variables, e.g. image recognition (where each pixel is a variable), they produce results that are difficult to interpret and don’t allow you to compare the weight of each factor. Besides, ANNs require a massive dataset and a huge number of features to produce reliable results, and the input data we had collected didn’t match this description.

Unlike Random Forest, where each decision tree votes independently and thus a high level of reliability is guaranteed, neural networks process data in one pot. There is nothing to indicate that using ANNs for this study would result in more accurate results.

Our main requirements for a research method were stability and the ability to identify the importance of the factors. That being said, Random Forest was a perfect fit for our task, which is proven by numerous ranking tasks of a similar nature, also implemented with the help of this algorithm.

Why are website visits the most important Google ranking factor?

Hands down, this was probably the most controversial takeaway of our study. When we saw the results of our analysis, we were equally surprised. At the same time, our algorithm was trained on a solid scope of data, so we decided to double-check the facts. We excluded the organic and paid search data, as well as social and referral traffic, and taken into account only the direct traffic, and the results were pretty much the same — the position distribution remained unchanged (the graphs on pp. 40-41 of the study illustrate this point).

To us, this finding makes perfect sense and confirms that Google prioritizes domains with more authority, as described in its Search Quality Evaluator Guidelines. Although it may seem that domain authority is just a lame excuse and a very vague and ephemeral concept, these guidelines dispel this myth completely. So, back in 2015 Google introduced this handbook to help estimate website quality and “reflect what Google thinks search users want.”

The handbook lists E-A-T, which stands for Expertise, Authoritativeness, and Trustworthiness, as an important webpage-quality indicator. Main content quality and amount, website information (i.e. who is responsible for the website), and website reputation all influence the E-A-T of a website. We suggest thinking of it in the following way: if a URL ranks in the top 10, by default, it contains content that is relevant to a user search query.

But to distribute the places between these ten leaders, Google starts to count the additional parameters. We all know that there is a whole team of search quality raters behind the scenes, which is responsible for training the Google’s search algorithms and improving search results' relevance. As advised by Google Quality Evaluator Guidelines, raters should give priority to the high-quality pages and teach the algos to do so as well. So, the ranking algorithm is trained to assign a higher position to pages that belong to trusted and highly authoritative domains, and we think this may be the reason behind the data we received for direct traffic and for its importance as a signal. For more information, check out our EAT and YMYL: New Google Search Guidelines Acronyms of Quality Content blog post.

Domain reputation and E-A-T — Google Search Quality Evaluator Guidelines

Here’s more: at the recent SMX East conference, Google’s Gary Illyes confirmed that ‘how people perceive your site will affect your business.’ And although this, according to Illyes, does not necessarily affect how Google ranks your site, it still seems important to invest in earning users’ loyalty: happy users = happy Google.

What does this mean to you again? Well, brand awareness (estimated, among other things, by your number of direct website visits) strongly affects your rankings and deserves your putting effort into it on par with SEO.

Difference in Ranking Factors Impact on a URL vs a Domain

As you may have spotted, every graph from our study shows a noticeable spike for the second position. We promised to have a closer look at this deviation and thus added a new dimension to our study. The second edition covers the impact of the three most important factors (direct website visits, time on site and the number of referring domains) on the rankings of a particular URL, rather than just the domain that it resides on.

One would assume that the websites on the first position are the most optimized, and yet we saw that every trend line showed a drop on the first position.

We connected this deviation with branded keyword search queries. A domain will probably take the first position in the SERP for any search query that contains its branded keywords. And despite how well a website is optimized, it will rank number one anyway, so it has nothing to do with SEO efforts. This explains why ranking factors affect a SERP’s second position more than the first one.

To prove this, we decided to look at our data from a new angle: we investigated how the ranking factors impact single URLs that appear on the SERP.  For each factor, we built separate graphs showing the distribution of URLs and domains across the first 10 SERP positions (please see pp. 50-54). Although the study includes graphs only for the top three most influential factors, the tendency that we discovered persists for other factors as well.  

What does this mean to you as a marketer? When a domain is ranking for a branded keyword, many factors lose their influence. However when optimizing for non-branded keywords, keep in mind that the analyzed ranking factors have more influence on the positions of the particular URL than on the domain on which it resides. That means that the rankings of a specific page are more sensitive to on-page optimization, link-building efforts and other optimization techniques.

Conclusion: How to Use the SEMrush Ranking Factors Study

There is no guarantee that, if you improve your website’s metrics for any of the above factors, your pages will start to rank higher. We conducted a very thorough study that allowed us to draw reliable conclusions about the importance of these 17 factors to ranking higher on Google SERPs. Yet, this is just a reverse-engineering job well done, not a universal action plan — and this is what each and every ranking factors study is about. No one but Google knows all the secrets. However, here is a workflow that we suggest for dealing with our research:

  • Step 1. Understand which keywords you rank for — do they belong to low, medium or high search volume groups?

  • Step 2. Benchmark yourself against the competition: take a closer look at the methods they use to hit top 10 and at their metrics — Do they have a large scope of backlinks? Are their domains secured with HTTPS?

  • Step 3. Using this study, pick and start implementing the optimization techniques that will yield the best results based on your keywords and the competition level on SERPs.

Once again, we encourage you to take a closer look at our study, reconsider the E-A-T concept and get yourself a good, fact-based SEO strategy!

Like this post? Follow us on RSS and read more interesting posts:

Community Education Manager at SEMrush and SEMrush All Stars group admin. Throw your questions and product feedback at me!
Share this post


2000 symbols remain
This is an awesome study and research work you guys have done, Made me learn something really useful after quiet sometime, Would you like to review the SEO strategy and keyword research of zumexo dot com, we might need your exprtise

Team Zumexo
Interesting. Can I ask where you get your data on bounce rate ?
Sasa Rebic
Hi John,

this is great study, I agree with the majority that is written exept about Brend keywords. Our seo agency Net Vision from Serbia, has a very different experience. Namely, all the clients we have been working with link building in order to strengthen the site's authority (linking Brend) are 5 times more advanced than other sites. In 3 months, we positioned one clinic for 55 keywords to the top 3 positions on in addition to 3 strong competitors. Atlas Clinic looking this link for monitoring keywords I also think Google has said it's working on branding. Whether I am right or wrong?
Not sure where my last comment went, but here it is again:

I read several docs on random forest, but it doesnt make sense in this application (detailed examples missing).

This still looks very correlational to me. What you should do is test a bunch of things that you already know the answer to to see if your methodology holds up. Specifically, things that are known to be wrong, but in increasing use by higher ranking sites. That way, if your report says "more social icons on a page lead to to higher ranked pages" then we say, this method doesnt tell us much useful

For example: sites with more social network icons, or social shares; larger page size; other design related stuff; server response time, number of ips hosted on the server (since top ranking sites are probably more likely to be dedicated hosting); etc.... So, for example, if you show that higher ranking sites are more likely to have dedicated IP, then this whole report you made is not useful, to me at least.

I would say meta keyword, but no one uses those anymore. Age of domain prob wouldnt help either.
Xenia Volynchuk
john smith
John, I respectfully disagree with you on the Random Forest algo being an inappropriate method. It's been created to estimate the importance of variables (as stated in the original work by Leo Breiman here and several other materials). And although this method hasn't been used for SEO ranking factors studies before, nothing makes us think that is doesn't suit this purpose.

It may seem correlational , but it isn't. Yes, we have chosen the list of factors to 'feed' it to the Random Forest, but we've been very open about this and claimed that we explore some of the most controversial and the most discussed ones. To pick these factors, we've been listening to the community and to the common sense. That's why your examples may be not very illustrative. Also, we admit that if we add more factors next time, the results might be different.

Re the direct traffic, yes it may seem strange, but if you think of it as an indicator of user's trust and effective PR efforts, it won't look that bizarre anymore.

Best, Xenia
Martin Woods
Thanks for the in-depth SEO study, very useful.

One question - What would happen if you removed brand name keywords from your study in terms of the relative importance of the 17 factors? I suspect that you would see some very different results, for example 'Direct website visits' may be a very important factor for a brand name keyword, but for other keywords, I would imagine that it's a lot less important.

You've obviously included a lot of brand names within the 600,000 keywords, otherwise you wouldn't have seen such a significant drop in your graphs for the 1st position, as you explain on page 50 of the report. Typically my clients all already rank 1st for their own brand name, as most sites do (e.g. our company is Indigoextra, so we naturally rank 1st for "Indigoextra" with Google).

You could probably filter this by simply removing any keywords that are part of the domain name. I don't know how much work it would be, but I'd be extremely interested in the result.
Roman Delcarmen
Interesting Study. I will read it carefully....thanks guys
Xenia Volynchuk
Roman Delcarmen
Thanks, Roman! If you have any questions / thoughts after reading, we'd be happy to discuss!
Thanks for an interesting study. I have a question about the definition of Direct page visits (p. 51): does this include only direct traffic entries on a given page, or does it also include direct traffic that may have entered on another page (e.g. the homepage) and then navigated to a given page?
Xenia Volynchuk
Britt hult
Hi Britt! Thanks for reading and asking questions. In our study, we opted to stick to Google Analytics' definitions — which means, that page traffic is counted based on sessions. Each session can include several page views, which share the same source (=first page view source). So, yes, direct traffic to a given page may include those who have entered on another page (e.g. the homepage) and then navigated to a given page.
Christian Højbo Møller
Hi guys

Great job. I am glad you are continuing to do these, and even happier that you tried to innovate by stepping away from correlation analysis.

However, I am wondering why you guys aren't featuring the Random Forest Machine Learning results? Why are you sticking to the graphs as indicators when it sounds like you have done some serious statistical work with Random Forest?

All the best, Christian
Christian Højbo Møller
Christian Højbo Møller
Hi SEMRush

Should I assume your silence as a way of saying: "The results in Random Forest said nothing meaningful."?
Xenia Volynchuk
Christian Højbo Møller
Hi Christian! No, it just means I've been on vacation :) Basically, the results of processing the data with the Random Forest algo can be seen right in the study. We haven't included any interim results and even had to get rid of some less interesting graphs to keep our e-book readable. Hope it makes sense!
Christian Højbo Møller
Xenia Volynchuk
Hi Xenia

Are you talking about the 17 variables ranked from "not important" to "very important" on page 8 / 55?

All the best, Christian
Xenia Volynchuk
Christian Højbo Møller
Christian — Yes. This factors distribution gave us the unique base for the further research.
Ajay Rai
Really great piece of information which will be helpful in enchancing my skills.

Keep sharing
Xenia Volynchuk
Ajay Rai
Thanks so much, Ajay! Glad it was helpful.
HI Xenia,

Did you also look into the type of industry i.e. information based website or e-retailers website? The high bounce rate for an information based website doesn't necessarily mean as a bad indicator. For instance if someone searches for “what is the distance between X and Y”, he/she would land on a specific page and gets the information (Z miles) and exit the page/site. In this scenario, the user gets what he/she wants and therefore no needed to navigate to any other pages within the website.

However, the same can't be said for a e-retail website (or any website with the view to make money) where having high bounce rate can be seen as a negative indicator. So if 70% of your keywords/analysis was information based and only 30% on e-retailers then I would imagine your bounce rate conclusion would somehow be diluted, wouldn't you agree?

Also, did you look into the users' intent as high intent related keywords prove to be more profitable despite their low volume and potentially quicker to rank for compared to high volume keywords (top of the funnel) which could drive more traffic with minimal impact on conversion rate, depending on the industry?
Xenia Volynchuk
Stephen Andrews
Hi Stephen, thanks for your questions and comments! We haven't looked at the separate industries in our study — the dataset consists of 600,000+ randomly picked keywords, and we split the results by keyword search volume instead. Both are correct, just different approaches :)

I completely agree that some signals may be more relevant to one industry and less relevant to another one. And your example with the bounce rate illustrates that well. As we suggest here, one should look at the outcomes of the study and benchmark themselves against the competition — and base any further actions on both of these actions, not just one. This workflow allows to take into account industry specificity.

We haven't looked at the users' intent, however this would definitely be an interesting thing to investigate and makes perfect sense.

Overall, liked your ideas — we'll see whether we can incorporate those new ways of segmentation into our next researches. Thank you!
Hmm.. got to agree with Nick. This isn’t a causational study. :( it’s very much a correlational one. Albeit you’re not using Pearson or spearman, the conclusion shouldn’t be a causational one.
Xenia Volynchuk
Thanks for your feedback, Timothy! Although I still wouldn't call it a correlational study, I can admit that we should improve our interpretation for the future researches.
Nick Li
First let me apologise if my tone sounds harsh in my previous comments, as I may transfer part of my frustration because I clicked on a hyperlink on this page and I have to type everything again.

I did more reading on Random Forest as variables importance ranking, and as quoted from this statistician blog
Being important variable means “variables that they believe to be important in predicting the outcome”. It only means this variable has higher prediction value to the outcome base on the trained algo, it does not suggest if one cause the other.

If you train an algo to look at different factors and a person’s wealth, you may find the value of real estate a person own is the most “important factor”, meaning if A has real estate that worth more than B, A is highly likely wealthier than B. It does not mean buying more real estate leads to wealth, nor does it suggest wealthier people tends to buy more real estate (though it is likely one way or another). I did not say your data is wrong, just the interpretation is misleading. In the case of direct traffic, it is equally possible that website have built brand awareness by ranked higher, and therefore higher direct traffic for sites with higher ranking.

While I accept the study aims to provide insight to SEO technicians who already pushed their page to top 20 and are looking for factors that help them further, Google does not have a separate algo for the top 20 result compare to the rest of 100K+ result in that SERP, if not 10M+. Only meaningful sampling can make your statistic significant.

Sidenote, I believe the comparison between volume group is totally legit, as 600/4 = 150k samples in each ranking position and there comparison statistically make sense , and it would be better if a statistical testing was applied (t-test or Anova).
Xenia Volynchuk
Nick Li
Thanks for doing additional research, Nick! And sorry about that link clicking incident — I know how annoying it may be :) Thanks for the reference to the Andrew Landgraf's blog, I see what you mean.

Re the Google algo: we're not claiming that Google has different algo for page 1 and page 100 — we intentionally showed the influence of the analyzed factors on positions 1 to 20 only, as we wanted to offer some advanced insights rather just pointing to basic things like fixing tech mistakes, eliminating toxic links etc.

Thanks again for all your comments — I now have a lot to discuss with my team here at SEMrush. I think I got your point on the interpretation of the data. And if you have in mind any scientific methods that would be more suitable for resolving the ranking factors task, please let me know!
Nick Li
This study is a “statistical” study. While I am not a statistician, I am not convinced the methodology chosen for this study is appropriate, nor is the conclusion made of it.

Data Sample: Why 600,00 SERPs but looking at the top 20 only? SERP for any given keywords will return at least 100k results, 20 out of 100k is less than 0.02% of the population and therefore is such a small sample size and does not represent the whole picture. On the other hand, you only need >1000 randomly chosen keywords from each traffic volume group to be representable for that group.

Correlation vs Clustering: Random Forest is a method for clustering studies. Correlation studies look at the correlation of factors (ie alleged ranking factors) and the observation (ie ranking), clustering were meant to group individuals having similar configuration (eg websites with keywords in title but not in content) and if there is a pattern (eg all ranked at top 20). The score used for “importance” of a factor is exactly the meaning of “correlation” – whether a factor change the positioning of the website in the cluster.

Interpretation: Therefore, what you have presented is, once again, correlation of these factors to the ranking. And that means direct browsing is strongly and positively correlated with ranking, the data did not suggest it, or any other factors investigated, is the cause of a higher ranking.

I think you should consult a statistician before carrying out this research again in the future, it really is not convincing when there is no error bar for average value of 600,000 data points on a graph.

Top 20 richest man all drink red wines, play golf and work 12 hours a day, your conclusion of keywords in title and description doesn’t matter is like saying hardworking is not a factor because all of them work 12 hours, while concluded direct traffic (drink wine and play golf) is the most important factor because there is correlation.
Xenia Volynchuk
Nick Li
Hey Nick, thanks for taking time carefully read the study and share your thoughts! May I ask why you think the methodology we chose is inappropriate for this study? The Random Forest algo was created to solve the task of not only clustering, but also ‘ranking the importance of variables’ (that’s a Wikipedia quote), so it seems natural to be used for estimation of factors importance.

Data set: we’ve looked at the top 100 positions, but are showing the results for top 20 positions only. Although your math is 100% correct, let’s be honest — normally people don’t go further than page 2, so analyzing what’s happening on page 100 doesn’t make much sense. Also, we’re pretty open about the goal of our study — it is aimed at those who are done with the on-page and technical SEO and are looking for some advanced techniques that may work when you already rank in top 10-20.

With all respect, I can’t agree with your comments on the interpretation. Direct visits were named the #1 ranking factor because once we fed the list of the alleged factors and the data set to the machine learning algo, that was what it returned. And the takeaway here is that investing in brand awareness (i.e. PR and activities alike) will pay off with higher number direct visits. And it’s important to keep in mind that this factor has much more stronger influence on a high-volume keyword group.

I hope it makes sense and am happy to discuss further. Thanks again for your comments and tips.
Hi there, great study. Which way do you know "websites visits" ? Google Analytics ? Or is this SEM Rush Trafic ? Which way did you pick the 600 000 Keywords ? Randomly ? With some criteria ? Is there a list of your "features" ? Thanks :)
Xenia Volynchuk
Syl Vain
Hi there! Thanks for your questions — great to see that this post provoked some thoughts! We used clickstream data to look at the number of direct visits. As to the keywords, yes, they were picked randomly to ensure the results are unbiased, and we split them in two ways (by search volume and by keyword length) to show the different angles and provide more opportunities to use this data in one's SEO strategy. Hope it helps!
Thomas Herman
So, does website visits as #1 mean that basically the system favors the incumbents? IE it's hard to get into the top rankings because they're already dominated by a small number of sites who get the most visits because they are already "one of the top ranking sites"? And because it's "direct website visits" you can't even buy Adwords to get into the top rankings?
Xenia Volynchuk
Thomas Herman
Hi Thomas, thanks for your Q! My answer would be 'not exactly', and here is why. Of course, it would be very hard to revolt and jump from the 10th page to position 1. At the same time, distribution of rankings between the first 20 pages is something one can try to influence and succeed. Although domain authority as such an the number of direct visits mean a lot to higher rankings, these are not the only ranking factor and can't be considered in isolation. Google prioritizes users needs, e.g. relevance of the content to their queries, source authority, users security and much more. Speaking of AdWords, you have all the chances to get a sweet spot above organic results, however one need to remember, that landing page quality (as a component of a Quality Score) also affects the chances of your ad to show up and its position. So, the bottom line here is that all your digital activities should be user-centered -- Google knows how to distinguish it and will return the favour :)
"Direct website visits are the most important ranking factor" - This is only on websites using google analytics I take it? Correct? Otherwise, google could never know how many direct visits you get. With that being said how do sites without analytics installed fair? Also, if this truly is the biggest ranking factor perhaps you could write a post explaining the best way to increase direct visits?
Paul Gillooly
Even if there is no GA, Google still knows how much traffic this website gets because they also have Chrome.
Xenia Volynchuk
Paul Gillooly
Hey Paul, thanks for your questions! Although there is no public evidence that Google uses GA data to estimate direct website visits, the fact that Chrome sends part of users data including bookmarks is well-known (details can be found in its privacy policy). Since Google guys don't put their cards on the table, we can only guess whether they use GA data or not. However, as described in the Search Quality Evaluator Guidelines, there is a precise workflow that allows the raters to judge about the domain/page authority, and it doesn't involve GA. As to increasing direct visits -- it's an excellent idea, definitely going yo cover it in one of our next posts. Cheers
what about domain age :/ isn't a ranking factor
Xenia Volynchuk
Adeel Aquarious
Hi Adeel, thanks for you comment! We haven't analyzed the domain age in this particular study. Just the 12 factors listed above. However, this is definitely not the last ranking factors study, maybe we'll have a look at the domain age later — stay tuned!
thanks, you are genius!!!
Thanks for precious Post on the ranking factors
Ghayoor Shaikh
Thannks for precious Post on the ranking factors. I have seen the [link removed by moderator] They sale the all tools to get the best access for needy freelancers. By the way, Semrush is one of the most favorite tools that is used by all SEOs.
Nishi Chandra
Nice...Number Driven Ranking Factors help us to understand quickly.
Xenia Volynchuk
Nishi Chandra
Happy you found it helpful, Nishi!
Xenia Volynchuk
Thanks, Luis!

Send feedback

Your feedback must contain at least 3 words (10 characters).

We will only use this email to respond to you on your feedback. Privacy Policy

Thank you for your feedback!