How can you become a mini data scientist?
I previously had a go at answering this question in another article I wrote for the Semrush blog, “How to be a Mini SEO Data Scientist.” In the end, the article was a bit underwhelming by my own standards; I dove into social media, and whether or not shares affected traffic.
So, this time, I decided to start again. In this article, I will focus on SEO and data science.
My Background in SEO and Data Science
To give you some bearings here, I own an SEO agency which focuses on competitive verticals. We get involved with some quite "hardcore" clients who know what they’re talking about and are very demanding in the right way (i.e. they want results).
When we get results, there are unlimited budgets; in these sectors, the revenues to our clients are massive when we achieve agreed upon goals.
As a result, there is a constant pressure to improve systems and processes, especially around link acquisition.
In my “SEO data science” journey, I’ve worked with a PhD economist specializing in econometrics and a PhD data scientist. I’ve also worked with three specialist data analysts.
My Moment of Realization
Having burned a lot of money working with people who had the best intentions but the wrong knowledge, I’ve had to accept that I’m the one who understands the data we're working with and it’s up to me to tell these data people what to do.
Now that I’ve got the right data within, our business is healthy again.
- Having a specific requirement/pain point.
- Asking a good question.
- Being clear on the ideal outcome for the question.
- Understanding the meaning and relative importance of the metrics you’re dealing with.
- Having an answer to the questions.
- Being able to use Excel properly so you can realize the hypothesis by manipulating the data you have.
It all sounds very simple of course. But it’s not. And where I had gone wrong in the past was that:
- I had a good hypothesis.
- I gave it to the data person who didn’t understand the metrics and what they meant.
- Everything fell over.
It’s also important to not get confused with data harvesting, which is a very different thing. If you’re using SEO tools for Excel, 90 data grabber or SEO gadget for Excel it just makes the process of gathering data easier.
Note: 50 MB of data doesn’t really do much for you; it’s the insight and actionable outcomes you get from the data which turns into results for your clients.
Data Science: A Case Study
I’ll share a particular journey I went through recently with this whole idea of outcome, hypothesis, metrics and Excel.
The Specific Requirement
As I mentioned earlier, link acquisition is a huge thing for us. Of course, I wouldn’t like you to think that we're all about links, because links are gradually diminishing in importance and engagement seems to be filling that vacuum. We're about sustainable results however they come.
The Pain Point
One of our great pain points is finding enough good and relevant websites to reach out to. The better our content, the more likely we are to earn guest post placements with these websites. But, finding the most relevant websites is really hard work and involves a lot of manual review.
The underlying driver is that site owners are always afraid of Google penalizing them for taking on irrelevant guest posts with links to sites within competitive verticals.
Therefore, you must:
- Find relevant websites.
- Give them genuinely really good content.
- Ensure they appreciate this gesture.
- Make sure any link is editorially justifiable.
In short, you reduce the fear people have of Google and you get a link placement that's worth something. Finding relevant websites efficiently is a huge deal.
Asking Good Questions
- How can I find relevant websites easily?
- How can I do this at scale?
- What is a relevant website?
- Can I make this a sortable list?
If these questions aren’t answered, the project will fall over. If you don’t know what you’re looking for, then how will you ever find it? Once you know what you’re after the rest is fairly straightforward.
Mainly, success happens when:
- You understand what you want to achieve
- You know what you need to know in order to achieve it
Clarity on Ideal Outcome
Of course, the ideal outcome is that I can harvest websites, run them through some sort of algorithm and find out which of these websites is most relevant to me, at a low cost for the data.
On a bigger level, I want to reduce human input into the process so I can standardize it more and create far greater efficiencies.
Understanding the Data and Metrics
As you know, there are two questions:
- What sites are relevant?
- How do you find relevant sites?
To make progress, I needed:
- A solid indicator for relevance.
- For that indicator to be applied as many websites as possible.
- One that was cheap enough for me to use at scale.
For a long time, I thought about Majestic Topical Trust Flow. But I’ve never been able to really get my head around it because in many ways the categorizations didn't make sense to me.
An aside: Majestic is a link analysis tool and Topical Trust Flow is a metric Majestic uses for describing topical characteristics of a website.
Going back to my question about what sites are relevant: as long as Majestic’s Topical Trust Flow categorization is consistent for every website it interrogates, then the metric should be pretty robust.
But, there are lots of different categories for the same website and each category has a weighting. So, for example, in this website screen grab above, there is a high confidence it’s about society and government and a relatively low confidence it’s about computers and Internet.
I needed to come up with a way of easily identifying which websites are most relevant to my clients' outreach targets.
Having Answers to Your Questions
In this case, the answer is to build a seed list of websites I know are relevant to my outreach project and come up with a way of scoring the websites by Topical Trust Flow relevance.
A good way of getting a seed list together is to get the best backlinks for websites that rank really well in the vertical I’m interested in. By “best,” I mean links that have lots of Majestic Trust Flow.
A small sample of best back links.
Now I’ve got a list of websites that I’m confident represent the kinds of sites I want to get links from.
All of these websites have categorizations and weightings for each category. What I need to do is aggregate the whole lot into one simple list like this:
Weightings for different categories
Earlier, I pointed out that categorizations have weightings.
I've gone through the worksheet and added up the weightings for each category, and from that I have a sortable list. This list has 497 categories (apparently there are around 1000 Topical Trust Flow categorizations).
Now that I have a weighting against a category, the rest is easy. I just have to go through lists of sites and order them by relevance to my master list of categories.
As long as Majestic is consistent in the way it categorizes websites, then everything should work fine.
List of sites I've come up with
As you can see from the screen grab, I’ve got a sortable list that is very consistent with my master category list.
Here’s another grab from the bottom end of the list; you can see how irrelevant these sites seem to be:
The least relevant sites on my list
Using Excel Properly
When you open a blank Excel spreadsheet, there are a number of ways you can analyze data. The most important thing is to understand what Excel can do with the metrics you’re interested in manipulating.
Pivot tables are remarkable because of how they can aggregate data. Vlookups mean you can do much bigger calculations really easily, and the huge library of formulas will do a huge number of specific tasks for you.
I don’t claim to be that good with Excel, but I know the answers I need. So I wireframe things out in Excel, and if I need VBA programming to make my spreadsheets far more powerful, I get my VBA guy in.
The main point is this: no data expert can help you if you’re not clear on what you’re asking him/her to do for you. I know because I burned a lot of money on hoping experts could guide me, when in fact, I had to help them to get results.
All of this data stuff may seem a little overwhelming. When I look at people like Dr Pete, Jono Alderson or Alex Chudnovsky, or look at the scale of what SEMrush can do (558,257 users, 106,600,000+ keywords, 45000,000+ domains, 27 databases ), I just feel like a total noob.
Of course, instead of sitting there crying, I’ve gotten my head around Excel and built some tools to answer important questions. The Relevance Scoring Spreadsheet is just one example.
My best advice:
- Define your questions properly.
- Spend the time understanding the metrics and data which help you.
- Get your Excel skills up to speed.
- And start answering questions you care about.
Do you have any questions? Just comment below and I’ll answer them.