#SEOisAEO: How to get Entities and Their Attributes in the Knowledge Graph

English

Transcript

Jason: I'm particularly excited about this episode because I initially thought I'd been really smart: I got my place in the knowledge graph by leveraging the fact that I was a musician and I'd made some cartoons for TV.

jason-kg.png

Then I started talking to other people about their experiences, and I realized that I had it pretty easy. I was leveraging notable things that were easy to get myself into the knowledge graph. Cheating like that is fine, but it doesn't necessarily bring you the low-down, the real information about how to do it. So we're going to look at that today with David, Andrew, and Sam.

"In order to own the conceptual space where your company surfaces, think in terms of semantic density." Can you explain?

David: Let's go back to pre-semantic search days when the web was simpler, search was simpler, and our job was simpler. We thought in terms of keyword density. In the worst-case scenario that is the number of times a keyword appears in a web document. And that would help that document to surface for a search query that contained that keyword. Simple. Now we have semantic search and semantic density. The same simplicity of approach applies, but our work is infinitely more complex. So what do we mean by semantic density? In order to be provided as a relevant answer to a search query in semantic search, we need to surface the credibility of the information we provide. The only way of doing so is by building the presence as an entity in relation to that particular query. And that means having the relevant information corroborated just about everywhere, so Google sees it and is confident. Only then can it actually surface you in relation to a particular query.

We mention domain authority in relation to websites. Both Gary Ilyes and John Mueller have said Google doesn't take domain authority as a ranking signal.

But then Google also has said that when it comes to a new page on the web which surfaces in relation to a website, they use a site-wide signal to rank that particular web page. That is essentially domain authority. So mixed signals here.

Jason: You said website authority. But if you have your entity in the knowledge graph, and you've managed to connect the brand to the website, then it's no longer Domain Authority - we are talking about Brand Authority. Up to now, Google had to base trust on links and websites because it was something they could actually get a grip on. But now they can get a grip on brands, we're looking at brand authority, not website authority. Would you agree with that?

David: Yeah, definitely. And essentially the way we establish semantic density is by getting outside our own tiny website, which is controlled by us, which makes its trustworthiness questionable. So we have to get corroboration on the unstructured web, which is the social media space, structured databases and citations in different publications and so on. That's semantic density. And there are two vital things. First of all, it's a lot of work. Second thing, there are no shortcuts.

Jason: Yes. That is something people often fail to understand. They know (and can talk about) themselves, but they don't realize that until they've convinced Google that they are actually telling the truth, they have no authority.

David: Yes, absolutely. And I think you're making a very important point here about convincing Google. Also remember that, because Google's algorithms are constantly changing and completely opaque, we always work in the dark. We probe by doing something, reading the results and seeing if that works. A lot of the stuff we work with is theoretical.

At the same time, however, it's based on mathematics that is public knowledge. Look at how domain authority is developed in the context of graph theory, you need four things. - The prestige of a website of its authors - there are equations which give us that. - The quality of the information which is presented - there are mathematically assessed signals which give us that. - The website centrality - the closeness of the importance of a website in a knowledge graph. - A competitive situation around the subject - the more popular the subject, the greater the competition, the more people do it, so the harder it is for you to rise as an authority because of the noise and the similarity with everybody else.

Those four things are not different from how we operate in the real world as flesh and blood entities. Now translate them into the digital domain: they can be broken down into practical steps which can be applied as a daily goal, daily task level. And that will produce results.

A concrete example of how this works in the real world. For any kind of product, the traditional marketing mix gives us product, price, place, and promotion. In AEO, that would be solutions instead of product, access instead of price, value instead of place, and education instead of promotion. And there we have a template of what the content should be doing, in any kind of platform, whether it's a website, or social web, or in a database.

Jason: Brilliant.

pasted-image-0.png

Knowledge graph, knowledge vault. Give us the rundown to where we're at with that now today

Sam: The knowledge graph is something people are more aware of. It is crowdsourced - Google's getting data from Wikidata, Freebase... So that's what the knowledge graph is, getting information from structured, curated sources. That is very human-created. There are huge gaps in that knowledge. For example, 71% of people in Freebase have no place of birth listed, and 75% have no nationality assigned. That’s because, since it's human created it is limited by the actual amount of manpower behind it. The knowledge vault tries to fix that by collecting data in an unstructured way, by scraping the web. And that's got its own issues with how factual that information actually is. When it finds something on the web; an HTML dom, or a table, web text, or annotations, it is finding RDF triples - someone is making a statement about something with a subject, predicate, and an object. Something like, "Obama is American", "Bob is 30" - any kind of statement like that about an entity. Obviously, on the web, anyone can say anything, so what Google's have spoken about doing with the knowledge vault is verifying. If they see the same information repeated over and over again on multiple different domains, they can conclude, "Okay, well this is more likely to be a fact." The stat I've seen around is that of triples it finds across the web, it is very confident in the veracity of only 16%. So there's a huge amount where Google is not 100% sure. Using their knowledge graph, they can cross-verify the unstructured data they've found across the web.

Jason: The idea of cross-verification is incredibly important. And then it seems that the more you provide information that turns out to be factual, the more you become a trusted source. And that could be a way in the future of building trustworthiness.

Next up, the Knowledge Graph. Here’s a way to check if you are in there: https://kalicube.pro/knowledge-graph-explorer.

pasted-image-0-1.png

How do you get in the knowledge graph? I found this when searching Google. Not very complete.

pasted-image-0-2.png

What is the way to get an entity into the knowledge graph?

Andrew: Schema is a given. Start with that on your site. The biggest data source that Google has is us. State the facts in Schema, then make sure that is corroborated elsewhere on the web. Wikipedia, for example. The majority of us are not notable, we haven’t written books, we're not famous musicians, we haven't invented things that everybody knows about. In short, we aren’t notable and cannot claim a Wikipedia page… Tough, but you can add information to Wikidata.Beyond that, you need to think about what Google is trying to do. It's trying to come up with the best answer about who you are and what you do. You need to think about the ways that you can feed Google this information. Google My Business is important since you're directly feeding the machine. Then you can get yourself cited in the Better Business Bureau, Trip Advisor, Yelp, Crunchbase and so on. Google is also scraping all this kind of stuff. People are an interesting case. Names are ambiguous. Simon's asked about this and he has this issue because he's got a very common name. If you search for Simon Cox, you get 50 trillion results, so then you search again for Simon Cox footballer, or Simon Cox SEO, or Simon Cox musician... then Google knows which one you mean because it is associating people with their attributes. You need to get those search terms related to your name (brand or person). A brand like Optimisey is easier because it is unique. But then, what the heck is Optimisey? Initially, Google doesn't know. First, explain on the official site that it is an SEO company. Hopefully, people start searching for Optimisey SEO, giving Google supporting evidence. But you also need crawlable corroboration. Interestingly, it comes back to some really basic SEO stuff that we've all covered before: link building, citations, mentions. Context is a really big thing here.

Jason: Great. What strikes me about many of those sources you mentioned - My business, my own website, Wikipedia, Wikidata etc. is that it's all stuff that I can control. So it's actually not trustworthy at all. Because I can add myself to Wikipedia, or to Wikidata… Perhaps the crux of it all is placing information on trustworthy sources where we have no control so that Google can cross-check and believe all this information.

David: Everything we've covered up to now has been building up to it really nicely. The trustworthiness of the information on the web is key. Essentially a knowledge vault is supposed to address that by carrying out computations between knowledge graph an unstructured data across the web and identifying some sort of signal. What we want in search is to index the web in one database that can give us all the relevant answers in relation to everything that happens online. But that is next to impossible to do on the fly right now because data is so fluid and complex. But there are things which are happening in the background. The entity database gets built up year on year, and entities are trustworthy sources. The connections, the edges between these nodes in the graph are being calculated more precisely year on year. Google’s computational power goes up every year, and the cost goes down. And that is critical because managing the amount of money they spend doing this is what allows them to keep on doing it. As the automated process improves, more data is included and the results get faster and smarter... we will get to the stage where there is independent corroboration, just like in real life. In real life, if you go to Cambridge and say “who is the best Cambridge SEO”, everybody will say Andrew, right? Well to make that happen, Andrew has actually done a lot of work. He's networked, he's visible, his work is everywhere, and his customers talk about him. If we do the same on the web, at some point this will start surfacing independently. It hasn't happened yet, but we can see that the pathways are there. In some ways, it's like the same discussion that we had in 2012, where the semantic search results were very imprecise. We were getting mixed results from the search: from the old source where it was just links, and from related semantic search results because the semantic search graph hadn't been populated yet. Once the knowledge vault is populated better, we'll see better results.

Jason: Yeah. So one thing I do pull out of that, is you're saying we should be networking more online, with Google as the eternal listener. Which is this nice idea. Google watches us network, and it can see where all our connections are coming in, and the more you do that, the more you build up an image of yourself.

Andrew: There's a really interesting thing there, Jason. Google killed Google+ recently. But remember Google Circles? That was Google's attempt to get the social graph. With that it knows that Jason the SEO is connected to David the SEO, and is also connected to Andrew the SEO, so chances are that the Simon Cox who is also in this group is the SEO guy and not the footballer.

David: It's interesting. Google+ has gone away, but that doesn't mean there’s a vacuum in its place. It has simply substituted the wholly-controlled parameters of Google+ with looking at the wider web. And my guess it's betting on algorithmic improvements to its machine learning to actually deliver the results.

Jason: Great! A quick aside about semantic triples in text. I talked to Doc Sheldon and he was saying “keep in mind the concept of RDF triples. Incorporate clarifying statements into reference content in a way which is supported by the markup.” Brilliant. David Amerland went further and said you should add a semantic layer by identifying meaningful elements and rewriting those parts to leverage subject-predicate-object. You could even employ somebody whose job is to pull out the semantic triples and reinsert them in a way that Google can easily read them. Then look at Wavii, who were automatically identifying relationships between subjects, objects/entities using verbs such as “acquired”, “graduated from”, “is author of”, “is based in”, “studied at”, and so on. So clearly correct grammar is vital. Taking that a step further look at context clouds. That is a recent Google patent about identifying entities within a document, seeing what other entities are around it, and trying to build understanding about entities, relationships, and attributes through the context in which they are found… And then building ontology-based context clouds that can link things together. I think I explained that quite badly.

context-clouds.png

Could you explain semantic writing and context clouds in a way that everybody else will understand?

Sam: Sure. When it comes to writing content for semantics, it's very different to ten years ago where you used the same word over and over to make sure Google realized, "Oh yeah, you're actually talking about this topic." Nowadays, with the knowledge graph and knowledge vault, it's very much around mentioning related entities. So if you're going to talk about Obama, it also expects you to say things like President, America, Washington, Michelle Obama. Those relationships will trigger  "Okay, I'm confident now you're talking about President Obama because you're talking about all these other things”. That's where your writing should be going. Ping the knowledge graph to see what it views as related and contextual to an entity. Then write around those. And because of how the knowledge vault is cross-checking facts, make sure you're using triples.

Jason: Indeed. So, it's building up the corpus of your site with semantic triples, talking about related entities and attributes to create a context cloud and use sentences that are clearly subject-verb-object. William Rock has just pointed out that Wavii was acquired by Google. I might be completely wrong about this because I got a bit confused after reading too many of Bill Slawski's articles, but that might've been the basis on which the context cloud has been built.

David: Google has a long history of acquiring technologies that it incorporates in its knowledge graph and search algorithms… this is probably a case in point.

Andrew: As SEOs, we sometimes forget that we have all these tools to find related topics. Google itself is a really good source. Google Trends is great! People often use it to compare trends in search term popularity - compare the trend of “blue shoes” versus “red shoes” over time. But Trends offers more. It tells you what topics Google thinks is related to the query. You stick in Barack Obama and it tells you the topics it sees as related. Donald Trump, President, Michelle Obama, all the things that Sam was talking about. Wow. Google will tell you what it thinks are the related topics around that subject. Use it. People Also Ask. Search for Barack Obama and the knowledge panel will tell you what people are looking for related to that data set. Who is his wife? How tall is he? Was he the 44th or the 45th president? All that stuff is in the knowledge graph. But the importance of relationships is not standardized. If you search for John Wayne, you get a list of all his movies. You search for David Beckham, you get a list of his kids. Because that is what people ask. It doesn't have the same knowledge graph data for every single person, it's not a cookie cutter template, it's based on what people look for.

Jason: Yeah. And you were saying is there is a process. There are several questions. David Beckham goes from “who are his kids” to “buy David Beckham t-shirts”. That is where people can make money out of the knowledge graph. Oooooh. I've just solved everybody's money-making problems for the next 20 years :)

pasted-image-0-3.png

I really like that last part - the idea of learning through known relationships, ie things that have been often manually curated, then semi-structured data where we can estimate relationships between entities, and even the unknown ones because we can guess that there is some kind of relationship through the frequency of co-occurrence.

knowledge-panesls.png

Here we've got a company, an operating system, and a person.

What are the specifics of different problems faced for companies, people, software and products?

Andrew: What Google hates is being wrong. It doesn't want to give an answer where it thinks, "mmmm not sure". It needs a high degree of certainty. You have to make it really clear what your offer is. The key thing here is schema. Mark up all your pages. Context is really important. Make sure the context is there in your content. One issue now is that people create companies and software with stupid names. Purple Flamingo or Tangerine Umbrella...

Jason: Or Wavii.

Andrew: Or Wavii. What the hell is Wavii? Out of context, Wavii doesn't mean anything. Bill Slawski uses the example of “horse”. Are you talking about the animal, the one you jump on as a gymnast, the one that you use to balance your carpentry stuff? The surrounding content, all the citations, all the links, will give context and allow it to disambiguate.

Jason: Yeah. That's what you were talking about yesterday - the ambiguity of everything. Companies tend not to be ambiguous within their industry and within their country. There are rules about sharing names and trademarks on company names. Whereas people have this enormous problem with ambiguity. I certainly have it. With my name, there's a footballer in South Africa, a dentist in America, a preacher in America, and a guy who's been in prison. There are loads of Jason Barnards around, so that ambiguity becomes a problem. You mentioned barnacling to me.

Andrew: So if you're not famous in your particular area, then you need to be mentioned where famous people in that area are. So let's say I was starting out as a double bass player, and I'm like, "Nobody's heard of me, how am I going to get into this world? Jason Barnard, he's a really famous double bass player. I can get mentioned where he is" And it is that piggybacking on those other people because that's how to get the context. Andrew Cox-Starkey says he is a double bass player. Google's not going to take my word for it. But if I’m mentioned in the context, in the same place that Jason Barnard is mentioned, "Oh we know who he is, he's a named entity, he's a double bass player, he has all these albums." It's barnacling - hitching yourself onto the double bass player. It might be worth signing up for a double bass school - not because you actually bother going, but because they make you look credible because you're signed up to that school and you're an alumnus of that double bass school.

Jason: Great, so we can credibilize by piggybacking the CEO or the board of directors and that can help us push the company. So whatever is notable: the company, people, products, software... take the one that is notable and leverage that as much as you can to get the other ones up there. It's a self-fulfilling prophecy, and the more you do it the better it gets.

Andrew: I was lucky enough to have Marie Haynes come and speak at one of my events, and she and I were chatting afterward. And she has this theory that Google is doing a lot of work to devalue links. There are just too many links, and we've so much spam links. But that's not working anymore because of the whole context thing. If you have 100 links in crappy directories, no good. If you have one link from that double bass school, that's much more relevant and much more useful. And Marie was talking about this distance from trusted sources: Wikipedia, Wall Street Journal, New York Times, whatever it is. The question is how far away are you from those trusted sources? Are you linked to from a site that's linked to from the New York Times? Or are you linked to from a site that's linked to from a site that's linked ... You might not be able to get yourself on the front page of the Huffington Post tomorrow, but can you get yourself on the front page of a site that's linked to from the Huffington Post.

Jason: Or even get yourself mentioned in the same breath, on the same page, in the same sentence, as somebody who works for the Huffington Post.

Andrew: Exactly. Getting in that context, getting in that circle.

knowledge-graphs.png

What about all these other knowledge graphs? They're obviously gunning for a place

David: Everybody is building their own knowledge graph. Google of course. Amazon have got their own knowledge graph, which is product related and purchaser revolving. Airbnb have their own. Facebook have an interrelationship graph. The question here is always, how do these relate to each other, and what's going to happen next? It's a floating question. We don't know precisely. There is value in the connectivity of these data graphs. However, Facebook is a walled garden. And everybody else at the moment tends to exchange data to a certain degree, so data is portable in terms of it's transparent to search, and search becomes a unifying latticework through which most of these things surface. Look at Amazon. They have a lot of data in their graph. They have movies, they have music, they have books, they have other products. Plus they have also pushed Neptune out in the corporate world - essentially a database that allows companies to take their data and turn it into a knowledge graph. If Amazon feel at some point they have sufficient knowledge about an individual to be able to pull that person into their universe, then they might think, okay why don't we wall off our information from Google? In which case you're going to get into a very fragmented web, where we will have to leapfrog from silo to silo in order to get something done. But all this is happening against a backdrop of constant change in terms of privacy and legislative initiatives to address those issues. So it is also quite possible that we'll get to the stage where we control the central data of our identity, perhaps through a web ID. That could be blockchain powered. There are a lot of initiatives on that front right now. For example, the Legal Entity Identifier framework that right now is aimed at corporations. It's a non-profit blockchain-powered initiative. And it's only a small step to getting that to us as individuals. If we get to that stage, then we control our data the same way that we control our personality in the real world. Companies will have to give us some kind of value to convince us to part with the data and allow them to use it.

Jason: Wow. Taking back control of the data is something I also heard from Andrea Volpini. I encourage my clients to communicate with the knowledge graph and try and shove all their information into it. But in fact what we're doing is throwing this information at Google, desperately hoping we can get some short-term value. But the long term could be a complete disaster.

David: It could be. It comes down to how valuable something is and what happens next. Google has a lot of data on us and it's incredibly valuable to them. If that value then is abused and legislative initiatives are put in place that prevent them from exploiting it, then they effectively killed the goose that lays the golden egg. I don't think that this is something that they want to do. So there's a dynamic there between what we lose and what we gain with every interaction. A great example is voice-activated assistants like Google Home and Alexa. We are aware that they listen to us, and we use them even though we don't trust them. We're in a space which I call a trust truce. Essentially we use them because we get something done. We're aware that they get some data, but that data isn't something which is incredibly valuable to us and it doesn't do any damage for us if they have it. If we get past that, we'll need to rework the relationship and find different ways of working.

Jason: Crumbs. Brilliant stuff, wonderful. Next. Linked Open Data cloud. 1224 datasets in one linked open data cloud. Nine knowledge domains: geography, government, linguistics, life science, media, publications, social networking, user-generated and cross-domain.The idea is all these knowledge graphs all link together but with the problem that the IDs don't match, there is a problem of portability. Despite that, is this an opportunity, can we just fill up the LOD?

David: In theory yeah. The original vision of semantic web was that everybody did their bit and put data out there because we all benefited - clearly tagged, marked and structured. That should've worked in theory. And we know it doesn't, because the dynamics in the real world isn't right. The reward systems in our brains aren't activated - we're unlikely to engage in that kind of behavior unless we personally gain something. So here the question is who's going to gain. It's reminiscent of directories in the beginning of the web where people contributed and created a loose taxonomy with information… but that got abused and was deprecated and now they aren’t worth anything. So I think the way forward is through clear identification of data in terms of its provenance, and clear structure of data in terms of its value. That will give it a relevance-value once it's indexed and used in search.

What does the future of managing your Knowledge Graph entries look like?

Andrew: That's a big question. I'm worried that Google is becoming the source of truth for everything.

Jason: Yeah, short term we’re playing with fire. For the moment I'm enjoying it, so I'll keep going. Sam, what do you think the future of managing our Knowledge Graph entries is going to look like?

Sam: As Google gets better at understanding whether something is a fact they're going to invest more heavily. Answer Bots are part of the future. Google's going to get smarter. Face up to that.

Jason: Oh yes. You touched on one point - trust and confidence and credibility. If I can leverage myself to be the trusted source about myself, and build on that confidence… Google becomes confident that I'm telling the truth because I've always told the truth about myself. Indirectly I get more control over my own data simply because these machines will tend to believe me thanks to my historical truthfulness.

David: You mentioned trust. It's one of those woolly things - we think “I know it when I see it, but I don't know how to create it.” That isn't true. There is a granular approach to creating trust. The steps that give us expertise: is your content good or is it not? Does it answer the question or does it not? Does it have the relevant technical information or does it not? Does it have all the relevant links or does it not? It is easy to show you are an expert. The steps which give us trust are: contact, perception, assessment, and connection. And they're linear in terms of how they evolve. The steps that give us authority are: identity, conversations, sharing, presence, relationships, reputation, and groups/communities. Big question - are you an authority on a subject? An authority is cited by everybody. Compare that to “he knows his stuff but nobody knows of him”. Making that happen is very labor intensive, you need a strategic approach. Also because it's the web, it has to be intentional. For the first time, we need to create those things in a very intentional, planned way. As opposed to how we do them in the real world.

Jason: Absolutely phenomenal. Brilliant.That explanation of EAT is brilliant. Importantly, it is very labor intensive. We don't have the choice. We have to build our EAT… feeding that to these Knowledge Graphs. Definitely not something we can put to one side. A lot of companies are saying, "We'll leave that till later because we need those short-term gains." That is terribly foolish of them

Credibility and Position zero

David: Position zero will always rely on relevance and trustworthiness.

Jason: Perfect last word. Wonderful stuff. Thank you Sam, Andrew and David, that was absolutely brilliant. Lots of really interesting insights, lots of great conversation. A couple of good jokes as well.Please come along next Tuesday, same time, same place  for “how does Google use the Knowledge Graph in its AE algorithm”. Bye bye, thanks a lot, guys.

Advanced

Check out other webinars from this series