Episode 41: Building Better AI Apps with Knowledge Graphs and RAG

Download MP3

[00:00:00] Dr Genevieve Hayes: Hello and welcome to Value Driven Data Science, brought to you by Genevieve Hayes Consulting. I'm Dr. Genevieve Hayes, and today I'm joined by Kirk Marple to discuss how knowledge graphs and retrieval augmented generation can be leveraged to improve the quality of generative AI. Kirk is the CEO and technical founder of GraphLit, A serverless cloud native platform that streamlines the development of AI apps by automating unstructured data workflows and leveraging retrieval augmented generation.
[00:00:35] Kirk, welcome to the show.
[00:00:37] Kirk Marple: Yeah, thank you so much for having me be here.
[00:00:40] Dr Genevieve Hayes: When ChatGPT was first released, there was talk that it would lead to traditional search engines like Google, as well as many people's jobs soon becoming obsolete. That was until users discovered generative AI's one major flaw. It makes stuff up. And in the 18 months or so since its release, there's been a number of stories in the press about people who've embarrassed themselves immensely by believing ChattGPT without question.
[00:01:10] Because of the stochastic nature of chat GPT, it's never going to be possible to completely eliminate hallucinations
[00:01:19] however, there are ways to work around this problem. One such way is through leveraging knowledge graphs and retrieval augmented generation or RAG. And that's what we're going to be talking about in this episode. Now, there are a number of commercial tools available that can assist software developers and data scientists in integrating knowledge graphs and RAG into the AI apps they're developing.
[00:01:47] And one such tool is GraphLit, which Mew Kirk happened to be the CEO and founder of. So for any of our listeners who haven't come across Graph Lit before, can you tell us a bit about it?
[00:02:03] Kirk Marple: Yeah, for sure. So, I mean, as developers have been working with unstructured data for the last number of years, there hasn't been the tooling that you might see for the structured data world. I mean, there's not the snowflakes or the five trans or these kinds of tools to make unstructured data work with Large language models.
[00:02:23] And so what we have is an end to end platform that makes it easy to put data in. We structure it by extracting text, do audio transcriptions, things like that and extract the kind of people, places and things in the data to create a knowledge graph automatically. And so really it's aimed at. Not even just data scientists, but really any kind of application developer who wants to leverage unstructured data and then integrate that with large language models in their applications.
[00:02:54] Dr Genevieve Hayes: Is this a business to business product or would you also aim at hobbyist developers as well?
[00:02:59] Kirk Marple: Yeah, it's anybody. So I mean, it's, it's developer focused. I mean, we, we say that. And so we actually started building an application on top of the platform three years ago, and we got started and then started to realize that this platform, the API could be useful to anyone. Building applications in this space.
[00:03:15] And so that's what we sell now is really just an API for everyone that they can very easily put data into the system and then have conversations with it, query it, and then even repurpose the content using these language models.
[00:03:30] Dr Genevieve Hayes: Okay. So you just said that you started work on this three years ago.
[00:03:34] Kirk Marple: Yeah,
[00:03:34] Dr Genevieve Hayes: So Chachapiti. I believe was released about 18 months ago. So that sounds like you started work about 18 months before the launch of ChatGPT.
[00:03:44] Kirk Marple: And that's what's really interesting is, I mean, so we were focused on the ingestion and the cataloging side of unstructured data. So we could pull in documents and images and even geospatial data, and we're making essentially a search type catalog interface for this data using natural language algorithms for that.
[00:04:05] And also Computer vision models, but it was pre LLM days. And so what we realized a lot was kind of when chat DPT came out, the LLMs became available just publicly. We were like, wow, this is like peanut butter and jelly. The two things go together really well. But we already had built and then quickly just kind of merge the two together and got more value out of what we were already building.
[00:04:30] Dr Genevieve Hayes: What you just described, that's almost identical to a project that I worked on in a previous job pre ChatGPT. So we actually had a large volume of unstructured data. So we had text, audio, video. and images. And basically my job there was to build a data extraction pipeline that turned the unstructured data into structured data and then loaded that into a knowledge graph.
[00:05:00] Kirk Marple: Yeah. Yeah.
[00:05:01] Dr Genevieve Hayes: So it sounds like what you were doing would basically have automated. What took me quite a bit of time to achieve.
[00:05:08] Kirk Marple: Yeah. Yeah. I had been CTO of a few different places trying to build pipelines like that as well. And so I saw how you basically had to do it yourself. I mean, you could use spark. You could use, object storage, you could use, Jason databases or graph databases, but you had to put all those pieces together yourself.
[00:05:26] Yeah. And so really what I saw is, I mean, there, look, there has to be a better, more elegant pipeline kind of graph ETL. I mean, you could almost call it. And I also had had done a lot of work in the multimedia space and building media management tools and metadata archives and things like that, where there's a lot of overlap with, how you get a video for a broadcaster on TV and have to manage.
[00:05:52] Search and metadata and all that kind of stuff. And so that I'd worked in that area for 10 years. And so this is kind of the next incarnation of a lot of those concepts. Really just being media agnostic and making sure that we can handle anything, really any kind of unstructured data.
[00:06:08] Dr Genevieve Hayes: I remember from when we were doing it, the one that we really struggled with was the named entity extraction. So we were using the Google AI APIs to do all of this. And with the named entity extraction, it had. a dubious level of accuracy. So it was good with names of, you know, cities, for example. So things that were really, really common and really common names.
[00:06:41] So I don't know if you had, Tom Cruise is a name, for example, something that's really obvious. But once you started to get into names that weren't common in probably in the US where it was developed, it started to get more and more dodgy. And I remember, I remember at one point our data had a lot of And there was one point where it picked up a particular expletive that I won't mention in this recording and said that that was a city in Sweden or something.
[00:07:16] Kirk Marple: It's, it's really, yeah, I mean, the early ones were the false positive rate was, was pretty high. And we had started using Azure. I guess they called it cognitive services. Their text analytics, which is very similar to that. And that's what we originally started with. And it was pretty good, but you're right.
[00:07:32] I mean, it's not perfect. And it was only good at named entities, not the metadata around them. And that's, what's really been interesting with, you know, The large language models and what we, we just kind of released this year, basically using LMS to do the entity extraction, but also get the full fidelity of like, Hey, it's a place.
[00:07:55] What's his address or here's an event? What's the ticket price? When does the event start? When does the event end? And so we map everything to You're probably familiar with like JSON LD or schema. org taxonomy. And so we're actually very opinionated about the data model. And so everything that we pull out maps to those entity types, I mean, like place, product organization, person.
[00:08:20] And that's a little different than some other entity extraction packages, but it lets us. Kind of canonicalize that data better for a set of use cases that a lot of our customers are using.
[00:08:31] Dr Genevieve Hayes: I would imagine that using an LLM would be way better than using something like Azure Cognitive Services.
[00:08:38] Kirk Marple: And we support both. we've looked at maybe, using a hybrid, like, maybe we do one pass kind of with like a cognitive services and maybe guide the LLM. It's something I want to try, but honestly, the LLMs work. I mean, The higher quality ones like GPT 4 and 4. 0 are noticeably better than like the Haiku or Sonnet from Anthropic, like kind of the lower level ones.
[00:09:03] Sonnet's not bad. Haiku does a really poor job in entity extraction. And even some of the other ones, like the Lama ones don't do a good job. But GPT 4 Oh, is our current default and does a really pretty nice job.
[00:09:16] Dr Genevieve Hayes: How does it compare in price to the other ones that aren't as effective?
[00:09:21] Kirk Marple: It gets expensive. And so, I think GPT four, I'm trying to think, it's almost 10 X. I think the cost difference between the low to the high I think it's like 30 cents a million up to some of them even go like 15 bucks, a million tokens. And so if you're not careful and this is actually something we were just dealing with you could dump a ton of data.
[00:09:41] And if you have entity extraction enabled, it's very, very token intensive. And so we had somebody throwing in like multi hundred page PDFs doing entity extraction at like 400, 000 tokens, a document, and it adds up pretty quick.
[00:09:56] Dr Genevieve Hayes: Yeah. I can imagine that. I remember, , back when I was doing that project I was described because I was the main person working with the Google APIs, I was the only person who knew how to interpret the bill. I remember our bill just going from when I was in the original proof of concept stage we were in the free quota to, you know, six, 12 months down the line we're getting like a 60, 000 bill or something.
[00:10:24] Kirk Marple: Wow.
[00:10:24] Dr Genevieve Hayes: We're using that much data. And yeah, I remember being asked, could you please explain why we have a 60, 000 bill?
[00:10:32] Kirk Marple: It's really true. And it's something where, because we're kind of model agnostic and you can swap the different models. I've been looking at kind of what's the good quality versus price but we're finding we have to educate our customers because. You can end up spending a decent amount of money.
[00:10:45] And so there is an educational component to it for sure.
[00:10:48] Dr Genevieve Hayes: In the day when data science first came out, there was this naive belief that a lot of people had that data should be free. So I remember having this boss and he wanted us to build this particular model. To predict it was defaults of a particular customer. And I remember him telling me and this other woman that I was working with in my team, just go and, phone up.
[00:11:20] The big four consultancies that we work with and some academics and ask them if they'll give us their data because they'll want to share it with us. And yeah, and I think people have gotten over that level of naivety now, but I think the naivety is now coming through with tools like ChatGPT, for example, because people are used to being able to just log on to the free version of ChatGPT and they don't seem to understand, yeah, these are businesses and they want to make money and they've got a business model.
[00:11:53] Kirk Marple: I mean, there's GPUs and power and storage to pay for. And none of this is free under the hood. So, and even for us, we're trying to make it cost effective where. Sure, we have to add margin on top of our costs, but it's really more, you want to enable volume, because you really want to make it easy for people to put, a lot of data and get value of a lot of data.
[00:12:12] So, so it can't be super expensive either.
[00:12:17] Dr Genevieve Hayes: Going back to what I was saying at the start of the episode, I understand that one of the big problems with LLMs like ChatGPT is their stochastic nature. And from what I've read about RAG, I believe this goes some way to solving this problem by introducing determinism into the knowledge retrieval aspect of a chatbot.
[00:12:42] And it's doing this through knowledge graphs. But that's about where my knowledge of RAG ends. So, Is there anything more to RAG than that, or, yeah, I'm guessing there is.
[00:12:57] Kirk Marple: Yeah, the way I look at it is, the R in rag is a lot of it where that's the retrieval or the search over your content. And what we look at it is, it's not just. Vector search, which a lot of people think, Oh, you just run these embeddings models, you create vector embeddings and you try and find similar text.
[00:13:16] And that's kind of what you throw at the LLM. But really it gets a lot more complex than that, where you may want to add in kind of more classical metadata filters like time or geospatial areas, or. I don't know, even like where the data came from, like, maybe I only want to ask about data that came from a slack channel and a SharePoint folder.
[00:13:37] And so you can't just look at it as like this holistic, big, massive data. You really have to do classic more search techniques of, okay, let's get that haystack smaller 1st and that gives you your kind of your set of sources. And then you also need to do things like re ranking. The default kind of sort order you get from a vector search may not really be semantically relevant maybe it gets you 50 hits, but maybe 10 of those are really good hits, and the other 40 are kind of like, so, so, and so that's where these re ranking models have come in essentially as like a post processor where you do your search, get a smaller set of data, throw it at a re ranking model.
[00:14:18] It comes back and says, okay, like here's a confidence based on what you're asking of what we really think this should be. And then you can take the output of that and give that to the LLM. And that increases your quality a lot. And so you're getting, you're basically just reducing the noise in terms of what sources you're providing.
[00:14:36] But then when you get into the graph side of this, what we've started doing with what's now people are terming graph rag, is Where it's you think of, okay, I have text, say it's a Slack message, an email and a podcast transcript. And there's people we mentioned, there's topics, there's places. And those are just words in the vector index, but really, If you extract those and enrich them, there's a lot of metadata that goes with it.
[00:15:05] Say we say I'm in Seattle and you could go to Wikipedia and get a ton of information on, like, the population, the weather, whatever. And. That data isn't going to be in the document sources or the content sources that you provide the LLM, but if you enrich it with the data and the knowledge graph, you're kind of giving background material to the LLM to say, okay, I see you use the word like Seattle in there, but I'm going to give you some extra stuff with this and that's what we're doing.
[00:15:37] So we give it a list of sources. A list of entities and then a list of relationships between those entities. And then the LM can have all of that to answer from. And I've been kind of calling it like color commentary. It seems to answer better in some situations because It kind of knows the background material of what you're asking about, and it may not be based on just where the search hit was found.
[00:16:04] And the other thing I didn't mention is there's another concept of when you give the content sources to LLM, the kind of window of where you found it. Can be something that you can optimize. So you could say, okay, I'm just going to give just a paragraph where Seattle was found, or where my questions seem to be answered, or I can expand that to the whole page of the document, or the section, like the chapter or you can even say.
[00:16:31] Oh, just, it's a slack message. It's an email. Just give me the whole thing. Like, I don't even care about just giving you the piece. I just always want the whole piece of data. And that really is another knob that you can turn. And that then gives the LLM more context that it saw the whole conversation or it saw the whole chapter.
[00:16:51] And that's another variable that's really important for RAG quality.
[00:16:55] Dr Genevieve Hayes: Okay, so there's a lot to unpack in that answer. So just going back to the start, you keep mentioning vector search. Can you explain what's meant by that? Because I keep hearing that term again and again, and I've never actually seen a definition of it.
[00:17:08] Kirk Marple: Yeah, the simplest way to look at it is, I mean, you're giving a piece of text. It's just some number of tokens of text. And so it's say it's around, like maybe a few pages worth of text that you can hand to a model and the model gives you back a vector of floating point numbers.
[00:17:24] And it's really just a way, it's almost a compression kind of idea where. It's generating a representation, a numerical representation of the data that has the ability to run analysis on it and find similarity between it. So they talk about, like, cosine similarity. You can take a vector of 1 piece of text and a vector of another piece of text, compare those numbers and get a number back that says, hey, how similar do those things look?
[00:17:53] And. There's different embedding models. OpenAI has some. Cohere has some. There's a bunch of different ones out there. The length of the vector can be different of how many numbers are in the vector. And so there's a lot of variability there, but you can tune it. So, essentially, you can retrieve the best quality data, and that's really what it uses is that similarity search and it's really at the heart of pretty much all RAG concepts work rather than using a classic keyword search.
[00:18:22] Dr Genevieve Hayes: So with these Vector embeddings, are these equivalent to the embeddings that you would use if you were fitting a NLP model?
[00:18:31] Kirk Marple: I believe so. I don't come from a classic data science world. I mean, I think of it more as I don't know. It's almost like a fingerprint. Like, we used to do, like, audio fingerprints or different imagery, things like that, it's kind of like this fingerprint of the data that has a connotation that you can see how similar this is and sort of cluster that data together.
[00:18:48] And, what you're doing is. During the ingestion process, you're creating the embeddings which is also based on decisions of how do you chunk up your data? If you have a 300 page document, you can't just stuff that all into 1 embedding. You have to break it up.
[00:19:03] And then they talk about the different chunking strategies and. You could, change the size of how much text is in each one of them. You can decide, do you want to put metadata in it or just the raw text? There's a whole bunch of decisions you can make. But at the day, what you're doing during retrieval Is you're saying, okay, somebody types in a prompt, you create an embedding of the prompt, and then you hold them up next to each other and go, okay, find me data similar to this.
[00:19:31] And what it gives you back is it's basically just finding things that seem representative to the question you're asking.
[00:19:38] Dr Genevieve Hayes: So because of this, if you're using a standard keyword search, if we're searching for, I don't know, let's talk about King Charles versus Stephen King. The term King is going to come up in both of those.
[00:19:53] Kirk Marple: Yeah,
[00:19:54] Dr Genevieve Hayes: Keyword search, that would struggle, but with vector embeddings, you get that additional context that would be able to distinguish between the two types of King.
[00:20:04] Kirk Marple: That's a great example. And it's interesting because the one thing that where this breaks down is, I mean, As you're typing in a question, you're not really optimizing it for search. I mean, you're optimizing it for what you want to retrieve and get the answer back. But the search mechanism may not really be accurate to like, I mean, if I say, Hey, I don't know that.
[00:20:26] Paper that I read last year, like, it may not know what that is, but going to look for where I wrote the word paper in a document, but that's not really what I care about. I care about the context of what that was related to or what the paper was about. And so that's where what they call naive rag kind of breaks down because the 2 things aren't necessarily related, but you have to work with them.
[00:20:49] Dr Genevieve Hayes: Does it have to be a graph database that underpins a RAG, or could you do it with a regular. relational database, SQL database.
[00:20:58] Kirk Marple: Yeah, yeah. The graph whole concept is really more of a sort of a, I don't know what you want to say, like an advanced version of rag. Like it's not required by any means. And so a classic rag is really just a vector database that you could do the similarity search. And some storage of the raw text in a way that is an index.
[00:21:20] So you can basically get back from the vector index to a textual index, because you have to have the text to give to the model. And that could be in a file system. It could be in a database. It could be whatever. And there's very naive ways to do it. The graph part of this is really Kind of extension of the basics, and it's really more of a saying, look, the naive techniques maybe hallucinate, or they don't give enough kind of context for the questions.
[00:21:51] Here's a technique to kind of give it more context and give better answers. And it's more of a pattern. It's not a product. I mean, it's really just more of a pattern of saying you have the initial version is that let's just use text and give that to the LM that we found was similar.
[00:22:06] And here we're saying, let's go farther than that and incorporate the vector index and a graph index and use both of them together. And that's really what graph is about.
[00:22:17] Dr Genevieve Hayes: So it's about getting the context with regard to the background information, but also the context with regard to the relationships between other pieces of information.
[00:22:29] Kirk Marple: Yep. Yep. Because what can happen is you have a piece of text that was found based on your prompt, but it may not be obvious that two of those pieces of information are related. The model may have been trained on that, but if it wasn't trained on that, how would it know that like Kirk works for Graflet?
[00:22:47] GPT four is not going to know that, but if that happens to be on Wikipedia somewhere or in the knowledge graph that we've extracted That can give extra color and extra feedback to the model. It can
[00:23:01] Dr Genevieve Hayes: When the model is accessing the information that's in this vector database, what if there isn't information in the vector database about something you're querying, would it then go and search on the internet for information?
[00:23:17] Kirk Marple: optionally. So by default, it would just, this is where hallucination can happen. If when you do a vector search and there's no content sources, like nothing relevant, you need to be able to handle that case. And Partially, you can just handle it by prompting. You can say, look, if you don't find any content sources, or if you're not provided any, just respond back to the user.
[00:23:37] Like, hey, I can't answer this because if you don't, by default, the models are going to try and guess and You can get into weird situations with user facing applications. But it's really one of those things where there may be situations where you want it to pull on its trained material and other situations where you just want to kind of use it.
[00:23:58] Almost as like a language engine, but not use its knowledge and you want to be able to hand it what you want it to answer from and let it do its thing of how to collate that into something useful. But you're not actually asking it to for things that might've memorized during training.
[00:24:15] Dr Genevieve Hayes: So my understanding of how this works is, if I had a rag base chatbot say, I would go into the chatbot and say, tell me everything, you know, about Seattle, for example, the chatbot within translate that into some sort of a query, which would go to the knowledge graph database. It would retrieve the relevant information about Seattle, and then the LLM would summarize that specific piece of information it's retrieved, the most relevant one that is, and then say, here is everything I know about Seattle.
[00:24:54] Is that right?
[00:24:55] Kirk Marple: so there's kind of two ways. So when you get a prompt, you can do entity extraction on the prompt. And so you could say, okay, if you're asking about, I mean, Seattle and Graphlet, the first pass would actually extract an organization entity and a place entity. And you can use that to search your graph and pull out more information.
[00:25:15] The text of the prompt also will have an embedding created that then you can talk to your vector database. And pull out the raw text. And so you're kind of doing those things in parallel where you have a text based query and a graph based query. Then at least what we do is we put the back together.
[00:25:33] And so we're coalescing the results and saying. Okay, here's a whole set of things we got from two different places. Now give your answer based on these two sets of information, not just one.
[00:25:45] Dr Genevieve Hayes: How do you stop the LLM from hallucinating when it's giving the answer based on those pieces of information?
[00:25:52] Kirk Marple: That typically today is through prompting and so you have to give it an out. And a lot of times. Okay. LLMs, I think I've heard other people say this, but it's like, they want to make you happy. They have a very hard time at saying no, or just not giving an answer.
[00:26:04] And so they'll try and conform to something in an answer. And so what we found is you have to kind of give it an out and prompt it to say, okay, if you can't answer with confidence, say, Like, give a default string back and to your point before I missed on the point of like can they call out to the Internet for web search, that would be part of the retrieval process.
[00:26:26] And so you could what they're called they add tools and so they'd identify an extra tool that could get called during retrieval to say, oh, I don't have information in my knowledge graph. I'm going to go live search something or just make a database query somewhere, pull some extra information, and then add that to my set of context.
[00:26:46] Dr Genevieve Hayes: How much data would you need to ingest into a rag based chatbot to make it worthwhile so that it doesn't constantly have to go out and look on the internet and things like that?
[00:26:58] Kirk Marple: We've seen good results, even with, just the sort of easy button is just chat with a PDF. I mean, if you could literally just put one document in I mean, with a standard rag, you can get pretty much value just from understanding, like doing summarization. It's good at especially but
[00:27:14] the interesting area to me is where you could coalesce very different sources, like you have a workplace and you have slack, you have notion, you have Google email, and you want your retrieve data to go across all those sources. And it doesn't take much for it to give a good answer back because.
[00:27:33] If it's going to find it through vector search or through like a keyword search, it's going to get something to answer from. Now, the question just is, is it the right thing do you really want it to be answering from? And I think the place is also is where there's a lot of duplicate data that if it finds a lot of things that are too similar, And then it doesn't know how to prioritize and that it can get kind of led astray during the retrieval process.
[00:27:57] I found this with like web footers that have the same text on every web page, they'll tend to get priority because they exist on every page. And so if you're searching for something, it happens to have a hit. It'll find it on like every page of the website because that footer will be similar.
[00:28:12] So it does take a bit of an art. To get these things to work properly for your content.
[00:28:18] Dr Genevieve Hayes: What happens if you have two sources that are diametrically opposed? So for example, suppose you were doing something to do with politics and you had one particular source of information that's from one side of politics and one that's from the other and they're probably going to be saying that our side's right, the other side's wrong.
[00:28:41] So you'll be having direct contradictions all along. How would it cope with something like that?
[00:28:47] Kirk Marple: That's a great question. I don't actually know the answer. I think it depends on a lot of the training of the model what I've found for things like that, where there might be disparate answers is ask it to do pros and cons. And I mean, like explicitly be like compare and contrast these things.
[00:29:01] And so you let it kind of speak both sides of it. And that's given pretty good results. Cause how the models work. Specifically for that's kind of beyond my knowledge, but you can kind of sometimes guide it to give you sort of a balanced answer. I found,
[00:29:18] Dr Genevieve Hayes: So it sounds like these are really good tools to assist you, but you can't just let the tool take the driver's seat. You have to actually know how to guide it.
[00:29:30] Kirk Marple: We're all building tools now to make it easier, but I think there has to be a knowledge of. This isn't just a magic bullet. It's something to augment our knowledge and, help us process knowledge better and explore knowledge that's in content that's being generated every day.
[00:29:48] But The models just aren't to a point where they're going to, like, replace your reasoning of your own brain. And I hear people say, like, they're as good as just having, like, a big pool of interns, like, sitting in a room that can kind of process data. And, you don't have to go crazy with thinking how much these are going to like replace jobs or this and that, like, they're not even close to that right now.
[00:30:09] I mean, they don't even have memory essentially. They're read only models. But it's changing so fast that I think the trajectory of all this, we don't know where it's going to be in a year or two.
[00:30:21] Dr Genevieve Hayes: I'd like to have a discussion about the knowledge graphs that are underpinning these. One of the biggest challenges I found when I was developing that knowledge graph in that previous job, pre Chetchapt was coming up with a schema that worked for our data, especially. in the early stages. So I had an idea for a schema, but then we'd get new data and it'd turn out, Oh, hang on, this is a better way.
[00:30:46] And you say that with GraphLit, it's automatically producing a knowledge graph for people. Do they have to define the schema or, is it capable of coming up with a schema itself?
[00:31:00] Kirk Marple: Yeah, so that's a decision that I made early on once I kind of understood schema. org and like what JSON LD is and like how Google creates their knowledge graph from that, I mean, it covers a wide swath of the types of objects and types of entities that you want to capture. Obviously there's gaps and there's subclasses of these taxonomies for like healthcare or I've seen them for manufacturing.
[00:31:24] But for the average thing that people are usually talking about, it really does has a ton of coverage in terms of what scheme is available. So we took a very prescriptive approach to say, look, we're going to ask the LMS to extract what they know about in JSON LD format. And they're actually well trained for that.
[00:31:45] I mean, especially the open AI ones. So they don't have any problem pulling out. The formatted data, the structured data and including the relationships, like, I mean, you could say if there's a a document that mentions a university and a couple of professors that work there., it'll give me formatted data back that says, okay, here's an organization.
[00:32:08] It's of this type. This class, it's essentially figuring out the class hierarchy. And then it's figuring out, okay. This person may work for this university. So it's picking out that relationship and it'll even know like, Oh, but are they also an alumni of that university and they'll create that relationship as well.
[00:32:27] And it does all of it automatically. Like, I mean, it's literally just a prompt that gives you back Jason and it's not perfect, but it's so much faster than having a human analyze that data. And so we look at it as like, if we can get that data out there and then as a post process, essentially clean up the knowledge graph.
[00:32:49] Where you're going to have to do entity resolution, like deduplication of similar entities or correcting metadata we call that enrichment. And so we may find a university in the data. But when we call Wikipedia or call some other data source. That may have been a made up name, or it may be like, oh, it's actually a place, not.
[00:33:12] A company, and so that's all that data curation side. That's still very much required because it may not be even the model's fault. It may just be, it couldn't really tell, or maybe it's the same term used in two different ways. And so I think a lot of what knowledge graph creation and knowledge graph management has been over the years is that curation of the whole of the data set and making sure that, you still have a human in the loop for that.
[00:33:37] And we still see that in the LLM side of the world too.
[00:33:40] Dr Genevieve Hayes: So basically what you're saying is that the LLMs are good at producing a generic schema that covers the vast majority of use cases, and then you could fine tune that if there was anything additional you needed.
[00:33:55] Kirk Marple: Yeah, I mean, they're already trained on the schema. org taxonomy. And so they know that very well. You can basically just prompt it, give it some suggestions, some examples, like few shot prompting, and it'll do a nice job. If you want to get something like we're talking to folks in like the healthcare or the pharmaceutical industry where they may have their own classification.
[00:34:15] And that would be something where you would need to give the LLM and say, here's the schema, here's the hierarchy, the taxonomy that I want to use. Output me structured data based on that. And so that's more of a guided schema approach where you're providing like the own schema. What we do today, I mean, we can kind of do both, but we lean more heavily on.
[00:34:40] Focused on a prescriptive schema of the schema. org today.
[00:34:43] Dr Genevieve Hayes: When you were speaking before, you were talking about entity resolution. That's something that really fascinates me. How do you do that? does the LLM do it or is there a separate tool?
[00:34:53] Kirk Marple: That's a great question. So it's something we're working deeply on right now. So there's a couple of different ways and there's companies even dedicated to this that can just focus on just this part. But we do some basic comparison kind of deduplication on creation. And so we'll look to see, just using string distance and things like that to see if there's other entities that are similar enough that maybe we shouldn't create a new entity.
[00:35:14] We should link to an existing entity. So that's kind of the easiest part. But one thing actually on my roadmap for the next couple of weeks is, you know, Using embeddings to create clusters of the entities in the knowledge graph and look to do merging or identify like, okay, it's Kirk from Seattle and Kirk from Chicago.
[00:35:35] And, but, Oh, wait, they work for the same company before they're probably the same Kirk. And so I've actually been testing that with LLMs and it's not bad. And I literally was just doing this a couple of days ago where. I gave the JSON of a few entities and I gave it a prompt and I said, Hey, create me a coalesced version of any entities that need to get merged and like merge the properties, fix up their names.
[00:36:03] Like, if you see 1 has a middle initial, 1 doesn't give me the version with the middle initial And even with a pretty quick prompting structure, they do a pretty nice job. And so if you can give it the sources of entities, they can actually resolve them together in a reasonable way. And what we want to see is, okay, how reliable is that?
[00:36:22] And we need to do more testing on it. I think from a simplistic sense, it actually does work, but we got to look at scale and then see how that really works at scale.
[00:36:32] Dr Genevieve Hayes: What I'm wondering about is here in Melbourne one of our main universities is the University of Melbourne, and you would not believe how many different ways you can write the name of the University of Melbourne.
[00:36:44] Kirk Marple: Yeah. Yeah. I just did a couple, like I took an example of, I ingested a bunch of data and then I handpicked things that look kind of similar, but I could tell needed to be resolved. And I just copy pasted it into the GPT playground with some prompting. And I was just curious to see, Really how well it could do.
[00:37:03] And the nice thing about it, and this was kind of my original idea is we're not just giving it the name, we're giving it the full metadata, the full extracted structure. So it could have the place and the geo location, and it would have all that context to it, whatever we were able to get out of the structured data and then the LMS.
[00:37:23] And I've seen some other products out there that essentially just train models to do this. And the nice thing about that is you can then guide it and it's more programmable in a sense, because you can prompt it rather than having a pre trained model. And cause I've seen some other products out there that essentially just train models to do this.
[00:37:41] And , I'd love to kind of compare, how good the new GPT models are versus those.
[00:37:47] Dr Genevieve Hayes: And this sounds like it's coming down to one of those things, context versus keyword search. University of Melbourne, it's often referred to just as Melbourne and context would allow you to know that if you're referring to Melbourne in the sense of the university, there's a difference between that and Melbourne, the city, whereas keyword search wouldn't be able to pick up the difference.
[00:38:11] Kirk Marple: And I think that's another interesting one is, when you do the entity extraction, often we get both, I mean, you'd get a place version of it and an organization version of it of the same word, because it's picking up that from the context and you wouldn't just want to de duplicate that away, you'd want to store some metadata with it to know that, okay, this is truly a university, this is truly a city and, and, and, And that's why I really love the idea of These aren't just names in the old named entity sense.
[00:38:41] I mean, these really are like entities where they have fidelity and metadata with them. And that's what makes it more valuable.
[00:38:49] Dr Genevieve Hayes: So in just 18 months, we've gone from regular LLM chatbots like ChatGPT to RAG based applications. Where do you see this technology heading 18 months from now? Do
[00:39:02] Kirk Marple: I mean, I think we'll still see a curve of GPT 5, GPT 6. I think are still going to, be really beneficial to all this kind of work that we're doing. It's only going to get better. And so I'm confident that curve is still going to go up. I think. The other techniques for rag are going to evolve.
[00:39:20] I think it'll be really interesting to see. I don't see rag as like standardizing. I don't think it was just going to be like, you go buy rag off the shelf and everybody has the same thing. it's more of a people's interpretations and it's a pattern. So I think it gives people room to play and figure out what the best value is And also, I think there's going to be more need for evaluation tools.
[00:39:43] We've used some open source projects, to do rag evaluations. I'm sure there's going to be more in that realm. And then the one missing piece, I think, is an off the shelf graph embedding model that is equivalent to a text embedding model or like a clip image embedding model.
[00:40:00] There's still open source models for graph embeddings, but they're not really turnkey and you can't just go grab it off the shelf, from open AI today and call an API. And I think if graph rag becomes more popular and starts to see more value, hopefully, companies will invest a bit more that we could give, the graph to a new type of model.
[00:40:23] And it could figure out the entity resolution on its own. And, say we'd use graph embeddings just to like cluster and be like, Hey, here's a cluster of data, go shake it out and make sure it's the best quality and it's D duplicated and all that. And then we merge that back into Our graph database.
[00:40:40] And so I think there's so much room that could have new things that can happen in that space.
[00:40:45] Dr Genevieve Hayes: you use an off the shelf graph database like Neo4j?
[00:40:49] Kirk Marple: Yeah, so we use Cosmos DB there with a Gremlin API partially just because that was what I first prototyped with years ago and it worked pretty well, but it's a prop label. I mean, a property graph pretty standard. but we're not doing like graph analytics per se.
[00:41:05] We're really using it more as another index for the data.
[00:41:09] Dr Genevieve Hayes: Okay, so if you're just using it as an index, you wouldn't then be able to tap into that graph database and do searches on it.
[00:41:17] Kirk Marple: No, we can do that. So we put a GraphQL API on top of our database. And so that lets you kind of walk from one node to the next and see, you could essentially say, okay, for this content Show me what other entities it's related to, and then you could go from those entities and then walk to their relationships.
[00:41:34] And those actually do follow the graph, they're doing graph queries under the hood. So we do leverage that today for that.
[00:41:40] Dr Genevieve Hayes: Okay, so a data scientist could tap into the graph database that's under the hood of these rags and then say, find the shortest path between me and you, for example.
[00:41:51] Kirk Marple: Oh, right, right. I mean, 2 sides to that the data is there, and if they had access to the database, they could, we don't expose that to our customers today. But say you used ETL, we extracted the graph One option we're looking at is like to export that so you could sort of take the resulting graph and like go put it into another database or do whatever you want with it.
[00:42:12] But today we kind of wrap it in our API. So we're not giving that kind of low level access.
[00:42:16] Dr Genevieve Hayes: Okay. And if you were to download that database that could potentially take up a lot of space and compute power. So it's probably not something that too many people would want to do, I'd imagine. Okay.
[00:42:29] Kirk Marple: We do give access to the nodes and edges for visualization. So you could do a query, even do a vector query today and then get a resulting graph from that. And when you do a prompted conversation with RAG, we give you the resulting subgraph back that is related to your sources that were in there.
[00:42:49] So we do give kind of at least enough data there that you can visualize it, but nobody's asked for it yet in terms of like analytics.
[00:42:57] Dr Genevieve Hayes: You haven't had a client who's nerdy enough.
[00:43:00] Kirk Marple: I mean, it's a great idea. I think there's so much we could do with it. But, and I think it's just, the data's there. We would just have to expose it to someone in the right way.
[00:43:08] Dr Genevieve Hayes: So what final advice would you give to data scientists looking to create business value from data?
[00:43:14] Kirk Marple: It's a good question. But I think, there's so much domain knowledge that's captured at organizations, especially around people that are retiring or moving on to different jobs.
[00:43:24] And that's an area that really interests me it's an area that people could focus on more to capture. That kind of embedded knowledge in organizations and almost have like, I mean, maybe the person that's been there for 30 years and they're going to retire, how do you capture them in like a chat DPT?
[00:43:43] And I think if, if there's ways to capture. Knowledge I think it would just benefit organizations so much that you don't have to then retrain people from scratch. And I think if someone could figure out a way to, even build better tooling on top of what we have to just make it easier for Almost like a recorder for organizations to just embed that knowledge for the next wave of people that get hired.
[00:44:06] And so I think it's such an untapped area it would be such an interesting area to have almost like a workplace recorder that could benefit people. And I always thought that's an interesting area to look into.
[00:44:18] Dr Genevieve Hayes: Interesting to see if someone manages to develop something like that based on individuals work while they're in a particular organization. Because if you took every report someone has written and every Report they'd read and every email they'd written, you'd be going a long way to getting to downloading all of their knowledge.
[00:44:41] Kirk Marple: It's really true. And we had worked early on with, I mean, some companies more in like the, the built world space, I mean, like construction and engineering and ports and railways, and there's so much embedded knowledge that, they would talk to me about, you know, 20 years of data.
[00:44:57] This one, I mean, consultancy had collected, and it's just, that's an amazing amount of knowledge. That's just sitting there unused. And to think of like, how can you transfer that knowledge into something that's actually usable by the people that are even on staff right now. So, I mean, I think there's companies doing this for workplace search, but I don't know if they're doing it necessarily as like, how can you almost have like an agent that knows everything in the last 20 years of the organization and can like help you do your daily job, at least by kind of being the extra smart person in the room.
[00:45:31] Dr Genevieve Hayes: On that note for listeners who want to learn more about you or Graflet or to get in contact, what can they do?
[00:45:38] Kirk Marple: Yeah. So we're just graphlet. com and a graphlet on Twitter. And I mean, we're Right now, it's I mean, really just looking for developers that want to build with these new models, take advantage of unstructured data, and you can find us there or find me on, on LinkedIn, and I'm always happy to chat about different things people are trying to build.
[00:45:58] Dr Genevieve Hayes: Okay. Thank you for joining me today, Kirk.
[00:46:01] Kirk Marple: Yeah, thank you so much. This is fun.
[00:46:03] Dr Genevieve Hayes: And for those in the audience, thank you for listening. I'm Dr. Genevieve Hayes, and this has been Value Driven Data Science brought to you by Genevieve Hayes Consulting.

Episode 41: Building Better AI Apps with Knowledge Graphs and RAG
Broadcast by