Value Driven Data Science: Boost your impact. Earn what you’re worth. Rewrite your career algorithm. | Transcript: Episode 52: Automating the Automators – How AI and ML are Transforming Data Teams

Episode 52: Automating the Automators – How AI and ML are Transforming Data Teams

December 19, 2024 / 41:39/E52 Download MP3

[00:00:00] Dr Genevieve Hayes: Hello and welcome to Value Driven Data Science brought to you by Genevieve Hayes Consulting. I'm Dr. Genevieve Hayes and today I'm joined by Barzan Mozafari to talk about how AI and machine learning are helping data professionals, including data scientists and engineers, do their jobs more effectively.
[00:00:21] Barzan is the co founder and CEO of Keebo, a turnkey data learning platform for automating and and accelerating enterprise analytics. He's also an associate professor of computer science at the University of Michigan and has won several awards for his research at the intersection of machine learning and database systems.
[00:00:44] Barzan, welcome to the show.
[00:00:47] Prof Barzan Mozafari: Thanks so much for having me on the show, Genevieve.
[00:00:50] Dr Genevieve Hayes: In many organizations, data scientists and data engineers exist as support staff. Data engineers are there to make data accessible to data scientists, and data scientists are there to make use of the data to support the rest of the business. But in helping everyone else in the business, data professionals can often forget to help themselves.
[00:01:12] However, just as AI and machine learning can be used to help others in the organization perform their jobs more effectively, there's no reason why they can't also be used to help data professionals excel in their own jobs. And as experts in applying these techniques, data scientists are perfectly placed to leverage them.
[00:01:33] And that's what we're going to be discussing in today's episode. Using AI and machine learning to make data professionals, in particular data engineers, more efficient is something, Barzan, that you've devoted much of your career to, both through your research and through your company, Keebo. So, to begin with, can you give us a brief summary of your work in this area, and how it led to the creation of Keebo?
[00:02:02] Prof Barzan Mozafari: Sure. So as you said, like I've spent pretty much about two decades of my career at the intersection of AI slash machine learning and database systems, essentially my body of work over the last Almost 20 years has been all about how can we leverage statistics or machine learning to build smarter data systems where smarter could mean faster, easier to deploy, easier to scale or cheaper overall
[00:02:26] and we've done it in different contexts. We've done it for transactional databases, for analytical workloads, and so on. And how it led to Keebo. It was a succession of different ideas, and it was a byproduct of what we saw in the market, so if you sort of step back and look at what we sort of witnessed over the last Decade or so in this space, and I like how you were saying, like data scientists and data engineers leverage data and AI to help others.
[00:02:52] But sometimes they forget to help themselves. I think that was pretty much how the story started with us as well, so if you kind of zoom out and look at modern data pipelines, One of the most impactful things that have happened over the last decade and a half, I would say, is the rise of what people know as data clouds, essentially the likes of Snowflake, BigQuery, Databricks, and others who sort of I'm really not an amazing job of lowering the barrier entry for people to actually start leveraging their data, back in the day, you'd have to have a significant upfront investment in servers and hardware and software licenses and train DBAs.
[00:03:33] And, you know, it would have been an at least like nine month, one year. In a significant amount of capital investment before you could actually start leveraging your data, so modern cloud data warehousing completely turned this upside down, so now what they did is that they allow for innovation by data teams where it's a team of data scientists or data engineers to happen very quickly, especially, along the edges of an organization.
[00:03:59] But as a side effect of lowering that adoption barrier, what we saw is that People were creating more complex data pipelines, and those data pipelines were becoming even more expensive. And the reason for it was because obviously whenever you make it easier for people to tap into their data, they're going to tap into more data, they're going to try to combine more data sources, they're going to combine them in more interesting ways, but also you have less technical or less database savvy users empowered to query their data.
[00:04:27] So then, the modern data pipelines, I would say, are significantly more complex. And more expensive and much, much harder to optimize then what we used to see a decade ago. So now a lot of these data teams are finding themselves sort of facing with a very complex data pipeline.
[00:04:48] They're spending a lot of their time manually tweaking these things, trying to optimize it. And, with the global economy where it is a lot of these data teams finding themselves having to do more with less, so, the cost of data infrastructure is going through the roof. You have to hire more data ops.
[00:05:06] People, more data engineers, data scientists have to worry about the bill and whether someone's going to yell at them or not, if they basically, overspend their budget when they're doing their ad hoc analytics and whatnot. So that's where we started thinking about, you know what we could do better than this,
[00:05:21] like the likes of Snowflake and others have lowered that adoption barrier and have turned this essentially CAPEX problem into an OPEX problem. So then the question was, how can we help lower the operating barrier for more people so that they can take advantage of data clouds? Without having to, spend an ominous like for
[00:05:41] Dr Genevieve Hayes: So, what exactly does Keebo do?
[00:05:44] Prof Barzan Mozafari: So from a very high level, we actually did a learning platform where we learn from how users and applications interact with their data and use their data in the cloud. And then we use those trained models to automate and accelerate the tedious aspects of the interaction within it.
[00:05:58] So, for example, we start automatically optimizing their. Cloud data warehouse and we save a bunch of money. That's actually one of our most popular Modules on the platform where people connect to Keebo and then overnight we drop their bill by sometimes up to 30 40 And we just charge them a small fraction of whatever we've automatically On the cloud data warehousing bill or we use basically those models to automate the FinOps for them.
[00:06:26] So we derive insights about their workload. Here's a 360 review of your, query health, your data health, your storage health your workload health, your warehouse health, all of that stuff with, detailed breakdown of what are the major causes of their spend?
[00:06:42] What are the queries that are inefficient? How they can optimize what's suboptimal and so on, or variety of other ways. For example, we also have a smart query routing where we use that knowledge that we basically derive or those models that we train to route every query to an optimal warehouse instead of Sending, a bunch of different queries to a warehouse where it might be over provisioned or under provisioned.
[00:07:06] We can actually send every query to a perfectly optimal warehouse at runtime. So there's a lot of different use cases where basically the main benefit of using our platform for our customers or the main value proposition that we offer our customers is We lower their cost. We significantly reduce the amount of engineering hours that are spent on operating and optimizing these data pipelines.
[00:07:28] So we free up engineering resources. And then we give a lot of visibility and FinOps capabilities to the customers of data cloud.
[00:07:35] Dr Genevieve Hayes: As a data scientist, I'm really interested in the types of AI and ML you're using under the hood to do all these things. So are these forecasting models that you're using, or what sort of ML are you using?
[00:07:49] Prof Barzan Mozafari: That's a very good question. So it's not just one type of model. So we actually made a number of really important design decisions from early on that really helped us get the traction that we're getting right now and really ease the adoption in the market that we're seeing.
[00:08:03] One of those decisions was that we shouldn't make any assumptions about the customer's workload. No two customers are identical. Even the same customer has many different workloads with many different SLAs and performance or cost requirements. So when you think about forecasting, there are workloads that are extremely predictable and forecasting would sound like a good idea.
[00:08:22] But there's probably a lot more workloads that are not at all forecastable or predictable. You know, you can have an ad hoc workload where the data scientist is looking at X this week, the next week they're on vacation, and then the week after that the CEO has asked them a question, and they're looking at, a whole bunch of other things,
[00:08:40] so at the end of the day, we leverage whatever is best for the task at hand to give you a small example. When we try to sort of auto optimize a customer's snowflake for instance, and save them a bunch of money, we have a variety of different algorithms under the hood, but one of our most effective and popular ways of doing it is leveraging reinforcement learning,
[00:09:00] because and I say this as someone who spent his entire career teaching databases, building databases and selling databases. Database systems are one of the least predictable systems out there. They're extremely non linear. Like, you can run two very similar queries and get wildly different performance every time,
[00:09:19] so, the way that reinforcement learning really works, fits the builder is just like how you and I learn like we basically try something. If it works, then we do more of it. If it doesn't work, we basically think about it and do something else, so you train an agent where each of these data warehouse vendors has a number of knobs,
[00:09:39] so you expose those knobs as actions to your reinforcement learning agent. And then it starts pulling these knobs and then you define the reward function as a combination of cost and performance. So, for example, you might have a customer who has a warehouse that's extremely mission critical. Like if you cause a slowdown on this warehouse, you're going to get fired.
[00:09:57] Then maybe it's. Used by their CEO. It's a customer facing warehouse. So then you want to make sure that your reward function, is heavily geared towards protecting performance Whereas there's another warehouse where it's an offline ETL task as long as it finishes before 8 a.
[00:10:13] m. The next morning you're fine. You're trying to save money So you define the reward function and we do this, under the hood The customer doesn't have to do any of this the user interface is very , intuitive. It's just a slider that basically the customer can either define their S. L.
[00:10:27] A. Or say how aggressive or conservative they want the algorithms to be. And then the agent starts actually operating. Whenever it takes an action that saves money, it gets rewarded. Whenever it takes an action that causes a slowdown for that warehouse or doesn't save as much money for that customer, it gets penalized.
[00:10:45] So these agents actually learn without having to make strong assumptions about what is that workload. is this a particular warehouse for this customer? mission critical or not, all of that is basically encapsulated in the reward function. And then the agents start learning. And you'll be surprised how quickly these agents converge and start actually saving a lot of money for customers.
[00:11:08] Dr Genevieve Hayes: That was going to be my next question. How long does it take for the algorithm to converge?
[00:11:12] Prof Barzan Mozafari: Because of some of these design decisions that we've made actually the majority of customers start. So first of all, the onboarding takes about half an hour. So as someone who came from academia, I spent most of my career writing peer reviewed, , papers in this area and, pushing the edges of whatever that we were doing here as a state of the art, here's how we improved on it and whatnot.
[00:11:35] I was always interested in getting wide adoption for the things, which was kind of unusual, to be honest with you. A lot of my colleagues in academia, they just get excited about writing papers for the same. So then the papers get published, everyone's excited, you present it, and you move on to the next page.
[00:11:50] But for me, always, the question was, how can we take this to the next level so that it doesn't just remain as a paper? It turns into something real that actually makes the world a better place. So, AI in particular, some of the early lessons that we had from a previous startup that, I was involved in was that you need to solve for four types of barriers,
[00:12:14] but one of those things was that The time to value has to be extremely short. So most of our customers actually start seeing savings within the first 24 hours. And then as the agent learns more about that workload over time and actually the savings increase. So the savings actually keep increasing usually even within the first week, but they converge pretty quickly.
[00:12:35] And then. This may be within the first 24 hours, you only get 10, 15 percent savings. By the time the model's been in place, the agent's been learning, you know, within the first week, it might go up to 25, 30, 40, depending on the customer's appetite for more savings.
[00:12:48] Dr Genevieve Hayes: Pretty awesome. I did some reinforcement learning when I was doing my master's and all of that was based on, the Bellman equation. So is this just an extension of the Bellman equation that I learned about in my master's?
[00:13:02] Prof Barzan Mozafari: I think, yeah, the question is one of the most influential ideas in reinforcement learning, but there's many different flavors of it, how you learn. Whether it's model for you or not, how you basically think about reward, how you think about the exploitation, exploration, trade offs and whatnot.
[00:13:17] So there's a lot more that goes into making reinforcement learning work in real life than what we basically teach our students in academia, like the Bellman's equation , is a concept, but actually there's a lot of interesting real life considerations that you have to sort of think through
[00:13:34] for instance just to give you an idea of why things are trickier is that, if you're basically training a reinforcement learning model to learn chess, Well, the rules of the chess are well known, you just create a simulator and let the model play against itself or play against some history or have a simulator and then the model learns and once it learns, you say, Okay, now I'm ready.
[00:13:56] And then you go and basically play against the world's champion. It's not as easy when it comes to optimizing the customer's cloud data warehouse for a number of reasons. First of all, you don't have all the time in the world. Number one. Number two. Number three. The cost of a mistake is extremely high, so you need to learn without burning out that customer.
[00:14:13] So the saying we have here at Keebo is because we are fully autonomous or we allow customers to also have safeguards in place and whatnot. 'cause not everyone's okay with fully autonomous, but Keebo is the only solution that actually has the ability to be fully autonomous.
[00:14:28] As it does, its authorizations. Whenever you have something that's fully autonomous. The same we have at Keebo is that our first slowdown is our last slowdown so the analogy I give is like we're offering a fully autonomous car If you have a fully autonomous car, you can't just Train it on the live streets in melbourne for example, Because your first crash is going to be your last crash in our case, our first slowdown of the customer will be our last slowdown.
[00:14:53] The customer doesn't want you to make mistakes that cause a problem to that customer. So you have to have a lot of interesting safeguards in place. Because at the end of the day, if you don't make mistakes, you're not going to learn, just like humans, like, if you never take a particular action, you're never going to know whether it's a good action or not.
[00:15:10] So at some point you need to explore, but you need to have some safeguards in place so that your learning does not burn out that customer. So, yes, Bellman equation is an important part of reinforcement learning, but interestingly enough, there's a lot of interesting real life Consideration that you have to engineer for and put in place to be able to bring something to life that works and creates value for the customer.
[00:15:34] Dr Genevieve Hayes: Yeah, I remember when I was learning reinforcement learning. We had to build for an assignment a reinforcement learning agent that played a video game. So you had to land this spaceship on the moon between these two flags. And you could watch it training on your computer, And basically the first half hour was just watching this spaceship crash again and again.
[00:15:57] And I was just imagining the database equivalent of that, where your reinforcement agent is doing insane things and crashing the database system again and again. How do you put a guardrail in place so that if it pushes this funny lever and it does. metaphorical equivalent of a crash that that doesn't actually translate through to a real crash.
[00:16:23] Prof Barzan Mozafari: So, short answer is with lots of difficulty, , but I'm pretty sure you're expecting more details. So there's a lot of engineering that goes into it, so, for instance, one of the safeguards we have in place is called former guardrail. So we look at the historical. Statistics on that warehouse.
[00:16:39] That's for example, Hey, here's the number of queries that are typically queued. So one of the most interesting things like we've learned at Keebo is that something as innocent as like performance has widely different meanings and interpretations depending on whom you're talking to, like customer A's notion of performance means As long as my job doesn't fail, whereas customer B's notion of performance is if my 99th percentile query latency exceeds two seconds, then you're in trouble,
[00:17:09] and even within the same customer, what the, , data team considers a good performance might be very different than someone from their supply chain management department and the CFO and someone who's using. So One of the other decisions we've made at Keebo is that we shouldn't make assumptions on the user.
[00:17:25] We're not going to make decisions in terms of what's important to them. We allow them to express what's important. So one of those things is called as I mentioned, performance guardrail, where the user, we actually look at their historical data and say, hey, you know what? On this warehouse, here's, for instance, your average latency is your 99 percent latency.
[00:17:45] On average, you have this many queries that are queued at any point in time. Here's the total queue time on this warehouse and whatnot. So then, we use that as a baseline of, okay, as long as within those same parameters, then we have not caused a noticeable impact to the customer. So you can actually take actions that you're fairly confident are going to keep you within those things.
[00:18:06] But we also allow customers to express those SLAs explicitly if they wish to do so. And then as soon as we see those things deviate, the back up mechanism kicks in and overrides the reinforcement learning's decision. that the reinforcement learning says, Hey, you know what? I think if I Crank down the size of this warehouse from an X large all the way to X small, I can save this cost a lot of money.
[00:18:28] And then the performance girl goes and says, yes, thank you. But no, thank you. We're not going to do that. It's going to cause a slowdown. So it's an interesting game. Like, I mean, at the end of the day we've never had customers sort of complain, like, why did you save me today? 35 percent instead of 36 percent, but they will get pretty upset if their queries Become slower.
[00:18:53] So that's why Whenever we have a fork to the side between protecting performance and saving more money. We err on the side of protecting performance because that's what most customers prefer Even though we make more money when we save more for the customer because our other models we we share A percentage of the savings with the customer if they're using us for cost saving we have different use cases as I mentioned but we want to make sure that the model of save money where it's called a no brainer saving So if I save you money, because you wouldn't need a company, you wouldn't need reinforcement learning, you wouldn't need a PhD, and you wouldn't need any of the patents that we have.
[00:19:28] If you just wanted to save money at whatever the cost, I can tell you, you just go and , reduce the size of all your warehouses to x small, and I promise you're going to save a lot of money. But then you're going to get fired the next day. So the idea is to find savings. That do not come at the cost of a slowdown.
[00:19:45] So it's simplified intuitive kind of visualization. I tell people like, listen, if your warehouse is underutilized that more than 50 percent then, that's a good sign that probably reducing the size by half is not going to cause a slowdown. Does that make sense? So if you have a boss and more than half of the seats are empty, you can move your warehouse.
[00:20:08] Okay. Passengers to a bus that has half the number of seats save money without compromising people's comfort.
[00:20:15] Dr Genevieve Hayes: Does it help that pretty much all of your clients would probably use one of the standard cloud databases? So Snowflake BigQuery, Databricks, things like that. So is there knowledge that you can take from The fact that you've optimized other similar databases and use that as a starting point when optimizing the database for a new client.
[00:20:39] Prof Barzan Mozafari: That's a great question. So I'm going to break your question into two parts because I my answer to part of it is yes. And part of it is no. So do our models generalize or our know how our technology? Can it generalize to other database systems? Or is it specific to snowflake? Yes, each database system is different.
[00:20:58] And the knobs are different. For example a database workload behaves differently when you go from a large to medium base, so, the models and the agents learn that, but there's a lot of common best practices from an engineering perspective that translate from one so the answer. The first question is yes.
[00:21:14] However, we're only focused on cloud data warehouses. And for example, our warehouse optimization is GA for snowflake. We have other offerings on our platform like workload intelligence or, smart query routing where, we're signing up beta customers for databricks and other platforms.
[00:21:30] But we're not actually doing any of this for any of the on prem database. The reason is not the technology is simply because the world is moving towards the cloud. So in business terms, we're just following the money. You know, there's this joke that they asked as someone a bank robber, why do you rob banks?
[00:21:46] They said, because that's where the money is. So the reason why we're supporting Snowflake is because that's where the money is. That's one of the best, but also most expensive cloud data warehouses out there. And that's why people have the most acute pain when it comes to paying their Snowflake bill.
[00:22:02] So that's where we are. Move the needle a lot. But the second part of your question is, do we transfer knowledge? In other words, do we do transfer learning from one customer to another? The answer is no, for two reasons. Number one there's a lot of privacy concerns, but customers don't want their workload, even if it's just metadata to benefit someone else's workload.
[00:22:23] There's compliance, fun, privacy concerns that every data scientist is aware of. But the second part of it is, to be honest with you, every customer and even every single workload and every single warehouse is pretty unique. What people run on a small at customer A are widely different than, this other customer is running on their X.
[00:22:45] And then what this customer considers good performance is very different, what that customer could. So what we do is that we've built this platform. That actually gets smarter for each customer, independently of other customers based on their own usage and their own.
[00:22:58] Dr Genevieve Hayes: As I mentioned in the introduction traditionally the role of the data engineer has been to support the data scientist in doing their job. But Keebo essentially flips that on its head. So by making use of Keebo, data engineers are being supported by data scientists. to do their jobs more effectively, albeit the data scientists who are supporting them are those employed by Keebo to train your reinforcement learning models rather than in house data scientists from the organization.
[00:23:31] Now that's a pretty big shift in the whole dynamics of data engineering, data science teams. How do you see the relationship between data scientists and data engineers evolving as machine learning becomes more integrated into data systems, both through third party tools such as Keebo and tools built in house?
[00:23:57] Prof Barzan Mozafari: I think that's a very good question actually. Like traditionally, it's exactly as you mentioned, data engineers provide the data so data scientists can glean insights from it. But as things have changed, Progressed, as I mentioned earlier, these data pipelines have become so complex.
[00:24:15] They've actually become complex enough not to be easily manageable and optimizable by humans. Like, at Keebo, I've seen queries that sometimes span multiple pages. Like as someone who's been teaching databases, I have to stare at it the whole day to really understand what this query is doing.
[00:24:31] A lot of them are even auto generated. So it does make sense actually at some point for people to start eating their own dog food, so like the data scientists in this case, us like building automations using machine learning. And I think there's this thing that they talk about, like, databases for AI or systems for AI and now AI for systems too.
[00:24:51] So I think the world is changing. Like people are realizing there's a limit to how much you can do manually. I call our warehouse optimization tool at Keebo an infinitely patient, infinitely competent DBA, because at the end of the day, if you have a DBA, you're running one million queries a day, which is not uncommon for a customer of, let's say, Snowflake or X to have millions of queries on a weekly basis.
[00:25:15] Now, we're expecting a DBA or a data engineer to kind of stare at one million queries and say, Oh, I think here's how you can optimize them. Obviously, that's not humanly possible. It's just beyond the comprehension of anyone to actually make optimal decisions, even if they had the time or the skills to do that.
[00:25:31] And even if they did, a lot of these cloud workloads and, with all due respect to our data scientist friends, a lot of what data scientists do also change rapidly and drastically over time, what my data scientists are looking at this week. It's very different the next week than a quarter from now.
[00:25:48] So even if I manually optimize everything for them as a data engineering team today, That's not going to be optimal amount. So this idea of manual optimization is a never ending X, so that's a lot of what I think is going to change. People are going to start building more AI powered automations for data engineering as well.
[00:26:08] And in fact, that's one of the biggest challenges we're facing, to be honest with you, is that engineers by nature. And I say that as someone who's an engineer by training themselves, like engineers are out there to build things. So building is a first nature to engineers. So a lot of the times we see data engineering teams where they're like, Oh, I could just build that myself.
[00:26:29] And I think that's being on the wrong side of. history because machine learning and AI in general have reached a point where, number one, they can automate a lot of these tedious tasks that humans cannot or should not be doing manually. But also, think about it. I always give this example.
[00:26:46] You're a marketing company. Your data engineering team should be doing things that drive your business. They should not be building snowflake optimizations. There's a limit. I think, one of the biggest mistakes that I see a lot, engineering orgs in particular, across the board. is that people have a tendency to build versus buy and sometimes that's the right decision, but more often than not, nine out of 10 times, people are trying to build something where they can buy something significantly better than that at a fraction of the cost, because they just think about the dollar value.
[00:27:20] They don't realize that, Hey, if I have two engineers, work on this for three months, even if there's zero maintenance afterwards, that's half a salary of a full time engineer. And number two, there's a huge opportunity cost. How could those two engineers be working on, instead of saving money or optimizing this thing, what if they actually worked on something that grew my top line business and revenue.
[00:27:45] If I'm a software gaming company, my engineering resources should be invested in creating a better, more fun computer game, or if I'm a marketing team, marketing product, I should be focused on that. So I think a lot of this is going to change, but that's the biggest challenge we're seeing is that because.
[00:28:05] There's that tendency of build versus buy. There's still a lot of resistance, but history is one of the most powerful forces in in universe. I think people will eventually realize that, hey, they're going to be replaced with machines if they don't focus on the right thing. If they continue focusing on building things that can be automated I think they're losing that battle.
[00:28:28] But point, I see that AI and data scientists will be delivering a lot more value.
[00:28:34] Dr Genevieve Hayes: Yeah, that whole build versus buy thing that you've observed among data engineers, that's something you also get with data scientists. So I can imagine some data scientists and This was me in the past, they would say something like Keebo and think, wow, I love reinforcement learning.
[00:28:52] Let me have a go at building this myself. Whereas I imagined that there would be a massive investment in data science hours required to build something that would not be even a fraction of as good as what you've done at Keebo. But that's what a lot of data scientists want to do. Do you foresee a time when the role of in house data scientists will expand to include supporting data engineers, or do you expect that most of the data science support for data engineering will come via third party tools such as Keebo?
[00:29:25] Prof Barzan Mozafari: That's a very difficult question. I would hope that it's the latter but I think to be honest with you, the answer would depend on how sophisticated these organizations are in the way that they consume AI, like there are some organizations where The CIO or the CDO are extremely AI savvy.
[00:29:43] They exactly know what's possible. I always say, the key is to find out what can and should be automated. A lot of efforts are wasted in automating things that should not be automated, or trying to automate things that cannot be automated with today's AI. So organizations that are well run and well managed with the right visionaries, they will make sure they have guardrails and decision making process in place where organizational resources are not wasted on side projects.
[00:30:10] Like I just remembered an interesting story where we once crossed paths with an interesting organization. They basically had two engineers and I happened to be in that meeting. So we have a number of different offerings.
[00:30:21] One of our offerings is a patented fully automated query rewriting tool that takes the query. automatically rewrites it and makes that query 10 to 100 times faster. And this is particularly useful for BI workloads and does so with correctness guarantees and tremendous amount of work has gone into verification and making sure that we detect the change automatically in the base data.
[00:30:44] We automatically update the smart models that we're creating. We automatically rewrite queries and automatically enforce the SLAs that the customer has. Many PhDs have spent years working on this thing, like people whose PhD was literally on database optimizations have spent, thousands of hours, building this tool file.
[00:31:05] And then, we came across a couple of engineers and they were like, Oh, that's a pretty cool tool. We're thinking of doing this as a side project over the next couple of months. And then you start thinking to yourself, they have no idea what they're talking about.
[00:31:17] A lot of times you feel like, Oh, I could do that. And then when you start actually writing code, you realize, gosh, there's so many corner cases. There's so many things that I don't have. So it was just really funny. I just keep thinking about it. But first of all, they were not a software company.
[00:31:32] They were in a completely different business. If I'm being honest, they had no business of building any internal tooling for that matter. But even then, they would think of building this thing over the next couple of months and do so as a side project. Every time I remember, it just brings a smile to my face.
[00:31:47] But that's an example of like how a lot of the times we as engineers tend to underestimate how much time and effort and resources and the opportunity cost of building things that we have no business of, and it applies to us at Keebo as well. I always say like to our team, people are buying Keebo because we have the smartest technology for optimizing their snowflake or for giving them FinOps and giving them, smart query routing.
[00:32:11] That's where we should be focusing our resources. We have no business building a new logging or monitoring tool. we can just go buy that. We should not be building things that are readily available. So, to your original question, do I see people building it in house? I hope not.
[00:32:27] But I think it's going to continue happening until the field matures. Like people who are graduating with a CS degree these days, they're a lot more AI and machine learning savvy than the,, previous generation that maybe graduated even 10 years ago. So most of what they've heard is what they see in social media or sci fi movies.
[00:32:45] But like the newer generation of people or data scientists. Or even executives, like you're seeing more leaders who are capable of sort of making what should be bought versus built.
[00:32:57] Dr Genevieve Hayes: It reminds me of something that happened to me on a job that I had. It'd be about five years ago now, where initially senior management wanted the data science team to build all these models that effectively replicated the models that you could buy on things like AWS Google Cloud Platform, Microsoft, et cetera, you know translation models and transcription models, things like that.
[00:33:24] We didn't have the data to do it. We didn't have the GPU farm. We didn't have any of these things. We basically failed at it for a couple of months. And then something changed and they needed these models by the end of the month and suddenly the attitude of senior management shifted, they handed over the credit card, we got the GCP models and we had it all built out by the end of the month.
[00:33:49] And it was one of the greatest moments because we went from, , these models that were just completely failing to something that was working perfectly in the space of about four weeks.
[00:33:59] Prof Barzan Mozafari: That's amazing, that's exactly, yeah, what I'm talking about as well.
[00:34:05] Dr Genevieve Hayes: And for me, that was a light bulb moment because before that I was very pro build because, I'd done this degree and I wanted to build everything. But just seeing how quickly we could get this brilliant thing up and running.
[00:34:17] After that, my attitudes just, you know, You know, if it exists, just pay for it, and you'll have it now, and it'll be better than anything you can build in house.
[00:34:26] Prof Barzan Mozafari: And you can spend, your talent on building on top of it , instead of reinventing the wheel. I think there's also A bit of psychology there, actually. I once asked the engineering leader at one of the FAANG companies, I was like, Hey, why don't you guys keep building these tools?
[00:34:39] I promise you for everything. Like, because they built a lot of different, Versions of storage systems and whatnot. And he was like, listen, at the end of the day, when you write code, you have to debug that code and maintain it. And we would much rather debug our own crappy code than debug someone else's crappy code, which I think is true for a lot of the open source things where it's like, there's a code there.
[00:35:02] You have to maintain it. But I feel like. That excuse is no longer valid now with a lot of these tools being a fully managed hosted service where there's nothing to debug. You know, you just out of box, no code or low code kind of solution. But yeah, I remember him telling me very honestly, like, we can't say this publicly.
[00:35:21] But we'd much rather debug our own crappy code than someone else's crappy code. That's why I keep, you know, inventing the wheel.
[00:35:28] Dr Genevieve Hayes: Well, I mean, that's the thing, all those pre built models, they're not perfect. I mean, I spent about a month doing all these tests on them to try and find out where they broke. But the thing is, once you figure out where they break, your own models are gonna break too, but, it's, I actually prefer trying to find the problems in someone else's crappy code when that someone else happens to be Google or Microsoft or Amazon than trying to find it in mine.
[00:35:59] Prof Barzan Mozafari: Oh, well said. That's a good, that's a good point.
[00:36:03] Dr Genevieve Hayes: So, looking ahead, what emerging trends and technologies do you see having the biggest impact on the data space in the coming five or so years?
[00:36:12] Prof Barzan Mozafari: Oh, that's a pretty good question. One of the things that makes life a little bit hard these days is that LLMs and Vini AI have become buzzword. So it makes it so much harder for people to discern. Like wannabes and marketing buzzword from actual, I think that that's going to fade away, hopefully.
[00:36:30] I don't know if that's my prediction or hope, but I think that's one of the biggest challenges that we're facing as technologists as someone who came from like our stuff, like we actually publish peer review publications and patents. And we care about this. Like we have a lot of machine learning and database PhDs working on what we're doing and whatnot.
[00:36:46] But then we show up to these trade shows. I think we were at Snowflakes last year summit and I just took a walk and look at the different booths and whatnot. I don't think I came across a single booth that , did not have the word AI on it.
[00:37:00] And there is no way that every single vendor here is using AI, but it's just like how marketing works. When something becomes hot, everyone feels like they need to use it. It's like how it, big data used to be a buzzword, like, two decades ago, when people felt like, hey, we need to put a big data and everything, and that people will come.
[00:37:18] And I think that's made it hard, not just on the vendors who actually do AI, because now everyone's talking about AI, but it's Made it harder on buyers and enterprise buyers particularly because they cannot discern real AI from, claiming to be using it. I think a lot of that is going to go away.
[00:37:35] And when things become more real, I think we're going to see a real impact. Because if people know, hey, this thing is actually some of these adoption barriers that we, right. I think that's one thing that's going to happen. I think people are going to have a better idea. The second thing. People are going to have a better idea when and how to leverage.
[00:37:52] Right now it's not just a build versus buy, but there's also a lot of unreasonable resistance. And I would say even futile resistance to AI by ICs, engineers, data scientists, because they feel like, Hey, that's coming to take my job away. And my response is always like, if you're doing something that can be automated.
[00:38:10] Your job is going to be taken from you. It's just a matter of work. So instead of resisting it, the best solution is to educate yourself and figure out how you can leverage whatever AI that's available on the market, instead of constantly resisting it, because maybe you can hold that ground for another three months, another year, another two years, but that's a wave that's going to come and crush you.
[00:38:30] All you're doing is doing things that can be honored. If you want to have job security Be the innovator be the one who actually adopts AI. We've literally had customers who won Internal awards from their cfo's like I remember it was this customer in this company where they won the cfo's strategic initiative award For finding out about Keebo all the guy did he scheduled a con scheduled a trial they said about 500, 000 dollars.
[00:38:55] So knowing really like how to leverage AI, how to best leverage it, how to build on top of it. I think it's a great skill to have. But related to it, my third prediction is that and we're already seeing this. It's almost not a prediction but we're seeing early signs of it.
[00:39:09] Is that, People are seeing what AI is, and there are a lot of people, even people who basically graduated, a decade ago. They're going back, to educate themselves. People are taking stats classes retraining themselves in some ways, I think those are the three things that are going to be really trans.
[00:39:24] Like, I think instead of just talking about here's what, the next generation of LLMs are going to do, But I see the real opportunity is how we perceive them, how we leverage them, and how we prepare for educators.
[00:39:36] Dr Genevieve Hayes: And what final advice would you give to data scientists looking to create business value from data?
[00:39:42] Prof Barzan Mozafari: I would go back to the last two points I was mentioning, like, that's my prediction that that's going to happen. So AI is happening, low value, manual, repetitive, tedious tasks. Make sure you're on the right side of history. The way you can be on the right side of history is that instead of being the one who has to convince their boss, why you should not buy this AI, be the one who embraces it, be that innovator who educates your team.
[00:40:06] And part of that is you going and if you have not, Taking, AI classes when you got your degree, we're living in a time and age where it's extremely easy. There's plenty of even free online, , and, with very small investment. You can learn just enough to be dangerous.
[00:40:22] You can build on top of learn about supervised learning, learn about basically how reinforcement learn basic statistics and about like learning. If you go back, brush up on these clients and then be the innovator, be the eyes of your business, be the eyes and ears of your business. Hey, here's what we're doing.
[00:40:39] Here's how we can leverage AI, how we can best leverage AI. And here's what I can bring to, because otherwise you're gonna be a place. So my advice summarize is be on the right side of history.
[00:40:50] Dr Genevieve Hayes: So, for listeners who want to learn more about you or get in contact, what can they do?
[00:40:54] Prof Barzan Mozafari: You can hit me up on LinkedIn If any of this interests you, like we always you know interested in collaborating with others. I also recommend people to go to our website, keyboard, ai, KEE Ai where they can learn about the latest that we've done. You can sign up for a free trial if you're using Snowflake.
[00:41:13] How any cloud data warehouse in any capacity reach out to us get a free trial Try it out for yourself, read some of our papers. You can also find my academic publications on my home page if you
[00:41:23] Dr Genevieve Hayes: Thank you for joining me today, Barzan.
[00:41:25] Prof Barzan Mozafari: Thank you so much for having me. It's a great pleasure
[00:41:28] Dr Genevieve Hayes: And for those in the audience, thank you for listening. I'm Dr. Genevieve Hayes, and this has been Value Driven Data Science, brought to you by Genevieve Hayes Consulting.

Creators and Guests

Host

Genevieve Hayes

Guest

Barzan Mozafari

Episode 52: Automating the Automators – How AI and ML are Transforming Data Teams

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere