Value Driven Data Science: Boost your impact. Earn what you’re worth. Rewrite your career algorithm. | Transcript: Episode 39: The Impact of Data Science on Data Orchestration

Episode 39: The Impact of Data Science on Data Orchestration

June 20, 2024 / 39:06/E39 Download MP3

[00:00:00] Dr Genevieve Hayes: Hello and welcome to value driven data science brought to you by Genevieve Hayes Consulting. I'm Dr. Genevieve Hayes. And today I'm joined by Sandy Rizza to discuss the role of data science in data orchestration. Sandy is a data scientist turned data engineer and is currently the lead engineer on the DAGSTA project, an open source data orchestration platform used in MLOps, data science, IOT, and analytics.
[00:00:29] He is also the co author of Advanced Analytics with Spark. Sandy, welcome to the show.
[00:00:35] Sandy Ryza: Hi Genevieve, it's great to be here. Thanks for having me on.
[00:00:38] Dr Genevieve Hayes: One of the big promises of data science is that it gives organizations the ability to combine multiple disparate data sets to produce value creating insights. But this is only possible if you can get all those disparate data sets together in the one location to begin with. Having endured the early days of data science where this wasn't the case, thankfully, this is something that most organizations are now well aware of.
[00:01:07] And this has led to the rise of the data engineer and the data orchestration platform. In this episode, we're going to be looking at the role of data science in the domain of data orchestration and how data orchestration has evolved and continues to evolve in response to the recent rise in the use of AI and ML.
[00:01:30] But before we start discussing the evolution of data orchestration platforms, Sandy, I'm interested in learning a bit more about the evolution of your career. In particular, by all indications, you had a very successful career as a data scientist to the point where you literally wrote a book on the topic.
[00:01:52] What led you to changing directions and deciding to become an engineer?
[00:01:57] Sandy Ryza: Great question. Yeah, so I think I had experiences in data science that are probably similar to a lot of others, or a lot of other data scientists that I observed. I got into data science in the 1st place, because I was really excited about the possibility of using. Data to help build understanding, so both in an automated way through machine learning models, and also through a more hand carried away through it, through analytics.
[00:02:23] And as you kind of alluded to, When I tried to do this, I encountered these issues. I think a lot of data scientists face which is that, they're asked to, build a machine learning model or answer a question. And doing that requires data and they don't always have access to that data or have it in the format that they care about.
[00:02:45] And then, even more so once they have done that, once they've gone and answer that question, the next question they have to answer, or the next machine learning model that they have to build often ends up relying on very similar sets of, sub questions. So, to answer how many weekly active users you have, you need to develop an idea of what a user is and what that means to your organization.
[00:03:07] And you want to be able to not throw away that work as soon as you do it. So I was in, a couple of different data roles and basically encountered the same kind of challenges in each of them. So for example, I was working. In health care data science at a company called clover health.
[00:03:25] We're trying to build models that would help us deliver better care to our patients. I was building transportation models that a company called keep trucking trying to understand the behavior of truck drivers. And in all those cases, I found that. Probably more than 50 percent of my time was spent wrangling data and doing these frustrating tasks.
[00:03:46] I have a background in software and computer science. And so when I was faced with these frustrations, I would, spend time trying to build software. These organizations that would empower myself and then, The teams that I was later on leading to, not have to spend so much time with these repeated tasks.
[00:04:04] And I basically built these mini layers at multiple different companies that were kind of helping me organize these data pipelines. So, let's say I needed some definition of users to be able to answer questions or build an ML model. I would build a little framework that would allow me to do define like a users table and maybe define how that user table gets calculated and how it gets calculated from other tables.
[00:04:31] And ultimately, we're just spending most of my time. At these companies building and maintaining these internal frameworks to make the job of my team easier. So kind of naturally transitioned from data science to this, like engineering role, just through trying to help myself be better at my job.
[00:04:49] At those companies and then finally moved over to this full time engineering role at Dags for Labs by basically trying to build a more general version of this tool that I had built multiple times inside these organizations with the goal that it could be used at a bunch of different organizations.
[00:05:04] Then if I built it in a slightly more general way than I've been building in the past it could be a lot more widely useful.
[00:05:11] Dr Genevieve Hayes: What you described there about having to come up with that standard definition of a user. That's something that I've been through in my own career. One thing that struck me while you're saying that is how did you get everyone else in the data science and data engineering team on board with your definition of what a user was?
[00:05:30] Sandy Ryza: Yeah, so it's always hard. And, I think the way you talk about it, you almost present it as a easier problem than it is because it's not just getting everybody on the data science and the data engineering team on board with your definition of a user.
[00:05:41] It's getting everybody on the business team who actually looks at that metric. And, maybe even the CEO of the company to agree with your definition of the user. I don't think there's any silver bullet when it comes to that, but I think what happens in a lot of organizations is because there's no silver bullet, basically each data scientist or each team will come up with their own definition of a user, which, , works great for getting things done really quickly, but then suddenly you're looking at a dashboard and, seeing data is just totally inconsistent.
[00:06:08] It gets really bad when you start thinking in terms of money, when you have the finance team that has one model and the sales team that has a different model and their data scientists have different views of the world and, like, the dollar values just don't match up.
[00:06:20] The only thing I have to add there was kind of trying to use technology to force people to have these conversations before they built a ton on top of their different definitions of users and, coming in saying like, Oh, it looks like you're trying to define what a user is over here.
[00:06:35] Why don't you have a conversation with Andrew? Who's also trying to define what a user is over here? Before we sort of set that into stone.
[00:06:43] Dr Genevieve Hayes: What you described with the problems with the finance team, I had that exact problem because I spent a lot of my early career working in insurance. So I was in the actuarial and analytics team and we'd have one definition of all sorts of things. And the finance team had a different one. And we both reported to the chief financial officer.
[00:07:03] The number of times I got phone calls from the CFO saying why don't your numbers match with the numbers from the finance team? And yeah, eventually it did. Led to us sitting down and having a conversation and sorting everything out.
[00:07:17] Sandy Ryza: That's right. Yeah, it's it's really interesting thing. Like, often people will be hired as maybe the 1st data science hire at a company. And, the narrative will be that they are starting data at the company, but they'll get in. And, this happened to me at keep trucking and they'll find out that,
[00:07:32] people aren't stupid. They've been thinking about data the entire time. And there's all these different functions that are already thinking about data, right? So finance has a data platform and maybe it's not like a fancy data platform with Spark and, , machine learning and everything, but like they're doing stuff with data and they're getting by.
[00:07:48] And then, maybe the engineering team will also have its own data platform because they need to understand the performance of their application and catch all sorts of errors. So often the task of the first quote unquote data hired a company isn't starting the data function at that company, but actually unifying all these existing data functions that are already existing at that company.
[00:08:09] Dr Genevieve Hayes: And also things like when you're dealing with finance, they often have to work with a particular set of definitions. So you might think that you've got a better definition, but legally finance has to use whatever the definition they have is. So you find that you actually have to shift to what they're going with rather than convince them to go your way, even though your way is definitely better.
[00:08:33] Sandy Ryza: That's right. That's right. And I don't want to misconstrue when I say there should only be one definition of a user. Like, sometimes there are, different valid ways of looking at the same truth. But I think being intentional about that is much better than kind of letting it evolve in a very chaotic way.
[00:08:51] Naturally.
[00:08:52] Dr Genevieve Hayes: Being able to articulate why you're doing something is actually more important than what you're doing, I think.
[00:09:00] Sandy Ryza: That's right. That's right. And, , I think that goes back to these schools that I've, focused on at these organizations and now have been building more general purpose versions of is being able to add context to the data that you produce is so important. I think that there's an easy failure mode where you go into your data warehouse and you have a bajillion tables which, many of them overlap and what they seek to represent.
[00:09:27] And being able to sort of impose some order on those tables, when someone comes across the table, kind of being able to understand why it was generated, what it's downstream of, what the intention was behind it. Is really important.
[00:09:41] Dr Genevieve Hayes: Taking a bit of a step back, I can understand why you Wanted to become an engineer, but at the same time, an alternative approach you could have taken is to just work with engineers and say, as a data scientist, this is what I would like you to do. Could you implement that? Why did you decide, okay, I want to do this myself rather than working with engineers to get them to implement it?
[00:10:07] Sandy Ryza: Yeah, I think part of it was that I had a bit of an engineering background, so I felt comfortable working on that. I think there's often a bit of a rift between data scientists and the engineers that are there to support them. Even those, those engineers are often well-intentioned, they don't necessarily have the sort of mindset of the data scientists as deeply in mind as the data scientists do.
[00:10:31] So, living a set of workflows day in and day out gives you a perspective. That's a little bit harder to learn just by speaking to someone. And engineers who support data scientists often have kind of a variety of competing priorities. So, they'll be responsible for supporting other parts of the platform.
[00:10:49] They'll be responsible for supporting other data teams in the organization, or perhaps machine learning teams, and just can't necessarily spend the amount of time that it would take to get there and build the exact tool that the data scientist wants. Ideally, there's some technology and some sort of framework that allows engineers and data scientists to collaborate really well.
[00:11:11] So, the data scientists can be, thinking very heavily in terms of the data and not have to think about the platform so much, but at least in the organizations and with the level of technology that I had experience with, there wasn't enough of a baseline. That enabled that separation. I think in a way that's kind of what we're trying to get to with with Dagster.
[00:11:32] Dr Genevieve Hayes: Before Working on this episode, I hadn't previously come across DAGSTAR so can you tell us a bit more about the DAGSTAR project?
[00:11:40] Sandy Ryza: Yes. So Dagster is a data orchestrator. The job of Dagster is to execute data pipelines that keep data assets up to date. So a lot of jargon in there that maybe I can unpack that would be helpful. I think the biggest piece, or at least coming from the perspective of being a data scientist is this notion of a data asset and a data asset.
[00:12:01] It's some object that captures some understanding of the world. So normally it's a table, it could be some, it could be a data set, it could be a sheet machine learning model, it could be a dashboard. It's ultimately the output of some sort of process that operates on data. In some cases, data assets are kind of the end goal of your job as a data scientist.
[00:12:22] So for example, If you're building a machine learning model that model might be a data asset you kind of expose to the rest of your organization. In other cases, data assets are these intermediate objects that are helpful for other tasks that you might do. So it might be your users table that you use to query every time someone asks you a question about users and you need to produce an analysis.
[00:12:44] Ultimately the job of a data pipeline is to produce a set of data assets that are useful in some sort of way. So it starts with input data that comes from some source, and it produces a set of data assets that are useful to an organization.
[00:12:58] I
[00:13:00] Dr Genevieve Hayes: from regular ETL?
[00:13:04] Sandy Ryza: think the word ETL gets thrown around a lot and I think And ETL is a pattern of data pipeline. The way kind of I think about what most data engineers and data scientists are doing most, you know, realistically, it's something like ETL, TL, TL, TL, in the sense that they start by extracting data from some source system.
[00:13:26] They bring it into a data warehouse and then they constantly go back and forth between transforming it. Doing some sort of, derivation and then loading it back into that data warehouse and developing sort of like a chain of intermediate tables that at each point sort of more clean, more close to the end product are they're trying to get to.
[00:13:45] So, ultimately, this overlaps with ETL. ETL is a bit of a ancient term that is somehow made it into our modern data engineering lexicon.
[00:13:54] Dr Genevieve Hayes: At what stage in the DAGSTAR project's development did you become involved? Were you one of the founding members or had development commenced prior to you joining?
[00:14:04] Sandy Ryza: I joined when the company was fairly small, but I wasn't one of the founding members. I helped take Dagster in a direction that was a lot closer to the tools that I had built when I was working at these companies. So I came in kind of like sold on this vision of what we now call asset centric orchestration.
[00:14:23] So really thinking about the data assets like tables that you're trying to produce when you define a data pipeline, as opposed to just thinking in terms of like, do this, then that, like, task oriented orchestration. And so I joined fairly early and kind of in many ways redesigned the Dagster APIs to follow that paradigm.
[00:14:45] Dr Genevieve Hayes: That's interesting. So was it going in that task oriented direction before you arrived?
[00:14:51] Sandy Ryza: Yeah, it's an interesting story. The original vision of Dagster and what had sort of drawn me to it in the first place was this asset oriented vision. But when I joined I think based on some early feedback from a few, very like opinionated users, the direction of the project had kind of veered into this task oriented direction.
[00:15:13] Dr Genevieve Hayes: So a lot of my initial. Work there was to kind of get the project back on track to this vision that had sort of motivated in the first place, but hadn't been as present in the recent development. That idea of the asset oriented vision how does that align with your background in data science? Did your background in data science lead you to the conclusion that that was the way to go, or was it something else that caused that to resonate for you?
[00:15:43] Sandy Ryza: Yeah, absolutely. I think it was really the union of two things my background in data science and working on orchestration that sort of produced this asset oriented direction. And I think what it came down to is that in data science and machine learning.
[00:15:57] You're always thinking about the data. So, when I was assigned, some task and, let's say I was asked to produce a report that would answer some sort of question as a data scientist, or as to produce a machine learning model. The 1st thing my mind would go to is what data do I need in order to produce this?
[00:16:14] Where do I start? What is the user? Let's say, and then, okay. In figuring out what a user I would end up need to think about, even sort of more granular data sets, like, okay, to define what a user is, we need to understand, maybe a user is defined as someone who logs into our website.
[00:16:32] So that means I need to understand the data about website logins. And what you end up with is this network, or as we call it a DAG of data sets. That, you know, ends with the ultimate thing you're trying to understand or create. But then depends on a set of intermediate data sets that capture some intermediate understanding.
[00:16:54] Then going all the way back to the source data that you're working with. So for me working as a data scientist, I was just sort of constantly confronted with this natural way of thinking about what I was doing, which is basically like producing. More and more refined versions of the data that I had access to and I wanted a orchestration tool that would allow me to when I was actually like scheduling things and building data pipelines, it would allow me to think in those terms.
[00:17:19] Dr Genevieve Hayes: When you express it like that, I can see how that's the way I work as a data scientist. I just never realized that that was how I worked. I can't envisage how that translates to a tool. And I can see, this is probably how Dagster ended up.
[00:17:36] Diverging into that task oriented way because presumably is a lot easier to produce a tool that's based on here's some functionality that can perform this task and here's some functionality that can perform this task and you can put them together to do whatever you want. What does it look like when you've got that asset oriented vision?
[00:18:01] Sandy Ryza: Sure, sure. And, 1st, I want to preface and say that Dexter obviously has to model, when we talk about tasks, we're talking about computation you know, stuff happening. So Dexter obviously has to model. The fact that computation is happening and tasks are happening, but the way it does it is all in terms of data sets.
[00:18:19] To answer your question directly, though, it basically looks like a network boxes and arrows pointing between those boxes and each box is a data set. So you can imagine at the beginning is your source data. And then there's, an arrow pointing from your source data set to every downstream table.
[00:18:38] That is produced using that data set. So maybe you produce it using a select query, like select star from my raw users table. And you get an arrow between those two. And it looks like this big graph of connections. And, sometimes data catalogs will present kind of a similar graph.
[00:18:58] What's kind of cool and fundamentally unique about Dagster is you can go. and click on any of those nodes in that table and say materialize this asset. And so what that'll do is that'll actually run the computation to recreate that table from the upstream tables that it's based on. So you might, click the user's table and say, regenerate the user's table.
[00:19:20] And it'll regenerate it from the upstream tables, the raw data that the user's table is meant to be regenerated from. And you can click whole sections of the graph and regenerate them and Daxter will handle running everything in the right order.
[00:19:33] Dr Genevieve Hayes: Okay, so what I'm imagining from what you're saying is that the assets are what the end user would see when they're seeing this graph on the screen, as opposed to a chain of tasks that I've seen in some other products where they would be used for producing things. I'm thinking of. SAS enterprise guide from many years ago when they had that sort of format.
[00:20:02] Sandy Ryza: Yeah, that's exactly right. And even though, I use tables in my example, assets could be machine learning models. They could be reports. So that makes it different than tools like dbt, which are a little bit similar to Dagster, but thinking primarily in terms of tables.
[00:20:16] Dr Genevieve Hayes: And the people who are creating this, it's still have to create the tasks, but what's displayed is the assets rather than the tasks because the assets are what are important. Is that right?
[00:20:26] Sandy Ryza: That's exactly right. Yeah,
[00:20:27] Dr Genevieve Hayes: Yeah. Because there's no way you can completely remove the tasks, otherwise platform wouldn't know what you were doing.
[00:20:34] Sandy Ryza: that's right. Yeah. You still have to ultimately. Define how the users table is derived from the raw users table. Like all the logic says, filter out these bad users it's et cetera. But the way it's represented is all in terms of the assets that it produces compared to other tools that kind of don't have that awareness and thus offer this much more fractured view.
[00:20:53] Dr Genevieve Hayes: Okay. So who's the target user for DAGSTR? Is it data engineers or is it data scientists?
[00:21:01] Sandy Ryza: Yeah, it's an interesting question. This is something we talk about a lot. The way that I kind of think about it in my mind is as this, there's this spectrum and at one side of the spectrum is someone who's kind of a full on platform engineer. So they don't need to understand the domain of the business at all.
[00:21:19] Their entire job is to sort of set up technology that empowers other people to do their jobs. And then at the other end of the spectrum, you have someone who's maybe a pure analyst. And the main thing that they're comfortable with doing is, going into a business intelligence tool and let's say typing out a SQL query to understand something.
[00:21:38] And then in the middle, you have the spectrum of roles where people are sort of thinking more about technology versus more about the data domain. So, like, if you go 1 towards the middle from the data platform engineer, you get like a data engineer and a data engineer is maybe they're doing a mix of sort of, like, pure technology tasks, but they're also thinking about the data a little bit.
[00:21:59] They're responsible more. For producing and maintaining particular, important data sets. Then, you move a little bit over and you have a data scientist and a data scientist is also often sort of called on to do some technical work and have, technical skills. Maybe the majority of where their head is at is thinking more about the data itself statistics on the data.
[00:22:19] And so on and so forth. And I think Dagster kind of caters to a few different personas. So if you're kind of like a pure analyst, you probably would not use Dagster because someone else would be using Dagster for you. But if you're a data platform engineer, a data engineer or a data scientist, you would be using tags here because your job involves some degree of thinking about data pipelines and thinking about producing durable data assets.
[00:22:46] Dr Genevieve Hayes: So as a data scientist, suppose I was working for a organization that was using DAGSTR. How would I use it? Would I be setting things up in DAGSTR or would I just be using a graph that a data engineer or a platform engineer had set up and just use it to regenerate assets for example?
[00:23:06] Sandy Ryza: Yeah. And so do you mind if I throw half of that question back at you? And so you're this data scientist, what exactly is your particular job in this organization? So, are you producing a machine learning model? Are you answering business questions?
[00:23:21] Is part of your job producing data sets that you might use in the future to answer questions?
[00:23:27] Dr Genevieve Hayes: I would imagine that it would be all of them, but the focus of the jobs I've tended to have has been more in answering business questions. So let's go with that.
[00:23:36] Sandy Ryza: Cool. Yeah. So, let's say your job is to answer business questions, but often in the process of doing that you want to build tables that will make it easier for you to answer, similar business questions in the future. So in that case, you often would be using Daxter.
[00:23:55] What you'd be doing is you'd be writing either Python code. So Daxter's API is primarily a Python API. You'd be writing either Python code or SQL code which would define how a particular table is derived and then basically informing Dagster about that code. So, you'd probably be, committing it to some sort of Git repository that Dagster has an awareness of and then once it's checked in Dexter will be aware of it, or you might be developing with Dexter on your laptop.
[00:24:21] And you might be editing Python code and that Python code that you edit changes what shows up inside Dexter.
[00:24:27] Dr Genevieve Hayes: Okay. That's interesting. And I can see there's clearly a strong influence of your data science background on the creation of DAGSTA. So you're using Python, you've got the Git integration, and I would imagine that these are some things that other data orchestration platforms might not necessarily think to include their developers with data scientists.
[00:24:53] Is that right? Or is everyone just using Python these days?
[00:24:56] Sandy Ryza: That's totally right. Another big piece of it that I, didn't get a chance to talk about already is okay. So half of kind of what makes Dagster especially useful for data scientists and kind of unique in that way is this asset oriented thing that I talked a lot about the other 1 is being able to run in a really lightweight way.
[00:25:14] So data scientists have these very experimental based workflows. They don't want to be spending their time, using heavyweight Python frameworks where they have to type a bunch of boilerplate to get anything done. And they also don't want to be spending much of the time dealing with heavy infrastructure where they have to.
[00:25:30] Set up a bunch of long running services or connect to external services to, do basic data processing tasks. And so a big part of the way that we've designed Dagster is to be very lightweight. So all you have to do is, type some Python code into a file. Usually you're just writing a basic Python function and then adding a decorator to that function to say This is the table that this function produces, and here are the other tables that it depends on.
[00:26:00] It's a very lightweight, and then, basically being able to just type a single command and be able to get a little web UI and interact with your asset graph and your computations in that way. And yeah, I think that was very directly informed by the frustrations of myself and others in dealing with orchestrators like Airflow, where just to do basic local development tasks felt possibly heavyweight.
[00:26:22] And I just didn't find myself wanting to deal with those technologies.
[00:26:25] Dr Genevieve Hayes: Yeah, I remember in a previous job that I had, we were working with a platform not an orchestration platform. It was a, I guess it would have been a data orchestration platform that was leading to the data going into a particular tool. And it was developed by software engineers who were used to working in Java.
[00:26:45] So the whole thing was in Java and you could write functions that would. Create particular things, but because they were used to working in Java, it supported Java. And if you wanted to use Python, you had to use Jython. And have you ever tried to write in Jython?
[00:27:03] Sandy Ryza: I've heard of Japan. I've never touched it. Seems terrifying.
[00:27:06] Dr Genevieve Hayes: It's awful. It's like Python, but it doesn't work half the time because every so often there's like random Java put in there and there is no documentation for it. I remember I had this deadline that I had to meet and was awful.
[00:27:25] Sandy Ryza: I'm sorry you had to go through that. That sounds terrible.
[00:27:28] Dr Genevieve Hayes: It was a learning experience. It taught me what not to do. So yeah, I can see that Dagster has learned this and is not doing this. Now it's basically the lingua
[00:27:39] Sandy Ryza: Yeah. I think the existence of Python has just been a major major step forward in the ability of engineers and data scientists to Work and think in the same language.
[00:27:51] Dr Genevieve Hayes: franca of data teams now.
[00:27:55] Sandy Ryza: That's right. That's right.
[00:27:57] Dr Genevieve Hayes: Yeah. Are there any other special features of Dagster that are specifically included because of your background in data science or to assist data scientists in performing their work?
[00:28:10] Sandy Ryza: So the big two are these bits of asset orientation and this piece about lightweight development. There's a bunch of other smaller things that I think have really come through as well. One piece is this metadata system that Dexter has, and so I talked about this, notion of wanting to have context a lot whenever you run a.
[00:28:33] Computation to, let's say, train a machine learning model or develop some data set. There's often stuff you want to know about what went on. So when you're dealing with machine learning, for example, often you want to know various evaluation metrics about the model. There's all sorts of like numbers.
[00:28:47] They're just useful to understand. Both the parameters that went into training the model as well as. Statistics about the model that was produced. And so Dagster focuses a lot on being able to record this data in a structured way that makes it easy when you're trying to understand an asset that you come across in the wild or the results of a past computation what's the deal with it.
[00:29:12] So you run a job or click on your machine learning model and, execute it inside Dagster. So it gets materialized, refreshed. And then someone else comes to it later and is curious, like, what parameters were used to train this model? Or what's the area under the ROC curve?
[00:29:28] And that data is just there and available in Dagster. So it's really easy to understand context about your data assets.
[00:29:36] Dr Genevieve Hayes: When you say you could click on a ML model asset and refresh it, would that mean that it would retrain it using the latest data?
[00:29:45] Sandy Ryza: That's right. Exactly.
[00:29:46] Dr Genevieve Hayes: That would be really useful. So you could basically have continual training using Dagster.
[00:29:52] Sandy Ryza: That's right. So, I've talked a lot about this sort of like clicking and stuff. A huge part of, any orchestrator, including Dagster is scheduling and basically automatically running stuff. And so the idea is that once you have your graph of data assets, like this big network of machine learning models and tables, you can automatically refresh stuff inside that network.
[00:30:13] So you can set things on schedules so that, once a day. You'll refresh everything or, be even more reactive and say every time that this table gets changed automatically retraining the machine learning model that depends on it.
[00:30:26] Dr Genevieve Hayes: Could you then build into it alerts if there are problems that are occurring like data drift or something that's causing the performance of that ML model to decline?
[00:30:36] Sandy Ryza: That's right. Yeah, exactly. That's a big part of the way that people use Dagster. Something that we introduced recently that I spent a bunch of time on is this feature called asset checks. And so the idea is every data asset, whether it's a table or a machine learning model. Can have a set of properties that you expect it to hold to.
[00:30:57] So maybe, you always expect your table to have more than 20 rows, because if it has less than that, something's definitely wrong. Or you always expect the performance of your machine learning model to be kind of in line with the, With recent performances, so you can write kind of arbitrary little Python functions that verify some property of your asset.
[00:31:17] And if those fail, Dexter will tell you about it. There's all sorts of options. You can make it actually stop the, bad data from propagating through the rest of the pipeline, or you can just get a little alert about it.
[00:31:28] Dr Genevieve Hayes: So this is very much a orchestration tool that's targeted at data scientists.
[00:31:34] Sandy Ryza: I think so.
[00:31:35] Dr Genevieve Hayes: Yeah. I've had limited experience with data orchestration tools. What this seems like, what you're describing here is quite unique is this unique or are these things you'd expect to find in all orchestration platforms?
[00:31:50] Sandy Ryza: I think it's unique. And I was honestly a little surprised by that. One thing I sort of discovered in the earlier stages of, both being a data scientist who needed orchestration and also, eventually working on Dagster is so many organizations end up building internal tools that do these functions. So you go to like any sophisticated tech organization and they'll be using, some orchestrator, but then they'll have like a whole team that is building a framework for data quality on top of their orchestrator.
[00:32:18] And they'll have a whole team that basically has built a special, but we would call an asset layer on top of their orchestrator. And, there's all these frictions between these different layers, right? So the orchestrator will think about things in one way. The asset layer will think about things another way.
[00:32:32] Like, people are clearly thinking about things in these terms. So it was extremely surprising to me. This didn't exist when I was originally working on it. Because , from my perspective, both, having worked on data and having. Seen the way other people worked on data.
[00:32:47] It was the obvious way to think about things. But by and large, other orchestrators do not. To the world in this way?
[00:32:53] Dr Genevieve Hayes: I think part of the problem, cause I'm thinking back to that orchestration tool that I used, it's the one where I had to program in Jython and with that, I was told that it was basically built on top of Apache Camel. So I suspect a lot of these other orchestration tools are built on top of open source tools that don't deal with these things.
[00:33:15] And because all of them built them on top of the same open source tools, you just keep perpetuating the same problems.
[00:33:22] Sandy Ryza: Yeah, I think it maybe goes back a little bit to your question about sort of. Why not have the engineers build the tool? And I think part of it is that engineers love to generalize. And I, personally understand this perspective, as an engineer. So if an engineer is asked to build an orchestration tool, they'll think, okay, what's the most generic orchestration tool that I can build that will satisfy the largest number of use cases?
[00:33:45] And when you think in those terms, like tasks are great because you can use tasks for everything. And so I think. 1 of the things that was helpful about being a data scientist was this more myopic worldview where I was like, I don't care about all the other stuff. People want to do a task.
[00:34:01] Like, I'm trying to get my job done. I want to build a tool that will work well for my job. And I think. Limiting DAGster to really think about data pipelines as opposed to just generic workflows has allowed us to, build capabilities that make it very specialized for that purpose.
[00:34:20] Dr Genevieve Hayes: Yeah. So I can see how your data science background has influenced you in your engineering job. Is there anything you've learned from your time working as an engineer that were you to go back to data science would make you a better data scientist?
[00:34:38] Sandy Ryza: That's a great question. Thinking about the things that I've learned in my job as an engineer at Dexter Labs, this in a way will be slightly contrary to the last point that I made. It's easy to want to build sort of the most purpose built tool and the thing that will actually, just totally Scream and be the best for a particular use case.
[00:34:59] And that's kind of a luxury that you often have when you're building tooling inside of a particular organization. But working on more general purpose software has kind of taught me that you always have to give users an escape hatch, you can't anticipate because it's sort of like unknown, unknown situation.
[00:35:15] You can't anticipate everything that everyone is going to want to do with your technology. And so you have to develop a little bit of humility that says, even if I can't totally understand why someone would ever want to be able to do X. You often have to make the tool a little bit more general than you think it should be to accommodate those use cases that are eventually going to come up.
[00:35:39] Dr Genevieve Hayes: And I think there's probably some analogy to be drawn there, as a data scientist. Building dashboards, building machinery models, answering questions to think a little bit more generally about the way that what you're producing will be used or misused relative to the way that you envision it being used. So if you did go back to data science, how would you do things differently from your last stint as a data scientist because of your experiences with DAGSTR? Yeah.
[00:36:08] Sandy Ryza: I would think a little bit more intentionally about the platform that I was implicitly building as I do data science. So as a data scientist. You're often making choices about technology, even if that's not your primary job. A default role that you end up with is, what technology do I use?
[00:36:26] What technology do I use to answer a question? What technology do I use to build a machine learning model? But also, what technology do I use to communicate? And I think the lens that I would probably bring back is thinking more intentionally about the.
[00:36:41] Data platform that I'm, implicitly building with these technology choices. So, if I choose to, use notebooks or use a particular notebooking technology for a particular part of my workflow where is that going to end up? If I keep doing this for the next six months am I going to have a Wild proliferation of notebooks where no one can ever figure out what the right one is or find the 1 that answered a particular question or am I going to, find some sort of way to organize it all and have, naming convention or organization scheme for getting my notebooks in order.
[00:37:15] So I think just kind of thinking ahead and trying to make choices that will reduce chaos in the long run and make it easier to understand what's been going on in my data platform.
[00:37:29] Dr Genevieve Hayes: And what final advice would you give to data scientists looking to create business value from data?
[00:37:36] Sandy Ryza: Yeah, I mean, I think it's gonna echo the last thing that I said a little bit. Maybe I'll try to say it in a slightly more clear way. Which is think a little bit like a data platform engineer try to work in a way that is going to make the future version of yourself and other people at your organization more productive.
[00:37:56] And sometimes that means doing a little bit extra to accomplish a task, then you might have otherwise done, adding a little bit of documentation or spending a few more minutes trying to see if there's a table out there that already, does some part of what you're trying to do.
[00:38:10] And then, talking to the person who developed that table and modifying it to do exactly what you want to do instead of building your own. Just copying and pasting their SQL query and doing your own version of it. So make choices that lead to order in the long run and that will make, your job and job of people you work with a lot more serene.
[00:38:29] Dr Genevieve Hayes: So don't create technical debt.
[00:38:31] Sandy Ryza: Maybe that's the simplest way of saying it.
[00:38:34] Dr Genevieve Hayes: For listeners who want to learn more about you or get in contact, what can they do?
[00:38:39] Sandy Ryza: Yeah. You can find me on Twitter at S underscore R Y Z or just find me on LinkedIn, Sandy Rizzo. And if you want to check out Dagster, it's at Dagster. io.
[00:38:50] Dr Genevieve Hayes: Thank you for joining me today, Sandy.
[00:38:53] Sandy Ryza: Thanks so much for having me on Genevieve. I really enjoyed this conversation.
[00:38:55] Dr Genevieve Hayes: And for those in the audience, thank you for listening. I'm Dr. Genevieve Hayes, and this has been Value Driven Data Science brought to you by Genevieve Hayes Consulting.

Creators and Guests

Host

Genevieve Hayes

Guest

Sandy Ryza

Episode 39: The Impact of Data Science on Data Orchestration

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere