Engineers of Scale

Læs mere Engineers of Scale

Hello everyone, welcome to the Engineers of Scale podcast. In this podcast, we go back in time and give you an insider’s view on the projects that have completely transformed the infrastructure industry. Most importantly, we celebrate the heroes who created and led those projects. sudipchakrabarti.substack.com

Data Engineering: The Past, Present and Future with Joseph Hellerstein

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/] for a fascinating look back [https://sudipchakrabarti.substack.com/p/when-hadoop-was-king-and-yahoo-was] on Hadoop, Reynold Xin [https://www.linkedin.com/in/rxin/], co-creator of Apache Spark [https://en.wikipedia.org/wiki/Apache_Spark] and co-founder of Databricks [https://www.databricks.com/] for a technical deep-dive [https://sudipchakrabarti.substack.com/p/from-spark-to-databricks-sparks-origins] into Spark, Ryan Blue [https://www.linkedin.com/in/rdblue?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAAzlA6sBKpw5AAsa7SgDV425Ay1w6My0b4U&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BohWsnzB2Rw2ipZdP53XWlg%3D%3D], creator of Apache Iceberg [https://github.com/apache/iceberg] on the technical breakthroughs [https://sudipchakrabarti.substack.com/p/iceberg-the-open-table-format-for] that made Iceberg possible, and Stephan Ewen [https://www.linkedin.com/in/stephanewen/?originalSubdomain=de], creator of Apache Flink [https://en.wikipedia.org/wiki/Apache_Flink]. In this episode, we host Joseph Hellerstein [https://www.linkedin.com/in/joehellerstein/], Professor at UC Berkeley and founder of Trifacta and RunLLM [https://runllm.com/]. Joe helps us step back and explore the evolution of Data Engineering over the past several decades while also discussing the future innovations on the horizon. Show Notes Timestamps * [00:00:01] Introduction and Joe’s background * [00:01:38] What got Joe interested in data engineering * [00:03:59] Defining data engineering and its key components * [00:05:16] Significant trends and changes fueling data engineering over the last 20 years * [00:06:30] Key components of data engineering and the role of each * [00:08:07] Contrasting modern data stack with traditional data stack * [00:12:10] Developers vs. data engineers/analysts in building data pipelines * [00:14:12] Role of AI and LLMs in data preparation and cleaning * [00:16:51] Journey from data warehouses to data lakes to data lakehouses * [00:21:14] Role of data catalogs in data engineering going forward * [00:32:57] Unified data platforms vs. best-of-breed tools for data engineering * [00:37:03] Possibility of one system serving both OLTP and OLAP use cases * [00:40:46] Impact of AI on the data stack and data engineering * [00:43:23] Interesting future research directions in data engineering * [00:46:02] Lightning round: Acceleration, unsolved questions, key message Transcript Sudip [00:00:01]: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today I have the great pleasure of hosting Joe Hellerstein, professor at UC Berkeley and founder of Trifecta and Aqueduct. Joe, welcome to the Engineers of Scale podcast. Joe [00:00:36]: Thanks, it's fun to be here. Sudip [00:00:37]: Thank you so much. So I'm not going to walk our listeners through an overview of your background because you truly need no introduction. When it comes to innovation in the field of data engineering, there are really very few people who even come close to what you have done. I will, however, mention a fun fact that I learned recently, even though I think I've known you for many years, and that is your interest in music. Not only are you a musician on the side, you actually had even minored in music during your PhD at Wisconsin. So how do you balance all of your research and startup work with your interest in music? Joe [00:01:12]: Well, I'm a believer that you should have a rich life and that people who spend 24-7 on computing are spending a little too much time maybe. So I enjoy computing and data engineering and all that good stuff, and I love to geek out about those things, but it's one of a bunch of things I value in life, family, hobbies, and so on. And I'm sure a lot of your listeners are the same. And anyone who tells you that you have to do something 24-7 to Excel, I think is telling you a lie. Sudip [00:01:38]: So then what got you interested in data engineering as a quality of research in the first place? Joe [00:01:44]: Yeah, well, my background, going back to my training after college, was in database systems. And my first job right out of college was at IBM Research, which was the founding lab out in San Jose that built the first relational databases. And a bunch of those people were still there. So I was really brought up by some of the founders of the field of database systems. After that, I went to Berkeley and then Wisconsin for my schooling, which was more of the community of the folks who really pioneered the database system space. So I'm an old hand, even though I'm not as old as most of those people by about a generation, I still feel like I'm an old database hand. I come from that lineage. And what's happened over my career since the, you know, I got my PhD in the mid-90s, is that the process of managing data and the computation that goes around it has become more and more central to all of computing and the way it projects on the real world. So your listeners know better than any, probably, that really, we shouldn't talk about computer science. We should talk about data science, data engineering, because without data, computing is kind of meaningless. And this is a truth that emerged, you know, in the last 20 years, really. But it was one that the database people were working on well before that. And I feel kind of blessed to have been born into that community because the relevance of data engineering to all things computation and therefore much of society is so apparent today. Sudip [00:03:01]: I would say that both the schools that you have involved with, Wisconsin, and of course, UC Berkeley now, I think, have had tremendous quality of work coming out in database and data systems and data engineering together. So it just, you know, has had such a big history. It's an awesome tradition. Joe [00:03:19]: I mean, when I was coming up, there were three places to do real work in data systems. It was IBM, Berkeley, and Wisconsin. And I had the fortune, and to some degree, I took measures to interact with all those people when I was very young, straight out of college. And they were really the center of all activity because a lot of academic computer science at that time didn't get it. MIT, when I interviewed in the mid-90s, hadn't had a person on the faculty doing databases for over a decade. And it was very clear when I got there that they did not think it was an intellectual activity. They thought it was something that businesses do. And that's all changed radically in the course of my academic career. We're now data-driven computing is all of computing, really. Sudip [00:03:59]: Taking a step back, if you were to describe to someone what data engineering is, and I know you teach a very popular course on that at Berkeley, too, how do you describe it? Joe [00:04:12]: Yeah, it's been tricky, actually, because I think we're in a time of transition. And so you have to talk about things relative to where things are right now. So the way I talk about it with people is they understood that there was a shift from traditional computer science to what was being called data science over the last, say, decade, where clearly data had to be at the center of things, or at least some things. But what happened in the data science programs that evolved is they were largely developed as sort of statistics meets algorithms. And that left out all the engineering concerns of how do you build systems around these foundations, particularly systems that drive large amounts of data? Because the statisticians traditionally didn't use large amounts of data. Incredible what's achievable with statistics with very small amounts of data, of course. And so that's what I talk about. It's like, well, how do we systematize and build engineered artifacts that reflect the importance of data and computation together? Sudip [00:05:02]: And looking back in the last 20 years or so, since you started working at Berkeley and obviously started your two companies, are there certain significant trends or changes that have really fueled data engineering? Joe [00:05:16]: Yeah, I mean, you know, there's a long enough scope that you have to include the existence of the World Wide Web as one of them. So, you know, you go back to the 90s and data was all about relational databases because that was the data that was entered into computers. And the web changed all that. Now there's all sorts of nonsense that you could harvest. And I remember joking in the early aughts, maybe late 1990s, my buddy was saying, my gas station just got on the internet. Goodness knows why I would ever want to run a query against my gas station, right? But nowadays we realize that all that sort of light recording of ambient stuff and people's thoughts and ideas and conversations is highly valuable. That was not at all clear when the web started out. You know, web search is like, well, I might want to find some stuff. Most of it's irrelevant to me, but I want to find a few things, right? That was web search. But what you see, you know, if LLMs are a compression of the web, what we're seeing today is having a compressed version of everything anybody's ever said is outrageously powerful, even if the technology is pretty simple. So the rise of kind of ambient human information, something I did not anticipate whatsoever. Sudip [00:06:21]: Got it. Today, as we know data engineering, what would you say are the key components of data engineering and kind of what's the role of each? Joe [00:06:30]: You know, we often talk about pipelines, right? And I think it's not a bad way to think is to kind of start down the pipeline, look at what feeds what. Where does the data get generated? How does it get ingested? What data didn't I measure at all? Actually, we start there. The statisticians always start there. There's a universe. Things are happening. Some of it gets measured. That's called sampling. That's the start of any pipeline is there's phenomena out there that we could record. We choose to record some of them. And then, of course, there's the pipelines we think of from ingest to processing to feedback loops that happen, right? When you're learning from outputs and how people react to them. So thinking about the long-term pipelines all the way from what did we measure to how did people react to it in apps? And then we measured the apps and we closed the loop. That's modern data engineering. Unfortunately, it's too big for any one organization to own, right? You go into any company and there isn't one org inside the company that owns that whole thing. And it's definitely too big for any one person's head. And so the other reality of data engineering, like a lot of real-world engineering, is teamwork. And it's cross-disciplinary. And it's a lot about people. Sudip [00:07:36]: Absolutely. Yeah, people, culture, team, all of that plays into the efficiency of a data engineering team, 100%. I think over the last few years, particularly with the last decade or so, we kind of transitioned from what used to be called traditional data stack, like much more on-prem, much more built around old generation technologies. And now we keep hearing about modern data stack, much more cloud-native and so on. In your view, what is modern data stack? What are the components? How do you contrast that with traditional data stack? Joe [00:08:07]: I have some opinions about the branding of all this. So modern data stack is a brand that was basically promoted by a couple of startups that were venture-backed. They tried to ascribe a particular meaning to that. And of course, the word modern is kind of hilarious because to me, it's kind of Mad Men 1950s modern furniture, right? But it is true that we do things differently today than we did in the 80s and 90s. So let me talk about it in those terms rather than trying to say it's two particular companies that tried to brand the modern data stack, okay? Because there's also by now enough blowback with that terminology that I just don't even want to walk down that path. Sure. Having said all that, when I was coming up, it was the beginning of the data warehouse movement. And that itself was a reaction to just having a database. So once upon a time, you had data, you put it in a database and that was it. And IBM would sell you a mainframe, right? You'd run the database. And then what happened was there were lots of databases and they were all over the place. And so there were many of these operational databases. The data warehousing movement came along and said, people want to see the big picture across all these databases. So we're going to have this ETL process, right? We're going to load, extract from all the operational stores, transform the data into a common format, load the warehouse. And that was a story that made many consultants very rich. And it also opened opportunities for some software vendors, Informatica, Teradata, right, that were tuned for that workload. And that was the status quo for about a decade. And then what happened was the world became too complicated for that as well. So the fiction that there'd be a single data warehouse that would really cover the business was a fiction all along, but it became a painfully untrue fiction sort of post the web, really. We had lots of data that really didn't want to go in the relational database at all. It wanted to be text search or it wanted to be image files in a file system, right? And then we saw the cracks emerge there in the data warehousing relational database movement. We heard NoSQL and there was a bunch of things like that. Where we are today is, I think, is kind of the end of that road. And in the end of that road, we have a majority of data that is not rows and columns when it's born, that is highly valuable, that needs to be managed. So you can't just throw it in the file system because it's too important and people need to be able to version it, know where it came from, know its provenance, know how it's getting used and do governance on it, all the things that are managing the data. So the things that used to be easy in the database are really hard on this messy data, all this stuff about governance and organization. It's very disparate and it's all over. So what's inevitable in this setting is that we're going to have to kind of stitch together more stuff even than before. And that's where we get into kind of the state of things today. And there's names people like to use for it, like Lakehouse and so on. But that's a fine name actually, doesn't mean a whole lot, but neither did Warehouse so that's okay. But the bottom line is there's going to be rich data of many facets and there's going to be many uses for that data. So big fan in of lots of data sources from lots of places, big fan out of lots of use cases. We'd like management in the middle of that hourglass, but we're not going to be able to assume that the data is structured for that management. The data is going to retain its probably original format and then extracts of various kinds kind of go out, right? So that's kind of where we are. And I think, you know, if it was a data filtration system or a data hourglass or something that might be more helpful than a lakehouse, which to me is a cabin on the side of a lake. I don't know if that's a useful analogy. It's cute neologism. It is this kind of problem that we're trying to manage. Sudip [00:11:24]: So I think, you know, it's like from data warehouses, which are primarily ETL. And when it came to data ingestion, we are now more in the ELT world, right? Which is data lakes. And then also, like there is a small movement around capital E, small T, and then LT, which is like you extract, do some transform, load, and then transform again. So I just wanted to talk a little bit about the users and the builders of this data pipelines. And there are, you know, two primary kinds of personals one are developers and the other ones are data engineers. In your view, do you see like developers, you know, kind of taking on more and more of this complex world of building data pipelines using it? Or do you see more like analysts and domain experts, you know, do things on their own using local tools, LLMs and so on? Joe [00:12:10]: I'm going to take a slightly different slice on it than you started with. Not because I quibble with it, but I want to make sure I'm kind of on the grounds that I want to talk about. So I think we can kind of split the world into people whose primary focus is computational. So think of data engineers, data scientists, IT professionals. Okay? That's not a cool name. So we try not to use that anymore. And then people whose job is fundamentally not about data or computing. Their job is something else. Like people in the line of business, people in science, people in what have you, journalists who use data. And you can think of them as consumers, if you will, whereas you think of the first body of people as maybe maintainers or something like that. They're the people who plumb the data. I think both constituencies are really important. I think they're both big markets for venture and for startups and for people who have technical skills to work at. But I do think Silicon Valley and academia both because academia and computing is computer scientists like me. We are very much more comfortable with the technical folks, with the developers. That's home base for us. And it's really fun to write software that we can use ourselves to make ourselves more productive. And so we do a lot of this dogfooding. Open source is another big kind of cultural contribution. You want to build a project to get your friends to use it. They like it. But it's, you know, we're kind of taking care of ourselves. There's this huge constituency of people who are going to get huge value out of data and they're the ones who understand how it's getting used. So they're actually people where like the cha-ching happens. These are the people who are going to monetize the data. In the sciences, it wouldn't be money. It would be scientific innovation. In journalism, it would be the big story. Sudip [00:13:41]: They extract value from the data. Joe [00:13:44]: Exactly. Yeah. And those people arguably are more important on my humble base, I try to admit that. So I think it is kind of two things. And I think we can ask a useful question, which is, will one set of technologies suit both of those constituencies? Maybe. I think that's a good conversation to do next. But I do feel like setting up that there's kind of broadly these two camps because two is easier to think about than 20. So let's not get too fine grained. How should we think about building and architecting data systems? And when I say systems, I do mean in the aggregate, like organizational systems to serve both those constituencies. And what kind of software do they need? I think it's a good, healthy frame for kind of the bigger picture. Sudip [00:14:26]: And do you see any convergence in terms of the kind of systems both constituencies might use? And does everyone become a Python programmer in some ways? Joe [00:14:36]: I'm so glad you brought that up. There's been a strong movement and I see this in the data science program at Berkeley as well. And we were a pioneering program in having a data science major. If you would just learn some Python, then you could do data science. And it just feels very backward. And I don't think that we should be expecting that all of the people who extract value from data will be programmers. Thinking programmatically, so having exposure to the notion of step-by-step problem breakdowns, instructing a computer to do things, all important, right? Do you have to learn a traditional programming language like Python to do that? I think not. I've felt this way for a long time, but I feel like LLMs are a really wonderful way to surface this to the general populace. If you are sufficiently disciplined step-by-step in telling the LLM what you want, it will probably understand you and probably help you get your job done, even if you're not really a programmer. And there's going to be other technologies, obviously, over time that'll be better than just a chat box on the internet for doing this, both user interfaces and models and programs. So yeah, I do feel that this idea that, oh, everybody will do everything in Python is super dev-centric. And that people like me and probably you, I don't want to put too much on you, and your listeners, that's what we're good at. But we have to have some empathy for the people who are the value extractors. And when I say empathy, I don't mean treating them like children. I mean giving them tools that make them super powerful. Right? Sudip [00:16:01]: Yeah, I think in particular over the last five to 10 years in general, the venture ecosystem which I can speak for has definitely probably over-rotated a bit more on the developer-centric experience of data engineering even at the cost of ignoring the actual constituents of users down the line like you were saying. I would kind of shift gears a little bit more on a core component of data engineering that you have done some incredible work on which is data preparation and cleaning. To this day, it's still like it continues to consume major time and resources. And you not only did a lot of work including your data wrangler paper and so on, but also went on to found one of the pioneering companies in this area, Trifacta. Looking back where we are now, do you feel we have managed to solve that problem of data preparation and cleaning yet? Joe [00:16:51]: Yeah, somebody pointed out that data transformation is kind of like cancer research. It's like a lifetime employment guarantee because you're going to help, right? And you may do brilliant, amazing things that help people's lives, but you're probably not going to solve it if you look forward with any sense of arrogance. There's lots to be done still, but there have been some good things that have been done that we can build on. Sudip [00:17:18]: And do you see like AI and LLMs kind of changing the game a fair amount going forward? Like any fundamental shift you see there coming? Joe [00:17:27]: This is great. So yes and no. And let me see if I can frame this up. When we started Trifacta, it was 2012. And the hypothesis in the research was that you want to build a feedback loop between the human and the computer. And the way it would work is this. The human would somehow guide the computer to, I want my data to look like this. And then the computer would say, well, here's a few things I could do to make your data more like this. What do you think about each of them? Pick the one you like best. I call this the guide-decide loop. So there's a human in the middle that's guiding the computer, and then the computer is making suggestions and the human's deciding which ones to use. And this was with a user interface that was visual. So I worked very closely with data visualization leaders like Jeff Hare, who's one of the fathers of D3.js, was a co-founder at Trifacta and a joint student to make sure that that visualization loop and the interaction part, the human part of that, were really powerful. So you could really see anomalies in the data, you could see examples of the data, and then you could interact by doing things like pointing at a bar chart and saying, this bar looks funny, or pointing at a cell in a spreadsheet and saying, what are the features that you give to the inference engine, to give to the algorithm to come up with suggestions on how to address those features? So a lot of our energy went in there. The AI that we advertised behind it was dead simple. It got a little better over the course of the company, over 10 years, our models got more sophisticated, but the user experience only got marginally better because the key issue was that interaction model. We built an interaction model around cleaning data, wrangling data. Now, we sold Trifacta some months before the launch of ChatGPT. And part of the deal was that I didn't go with the software to the acquirers. Sudip [00:19:13]: And this was Alteryx, the public company. Joe [00:19:15]: Alteryx, yes. So Alteryx acquired Trifacta. I will say that Google Cloud Dataprep is still the Trifacta product. It even says Trifacta on it. They haven't changed the branding. So both Google and Alteryx are using the tech directly. Better inference will make that product better, but it will not fundamentally change the hypothesis that started around this guide-to-side loop and this idea that you have to give people the opportunity to decide if the outputs of their prompts are right. You know, that's the whole thing with ChatGPT. You ask it a question, it gives you an answer. It's the same story. So if you're asking it for code, there's something I'll speak to the developers in the audience. You know, please write me code that will pivot this table and remove all the blanks. Okay, it'll spit out some Python code. How do you know it's the right Python code? Well, maybe you should run it on some sample data and see how it looks. Could you build a tool that would allow that iteration to go faster? You know, don't fill the blanks with zero. Fill the blanks by doing linear interpolation, right? Something like that. So you need this feedback loop and users need to be able to see or evaluate whether the suggestions that they're getting from the AI are right. That piece of the puzzle, the turn of the crank through the user is a big piece of it. So I guess in sum, user experience and a deep understanding of what you do when you're wrangling data so that you surface it to people and so that they can say to the computer what they mean, those are independent of how good the AI is. Similarly, the AI being really good doesn't remove the need for those experiences. So I would love to be working on this problem right now because you plug in GPT-4 into this. This is going to be way better than the inference we got. I think the qualitative user experience will be better in some cases by a lot and better in other cases only by a little. I mean, that depends on the task. But it is fun times, no doubt, for this technology. Sudip [00:20:58]: So some of the technical stack probably you could use some of the off-the-shelf models and so on. But what you're saying is the secret sauce around user experience, the real naughty problem is there. And that hasn't gone away or hasn't gotten any simpler. Joe [00:21:14]: Yeah, and I would say it's the same and your audience probably has hands-on experience by now with things like Copilot. If you embed Copilot in the IDE in a nice way as they've done with Visual Studio, it really helps the programmer quite a lot. But if you don't embed it well, for instance, if you ask it to write you an entire program instead of the next line it's going to do a rotten job because now you have to read that entire program, figure it out, etc. So this thoughtful combination of understand your domain, which in my case is data, understand what the technology can do pretty well, and then build the right feedback loop around that, that's going to be the game for a lot of products over the next few years around LLMs. And that is so true because we talk a lot about what is the mode for some of the companies that are using, obviously, LLMs and AIs and user experience comes up so frequently. If you get it right, it's, after all, a probabilistic way of thinking of the world, right? So if you do not get it right, it doesn't work. Sudip [00:21:53]: Couldn't agree more. I want to touch on a different topic, which is, you know, as I was doing my research on some of your work, I found this paper way back from 2005 that you wrote with Michael Stonebraker. And it was titled, What Goes Around Comes Around. I think, you know, the short summary was you guys kind of discussed this fascinating cycle of data models over the previous four decades and how data models have gone from complex to simple and then back again to complex. I'm curious a little bit, like, you know, now that, you know, it has been several years since your publication, where do you think we are in that data model cycle today? Joe [00:22:46]: Yeah, that's a fun topic for your listeners. If you don't know Mike Stonebraker, he was my master's advisor. He's one of the founders of relational databases. He won the Turing Award for that. He started his career in like 1970. He's 80 years old. He's still going. He's still like at every meeting running things. The guy's, he's amazing. And he's founded a number of influential companies. And of course, the Postgres project that many of us use that I was a grad student on. So Stonebraker's a legend. He's a super opinionated guy and he likes to kind of have his say. So that paper is written entirely by him. I don't disagree with what it says and we were putting a book together, but that was his chapter. And boy, is it him. So it's very black and white. And I see the world in shades of gray a little more than Mike does. But what I would say is the high level message of pretty much tabular simple data with a well understood query language that's pretty clean is going to always win over time over any custom complicated thing. I agree with that. Data is too important an asset to have behind fancy stuff. You want to have it behind relatively clean stuff. Having said that, as Stonebraker actually did in Postgres, you can put a lot of interesting data into Postgres and query it with pretty simple SQL. And it's not really flat tables anymore. It's something more than that. And, you know, if you go off and read database theory papers from the last five, ten years and you hang out with the right folks, you'll realize that generalizing the relational algebra to richer mathematical structures can give you actually more flexibility in this space than I think that paper gives account for. So I actually think over the next ten years we will see another generation of extended extensions to the relational model that will make it amenable to new data types in ways that we haven't seen before. So I can give you some concrete examples. Traditionally, like if you had something like an image in your database, it was just a blob, right? It's a binary large object. It's just a route of bytes, right? Unix-style stream of bytes. I think increasingly what we're going to see is, and Postgres actually has the infrastructure for this, it has since like 1989, but we're going to see this in the field. Point at any blob, you have a featurization function that pulls out a bunch of attributes for that blob. Those attributes, they're columns. So, you know, think about the image that a self-driving car takes at any given time. Bounding boxes around a bunch of regions in XY space, each of which may be tagged with a class or a list of possible classes. I think this is a pedestrian. I think this is a car. I think this is a stop sign. And then it's got a time ticker, and that's a row in a time series database. You know, if you want to start building queries over what happened at this intersection, at a busy intersection at a particular time, it's going to be a time series query over something that eventually looks like rows and columns, but it's actually video. And so we want to extend our tools and unify our tools, right? So the technologies like LLMs and image processing and so on are generating features that we can easily query. And we're going to need to build systems that are a little smarter for that than what we've got in Postgres and the like. But I'm optimistic that the road between where we are with stuff like Postgres and its children and where we need to get to, that's a bridgeable gap. And so I think there's a nice opportunity here for a next generation of powerful analytic databases to be built that will be extensions of what we have today. And it's not really a circle, what goes around and comes around. It's a spiral, right? You're going forward as you're moving, right? And I think there is progress that's required and that's going to happen. Sudip [00:26:12]: That's a fascinating example because today if you have to kind of extract the feature from a blob of image data, you have to build all these fancy pipelines and stitch together a different number of different tools. If you could expose all of that to a simple SQL interface to someone who only knows SQL, now you're just empowering that person so much more. Joe [00:26:33]: And now, if you may, think about data governance in that world. That function you wrote that extracts all those bounding boxes and labels, that's a model. That model had training data. And there was this ML team that owns the training pipeline for that model and maintains it. Now we have governance questions that are like not just what data did you look at, but in your query, which functions did you use? Were any of those functions model-based? What was the data that trained those models? And so if you're doing something like the right to be forgotten in GDPR, I want anything you're carded on the street to be deleted from the database because you're no longer using that insurance company, let's say. I can't just look at which queries touched it through SQL. I need to also look at which functions in your query were trained on a model based video that you were in. This is now across teams. It's across what we currently think of as totally different data pipelines. But this is the future of data governance and metadata management. So it's got pretty big implications for how organizations are run. Sudip [00:27:32]: That is actually a really good segue into one of the things I wanted to ask you, which was the role of data catalogs. I know you're not a fan of the term, but in that kind of modern data stack, what do you think we are with data catalog? I mean, some time ago, there were probably a lot of users that came out of the web-scale companies. There are, of course, companies like Alation and Colibra, who are more ahead in terms of commercialization. Do you see a data catalog as a role in the data engineering going forward? What does that look like? Joe [00:28:02]: I think, inevitably, they have to exist in some form. And I saw this when we were selling Trifacta, and I saw this in the research we did in this space. And I did some of that research in collaboration with LinkedIn back at the time, and they were one of the first data hub vendors we had. And I continue, actually, to advise Acro. So I should just make a public statement about that. So they're one of the data hub vendors. But, you know, if you have many data sources under different systems, some of which are proprietary, some of which are open source or different flavors of proprietary managed open source, they're not going to come with a common catalog. It's in no one's incentive, as a software vendor in the space, to build the catalog and if you're wrangling your data, then you'll catalog it with Trifacta. We'll own everyone's metadata. We'll be very powerful. Customers did not buy that. They knew that was a scam. So it's a lock-in scam. So a neutral data catalog is a reality, I think, for any large organization, even today, honestly, going forward all the more so as data gets richer and systems proliferate. And it's a hard problem that merits full-time tech focus. So again, at Trifacta, we wanted to build a data catalog, but by golly, that's going to be a big lift. We were plenty busy building data wrangling tools. We were happy to partner with the likes of Colibra Innovation and so on because they were doing a good job. And it takes a whole team just to do that stuff. Sudip [00:29:26]: 100%. On a different note, you actually have had a ringside seat, and not only that, actually worked on several technologies in this whole movement that you were talking about a little bit earlier, which is we went from data warehouse to we are kind of in the middle of getting to data lake and then even early days of data lake houses. Any thoughts on what is fueling this? Like what are the trends behind this journey from warehouse to lakes to lake houses? Joe [00:29:57]: Yeah. I mean, I think the easy answer is software logs were kind of the first high volume source of data that folks like us and the people on your podcast had to manage that just didn't really make sense to put in a relational database. It was too expensive. You know, they sort of had rows and columns, but they kind of didn't too because there's lots of text in those logs that you want to look at. I mean, they're not always structured the same way and so on. And so you saw the likes of Splunk and their following competitors carve out a very large niche on log processing that as an academic you're sort of like, yeah, it's kind of information retrieval is kind of databases. I don't see anything new here. La la la. Academic ivory tower stuff. But, you know, really important business problem with really good tech out in the field. That's the tip of the spear. New data type is just a little bit different. It's got different cost structure and value structure and different queries. You don't want to put it in Teradata, right? And you don't want to put it in Amazon Aurora either, right? You kind of want to just leave it lie. Now, Splunk didn't work that way because it was early. So Splunk, the thing that customers hated was it was so expensive to put your logs in Splunk. They were charging by the byte. Essentially, I heard complaints about that all day long, every day. Sudip [00:31:03]: And still do. I still do. Joe [00:31:05]: Yeah. Yeah, which is why it's kind of great they got bought by Cisco. I feel like it's old school pricing that'll last for a while. But realistically, that was the first use case. And what we're going to see now, because featurization with LLMs is so practical, you can really get structured data out of anything now. You can get structured data out of your web chats. You can get structured data out of your images and your security cameras. Very low budget sources of data are going to turn into columns. Is the customer happy or sad? Is this a complaint or praise? Which product are they complaining about? These are all things that you get out of a customer chat, right? Those all get to be columns now. And you're going to want to load them into something, some customer relationship management system, right? So there's going to be lots of modalities of data that we're going to extract structure out of, not just log files and traditional relational data. But I feel like log files were the first big use case and we're just going to see lots more. So an open question is, does a vendor like Databricks that wants to give you soup to nuts data lake house manage to give you the full spectrum of that stuff in a nice package? Or is it more like Splunk? Is it more like Splunk? You know, where you get kind of someone who's tuned up to be really good at a particular kind of pipeline, a kind of data and a kind of query. And then they can monetize that in a vertical application. I think those are really interesting questions for the space going forward. Sudip [00:32:23]: And that is actually a really interesting thread I just want to kind of pull on a little bit, which is, I think historically data engineering has been mostly about using best of breed tools and then stitching them together to implement your data pipelines, right? Now, of course, we have companies like Databricks, which you are intimately familiar with. And then to a certain extent, even Snowflake, they talk about their unified data platform. Do you feel like we are heading to a world where enterprises will standard on a unified data platform? Or do we still have this, you know, kind of duct taping of best of breed tools? Joe [00:32:57]: You know, it's funny. We have this conversation often. And when I say we, I mean, those of us in the community, not just you and me. You say Databricks and then you say Snowflake and you think about them. And then you remind yourself that AWS and Google and Microsoft have those things and 17 more, right? That are data solutions. So if we broaden the scope a little to be like, what tools in the AWS toolbox should I be using? And will I stay only at AWS? You know, we know the answer to that. The answer is no, right? For lots of reasons. Now, as to who's going to be good at what, as opposed to maybe I want to split my bets and I don't want to have one vendor relationship. It's going to be possible to be an 80% solution on a bigger, 80% of more stuff, right? The relational database was always winning because it was good for 80% of your data problems. Now, you know, what is 80% of your data problems is a much, your problems are much broader. I do think there are sort of 60 to 80% solutions there that you could get from a single vendor. I think there will be solutions that under the hood will have a lot of pieces that today we think of as different pieces of software. You know, one of the things that happened with the big data era and open source and I got a little salty about this with some vendors at a conference recently, I should say, is, you know, we went through 10 years of Hadoop, right? And it was awful. And the reason it was awful was partly because it wasn't the greatest software in the world. It was kind of open source and a lot of it was immature and never really matured. But a lot of it was that it was 14 or 16 pieces of software that weren't super mature. Each of which had a logo and a fan that like, you know, a community around it. So there were five or six super fans followed by 25 fans. It was like identity politics. It's like, no, no, no. You have to integrate the queue into the database system because the queue is cool. It's got a name and a logo and a bunch of fans, you know, whereas if it had been run by a business, they might have consolidated business units over time. And so what I think we're dealing with right now is going to be consolidation of what we currently think of as pieces of the pipeline just because it's going to make technical sense to consolidate them. They're close enough to each other. They should just not have two teams and two products and AWS is rife with this, which is the most confusing. I do think we're going to enter into an era of consolidation around that. Open source will be the last to do it though because of all the cultural issues I just mentioned. The way you get open source to move forward is you build an inner team that really is super fans. And so that causes fragmentation of product. It's hard to build a big enough team of super fans to build a big enough product and to merge teams over time. So I got into a fight at this conference because somebody said that the future of data systems is the stuff emerging from Facebook and Voltron and other places right now. And I was like, I'll believe you when you show it to me, but the last 10 years suggests otherwise. What I see from Amazon, Google, Oracle, Databricks, Snowflake, all that stuff is way better than what's coming out of open source technically. And it's not like they're hiding the technology. Actually, they're publishing about it, especially like Amazon and Microsoft. They write lots of papers about what they're building. Those papers are more sophisticated on average than what we're seeing in open source. So much as I'm a huge advocate of open source and a postgres guy from way, way, way back, what I'd say is that in terms of these big stack problems, we're going to see consolidation. It's going to happen first at the bigger companies or at startups that are willing to take on a big risk and that some of this like piecewise stuff is going to fade away. Sudip [00:36:17]: It's a fascinating view. Yeah. Thank you. I have a similar question, but more around use cases. So which is traditionally we always have had systems that were transactional, so doing OLTP. And then we had systems built for analytics, OLAP. And I think, you know, I actually myself got into a debate five, six years ago about whether it is possible for one system to cater to both. And I think there was a terminology around that time that came up, which was HTAP, Hybrid Transaction Analytics Processing or something like that. Do you see like a world where it's one system to serve both use cases or do you think those use cases will forever be served by separate systems? Joe [00:37:03]: To me, this is totally a cost-benefit analysis conversation. So is it possible to build a system that does both? Well, I believe it's possible. In fact, one way to build it is to just sell both systems and put a little glue underneath it and kind of try to hide it from the customer. But that's not a very good instance. But you can build these things. And in fact, former PhD students of mine have done great work on this at IBM and other places. The possibility to do it well is there. It's hard to do because you're basically meeting multiple SLOs with a single piece of software. Some people want very low latency, high transaction rate and then transactional semantics. And other people's SLOs, I want very large data volume. High latency is fine, but huge data volume, I want lots of throughput of bytes. And to meet both those SLOs in a single system without introducing 700 tuning knobs, one of which might be run in OLTP mode versus run in OLAP mode. But to really get down to what are all the knob settings that make it work well in one or the other, it's really hard to build that well and make it usable. And then you ask the question, well, if I built it and I invested in doing that, and let's say I have infinite budget, let's say I'm AWS, is there enough customer demand to fund that? And to my guess, the answer is no. Most people probably can live off some kind of data loading in the background that's not too hard to manage, not too expensive in human time to do the ETL, ELT, call it what you will. And have two systems and two different organizations that manage them. And then there's the governance issues. So the governance issues for your operational databases are typically very few people get access to them, right? And it's real easy to lock down who gets to see what because very few people get to see any of it. But once you get it into the warehouse and you've torn apart all the little pieces that are accessible to different people from each other in some sense, now you have a management problem only on the OLAP side, but it's hard enough over there. So getting governance working in HTAP is also a big challenge outside of the other technical challenges. The organizational governance challenges are hard. Again, I don't think most orgs really want all that noise in their production operational databases. My advice to the entrepreneur doing HTAP is that's brave and risky. But if you did it and you won, man, that'd be awesome. You'd get two markets instead of just one. But I think it's high risk. Sudip [00:39:18]: I will take that. By the way, one comment I just wanted to make was a conversation we had over the last half an hour, 40 minutes. I've heard you so many times bring up governance, which was fascinating, you know, given you're an academic first, but I imagine a lot of that comes from your entrepreneurial experience. Joe [00:39:33]: Yeah, absolutely. Nobody, none of my colleagues know much about this except for the ones who are actually beginning to get interested in fairness. So some of my colleagues who are working on the boundary like data fairness, AI fairness, they get it actually very deeply in ways that I often don't think about because in industry we don't think about that so much. But most core computer scientists think about how fast things go and how useful the outputs are. And they tend not to think about governance. It's absolutely right. And it's hard to teach, honestly, like giving a lecture on that stuff's pretty dry. I don't know if you've ever gone through a class on like access control, but it's not real inspiring. So I think it's one of those things you've got to get out there, get your hands dirty. And they realize it's like the biggest problem sometimes. It is not fun because you are mostly talking about locking down as opposed to, you know, obviously empowering people. Yeah, it's a little bit like security in that sense where, you know, you have to scare people enough and put a lot of drama around doing it wrong. To get people fired up to want to do it right. Sudip [00:40:25]: 100%. I want to ask you a little bit about how you see the whole data stack evolving with AI. I mean, everybody now wants to be an AI company. Do you see some fundamental changes in the data stack and how we have done data engineering over the last, you know, two, three decades now that we are going to add up all of this in a fascinating LLM technology? Joe [00:40:46]: I guess what I'd say is a couple of things. I'm beginning to get the feeling after beavering away at this at Aqueduct for a couple of years with my co-founders who are brilliant Berkeley PhDs and professors that, you know, picks and shovels broad purpose tooling for LLMs for the enterprise is not ready yet. It's too early for that. Most enterprises that are going to go in this direction probably will either use what they can get from Microsoft and the like or they'll do some stuff in-house until they figure out what they really need. And this just hasn't settled down enough to do a broad-based solution. I think that decentralized solutions are going to be earlier to adoption and earlier to generating real value. You know, the best examples that we see in the wild are the difference between ChatGPT and Copilot. I know exactly why I would pay for Copilot. I have less reason to pay for ChatGPT as an individual. I mean, it's kind of cool as long as it's cheap. But, you know, broad-based answer any question I can think of pretty well, I don't know what that's for, really. But, you know, if developers are going to be doing the tooling because developers are going to give very crisp feedback as to what they want in this space, that is a good place to focus. I think Microsoft's been very smart focusing on Copilot because the value of it to their constituency is very clear and they can dogfood it in-house for a long time and figure out how to make it better at the tasks they already do anyway. Medical, by contrast, like on the one hand, sounds great, right? And on the other hand, yikes! How do we make it work well for this kind of thing when it hasn't been trained on it? I had a conversation with a major medical provider recently and they said they have like a stack list of stuff they want to try LLMs for and it's got like 250 use cases. And I said, wow, I'd love to see that. That sounds fascinating. But, you know, the ones that we settled on talking to them about that seemed like they were actionable were ones inside of IT. Because inside of IT, we could figure out, first of all, we wouldn't kill anybody. Second of all, and I didn't really figure out like if it's working or not pretty well. And so doing pilots inside of sort of technology settings, I think there's going to be a lot of easier pickings there for a while. And we'll get better at it by using it ourselves as the developer community more quickly. Of course, people will succeed applying it to specific verticals too. And that's also good. But doing broad-based right now, I think we're too early. Sudip [00:43:06]: Too early to settle. I want to ask you one last question before we move on to the lightning round. And you probably have one of the best vantage points to answer this. What are some of the most interesting future directions, research projects, and data engineering that you are excited about? Joe [00:43:23]: So I always like to answer these questions in terms of that I'm not working on because I have my biases, right? But actually on this one, since I'm not doing Trifacta anymore, I would say that actually data transformation, the T in your E-T-L-T-L-T-L-T-L, whatever star, I think it's an awesome petri dish for LLM technologies because of a bunch of things. First of all, algorithmically trivial. It doesn't have to invent new algorithms to do E-T-L. We're not doing things to the data that involve computing Fibonacci numbers or even sorting often. I mean, like, you don't have to implement quicksort. You don't have to invent any new algorithms. You just need to apply building blocks in the right way to the problem. So I think it should be good at it to a first approximation. Secondly, very hard user interface problem. So it's one thing for me, and I don't mean to minimize things like mid-journey because actually I think it's totally fascinating and awesome. But if I say, give me a picture of two people talking on a podcast and it's got this crazy awesome picture, it's really gratifying. But it doesn't matter if it's right or wrong. And I didn't define what's right and wrong. Sudip [00:44:26]: How delighted were you? Joe [00:44:27]: I'm like, I'm delighted. Try again. That's not engineering. That's authoring. And authoring has got different constraints than engineering. So I would say in the engineering world, the great thing about data transformation is it's hard to know if you got it right. It's a huge piece of data. So whether you got it right is highly contextual. Example, you got log files like Splunk. The marketing department is going to put it in the CRM to try to figure out targeted ads. They want to do one thing with that data. The IT department wants to look at downtimes of servers. You're going to do a very different thing with that data. And whether the data is cleaned adequately for one or the other is a completely different objective function for the optimization. So saying there's an LLM that does this, well, maybe with the right prompting. Yeah, maybe. But the prompting is going to have to be very interesting. And my point there is because there's many different correct answers to clean up this data, the evaluation of the output becomes the hard part. And this gets all the way back to the beginning of the podcast. Did you build the right user experience so the user can guide the system and decide if the outputs are right? That guide decide loop comes back. So I think it's a wonderful petri dish and LLMs are great at some things right now in data cleaning and terrible at others. We can talk about that and it's one I know well. So that's also easy for me to suggest. Sudip [00:45:42]: So we end each of our episodes with three quick questions. We call it the lightning round. So the first one is around acceleration. So I'm going to ask you a little bit around your space, which is data engineering. What do you think has already happened in data engineering that you probably had thought would take much longer? Joe [00:46:02]: The one that surprised me at the speed it went is the disaggregation of the stack. I would have expected that kind of the tightly coupled what they call shared nothing architecture where you have memory disk all in one and then you that's your building block and you knit those together and you parallelize across these full machines. That transitioned in the cloud very quickly to what amounts to shared disk. We have a disk tier, a storage tier, think of S3 and we've query processing tier and a log generation tier maybe and the log generation tier hydrates the storage tier. I did not expect that to happen or happen so quickly and it makes sense and it forced a lot of design changes. I'm very impressed actually with the teams at Microsoft and Amazon and Google who've led on this but that shift to disaggregated stack went faster than I would have guessed. Sudip [00:46:47]: In large scale data processing generally speaking what do you think is still the most interesting unsolved question? I mean data cleaning is definitely one I can think of from your background. Joe [00:46:58]: I think the most exciting thing I'm not working on right now is querying unstructured data. I think we're going to see tons of progress on that in the next five years. I don't even think you need 10 to be five years you have SQL interface to everything and you'll be able to ask questions of any kind of object and get some kind of answers but you may not have time to train GPT-4 on all that data every time so how are you going to plumb together all the pieces so I can ask SQL on all my data that's in the lake? I think that's happening it's happening in research and I foresee it happening in industry over the next short window. Sudip [00:47:28]: Fantastic. Last question. What's one message you would have for everyone listening? Joe [00:47:35]: One message that I mean this audience probably already knows but it's always about the data. There's going to be lots of innovations on computation there's going to be lots of cool algorithms there's going to be new kinds of models it's always all about the data and that means where did it come from what data did you choose to acquire and then of course how you bake the cake with it right you know whether it's training a model or building a warehouse or whatever that's important but it's always about all the data and where it came from it's always about that but so are the systems you roll out and the algorithms you run they're always in service of the data and the traditional field of computer science really is always all about the data. Sudip [00:48:14]: That is actually a really fascinating answer you know given we are talking about data entering and given you're back could not agree more. So on that note Joe it was a real pleasure and privilege to host you today. Thank you so much for your time! Joe [00:48:29]: My pleasure Sudip. Thanks for having me. Sudip [00:48:31]: All right. This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

12. dec. 2024 - 48 min

From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Infrastructure Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do an in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episode [https://sudipchakrabarti.substack.com/p/when-hadoop-was-king-and-yahoo-was], we hosted Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/] for a fascinating discussion on Hadoop. In this episode, we are incredibly fortunate to have Reynold Xin [https://www.linkedin.com/in/rxin/], co-creator of Apache Spark [https://en.wikipedia.org/wiki/Apache_Spark] and co-founder of Databricks [https://www.databricks.com/], share with us the fascinating origin story of Spark, why Spark gained unprecedented adoption in a very short time, the technical innovations that made Spark truly special, and the core conviction that has made Databricks the most successful Data+AI company. Timestamps * Introduction [00:00:00] * Origin story of Spark [00:03:12] * How Spark benefited from Hadoop [00:07:09] * How Spark leveraged RAM to monopolize large-scale data processing [00:09:27] * RDDs demystified [00:11:43] * Three reasons behind Spark’s amazing adoption [[00:21:47] * Technical breakthroughs that speeded up Spark 100x [00:27:05] * Streaming in Spark [00:31:13] * Balancing open source ethos with commercialization plans [00:37:45] * The core conviction behind Databricks [00:40:40] * Future of Spark in the Generative AI era [00:44:39] * Lightning round [00:49:39] Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host and partner at Decibel.VC, where we back technical founders building technical products. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today, I have the great pleasure of welcoming Reynold Xin, co-founder of Databricks and co-creator of Spark to our podcast. Hey, Reynold, welcome. [00:00:37] Reynold: Hey, Sudip. [00:00:38] Sudip: Thank you so much for being on our podcast. Really appreciate it! [00:00:41] Reynold: Pleasure to be here. [00:00:42] Sudip: All right, we are going to talk a lot about Spark, the project that you created and is behind the company that is now Databricks. You went to University of Toronto, is that right? [00:00:54] Reynold: I did go to the University of Toronto, spent about five years in Canada, and then came to UC Berkeley for my PhD. So, been in the Bay Area for over, I think almost 15 years by now. [00:01:05] Sudip: What brought you to Berkeley specifically? [00:01:07] Reynold: It's an interesting point. So when I was considering where to pursue my PhD studies, I looked at all the usual suspects, the top schools, and one of the things that really attracted me to Berkeley - actually, it was two things. One is there's a very strong collaborative culture, in particular across disciplines, because in many PhD programs, the way it works in academic research is you have a PI, a principal investigator, a professor who leads a bunch of students, and they collaborate within that group. But one thing that's really unique about Berkeley is that they brought together all these different people from very different disciplines - machine learning, computer systems, databases - and have them all sit in one big open space and collaborate. So it led to a lot of research that was previously much more difficult to do because you really needed that cross discipline. The second part was Berkeley always had this DNA of building real-world systems. A lot of academic research kind of stop at publishing, but Berkeley sort of has had this tradition of going back to BSD, UNIX, Postgres, RAID, RISC, and all of that, actual systems that have a real-world industry impact. And that's really what attracted me. [00:02:20] Sudip: And obviously, that's what you guys did with Spark, too. [00:02:23] Reynold: Yeah, we tried to continue that tradition. So we didn't stop at just the papers. [00:02:27] Sudip: Yes, absolutely. And I think that's a very common criticism of a lot of academic work, right? Great in quality, but doesn't go that last mile to get to production or get to actual users. [00:02:40] Reynold: It's not necessarily a wrong thing either, just different approaches. You could argue, hey, let academia figure out the innovative ideas and validate them and have industry productize them. It's not necessarily the strength of academia to productize systems. To some extent, it's just different ways of doing things. [00:02:57] Sudip: Let me go back to when you guys started Spark, and this is circa 2009. I'm guessing you had just joined the PhD program, and this was still the AMP Lab - Algorithms, Machine, People Lab - right? [00:03:11] Reynold: Yeah. [00:03:12] Sudip: Which, of course, is now Sky Lab and was RISE Lab in between. Can you give us a little bit of an idea of what the motivations were to start the research behind Spark? I mean, this was the time when Hadoop was still king, right? Like, why do Spark? [00:03:27] Reynold: It was an interesting story. So Spark actually started technically the year before I showed up. By the time I showed up, there was this very early, early thing already. So Netflix back then in the 2000s, I think even a little bit before 2009, had this Netflix Prize, which is the competition they created in which they anonymized their movie rating datasets so anybody can participate in the competition to come up with better recommendation models for movies. And whoever can improve the baseline the most would get a million dollars. [00:03:58] Sudip: Yeah, I remember that. [00:04:01] Reynold: That was a big deal. Eventually, it was shut down for privacy reasons. I think maybe there were lawsuits that happened, but it was a big deal in computer science: and in the history of machine learning. And this particular PhD student, Lester, was really into this kind of competitions: and also a million dollars was a lot of money. [00:04:17] Sudip: Sure. For a grad student in particular, right? [00:04:20] Reynold: A grad student makes about $2,000 a month. So he tried to compete, and one thing that he noticed was that this dataset was much larger than the toy dataset he used to work with for academic research, and it wouldn't fit on his laptop anymore. So he needed something to scale out to be able to process all this data and implement machine learning algorithms. And one of the keys with machine learning is that you are not done when you come up with the first model. It is a continuously iterative process to improve it over time. The velocity of iteration is very important. And he tried Hadoop first because that was the unique thing - if you wanted to do distributed data processing, you used Hadoop back in 2009. And he realized it was horribly inefficient to run. Every single run takes minutes, and it was also horribly inefficient to write. So the productivity for iterating on the program itself was very difficult because the API was very complicated, it's very clunky. So he kind of walked down the aisle - the nice thing about having a giant open space with people from very different disciplines - and talked to Matei, who was also a PhD student back then, one of my fellow co-founders at Databricks. He said, hey, I have this challenge, and I think if you have those kind of primitives, I could really do my competition much faster. So Matei and him basically worked together over the weekend and came up with the very first version of Spark, which was only 600 lines of code. It was an extremely simple system that aimed to do two very simple things. One was a very, very simple, elegant API that exposed distributed data sets as if those were a single local data collection. And second, it could put or cache that data set in memory. So now you can repeatedly run computation on it, which is very important for machine learning because a lot of machine learning algorithms are iterative. So with those two primitives, they were able to make progress much faster for the Netflix Prize. And I think Lester's team even tied for the first place in terms of accuracy. [00:06:24] Sudip: Did he get the money? [00:06:24] Reynold: He did not get the money because their team were 20 minutes late in the submission. So they lost a million dollars for a 20-minute difference. So if Matei had worked a little bit harder and had Spark maybe 20 minutes earlier, Lester might have been a million dollar richer. [00:06:44] Sudip: That's such an amazing story. Wow, I actually did not know that! [00:06:48] Reynold: So when Spark started from research, it kind of started for a competition and really just the collaborative open space and the opportunity that all of those people just happened to be there at the right time led its very, very first version. Now, obviously Spark today looks very, very different from what the original 600 lines of code was. But that's how it got started. [00:07:09] Sudip: One question I have since you touched on Hadoop, do you feel that Spark benefited from Hadoop being already there? Did you guys use some components of the Hadoop ecosystem? Like for example, if Hadoop hadn't existed, do you think Spark could have still been created? [00:07:23] Reynold: Spark definitely benefited enormously from Hadoop early on. There are also baggages that we carry from Hadoop that up until today are still there. It's definitely benefited massively. The first example was that Hadoop solved the storage problem for Spark. And as a result, Spark never had to deal with storage. Spark more or less considered storage as a commodity. To a large extent, organization were able to store a large amount of data reliably and cheaply, which was key. [00:07:56] Sudip: And this is HDFS in particular? [00:07:59] Reynold: HDFS, yeah. And later on HDFS largely faded and got replaced by object stores. But Spark never had to worry about, hey, how do you store a large amount of data? And that was very, very important. And Spark piggybacked onto the Hadoop deployments. All the initial deployment of Spark were sort of onto Hadoop clusters themselves. So the existence of those clusters made it easier because if the hardware resources were not there, that would also have been very problematic for any user of large-scale data processing systems. And Spark leveraged a lot of the Hadoop code itself, especially the storage layer, retrieval as well as the data formats. So Spark definitely benefited enormously from it. At the same time, it's a lot of baggage, unfortunately. There's no free lunch. [00:08:45] Sudip: Right, right. And we actually had Doug Cutting and Mike Cafarella, on the previous episode here, and Doug was talking about how he fully anticipated Hadoop eventually evolving and being replaced by a more advanced system. So it's sort of generations of advancement that happened. One of the key game-changers behind Spark at a very high level was Spark's use of RAM, the memory, keeping data in memory. Can you talk a little bit about how Spark uses memory and what are the benefits? Obviously, there are benefits in speed, iterative computation, but can you talk a little bit about that? And then one other question I'll ask at the same time is why was it so difficult for Hadoop to maybe adopt memory? [00:09:27] Reynold: This is actually a very complicated topic, but let's try to maybe explain it in just a few minutes. There are two places where Spark uses memory in a fairly clever way that Hadoop didn't do and those really led to the dramatic improvement. One is the ability to simply keep the data in memory. And as we said earlier, that's very important for any kind of iterative computation in which you want to repeatedly scan the same data. So it was critical for machine learning workloads. But at the same time, it's also the same primitive that's very useful for any interactive data science because you're often looking at the same data sets over and over and over again when you're doing interactive data science. Those actually happened to be the first maybe two killer use cases of Spark. The second place, it's a little bit less about the direct use of memory, but because Spark exposes fundamentally much higher-level abstraction compared with Hadoop. Hadoop is very simple - it's Map, Shuffle, Reduce - MapReduce. It's a very simple paradigm. There's no concept of joins and no concept of filters. You basically create filters yourself and map it back to Map and Reduce. Spark exposes a higher-level abstraction that has the concept of filters, joins, code group, and all of this. And as a result, it can train a more complex computation by DAGs, direct cyclic graphs, of tasks. And as part of it, it knows, for example, hey, if you're running a Map right after a filter, you don't have to persist the output of filter onto disk or onto HDFS and then read it back in. As a matter of fact, it can just stream through them. So this particular optimization now removes the need for data to go to disk repeatedly in a larger computation. But it's a little bit less about just memory. It has to do with a combination of both, hey, let's just pass through data, stream through data and memory, as well as having the ability to express that more complex computation diagram. [00:11:23] Sudip: So if you had like three stages in that DAG, Hadoop would’ve returned the results after every stage to the disk and read it back, whereas Spark can keep it in memory because it knows that DAG? [00:11:36] Reynold: It knows it's the same thing. I mean, it has to do with having the completeness of the computation rather than only Map and Reduce. [00:11:43] Sudip: Right. And then one of the primary concepts behind Spark is what is called RDD, Resilient Distributed Datasets. Can you talk a little bit about that for someone who might not be familiar? [00:11:54] Reynold: I think maybe from two perspectives, one's from a system perspective, the other one's from the user's perspective. And I think the user's perspective probably should go first. The brilliant thing about RDD is that distributed computation used to be super difficult. And if you think about message passing, Hadoop made it slightly simpler to have the MapReduce concept. RDD took it much further and basically said, hey, if you have a large dataset, the way you program large datasets should just be like how you program a collection of data in memory on a single node. If you were to write a Java program, a Scala program, a Python program, everybody knows what a list is, an array is. They're all collections and there are ways to transform collections. Maybe, the way we should be programming large datasets should be identical to that. And that really means the API now for programming a distributed program against a large amount of data is as if you're just manipulating some data that's on a single node. So that dramatically decreased the complexity in the API surface. Now, the second big innovation in RDD is this. One of the big things with distributed computation is, hey, you have all those machines, you might have thousands of machines that could fail. How do you deal with failures? So RDD, while the user-facing API exposes just a bunch of collections, internally, it creates a lineage for every collection. So it doesn't literally materialize when you say, hey, this collection is just a filter on the previous one. Instead of materializing the collection, it is actually lazy. It just tracks, hey, this collection is simply formed through doing a filter on the previous collection. So it gives you that lineage of how you create the datasets. And again, when you really need the result, for example, you want to output the data, you want to get back how many rows there are, it would trigger this computation graph. And if there's a failure on any of the machines, it runs something very simple - it analyzes the computation graph and checks, hey, so what are the downstream, upstream dependencies? If that node fails, I just need to reproduce the data on that node. So it creates a minimum plan to reproduce the partial dataset on that node to handle failures. And it does that all gracefully, without the user having to know anything about it. So that's really the two brilliant parts of RDD. The first is, it creates a new programming paradigm for distributed data processing, which makes you program basically single-node collections. And the second is all the underlying system techniques to make fault recovery work really well. [00:14:29] Sudip: Got it. And then in 2013, you guys introduced DataFrame, and then in 2015, you introduced Datasets. So what are those APIs and how do they connect or relate to RDDs? [00:14:43] Reynold: So in 2014, we introduced DataFrames. I remember because in 2015, at Strata conference, which is a big conference, I gave a talk about DataFrame GA. After we started Databricks and we started working a lot more closely with the Spark users. I mean, we always work very closely with Spark users but after the company started, now we no longer had the academic research to worry about. [00:15:08] Sudip: No paper to write. [00:15:10] Reynold: No paper to write. No exams to take. No courses to take. And then we realized at some point that, even though there's a lot of unstructured data and semi-structured data out there, at some point people introduced structure onto their data. Structure could be, hey, here's an array of floating point numbers for my machine learning vectors. Structure could be, here's a column called email description, which is a pile of text, right? Those are all structures. Probably like 95% of the programs become some sort of loosely defined structured programs. And the collection of data, which while it's a very powerful abstraction, is still not high enough for structured programming. And that involves also a lot of user-defined functions. Like imagine if you're programming in Python, you want to traverse a list, you want to do something to it. You write a lot of code to say, for example, let's do a for loop across the list. And then for each of the element, you try to compare it to say the number one. If it's number one or it's greater than one, you keep it. If it's less than one, you ignore it. There's a lot of code you're writing, expressing in Python. The problem with that code is that the system cannot optimize it because it is Python code. It is Python code with very strong Python-specific semantics that we can't do anything about. But we do know that often people are just doing very basic comparisons. They're doing very basic expressions that exist in a more structured context. So the reason we created DataFrame was twofold. One is we want to raise the level of structure in the API even higher that makes structured programs easier to write. So users have to write less code. The second is we want to be able to capture more and more of the semantics of the computation. So instead of the user writing a lot of user code in Python, they will just express what they want to do in the DSL in Python still. But this time it's the DSL that tells us the semantics of all those computations. And then we can optimize that under the hood. As a matter of fact, we did. Before Spark 2.0, Python was probably 5 to 10 times slower than any JVM language on Scala and Java for Spark users. And even today, if you Google or ask ChatGPT, it's very likely ChatGPT will tell you that if you use Scala, you get better performance with Spark. And the reason for that is not because Spark itself behaves very differently. Simply because if you had used Python before, we had to run a lot of your code in Python, and the Python code would inherently be slower. But with DataFrame, we're able to capture the semantic information and actually generate an execution plan that, regardless of what language you use, will be the same execution plan and we'll optimize it and we'll make it run faster and faster over time. And that was incredibly powerful. And it basically got Python to have exactly the same performance as the JVM languages. So it's a big deal because these days probably 80% of the Spark users use Python. [00:18:27] Sudip: Right. I mean, it's a language for data scientists. [00:18:30] Reynold: Yeah, exactly. And then the Dataset API came, although I actually would not recommend people to use the Dataset API. In hindsight, I would not have created the Dataset API. [00:18:40] Sudip: I see. [00:18:41] Reynold: So Python is weakly typed, dynamically typed. Scala is statically typed. And a lot of Scala programmers really love typed information because that’s powerful. One of the things with DataFrame is that it becomes dynamically typed. So even when we declare DataFrame in Scala, it doesn't actually know what exactly is the type of that DataFrame, what are the columns in it. All of those are dynamically generated. So Dataset was our attempt and I think we did our best job out there to create a statically typed program that allows you to bind in runtime but give you that compile-time safety throughout the program so you would know, hey, this is actually a Dataset of, for example, a student. And student is a class with all these fields. And we give you the compile-time safety, but with the final validation that when you read this data in, the first time you create a Dataset of student, we actually run the validation to see all the data actually have all these fields you need and how does it map to the student class. The reason I said maybe in hindsight, I would not have created it is, as it turned out, it only served a very small percentage of users. And second, it's extremely complicated. The most complicated code in Spark are one, the scheduler - schedulers are always complicated. Second is all this type mapping and static sort of type, dynamic type binding stuff. Those codes are super hairy and very error-prone and very few people know how to change it. So a lot of investment for high technical complexity for very small percentage of users. It is a very cool idea though in abstract. [00:20:21] Sudip: That's a fascinating story. Talking about programming language, one quick question I had was, the choice of programming language for Spark was Scala. Why was that? [00:20:31] Reynold: Every month some new engineer joins Databricks and asks, why do we use Scala? It was actually an easy choice back then. In 2010-2009, because the Hadoop ecosystem was in Java, in order to be able to leverage most of the Hadoop ecosystem, Spark needed to be in the JVM. At the same time, we really wanted something that could be interactive. We felt that the interactive experience was extremely important. Java was not interactive. Even today, Java is not interactive. There's no way you just hand-type Java without an IDE. It's not concise. It is super verbose. So Scala was a language that was identified as basically much more concise than Java with a repo that we could actually hack. So now somebody could just interactively in command line start issuing a one-line Spark code and run across like a thousand machines in the cloud. And there was no other language that would fit the requirements of being in JVM and being interactive [00:21:33] Sudip: That actually clears up a big question for me I've always had, why Scala? [00:21:39] Reynold: I think Spark and Scala had a coexistent relationship and really a lot of Spark users came to Scala because of Spark. [00:21:47] Sudip: Yeah, exactly. A lot of people learned Scala, including myself, just to use Spark. Now, shifting gears a little bit, Reynold, one of the things that obviously we all have witnessed over the last 10 years is the amazing adoption of Spark. Now that you are looking back and it's 2023 now, looking back at the last 10-12 years of Spark history, if you were to pick, let's say, three factors that you think were the most important in driving this adoption of Spark, what do you think those would be? [00:22:13] Reynold: The first and foremost would be, I think, the focus on the end user. Everything I talked about so far I always started with a level of abstraction to make it much simpler to program. These days I don't need to evangelize Spark anymore because it's everywhere. But one thing I often say is that, many of you will come because of the performance improvement. Because you've heard that you can make your program 10 or 100 times faster than Hadoop, but you’ll really stay for the programmability. You'll never go back to a MapReduce program once you program Spark. We didn't end with just the RDD API. We have continued pushing the boundary, introduce streaming APIs, introduce DataFrames and all this. And they're all about how do we help the end users to be a lot more productive, make the APIs more expressive for the tasks they are supposed to do. And that focus on the end user programmability is key. The second one would be a culture of innovation which is not surprising because a bunch of us came from academia and really wanted to apply bleeding-edge ideas to a real-world system and see how we could improve it. So Spark brought a lot of innovative ideas that were never ever done in other systems or only done in very niche systems. But, Spark brought those to the masses. The third one I would pick is unlike Hadoop and many other systems in the past, Spark had a batteries-included approach, which means, hey, what are the most common things people want to do? Let's make it doable out of the box, instead of having another extension framework or project you have to go download, install and configure in the system. We have had, for the longest time, all the popular data types. You could just use the built-in APIs to read them, like popular data types for distributed computation, not necessarily for local stuff because the focus was on large-scale data sets. And we added the whole machine learning library to Spark if you want to run logistic regressions - just run it out of the box, you have it. [00:24:17] Sudip: And you are talking about MLLib here. [00:24:19] Reynold: Exactly. All of this made Spark much more powerful because one of the big pain points for a lot of users is having to configure a system with dependencies and app frameworks and extensions. Whereas in Spark, you install it, and now you have all of this power. [00:24:37] Sudip: And going back to the usability piece, I mean, I like how you put it, like, come for performance and stay for usability, right? And going back to the usability piece, I believe one of those driving factors was when you guys added SQL support. You, in particular, I believe, were behind the original Shark project and then the Spark SQL project. I'm just curious, at a high level, was there any particular technical challenge that you had to resolve to bring SQL in a distributed data processing system? [00:25:09] Reynold: Over the years we have done it, we've been redoing it. Actually, what Shark referred to is actually what I did during my PhD before Databricks even started. The way Shark was architected was that it basically took Hive, which is the SQL Hadoop system. We took the physical client generated by Hive and converted it into a Spark program. And then we were able to run SQL somewhere between 10 to 100 times faster than what Hive would be able to do back then. One of the main challenges there was really the Hive codebase, which was potentially the most spaghetti codebase I've ever seen. I'm hoping I'm not offending anybody here and it probably had nothing to do with the creator. I think it's just because it was created at Facebook for a specific set of use cases and then it suddenly got insanely popular. A lot of use cases and different requirements got piled onto it and people added a lot of code very quickly. It was so bad that when Databricks first started, in the first year one of our founding engineers actually quit after working on this codebase for a few months and that was the sole reason given. And up until this day when I talk to him he's like, yeah, honestly, that was the real reason. That codebase was so difficult to work with. So when Michael Armbrust came to Databricks in early 2014, initially he was actually hired to build a query optimizer for large-scale data. At some point I walked up to him and said, hey, you should kill this thing I created. This nine-headed monster is impossible to deal with and the code is too difficult to maintain. I think if we start from scratch and build it from scratch and have nothing to do with Hive, it would be substantially simpler. It would be much easier to actually evolve. And he did. And then that became Spark SQL which is honestly much easier to set up, much easier to maintain, much easier to iterate on. [00:26:55] Sudip: So you basically ended up killing your own creation. [00:26:57] Reynold: Yeah, a lot of people didn't expect that but I was jumping over it. Some people were like, are you happy that now it happened? I'm like, yeah, that's incredible that it happened. [00:27:05] Sudip: That's so funny. And then going back a few years, talking about the other thing that kind of led to adoption of Spark, I remember back in, I think 2015 or somewhere around that, you guys made a number of key performance improvements to Spark Core. Can you talk a little bit about what those improvements were and what were some of the major architectural changes you guys had done? [00:27:31] Reynold: So 2015-2016 was a big year in terms of improvement and that's when we released Spark 2.0. And the claim at the time was Spark 2.0 was an order of magnitude, 10 times faster than Spark 1. It might not be exactly 10 times for every workload, but it was dramatically faster. And a lot of it had to do, first of all, with the DataFrame API. We talked about DataFrame API that gives an even higher level abstraction. Now we convey semantic information so we can optimize under the hood. And we fully leveraged that in Spark 2.0. So Spark 1 released the DataFrame API, raised the level of abstraction, and enabled us to do the optimization in Spark 2.0. And so those optimizations basically boiled down to two big things. One is we hyper-optimized for Parquet and actually created the first vectorized Parquet reader. And that led to the use of Parquet itself as the columnar format, and Parquet MR at the time was basically implemented row-wise. It was not fully extracting the performance out of a columnar format. So we implemented vectorized Parquet and that got massive speed up just in terms of read performance. And then for the entire query execution engine, we put this idea that exists only in academic papers and academic systems called Hosted Code Generation and put that into Spark itself. The idea is we would take a query plan and generate the actual code that's needed to execute that query. As if you're building a query engine purpose-built just for executing that one query. And the reason that would make it much more efficient is because a generic system has a lot of overhead because it has to be generic. For example, it has a lot of function calls because your query engines in particular have the concept of an iterator model in which you chain a bunch of iterators and each sub-operation is an iterator and every iterator call, when you say next, it generates a virtual function call which is very slow. Then the compiler is not great at optimizing it because those are complex programs. But if you, for example, have a simple program that does the aggregation. If you were to purpose-build a program to do that, you would just write a for loop from the beginning of the data to the end of the data and then sum something up into a local variable and then return the local variable. It would be a three-line program. Compilers would be amazing at optimizing that. So we basically treated Spark itself as a compiler that compiles SQL queries into actual purposely written code for executing those SQL queries. And SQL here doesn't just apply to SQL. A big, cute idea in DataFrame is that DataFrame is not very different from SQL and SQL is not very different from DataFrame. They all generate similar query plans under the hood and once you generate those purpose-built code for that query, you compile it with the JVM. The JVM now is much better at optimizing that. So basically vectorization and CodeGen combined, we actually got to a dramatic speed-up. By the way, we didn't invent either of this. In academia, vectorization always existed but it was never built for the open-source SQL system. This kind of CodeGen was pioneered by Thomas Neumann at TU Munich and they had an academic system called HyPer which was eventually commercialized and got bought by Tableau. So we took that idea and really made it into adoption. Probably more people benefited from that because of that work and benefited from Thomas Neumann's idea than the HyPer system itself. [00:31:13] Sudip: I want to talk a little bit about streaming. There was Spark Streaming and I believe now you guys have Structured Streaming, right? I'd love to understand a little bit what is the difference between those two and then one other related question is, originally the way I know streaming was implemented in Spark was through micro-batches, right? Is it still the case and how is Structured Streaming different from the original idea? [00:31:37] Reynold: Incidentally, the relationship between Structured Streaming and Spark streaming is very similar to DataFrame and RDDs. It's all about raising the level of abstraction. Spark Streaming was a fairly innovative thing because it basically came with this insight which is that, if you just run a batch that is small enough and fast enough, at some point you have the approximate streaming. To the extreme, you run a batch for every row which is basically one row at-a-time processing. And that was pretty popular because it introduced a whole new class and workloads that previously people thought would be very, very difficult and require super specialized systems for. Just with a lot of IoT devices, sensor data, message buses becoming more popular as Spark was growing, streaming workloads started growing too. There was a very important problem with Spark Streaming which was actually not about micro-batches. I personally think the big micro-batch versus true streaming debate is overblown too much. The biggest problem with Spark Streaming was that the window of streaming is completely tied to the physical batch size. So the physical property of its execution is leaked into the programming abstraction. The most common operation in streaming is windowing. If you want to window by, for example, a bunch of records, in Spark Streaming, the way of design is that you would actually run that batch as a window. And this really limits a lot of optimization activities and also limits the programmability. It also limits another very important thing, which is, if you have late-arriving data, now you can't deal with it because you're already done with that batch. Your data shows up later and cannot be considered a part of that window anymore. So the time and everything has to be physically tied to the way it was executed. So after we did DataDrames, we started thinking about how to improve all those issues of streaming. One very obvious one was, many people want the concept of time that's logical instead of physical. Physical meaning whenever the event showed up to the system, that's the time it showed up. Logical means the event has some property called time, maybe a column or a field, and that is the actual time. So we thought that for streaming, let's completely decouple the execution from the API and just think about how you can program a streaming job. We looked at a lot of streaming jobs and realized the intent people express and the transformation people express, were not very different from a regular DataFrame. It's a program, except they want to run that in a continuous mode. And for any data that comes in, they want to be able to run that instead of just running once. So we came up, I think it was 2016 or 2017, I remember I was giving the announcement at Spark Summit and I said that, the easiest way to do streaming is that you don't have to think about streaming at all. So Structured Streaming basically introduced the concept of streaming DataFrames. There's no separate API for streaming. It is just a DataFrame. And the only difference is how you create a DataFrame. If you created a DataFrame using a streaming reader because it's coming from some message bus or even just a pile of files that might continuously arrive on object stores, you have a streaming DataFrame. And you can run the same operations just like your normal batch DataFrames. And everything else is the same. And that really made it much simpler to program streaming because now people don't have to learn a new paradigm. They don't have to think about, hey, what is windowing? Well, it turned out windowing is a group by some time. So that was very, very powerful. And that also led us to realize that a lot of people, when they do streaming, they're not even actually trying to do things in super real time. What they really, really want is they want the incremental-ness that happens in streaming as data flows in. Because virtually every data pipeline is continuous, they always have data coming in. They might come in once a day or once a week, but it's always new data coming in. It's very rare you have a data pipeline you run once and never worry about. Now, as data continues coming in, now you have to worry about, okay, so you don't want to reprocess all of your data all the time because that's highly inefficient. So people invented their own way of doing incremental processing. And it turned out the number one use case for Structured Streaming was people just using it for incremental processing because now they no longer have to worry about the state. The funniest thing is that they would build a streaming pipeline, they would run it, and it processed the data. The data's actually coming in, for example, once a week. So if right after half an hour, if they finish processing their data, they’d actually shut it down manually. And then a week later, when there's new data coming in they rerun their streaming pipeline. And because it's fault-tolerant and because it tracks all the incremental states, it just does this whole end-to-end incremental processing. Because of that, we introduced a concept called streaming once, which is literally that when you run the streaming pipeline it finishes processing all the data it currently sees, and then it shuts it down. And the next time you want to do it, you relaunch it again. And that itself, actually, it probably generates hundreds of millions of workloads just on Databricks today. People are running streaming in a batch mode. [00:37:07] Sudip: Exactly. Yeah, that's what I was going to say. It makes it so much easier. But it's a unified programming interface. It's a unified understanding. And I'm sure it also made it easier on the engine side, right? Reynold: Exactly, the unification. Yeah, because we don't have to build so many different engines. Another question is, so does it still run in micro-batch? There're actually different modes today. There are certain things that are running in micro-batch because for example, obviously, the streaming once mode, it is just a giant batch job, except it does all the incrementalism. There are also continuous modes in which it actually processes data all at a time, in which you can get from records coming in to records going out in some milliseconds. [00:37:45] Sudip: I want to shift gears just a little bit and go to like, you know, circa 2013 when you guys started Databricks, right? And then you had this really fast growing open source community, which was Spark, and then you had this very early company, Databricks, right? And one of the challenges a lot of open source creators have, particularly when they start a company, is how do you balance between the open source community with the commercialization effort? How did you guys manage to do it? Were there certain guiding principles that you guys had that really helped you? [00:38:21] Reynold: It's difficult. I think Ali, the CEO of Databricks often joked about that we should never start another company based on open source projects. The reason for that is you need two strikes, right? Normally to start a company and if you want the company to do well, you focus on the business problem of the company and with all the stars aligned and you're lucky and you work super hard and you have an amazing strategy, it works out. And that is a very difficult problem. If you add open source to it, now you have a two-step process. You have to have the open source project taking off and doing well, and that itself is not a trivial thing. It's probably a little bit easier than building proprietary software, but it's not a guarantee. Most open source projects don't work out. And then after that, you need to have an amazing business model and work towards a great strategy. And now the actual end-to-end success requires you have to multiply the two probabilities, which makes it very low. And probably the reason why there are very few companies that have heavy open source roots, and been successful. One thing we focus on a lot is we have different teams doing open source and they have different mandates. The open source teams are tracking so their KPIs are about adoption metrics of the open source projects. We always open source everything in API. We don't want to lock customers in because we have another kill API that makes it super difficult to migrate out if they ever need to. And we focus a lot on evangelism of the open source project. We try to walk a very careful line. For the longest time, the Spark Summit was called the Spark Summit and there was very little Databricks content in the first day keynote. It was all about open source. We saw this change as time went on and then as we also renamed the project. There was a lot of AI content so a lot of things have changed, but those are a lot of the things we did early on. There's creating a more cleaner delineation, both in terms of organizational structure, KPI tracking, events, and all that. But it's not an easy task. Actually, if we were to redo it again, I'm not sure if we could do it. [00:40:21] Sudip: I'm laughing because that's coming from the founder of the most successful company in open source ecosystem, right? [00:40:29] Reynold: I know a lot of people think of Databricks and say, hey, there's a great business to be made about open source. I'm not sure. I mean, it's not been doable, but it's not an easy job. [00:40:40] Sudip: When you guys started Databricks, I remember 2014-15 time frame, I used to go to some of the board meetings. I remember there was a pretty heavy debate about should Databricks stay all cloud, or should Databricks go and support an on-prem version of Spark? And there was a lot of customer pressure because Spark adoption was increasing. [00:41:03] Reynold: Everyone wanted to pay us - like ten million dollars for Spark. [00:41:05] Sudip: And there was a lot of competitive pressure. Cloudera was really going at the time, Hortonworks was around. But you guys had a very deep conviction that it is either Databricks cloud or nothing. Can you talk a little bit about that, like where that conviction came from? [00:41:23] Reynold: Yeah. I mean, in a kind of sense, we're really glad that we didn't go on-prem and become a support company. It ultimately comes down to the longer-term vision of where we see the puck is going and we try to go towards that rather than capturing what's right in front of us. And that's something easy to say because there are also scenarios in which people think too long term and they're dead before the long-term future and vision could even manifest. But I think one of the reasons that really got us going there was it actually had to do with Berkeley also. So at Berkeley in 2009-2010, there was this very famous paper called the Berkeley view of the cloud. [00:42:05] Sudip: Yeah, above the cloud, right? [00:42:07] Reynold: The view of the cloud. There are variants of the paper. The initial title and then later it's just a view of the cloud. But that paper is probably the most cited technical paper on cloud computing. It was so popular that many business schools incorporated that in various of their classes. Like my wife, for example, went to business school and read that paper. Not because I showed her the paper, but it was actually part of the class. And that paper predicted that the vast majority of the compute and computing infrastructure will move to the cloud and it will be finally true, this computing as a utility. Very few companies and houses have their own generators. They are just using electricity from the grid, right? It's something that's reliable enough, something that's viewed as virtually infinite and the economics simply makes sense. There are niche use cases, those will never go away, but the vast majority don't think about it as much. And that paper predicted that future by analyzing not just the technical foundations, but also what that type of allocations would be possible and the economics and the accounting and all that that have pretty profound influence on, I would say, the way we think about all this at Databricks. And I think some, maybe not everybody, but a couple of Databricks founders are also involved in that paper as well. So we always thought, hey, we really wanted the ability to be able to release software super quickly. We always wanted the ability to be able to provision and get a POC going in a matter of days instead of a matter of months, because now you have to go procure the hardware. We always thought a lot of the complexity of software, especially infrastructure software, is in the operations of it, not just in the building of it. And all of those are enormous values we can create, but we can only do that if it's in the cloud. And we're so early on at the time that we just view, and even today, we kind of view most software in our stack are pretty broken. We need to continuously improve them. Velocity is key. And having simplified environments, not having to worry about the 20 different variants of Linux, 50 variants of IP wireless thing and firewalls and all that would be enormously beneficial. So that's kind of what got us started. [00:44:27] Sudip: And clearly, it paid out so well for you guys. [00:44:31] Reynold: But, it is difficult Because every time we hire an exec, it's like, hey, I have a great idea to increase our revenue by 10x. [00:44:39] Sudip: Yeah, go on-prem. Before I switch to lightning round, one final question about Spark. What's next for Spark? What's in the future? [00:44:49] Reynold: I think the API is actually pretty good. We've been doing a lot of incremental refinement. For example, one of the biggest complaints of Spark is when you use Python, the error messages are simply nonsense because those include JVM stack traces and all that. And we actually spent a lot of time improving those to the point that you could probably still see it, but most of the time, you don't even feel there's a JVM that's running the Python program. So there's all this sharp edges we're trying to remove, which ultimately, they're not big fundamental ideas, but they're really the ones that create friction that gets in the way of everyday users. So a lot of work goes into that. With a lot of the GenAI use cases, it's 2023, everybody has to talk about GenAI, we have noticed a lot of the Spark programs that were generated by ChatGPT and all the other things. I bet there will be more Spark programs generated by machines than by humans in the next year. And most of them don't have good practices because many of them were generated and learned on a giant corpus on Stack Overflow and whatever is on Reddit, mostly based on malpractices from the past. Some were written before certain new things were introduced and were never updated. So we're working on this thing called the English SDK, which basically, if you think of it, it's really just how do we teach ChatGPT with the right prompting so it generates best practice Spark as opposed to generating malpractice Spark that now some other human experts come in and try to fix it. Things like this would make Spark's adoption go far wider and really benefit a lot of users that previously just felt Spark was too daunting. It can be somebody who is reasonably technical but they use Spark and they generate a Spark program in ChatGPT, they realize: oh it crashes for this data. But it turned out with a better chatbot, it will actually generate things that don't crash. We want to get to that point. I think another very big opportunity for Spark is that I fundamentally believe the biggest innovation of Spark is not performance, it's not just the use cases, but rather for something that's very old in data, which is data engineering. In the old school days, they used ETL tools like Informatica and all of that. And later, there's a little bit of Modern Datastack that got super popular the last couple of years and people write SQL queries again. There's something fundamentally wrong with SQL for data engineering. It's a very hyperbole statement, but what's fundamentally wrong with SQL? SQL was not designed with engineering practices in mind. You cannot easily test a SQL query. There's no abstractions in SQL. You could use recursive common subtable expressions, but again, it's just a pile of text. There's no variables, there's no for loops, there's no functions, there's no classes, there's no CI/CD framework, which means the key word is engineering. Engineering requires a sense of rigor, and rigor is backed by fundamental engineering principles. And what are the fundamental engineering principles? They are abstractions. I'm talking about software engineering here, right? They're abstractions. They are testing. They are CI/CD. They are how you roll out. And SQL is just terrible for all of those. I love SQL. I did my PhD in databases. I spent a decade optimizing, figuring out how to build systems to run SQL better. But we actually have a solution in front of us. It's real programming languages that can do everything SQL does, which is actually the Python Dataframe API. If you compose exactly the same program in SQL in a Python Dataframe API, it looks almost the same by readability. You can now actually test it using just vanilla Python code. You could decompose your program because it doesn't have to be a pile of text. You can have multiple files backing them. Each file is a Python file. You can have classes. You can have functions. All of this are great tools. By the way, it's just off the shelf and available because it's Python. People build far more complicated Python programs than the most complicated data engineering programs. So certainly you could use that. We haven't done a good enough job to explain to the world what's the value of Python. Python is now very popular in data science and machine learning. But to the SQL folks, many of them don’t even know Python. We haven't really done a good enough job in educating them. I think that would be one of the most important things that Spark can get right is to tell the world, hey, here's how you can do data engineering. And by the way, it's not that hard. It's just vanilla Python. And the Dataframe API is just the equivalent way of writing SQL. As a matter of fact, everything you're familiar with in SQL translates here, except now you have all the toolkits to do serious engineering. [00:49:39] Sudip: That's fascinating. So, Reynold, we end every one of our episodes with three quick questions. We call it the lightning round, starting with the first one being acceleration. So in your view, what has already happened in Big Data that you thought would take much, much longer? [00:49:57] Reynold: Yeah, the death of Hadoop. I would think it would take a lot longer for Hadoop to die. Enterprise software never really goes away, but Hadoop today is largely irrelevant. I want to see it happen faster, I thought. [00:50:09] Sudip: You definitely had something to do with it, didn't you? [00:50:11] Reynold: Yes, we had a part in that, but if you asked me 10 years ago, I would tell you by 2030, maybe you would see a rapid decline. [00:50:21] Sudip: Then the second question around exploration, what do you think is the most interesting unsolved question in your space, you know, largely Big Data processing? [00:50:31] Reynold: There's many. I'll just pick one. I think how do you combine unstructured data and structured data. Especially with GenAI now, there's great ways of processing unstructured data and analyzing unstructured data, but then how do you combine them is unclear. I think there's a lot of value that can be generated currently. [00:50:48] Sudip: Final question, what's one message you want everyone listening today to remember? [00:50:55] Reynold: I mean, to maybe the builders - the open source framework builders would be - put the user first, think from their shoes. Think about simplicity to the users, I would say. The last thing I said, Python, data engineering, you want to use a real programming language for data engineering to bring the engineering rigor into data. [00:51:15] Sudip: Reynold, thank you so much for sharing your insights. This was a real pleasure and frankly a privilege to have you on. Thank you. [00:51:23] Reynold: Thanks a lot for the invitation, Sudip. [00:51:25] Sudip: This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation. [00:51:41] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

13. dec. 2023 - 51 min

Engineers of Scale

2 måneder kun 19 kr.

Læs mere Engineers of Scale

Alle episoder

Kun på Podimo

Populære lydbøger