Billede af showet Engineers of Scale

Engineers of Scale

Podcast af Sudip Chakrabarti, Partner at Decibel.vc

engelsk

Nyheder & politik

Begrænset tilbud

2 måneder kun 19 kr.

Derefter 99 kr. / månedOpsig når som helst.

  • 20 lydbogstimer pr. måned
  • Podcasts kun på Podimo
  • Gratis podcasts
Kom i gang

Læs mere Engineers of Scale

Hello everyone, welcome to the Engineers of Scale podcast. In this podcast, we go back in time and give you an insider’s view on the projects that have completely transformed the infrastructure industry. Most importantly, we celebrate the heroes who created and led those projects. sudipchakrabarti.substack.com

Alle episoder

5 episoder

episode Data Engineering: The Past, Present and Future with Joseph Hellerstein cover

Data Engineering: The Past, Present and Future with Joseph Hellerstein

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/] for a fascinating look back [https://sudipchakrabarti.substack.com/p/when-hadoop-was-king-and-yahoo-was] on Hadoop, Reynold Xin [https://www.linkedin.com/in/rxin/], co-creator of Apache Spark [https://en.wikipedia.org/wiki/Apache_Spark] and co-founder of Databricks [https://www.databricks.com/] for a technical deep-dive [https://sudipchakrabarti.substack.com/p/from-spark-to-databricks-sparks-origins] into Spark, Ryan Blue [https://www.linkedin.com/in/rdblue?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAAzlA6sBKpw5AAsa7SgDV425Ay1w6My0b4U&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BohWsnzB2Rw2ipZdP53XWlg%3D%3D], creator of Apache Iceberg [https://github.com/apache/iceberg] on the technical breakthroughs [https://sudipchakrabarti.substack.com/p/iceberg-the-open-table-format-for] that made Iceberg possible, and Stephan Ewen [https://www.linkedin.com/in/stephanewen/?originalSubdomain=de], creator of Apache Flink [https://en.wikipedia.org/wiki/Apache_Flink]. In this episode, we host Joseph Hellerstein [https://www.linkedin.com/in/joehellerstein/], Professor at UC Berkeley and founder of Trifacta and RunLLM [https://runllm.com/]. Joe helps us step back and explore the evolution of Data Engineering over the past several decades while also discussing the future innovations on the horizon. Show Notes Timestamps * [00:00:01] Introduction and Joe’s background * [00:01:38] What got Joe interested in data engineering * [00:03:59] Defining data engineering and its key components * [00:05:16] Significant trends and changes fueling data engineering over the last 20 years * [00:06:30] Key components of data engineering and the role of each * [00:08:07] Contrasting modern data stack with traditional data stack * [00:12:10] Developers vs. data engineers/analysts in building data pipelines * [00:14:12] Role of AI and LLMs in data preparation and cleaning * [00:16:51] Journey from data warehouses to data lakes to data lakehouses * [00:21:14] Role of data catalogs in data engineering going forward * [00:32:57] Unified data platforms vs. best-of-breed tools for data engineering * [00:37:03] Possibility of one system serving both OLTP and OLAP use cases * [00:40:46] Impact of AI on the data stack and data engineering * [00:43:23] Interesting future research directions in data engineering * [00:46:02] Lightning round: Acceleration, unsolved questions, key message Transcript Sudip [00:00:01]: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today I have the great pleasure of hosting Joe Hellerstein, professor at UC Berkeley and founder of Trifecta and Aqueduct. Joe, welcome to the Engineers of Scale podcast. Joe [00:00:36]: Thanks, it's fun to be here. Sudip [00:00:37]: Thank you so much. So I'm not going to walk our listeners through an overview of your background because you truly need no introduction. When it comes to innovation in the field of data engineering, there are really very few people who even come close to what you have done. I will, however, mention a fun fact that I learned recently, even though I think I've known you for many years, and that is your interest in music. Not only are you a musician on the side, you actually had even minored in music during your PhD at Wisconsin. So how do you balance all of your research and startup work with your interest in music? Joe [00:01:12]: Well, I'm a believer that you should have a rich life and that people who spend 24-7 on computing are spending a little too much time maybe. So I enjoy computing and data engineering and all that good stuff, and I love to geek out about those things, but it's one of a bunch of things I value in life, family, hobbies, and so on. And I'm sure a lot of your listeners are the same. And anyone who tells you that you have to do something 24-7 to Excel, I think is telling you a lie. Sudip [00:01:38]: So then what got you interested in data engineering as a quality of research in the first place? Joe [00:01:44]: Yeah, well, my background, going back to my training after college, was in database systems. And my first job right out of college was at IBM Research, which was the founding lab out in San Jose that built the first relational databases. And a bunch of those people were still there. So I was really brought up by some of the founders of the field of database systems. After that, I went to Berkeley and then Wisconsin for my schooling, which was more of the community of the folks who really pioneered the database system space. So I'm an old hand, even though I'm not as old as most of those people by about a generation, I still feel like I'm an old database hand. I come from that lineage. And what's happened over my career since the, you know, I got my PhD in the mid-90s, is that the process of managing data and the computation that goes around it has become more and more central to all of computing and the way it projects on the real world. So your listeners know better than any, probably, that really, we shouldn't talk about computer science. We should talk about data science, data engineering, because without data, computing is kind of meaningless. And this is a truth that emerged, you know, in the last 20 years, really. But it was one that the database people were working on well before that. And I feel kind of blessed to have been born into that community because the relevance of data engineering to all things computation and therefore much of society is so apparent today. Sudip [00:03:01]: I would say that both the schools that you have involved with, Wisconsin, and of course, UC Berkeley now, I think, have had tremendous quality of work coming out in database and data systems and data engineering together. So it just, you know, has had such a big history. It's an awesome tradition. Joe [00:03:19]: I mean, when I was coming up, there were three places to do real work in data systems. It was IBM, Berkeley, and Wisconsin. And I had the fortune, and to some degree, I took measures to interact with all those people when I was very young, straight out of college. And they were really the center of all activity because a lot of academic computer science at that time didn't get it. MIT, when I interviewed in the mid-90s, hadn't had a person on the faculty doing databases for over a decade. And it was very clear when I got there that they did not think it was an intellectual activity. They thought it was something that businesses do. And that's all changed radically in the course of my academic career. We're now data-driven computing is all of computing, really. Sudip [00:03:59]: Taking a step back, if you were to describe to someone what data engineering is, and I know you teach a very popular course on that at Berkeley, too, how do you describe it? Joe [00:04:12]: Yeah, it's been tricky, actually, because I think we're in a time of transition. And so you have to talk about things relative to where things are right now. So the way I talk about it with people is they understood that there was a shift from traditional computer science to what was being called data science over the last, say, decade, where clearly data had to be at the center of things, or at least some things. But what happened in the data science programs that evolved is they were largely developed as sort of statistics meets algorithms. And that left out all the engineering concerns of how do you build systems around these foundations, particularly systems that drive large amounts of data? Because the statisticians traditionally didn't use large amounts of data. Incredible what's achievable with statistics with very small amounts of data, of course. And so that's what I talk about. It's like, well, how do we systematize and build engineered artifacts that reflect the importance of data and computation together? Sudip [00:05:02]: And looking back in the last 20 years or so, since you started working at Berkeley and obviously started your two companies, are there certain significant trends or changes that have really fueled data engineering? Joe [00:05:16]: Yeah, I mean, you know, there's a long enough scope that you have to include the existence of the World Wide Web as one of them. So, you know, you go back to the 90s and data was all about relational databases because that was the data that was entered into computers. And the web changed all that. Now there's all sorts of nonsense that you could harvest. And I remember joking in the early aughts, maybe late 1990s, my buddy was saying, my gas station just got on the internet. Goodness knows why I would ever want to run a query against my gas station, right? But nowadays we realize that all that sort of light recording of ambient stuff and people's thoughts and ideas and conversations is highly valuable. That was not at all clear when the web started out. You know, web search is like, well, I might want to find some stuff. Most of it's irrelevant to me, but I want to find a few things, right? That was web search. But what you see, you know, if LLMs are a compression of the web, what we're seeing today is having a compressed version of everything anybody's ever said is outrageously powerful, even if the technology is pretty simple. So the rise of kind of ambient human information, something I did not anticipate whatsoever. Sudip [00:06:21]: Got it. Today, as we know data engineering, what would you say are the key components of data engineering and kind of what's the role of each? Joe [00:06:30]: You know, we often talk about pipelines, right? And I think it's not a bad way to think is to kind of start down the pipeline, look at what feeds what. Where does the data get generated? How does it get ingested? What data didn't I measure at all? Actually, we start there. The statisticians always start there. There's a universe. Things are happening. Some of it gets measured. That's called sampling. That's the start of any pipeline is there's phenomena out there that we could record. We choose to record some of them. And then, of course, there's the pipelines we think of from ingest to processing to feedback loops that happen, right? When you're learning from outputs and how people react to them. So thinking about the long-term pipelines all the way from what did we measure to how did people react to it in apps? And then we measured the apps and we closed the loop. That's modern data engineering. Unfortunately, it's too big for any one organization to own, right? You go into any company and there isn't one org inside the company that owns that whole thing. And it's definitely too big for any one person's head. And so the other reality of data engineering, like a lot of real-world engineering, is teamwork. And it's cross-disciplinary. And it's a lot about people. Sudip [00:07:36]: Absolutely. Yeah, people, culture, team, all of that plays into the efficiency of a data engineering team, 100%. I think over the last few years, particularly with the last decade or so, we kind of transitioned from what used to be called traditional data stack, like much more on-prem, much more built around old generation technologies. And now we keep hearing about modern data stack, much more cloud-native and so on. In your view, what is modern data stack? What are the components? How do you contrast that with traditional data stack? Joe [00:08:07]: I have some opinions about the branding of all this. So modern data stack is a brand that was basically promoted by a couple of startups that were venture-backed. They tried to ascribe a particular meaning to that. And of course, the word modern is kind of hilarious because to me, it's kind of Mad Men 1950s modern furniture, right? But it is true that we do things differently today than we did in the 80s and 90s. So let me talk about it in those terms rather than trying to say it's two particular companies that tried to brand the modern data stack, okay? Because there's also by now enough blowback with that terminology that I just don't even want to walk down that path. Sure. Having said all that, when I was coming up, it was the beginning of the data warehouse movement. And that itself was a reaction to just having a database. So once upon a time, you had data, you put it in a database and that was it. And IBM would sell you a mainframe, right? You'd run the database. And then what happened was there were lots of databases and they were all over the place. And so there were many of these operational databases. The data warehousing movement came along and said, people want to see the big picture across all these databases. So we're going to have this ETL process, right? We're going to load, extract from all the operational stores, transform the data into a common format, load the warehouse. And that was a story that made many consultants very rich. And it also opened opportunities for some software vendors, Informatica, Teradata, right, that were tuned for that workload. And that was the status quo for about a decade. And then what happened was the world became too complicated for that as well. So the fiction that there'd be a single data warehouse that would really cover the business was a fiction all along, but it became a painfully untrue fiction sort of post the web, really. We had lots of data that really didn't want to go in the relational database at all. It wanted to be text search or it wanted to be image files in a file system, right? And then we saw the cracks emerge there in the data warehousing relational database movement. We heard NoSQL and there was a bunch of things like that. Where we are today is, I think, is kind of the end of that road. And in the end of that road, we have a majority of data that is not rows and columns when it's born, that is highly valuable, that needs to be managed. So you can't just throw it in the file system because it's too important and people need to be able to version it, know where it came from, know its provenance, know how it's getting used and do governance on it, all the things that are managing the data. So the things that used to be easy in the database are really hard on this messy data, all this stuff about governance and organization. It's very disparate and it's all over. So what's inevitable in this setting is that we're going to have to kind of stitch together more stuff even than before. And that's where we get into kind of the state of things today. And there's names people like to use for it, like Lakehouse and so on. But that's a fine name actually, doesn't mean a whole lot, but neither did Warehouse so that's okay. But the bottom line is there's going to be rich data of many facets and there's going to be many uses for that data. So big fan in of lots of data sources from lots of places, big fan out of lots of use cases. We'd like management in the middle of that hourglass, but we're not going to be able to assume that the data is structured for that management. The data is going to retain its probably original format and then extracts of various kinds kind of go out, right? So that's kind of where we are. And I think, you know, if it was a data filtration system or a data hourglass or something that might be more helpful than a lakehouse, which to me is a cabin on the side of a lake. I don't know if that's a useful analogy. It's cute neologism. It is this kind of problem that we're trying to manage. Sudip [00:11:24]: So I think, you know, it's like from data warehouses, which are primarily ETL. And when it came to data ingestion, we are now more in the ELT world, right? Which is data lakes. And then also, like there is a small movement around capital E, small T, and then LT, which is like you extract, do some transform, load, and then transform again. So I just wanted to talk a little bit about the users and the builders of this data pipelines. And there are, you know, two primary kinds of personals one are developers and the other ones are data engineers. In your view, do you see like developers, you know, kind of taking on more and more of this complex world of building data pipelines using it? Or do you see more like analysts and domain experts, you know, do things on their own using local tools, LLMs and so on? Joe [00:12:10]: I'm going to take a slightly different slice on it than you started with. Not because I quibble with it, but I want to make sure I'm kind of on the grounds that I want to talk about. So I think we can kind of split the world into people whose primary focus is computational. So think of data engineers, data scientists, IT professionals. Okay? That's not a cool name. So we try not to use that anymore. And then people whose job is fundamentally not about data or computing. Their job is something else. Like people in the line of business, people in science, people in what have you, journalists who use data. And you can think of them as consumers, if you will, whereas you think of the first body of people as maybe maintainers or something like that. They're the people who plumb the data. I think both constituencies are really important. I think they're both big markets for venture and for startups and for people who have technical skills to work at. But I do think Silicon Valley and academia both because academia and computing is computer scientists like me. We are very much more comfortable with the technical folks, with the developers. That's home base for us. And it's really fun to write software that we can use ourselves to make ourselves more productive. And so we do a lot of this dogfooding. Open source is another big kind of cultural contribution. You want to build a project to get your friends to use it. They like it. But it's, you know, we're kind of taking care of ourselves. There's this huge constituency of people who are going to get huge value out of data and they're the ones who understand how it's getting used. So they're actually people where like the cha-ching happens. These are the people who are going to monetize the data. In the sciences, it wouldn't be money. It would be scientific innovation. In journalism, it would be the big story. Sudip [00:13:41]: They extract value from the data. Joe [00:13:44]: Exactly. Yeah. And those people arguably are more important on my humble base, I try to admit that. So I think it is kind of two things. And I think we can ask a useful question, which is, will one set of technologies suit both of those constituencies? Maybe. I think that's a good conversation to do next. But I do feel like setting up that there's kind of broadly these two camps because two is easier to think about than 20. So let's not get too fine grained. How should we think about building and architecting data systems? And when I say systems, I do mean in the aggregate, like organizational systems to serve both those constituencies. And what kind of software do they need? I think it's a good, healthy frame for kind of the bigger picture. Sudip [00:14:26]: And do you see any convergence in terms of the kind of systems both constituencies might use? And does everyone become a Python programmer in some ways? Joe [00:14:36]: I'm so glad you brought that up. There's been a strong movement and I see this in the data science program at Berkeley as well. And we were a pioneering program in having a data science major. If you would just learn some Python, then you could do data science. And it just feels very backward. And I don't think that we should be expecting that all of the people who extract value from data will be programmers. Thinking programmatically, so having exposure to the notion of step-by-step problem breakdowns, instructing a computer to do things, all important, right? Do you have to learn a traditional programming language like Python to do that? I think not. I've felt this way for a long time, but I feel like LLMs are a really wonderful way to surface this to the general populace. If you are sufficiently disciplined step-by-step in telling the LLM what you want, it will probably understand you and probably help you get your job done, even if you're not really a programmer. And there's going to be other technologies, obviously, over time that'll be better than just a chat box on the internet for doing this, both user interfaces and models and programs. So yeah, I do feel that this idea that, oh, everybody will do everything in Python is super dev-centric. And that people like me and probably you, I don't want to put too much on you, and your listeners, that's what we're good at. But we have to have some empathy for the people who are the value extractors. And when I say empathy, I don't mean treating them like children. I mean giving them tools that make them super powerful. Right? Sudip [00:16:01]: Yeah, I think in particular over the last five to 10 years in general, the venture ecosystem which I can speak for has definitely probably over-rotated a bit more on the developer-centric experience of data engineering even at the cost of ignoring the actual constituents of users down the line like you were saying. I would kind of shift gears a little bit more on a core component of data engineering that you have done some incredible work on which is data preparation and cleaning. To this day, it's still like it continues to consume major time and resources. And you not only did a lot of work including your data wrangler paper and so on, but also went on to found one of the pioneering companies in this area, Trifacta. Looking back where we are now, do you feel we have managed to solve that problem of data preparation and cleaning yet? Joe [00:16:51]: Yeah, somebody pointed out that data transformation is kind of like cancer research. It's like a lifetime employment guarantee because you're going to help, right? And you may do brilliant, amazing things that help people's lives, but you're probably not going to solve it if you look forward with any sense of arrogance. There's lots to be done still, but there have been some good things that have been done that we can build on. Sudip [00:17:18]: And do you see like AI and LLMs kind of changing the game a fair amount going forward? Like any fundamental shift you see there coming? Joe [00:17:27]: This is great. So yes and no. And let me see if I can frame this up. When we started Trifacta, it was 2012. And the hypothesis in the research was that you want to build a feedback loop between the human and the computer. And the way it would work is this. The human would somehow guide the computer to, I want my data to look like this. And then the computer would say, well, here's a few things I could do to make your data more like this. What do you think about each of them? Pick the one you like best. I call this the guide-decide loop. So there's a human in the middle that's guiding the computer, and then the computer is making suggestions and the human's deciding which ones to use. And this was with a user interface that was visual. So I worked very closely with data visualization leaders like Jeff Hare, who's one of the fathers of D3.js, was a co-founder at Trifacta and a joint student to make sure that that visualization loop and the interaction part, the human part of that, were really powerful. So you could really see anomalies in the data, you could see examples of the data, and then you could interact by doing things like pointing at a bar chart and saying, this bar looks funny, or pointing at a cell in a spreadsheet and saying, what are the features that you give to the inference engine, to give to the algorithm to come up with suggestions on how to address those features? So a lot of our energy went in there. The AI that we advertised behind it was dead simple. It got a little better over the course of the company, over 10 years, our models got more sophisticated, but the user experience only got marginally better because the key issue was that interaction model. We built an interaction model around cleaning data, wrangling data. Now, we sold Trifacta some months before the launch of ChatGPT. And part of the deal was that I didn't go with the software to the acquirers. Sudip [00:19:13]: And this was Alteryx, the public company. Joe [00:19:15]: Alteryx, yes. So Alteryx acquired Trifacta. I will say that Google Cloud Dataprep is still the Trifacta product. It even says Trifacta on it. They haven't changed the branding. So both Google and Alteryx are using the tech directly. Better inference will make that product better, but it will not fundamentally change the hypothesis that started around this guide-to-side loop and this idea that you have to give people the opportunity to decide if the outputs of their prompts are right. You know, that's the whole thing with ChatGPT. You ask it a question, it gives you an answer. It's the same story. So if you're asking it for code, there's something I'll speak to the developers in the audience. You know, please write me code that will pivot this table and remove all the blanks. Okay, it'll spit out some Python code. How do you know it's the right Python code? Well, maybe you should run it on some sample data and see how it looks. Could you build a tool that would allow that iteration to go faster? You know, don't fill the blanks with zero. Fill the blanks by doing linear interpolation, right? Something like that. So you need this feedback loop and users need to be able to see or evaluate whether the suggestions that they're getting from the AI are right. That piece of the puzzle, the turn of the crank through the user is a big piece of it. So I guess in sum, user experience and a deep understanding of what you do when you're wrangling data so that you surface it to people and so that they can say to the computer what they mean, those are independent of how good the AI is. Similarly, the AI being really good doesn't remove the need for those experiences. So I would love to be working on this problem right now because you plug in GPT-4 into this. This is going to be way better than the inference we got. I think the qualitative user experience will be better in some cases by a lot and better in other cases only by a little. I mean, that depends on the task. But it is fun times, no doubt, for this technology. Sudip [00:20:58]: So some of the technical stack probably you could use some of the off-the-shelf models and so on. But what you're saying is the secret sauce around user experience, the real naughty problem is there. And that hasn't gone away or hasn't gotten any simpler. Joe [00:21:14]: Yeah, and I would say it's the same and your audience probably has hands-on experience by now with things like Copilot. If you embed Copilot in the IDE in a nice way as they've done with Visual Studio, it really helps the programmer quite a lot. But if you don't embed it well, for instance, if you ask it to write you an entire program instead of the next line it's going to do a rotten job because now you have to read that entire program, figure it out, etc. So this thoughtful combination of understand your domain, which in my case is data, understand what the technology can do pretty well, and then build the right feedback loop around that, that's going to be the game for a lot of products over the next few years around LLMs. And that is so true because we talk a lot about what is the mode for some of the companies that are using, obviously, LLMs and AIs and user experience comes up so frequently. If you get it right, it's, after all, a probabilistic way of thinking of the world, right? So if you do not get it right, it doesn't work. Sudip [00:21:53]: Couldn't agree more. I want to touch on a different topic, which is, you know, as I was doing my research on some of your work, I found this paper way back from 2005 that you wrote with Michael Stonebraker. And it was titled, What Goes Around Comes Around. I think, you know, the short summary was you guys kind of discussed this fascinating cycle of data models over the previous four decades and how data models have gone from complex to simple and then back again to complex. I'm curious a little bit, like, you know, now that, you know, it has been several years since your publication, where do you think we are in that data model cycle today? Joe [00:22:46]: Yeah, that's a fun topic for your listeners. If you don't know Mike Stonebraker, he was my master's advisor. He's one of the founders of relational databases. He won the Turing Award for that. He started his career in like 1970. He's 80 years old. He's still going. He's still like at every meeting running things. The guy's, he's amazing. And he's founded a number of influential companies. And of course, the Postgres project that many of us use that I was a grad student on. So Stonebraker's a legend. He's a super opinionated guy and he likes to kind of have his say. So that paper is written entirely by him. I don't disagree with what it says and we were putting a book together, but that was his chapter. And boy, is it him. So it's very black and white. And I see the world in shades of gray a little more than Mike does. But what I would say is the high level message of pretty much tabular simple data with a well understood query language that's pretty clean is going to always win over time over any custom complicated thing. I agree with that. Data is too important an asset to have behind fancy stuff. You want to have it behind relatively clean stuff. Having said that, as Stonebraker actually did in Postgres, you can put a lot of interesting data into Postgres and query it with pretty simple SQL. And it's not really flat tables anymore. It's something more than that. And, you know, if you go off and read database theory papers from the last five, ten years and you hang out with the right folks, you'll realize that generalizing the relational algebra to richer mathematical structures can give you actually more flexibility in this space than I think that paper gives account for. So I actually think over the next ten years we will see another generation of extended extensions to the relational model that will make it amenable to new data types in ways that we haven't seen before. So I can give you some concrete examples. Traditionally, like if you had something like an image in your database, it was just a blob, right? It's a binary large object. It's just a route of bytes, right? Unix-style stream of bytes. I think increasingly what we're going to see is, and Postgres actually has the infrastructure for this, it has since like 1989, but we're going to see this in the field. Point at any blob, you have a featurization function that pulls out a bunch of attributes for that blob. Those attributes, they're columns. So, you know, think about the image that a self-driving car takes at any given time. Bounding boxes around a bunch of regions in XY space, each of which may be tagged with a class or a list of possible classes. I think this is a pedestrian. I think this is a car. I think this is a stop sign. And then it's got a time ticker, and that's a row in a time series database. You know, if you want to start building queries over what happened at this intersection, at a busy intersection at a particular time, it's going to be a time series query over something that eventually looks like rows and columns, but it's actually video. And so we want to extend our tools and unify our tools, right? So the technologies like LLMs and image processing and so on are generating features that we can easily query. And we're going to need to build systems that are a little smarter for that than what we've got in Postgres and the like. But I'm optimistic that the road between where we are with stuff like Postgres and its children and where we need to get to, that's a bridgeable gap. And so I think there's a nice opportunity here for a next generation of powerful analytic databases to be built that will be extensions of what we have today. And it's not really a circle, what goes around and comes around. It's a spiral, right? You're going forward as you're moving, right? And I think there is progress that's required and that's going to happen. Sudip [00:26:12]: That's a fascinating example because today if you have to kind of extract the feature from a blob of image data, you have to build all these fancy pipelines and stitch together a different number of different tools. If you could expose all of that to a simple SQL interface to someone who only knows SQL, now you're just empowering that person so much more. Joe [00:26:33]: And now, if you may, think about data governance in that world. That function you wrote that extracts all those bounding boxes and labels, that's a model. That model had training data. And there was this ML team that owns the training pipeline for that model and maintains it. Now we have governance questions that are like not just what data did you look at, but in your query, which functions did you use? Were any of those functions model-based? What was the data that trained those models? And so if you're doing something like the right to be forgotten in GDPR, I want anything you're carded on the street to be deleted from the database because you're no longer using that insurance company, let's say. I can't just look at which queries touched it through SQL. I need to also look at which functions in your query were trained on a model based video that you were in. This is now across teams. It's across what we currently think of as totally different data pipelines. But this is the future of data governance and metadata management. So it's got pretty big implications for how organizations are run. Sudip [00:27:32]: That is actually a really good segue into one of the things I wanted to ask you, which was the role of data catalogs. I know you're not a fan of the term, but in that kind of modern data stack, what do you think we are with data catalog? I mean, some time ago, there were probably a lot of users that came out of the web-scale companies. There are, of course, companies like Alation and Colibra, who are more ahead in terms of commercialization. Do you see a data catalog as a role in the data engineering going forward? What does that look like? Joe [00:28:02]: I think, inevitably, they have to exist in some form. And I saw this when we were selling Trifacta, and I saw this in the research we did in this space. And I did some of that research in collaboration with LinkedIn back at the time, and they were one of the first data hub vendors we had. And I continue, actually, to advise Acro. So I should just make a public statement about that. So they're one of the data hub vendors. But, you know, if you have many data sources under different systems, some of which are proprietary, some of which are open source or different flavors of proprietary managed open source, they're not going to come with a common catalog. It's in no one's incentive, as a software vendor in the space, to build the catalog and if you're wrangling your data, then you'll catalog it with Trifacta. We'll own everyone's metadata. We'll be very powerful. Customers did not buy that. They knew that was a scam. So it's a lock-in scam. So a neutral data catalog is a reality, I think, for any large organization, even today, honestly, going forward all the more so as data gets richer and systems proliferate. And it's a hard problem that merits full-time tech focus. So again, at Trifacta, we wanted to build a data catalog, but by golly, that's going to be a big lift. We were plenty busy building data wrangling tools. We were happy to partner with the likes of Colibra Innovation and so on because they were doing a good job. And it takes a whole team just to do that stuff. Sudip [00:29:26]: 100%. On a different note, you actually have had a ringside seat, and not only that, actually worked on several technologies in this whole movement that you were talking about a little bit earlier, which is we went from data warehouse to we are kind of in the middle of getting to data lake and then even early days of data lake houses. Any thoughts on what is fueling this? Like what are the trends behind this journey from warehouse to lakes to lake houses? Joe [00:29:57]: Yeah. I mean, I think the easy answer is software logs were kind of the first high volume source of data that folks like us and the people on your podcast had to manage that just didn't really make sense to put in a relational database. It was too expensive. You know, they sort of had rows and columns, but they kind of didn't too because there's lots of text in those logs that you want to look at. I mean, they're not always structured the same way and so on. And so you saw the likes of Splunk and their following competitors carve out a very large niche on log processing that as an academic you're sort of like, yeah, it's kind of information retrieval is kind of databases. I don't see anything new here. La la la. Academic ivory tower stuff. But, you know, really important business problem with really good tech out in the field. That's the tip of the spear. New data type is just a little bit different. It's got different cost structure and value structure and different queries. You don't want to put it in Teradata, right? And you don't want to put it in Amazon Aurora either, right? You kind of want to just leave it lie. Now, Splunk didn't work that way because it was early. So Splunk, the thing that customers hated was it was so expensive to put your logs in Splunk. They were charging by the byte. Essentially, I heard complaints about that all day long, every day. Sudip [00:31:03]: And still do. I still do. Joe [00:31:05]: Yeah. Yeah, which is why it's kind of great they got bought by Cisco. I feel like it's old school pricing that'll last for a while. But realistically, that was the first use case. And what we're going to see now, because featurization with LLMs is so practical, you can really get structured data out of anything now. You can get structured data out of your web chats. You can get structured data out of your images and your security cameras. Very low budget sources of data are going to turn into columns. Is the customer happy or sad? Is this a complaint or praise? Which product are they complaining about? These are all things that you get out of a customer chat, right? Those all get to be columns now. And you're going to want to load them into something, some customer relationship management system, right? So there's going to be lots of modalities of data that we're going to extract structure out of, not just log files and traditional relational data. But I feel like log files were the first big use case and we're just going to see lots more. So an open question is, does a vendor like Databricks that wants to give you soup to nuts data lake house manage to give you the full spectrum of that stuff in a nice package? Or is it more like Splunk? Is it more like Splunk? You know, where you get kind of someone who's tuned up to be really good at a particular kind of pipeline, a kind of data and a kind of query. And then they can monetize that in a vertical application. I think those are really interesting questions for the space going forward. Sudip [00:32:23]: And that is actually a really interesting thread I just want to kind of pull on a little bit, which is, I think historically data engineering has been mostly about using best of breed tools and then stitching them together to implement your data pipelines, right? Now, of course, we have companies like Databricks, which you are intimately familiar with. And then to a certain extent, even Snowflake, they talk about their unified data platform. Do you feel like we are heading to a world where enterprises will standard on a unified data platform? Or do we still have this, you know, kind of duct taping of best of breed tools? Joe [00:32:57]: You know, it's funny. We have this conversation often. And when I say we, I mean, those of us in the community, not just you and me. You say Databricks and then you say Snowflake and you think about them. And then you remind yourself that AWS and Google and Microsoft have those things and 17 more, right? That are data solutions. So if we broaden the scope a little to be like, what tools in the AWS toolbox should I be using? And will I stay only at AWS? You know, we know the answer to that. The answer is no, right? For lots of reasons. Now, as to who's going to be good at what, as opposed to maybe I want to split my bets and I don't want to have one vendor relationship. It's going to be possible to be an 80% solution on a bigger, 80% of more stuff, right? The relational database was always winning because it was good for 80% of your data problems. Now, you know, what is 80% of your data problems is a much, your problems are much broader. I do think there are sort of 60 to 80% solutions there that you could get from a single vendor. I think there will be solutions that under the hood will have a lot of pieces that today we think of as different pieces of software. You know, one of the things that happened with the big data era and open source and I got a little salty about this with some vendors at a conference recently, I should say, is, you know, we went through 10 years of Hadoop, right? And it was awful. And the reason it was awful was partly because it wasn't the greatest software in the world. It was kind of open source and a lot of it was immature and never really matured. But a lot of it was that it was 14 or 16 pieces of software that weren't super mature. Each of which had a logo and a fan that like, you know, a community around it. So there were five or six super fans followed by 25 fans. It was like identity politics. It's like, no, no, no. You have to integrate the queue into the database system because the queue is cool. It's got a name and a logo and a bunch of fans, you know, whereas if it had been run by a business, they might have consolidated business units over time. And so what I think we're dealing with right now is going to be consolidation of what we currently think of as pieces of the pipeline just because it's going to make technical sense to consolidate them. They're close enough to each other. They should just not have two teams and two products and AWS is rife with this, which is the most confusing. I do think we're going to enter into an era of consolidation around that. Open source will be the last to do it though because of all the cultural issues I just mentioned. The way you get open source to move forward is you build an inner team that really is super fans. And so that causes fragmentation of product. It's hard to build a big enough team of super fans to build a big enough product and to merge teams over time. So I got into a fight at this conference because somebody said that the future of data systems is the stuff emerging from Facebook and Voltron and other places right now. And I was like, I'll believe you when you show it to me, but the last 10 years suggests otherwise. What I see from Amazon, Google, Oracle, Databricks, Snowflake, all that stuff is way better than what's coming out of open source technically. And it's not like they're hiding the technology. Actually, they're publishing about it, especially like Amazon and Microsoft. They write lots of papers about what they're building. Those papers are more sophisticated on average than what we're seeing in open source. So much as I'm a huge advocate of open source and a postgres guy from way, way, way back, what I'd say is that in terms of these big stack problems, we're going to see consolidation. It's going to happen first at the bigger companies or at startups that are willing to take on a big risk and that some of this like piecewise stuff is going to fade away. Sudip [00:36:17]: It's a fascinating view. Yeah. Thank you. I have a similar question, but more around use cases. So which is traditionally we always have had systems that were transactional, so doing OLTP. And then we had systems built for analytics, OLAP. And I think, you know, I actually myself got into a debate five, six years ago about whether it is possible for one system to cater to both. And I think there was a terminology around that time that came up, which was HTAP, Hybrid Transaction Analytics Processing or something like that. Do you see like a world where it's one system to serve both use cases or do you think those use cases will forever be served by separate systems? Joe [00:37:03]: To me, this is totally a cost-benefit analysis conversation. So is it possible to build a system that does both? Well, I believe it's possible. In fact, one way to build it is to just sell both systems and put a little glue underneath it and kind of try to hide it from the customer. But that's not a very good instance. But you can build these things. And in fact, former PhD students of mine have done great work on this at IBM and other places. The possibility to do it well is there. It's hard to do because you're basically meeting multiple SLOs with a single piece of software. Some people want very low latency, high transaction rate and then transactional semantics. And other people's SLOs, I want very large data volume. High latency is fine, but huge data volume, I want lots of throughput of bytes. And to meet both those SLOs in a single system without introducing 700 tuning knobs, one of which might be run in OLTP mode versus run in OLAP mode. But to really get down to what are all the knob settings that make it work well in one or the other, it's really hard to build that well and make it usable. And then you ask the question, well, if I built it and I invested in doing that, and let's say I have infinite budget, let's say I'm AWS, is there enough customer demand to fund that? And to my guess, the answer is no. Most people probably can live off some kind of data loading in the background that's not too hard to manage, not too expensive in human time to do the ETL, ELT, call it what you will. And have two systems and two different organizations that manage them. And then there's the governance issues. So the governance issues for your operational databases are typically very few people get access to them, right? And it's real easy to lock down who gets to see what because very few people get to see any of it. But once you get it into the warehouse and you've torn apart all the little pieces that are accessible to different people from each other in some sense, now you have a management problem only on the OLAP side, but it's hard enough over there. So getting governance working in HTAP is also a big challenge outside of the other technical challenges. The organizational governance challenges are hard. Again, I don't think most orgs really want all that noise in their production operational databases. My advice to the entrepreneur doing HTAP is that's brave and risky. But if you did it and you won, man, that'd be awesome. You'd get two markets instead of just one. But I think it's high risk. Sudip [00:39:18]: I will take that. By the way, one comment I just wanted to make was a conversation we had over the last half an hour, 40 minutes. I've heard you so many times bring up governance, which was fascinating, you know, given you're an academic first, but I imagine a lot of that comes from your entrepreneurial experience. Joe [00:39:33]: Yeah, absolutely. Nobody, none of my colleagues know much about this except for the ones who are actually beginning to get interested in fairness. So some of my colleagues who are working on the boundary like data fairness, AI fairness, they get it actually very deeply in ways that I often don't think about because in industry we don't think about that so much. But most core computer scientists think about how fast things go and how useful the outputs are. And they tend not to think about governance. It's absolutely right. And it's hard to teach, honestly, like giving a lecture on that stuff's pretty dry. I don't know if you've ever gone through a class on like access control, but it's not real inspiring. So I think it's one of those things you've got to get out there, get your hands dirty. And they realize it's like the biggest problem sometimes. It is not fun because you are mostly talking about locking down as opposed to, you know, obviously empowering people. Yeah, it's a little bit like security in that sense where, you know, you have to scare people enough and put a lot of drama around doing it wrong. To get people fired up to want to do it right. Sudip [00:40:25]: 100%. I want to ask you a little bit about how you see the whole data stack evolving with AI. I mean, everybody now wants to be an AI company. Do you see some fundamental changes in the data stack and how we have done data engineering over the last, you know, two, three decades now that we are going to add up all of this in a fascinating LLM technology? Joe [00:40:46]: I guess what I'd say is a couple of things. I'm beginning to get the feeling after beavering away at this at Aqueduct for a couple of years with my co-founders who are brilliant Berkeley PhDs and professors that, you know, picks and shovels broad purpose tooling for LLMs for the enterprise is not ready yet. It's too early for that. Most enterprises that are going to go in this direction probably will either use what they can get from Microsoft and the like or they'll do some stuff in-house until they figure out what they really need. And this just hasn't settled down enough to do a broad-based solution. I think that decentralized solutions are going to be earlier to adoption and earlier to generating real value. You know, the best examples that we see in the wild are the difference between ChatGPT and Copilot. I know exactly why I would pay for Copilot. I have less reason to pay for ChatGPT as an individual. I mean, it's kind of cool as long as it's cheap. But, you know, broad-based answer any question I can think of pretty well, I don't know what that's for, really. But, you know, if developers are going to be doing the tooling because developers are going to give very crisp feedback as to what they want in this space, that is a good place to focus. I think Microsoft's been very smart focusing on Copilot because the value of it to their constituency is very clear and they can dogfood it in-house for a long time and figure out how to make it better at the tasks they already do anyway. Medical, by contrast, like on the one hand, sounds great, right? And on the other hand, yikes! How do we make it work well for this kind of thing when it hasn't been trained on it? I had a conversation with a major medical provider recently and they said they have like a stack list of stuff they want to try LLMs for and it's got like 250 use cases. And I said, wow, I'd love to see that. That sounds fascinating. But, you know, the ones that we settled on talking to them about that seemed like they were actionable were ones inside of IT. Because inside of IT, we could figure out, first of all, we wouldn't kill anybody. Second of all, and I didn't really figure out like if it's working or not pretty well. And so doing pilots inside of sort of technology settings, I think there's going to be a lot of easier pickings there for a while. And we'll get better at it by using it ourselves as the developer community more quickly. Of course, people will succeed applying it to specific verticals too. And that's also good. But doing broad-based right now, I think we're too early. Sudip [00:43:06]: Too early to settle. I want to ask you one last question before we move on to the lightning round. And you probably have one of the best vantage points to answer this. What are some of the most interesting future directions, research projects, and data engineering that you are excited about? Joe [00:43:23]: So I always like to answer these questions in terms of that I'm not working on because I have my biases, right? But actually on this one, since I'm not doing Trifacta anymore, I would say that actually data transformation, the T in your E-T-L-T-L-T-L-T-L, whatever star, I think it's an awesome petri dish for LLM technologies because of a bunch of things. First of all, algorithmically trivial. It doesn't have to invent new algorithms to do E-T-L. We're not doing things to the data that involve computing Fibonacci numbers or even sorting often. I mean, like, you don't have to implement quicksort. You don't have to invent any new algorithms. You just need to apply building blocks in the right way to the problem. So I think it should be good at it to a first approximation. Secondly, very hard user interface problem. So it's one thing for me, and I don't mean to minimize things like mid-journey because actually I think it's totally fascinating and awesome. But if I say, give me a picture of two people talking on a podcast and it's got this crazy awesome picture, it's really gratifying. But it doesn't matter if it's right or wrong. And I didn't define what's right and wrong. Sudip [00:44:26]: How delighted were you? Joe [00:44:27]: I'm like, I'm delighted. Try again. That's not engineering. That's authoring. And authoring has got different constraints than engineering. So I would say in the engineering world, the great thing about data transformation is it's hard to know if you got it right. It's a huge piece of data. So whether you got it right is highly contextual. Example, you got log files like Splunk. The marketing department is going to put it in the CRM to try to figure out targeted ads. They want to do one thing with that data. The IT department wants to look at downtimes of servers. You're going to do a very different thing with that data. And whether the data is cleaned adequately for one or the other is a completely different objective function for the optimization. So saying there's an LLM that does this, well, maybe with the right prompting. Yeah, maybe. But the prompting is going to have to be very interesting. And my point there is because there's many different correct answers to clean up this data, the evaluation of the output becomes the hard part. And this gets all the way back to the beginning of the podcast. Did you build the right user experience so the user can guide the system and decide if the outputs are right? That guide decide loop comes back. So I think it's a wonderful petri dish and LLMs are great at some things right now in data cleaning and terrible at others. We can talk about that and it's one I know well. So that's also easy for me to suggest. Sudip [00:45:42]: So we end each of our episodes with three quick questions. We call it the lightning round. So the first one is around acceleration. So I'm going to ask you a little bit around your space, which is data engineering. What do you think has already happened in data engineering that you probably had thought would take much longer? Joe [00:46:02]: The one that surprised me at the speed it went is the disaggregation of the stack. I would have expected that kind of the tightly coupled what they call shared nothing architecture where you have memory disk all in one and then you that's your building block and you knit those together and you parallelize across these full machines. That transitioned in the cloud very quickly to what amounts to shared disk. We have a disk tier, a storage tier, think of S3 and we've query processing tier and a log generation tier maybe and the log generation tier hydrates the storage tier. I did not expect that to happen or happen so quickly and it makes sense and it forced a lot of design changes. I'm very impressed actually with the teams at Microsoft and Amazon and Google who've led on this but that shift to disaggregated stack went faster than I would have guessed. Sudip [00:46:47]: In large scale data processing generally speaking what do you think is still the most interesting unsolved question? I mean data cleaning is definitely one I can think of from your background. Joe [00:46:58]: I think the most exciting thing I'm not working on right now is querying unstructured data. I think we're going to see tons of progress on that in the next five years. I don't even think you need 10 to be five years you have SQL interface to everything and you'll be able to ask questions of any kind of object and get some kind of answers but you may not have time to train GPT-4 on all that data every time so how are you going to plumb together all the pieces so I can ask SQL on all my data that's in the lake? I think that's happening it's happening in research and I foresee it happening in industry over the next short window. Sudip [00:47:28]: Fantastic. Last question. What's one message you would have for everyone listening? Joe [00:47:35]: One message that I mean this audience probably already knows but it's always about the data. There's going to be lots of innovations on computation there's going to be lots of cool algorithms there's going to be new kinds of models it's always all about the data and that means where did it come from what data did you choose to acquire and then of course how you bake the cake with it right you know whether it's training a model or building a warehouse or whatever that's important but it's always about all the data and where it came from it's always about that but so are the systems you roll out and the algorithms you run they're always in service of the data and the traditional field of computer science really is always all about the data. Sudip [00:48:14]: That is actually a really fascinating answer you know given we are talking about data entering and given you're back could not agree more. So on that note Joe it was a real pleasure and privilege to host you today. Thank you so much for your time! Joe [00:48:29]: My pleasure Sudip. Thanks for having me. Sudip [00:48:31]: All right. This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

12. dec. 2024 - 48 min
episode Flink: The Unified Stream and Batch Processing Engine - with Stephan Ewen cover

Flink: The Unified Stream and Batch Processing Engine - with Stephan Ewen

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/] for a fascinating look back [https://sudipchakrabarti.substack.com/p/when-hadoop-was-king-and-yahoo-was] on Hadoop, Reynold Xin [https://www.linkedin.com/in/rxin/], co-creator of Apache Spark [https://en.wikipedia.org/wiki/Apache_Spark] and co-founder of Databricks [https://www.databricks.com/] for a technical deep-dive [https://sudipchakrabarti.substack.com/p/from-spark-to-databricks-sparks-origins] into Spark, and Ryan Blue [https://www.linkedin.com/in/rdblue?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAAzlA6sBKpw5AAsa7SgDV425Ay1w6My0b4U&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BohWsnzB2Rw2ipZdP53XWlg%3D%3D], creator of Apache Iceberg [https://github.com/apache/iceberg] on the technical breakthroughs [https://sudipchakrabarti.substack.com/p/iceberg-the-open-table-format-for] that made Iceberg possible. In this episode, we host Stephan Ewen [https://www.linkedin.com/in/stephanewen/?originalSubdomain=de], creator of Apache Flink [https://en.wikipedia.org/wiki/Apache_Flink], the most popular open-source framework for unified batch and streaming data processing. Stephan shares with us the history of Flink, the technical breakthroughs that made it possible to build a unified engine to process both streaming and batch data, and his take on the emerging streaming market and use cases. Show Notes Timestamps * [00:00] Introduction and background on Flink * [01:44] Relationship between the Stratosphere research project and Flink * [10:28] How Flink evolved to focus on stream processing over time * [13:08] Technical innovations in Flink that allowed it to support both high throughput and low latency * [18:56] How Flink handles missing data and mission failures * [21:47] Interesting use cases of Flink in production * [26:02] Factors that led to widespread adoption of Flink in the community * [29:18] Comparison of Flink with other stream processing engines like Storm and Samza * [37:07] Current state of stream processing and where the technology is headed * [39:47] Lightning round questions Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. Sudip: Today, I have the great pleasure of having Stephan Ewen, the creator of Flink on our podcast. Hey Stephan, welcome and thanks for your time. Stephan: Hello. Hey, thanks for having me. Nice to be here. Sudip: All right! So you created Flink, which is one of the defining stream processing engine in Big Data. As I understand, if I go back a few years, it all actually had started when you were still doing a PhD in computer science from the famous Technical University in Berlin and working on a project called Stratosphere with Professor Markle. Can you talk a little bit about what was the relationship between Stratosphere and Flink, if any? Stephan: When we started in 2009 with the work on Stratosphere, Hadoop was all the rage and the industry was starting to pick it up. But also in academia, everybody was looking at it. Hadoop had some cool properties, but it threw away a lot of wisdom from databases. And there were a bunch of projects that were trying to reconcile both worlds - how can we get some of the Hadoop goodness and database goodness in the same project. And Stratosphere was one of those. It was doing basically general-purpose batch Big Data on a mixed Hadoop database engine. When we were done with that project, we kind of liked the result too much to say, “okay, that's it, it goes into an academic drawer.” We actually made it open source and started working on it. So the open source version of Stratosphere became Apache Flink. We had to pick a new name when we made it open source and donated it to Apache. Sudip: What was the story behind the name? Stephan: I kind of picked up on that reading some of the old interviews. Sudip: Like, was there a conflict with another trademark? Stephan: Yeah. I think in the end, all naming decisions are actually driven either by trademark conflicts or search engine optimization. It was actually no different for Stratosphere. I think it was a horrible name to get good search ranking for. And also, it was a registered trademark in so many industries already that we felt, okay, there's no chance this will work for us. And I think it was fashionable at that point in time to pick project names based on words in other languages, because they were usually not trademarked in the US. And we did the same thing. Flink just means quick and nimble in German. And because, you know, it's kind of where it started from, we picked that as a name. It was a good choice in hindsight, because it did lend itself to a cute squirrel as the mascot - and we really liked that. If you want to go back to the question, like Stratosphere to Flink, Stratosphere was the starting point, it was an experiment with mixed database engines. The first year of Apache Flink was also pretty much that. We're doing batch processing, graph processing, ML algorithms on the same engine. And then we explored stream processing as a possible use case, because we knew there were some interesting properties in the engine. But stream processing wasn't a very big thing back then. When we started exploring that and published the first APIs, we saw that folks really, really loved it. And that's when we shifted the priority of the project to stream processing. So 2015 is when Flink went all in on stream processing. Sudip: So what Flink is today actually has very little to do with what Stratosphere was, as it happens with so many projects. Stratosphere was like the research phase, the first prototype, and then you found your niche with Flink. And over time, kind of almost rewrote the project. I want to go back a little to the motivations that started the research project. Like you said, Hadoop threw away some of the wisdom from the database world. Were there particular problems that you guys had in mind that you wanted to solve with this new system that Hadoop could not? Stephan: Databases have these two components - they have a declarative language and a query optimizer component around that. And they kind of decouple how things get executed from how you specify the actual analytics programs. That is for a good reason, because the number of programmers that can write very good low level code is actually very, very small. And the data analysts that can do that is even smaller. So this decoupling is there for a reason. Hadoop just threw that away and made everybody write MapReduce jobs. And of course, that's why one of the first things that came was building SQL engines on top of Hadoop again. But then those threw away some of the generality of the MapReduce model. The nice thing about MapReduce is that you could implement a lot of things that you couldn't as easily express in SQL. So the question was: was there a nice place in between? Can we use DSLs or embeddings or compiler extensions in programming languages to capture some of the program semantics and optimize them in a database query optimization style and then map them to a suitable execution engine and get some of the database query optimization code generation magic - not for just SQL, but for a more general programming paradigm. That was one of the starting points. And then, of course, the Hadoop execution engine is sort of simplistic, right? Like just map and reduce steps. Sure, it's a powerful primitive, but also it's not for everything the most efficient primitive. If you look at the systems that dominate the space today, they all have way, way more general and powerful engines that support a lot of the operations which came from distributed databases. So we worked on what distributed execution primitives you need, and what's the minimum orthogonal set to build a very flexible engine that can then run these programs that are going through a translation and query optimization layer. Sudip: If you were to describe Flink to an engineer who doesn't really know what it is, what would be a description that you would use? Stephan: Yeah, so we called it on the website, and I think the website still does that today, stateful computation over data streams. The idea of Flink is that data streams are very, very fundamental primitives. It's not just real time streams but it's actually the shape in which most data gets produced. Unless you're running a particle accelerator experiment that literally dumps a petabyte on you in a few milliseconds, most of the data in real life get created as a sequence of interactions from users with services and sensors. So almost all data actually gets created as a stream. And it's a very natural way to process it as streams. You do preserve sort of causal relationships. You take into account that there's a time dimension through which data moves and so on. So Flink is really a system that's built to work with this idea of a data stream. And it can be a real-time data stream. That's where it is probably most famous for being useful and where its strongest capabilities lie - to actually be very strong in analyzing real time data streams. But it doesn't stop there. It can also process archived data streams. If you think of a file as an archived data stream, like a snapshot of a database table as a stream of records, Flink works equally well with those. It even works when you mix those, like you have a stream that keeps continuously moving versus a stream that is more of a slow moving, or even a static snapshot of data. And then how do you combine those? So, Flink handles all sorts of stream processing: real-time stream processing, and if you wish, offline time-lagged stream processing, under a unified programming model and an execution engine. Although the strongest differentiating parts, I would say, are still on the real-time side. Sudip: If I go back and watch some of the presentations you guys did when you started Flink, I believe one underlying motivation was to unify batch processing with stream processing. It's almost like you guys built a batch processing system on top of stream processing by making batch processing a special case of streams. I'm curious what were some of the architectural innovations that you guys had to do? And what were some of the challenges you faced in doing that? Stephan: That's actually one of my favorite topics. I said it before that conceptually, the stream is a very, very fundamental primitive. And it's not just at a conceptual level - it also works like that in the runtime layer. Conceptually, if you're very good in processing real-time streams, you can start thinking that a batch is really just a stream that starts and then stops. Maybe I'm not getting the data bit by bit, like you know records that I draw from Kafka, but I'm getting it as a firehose over a read from a file in an S3 bucket. So naively thinking of a batch as a special case of a bounded stream also works well on the runtime. But then there are a ton of things you have to do differently to make it work well from a resource efficiency and performance perspective. There are a few things with the real-time streams and the batch streams that are actually different. And it's important to understand how to take care of those different characteristics. On the real-time side, a stream always has a trade off between latency and completeness. You have to understand how long you want to wait before you emit a result in order to reflect all relevant data versus what's the latency you introduce by holding the result back. It's kind of a very specific thing. You have to make the engine very flexible in configuring that trade off, because every use case has a slightly different requirement. It's where the watermarking mechanism in Flink ultimately boils down to, in configuring that trade off. The next thing is that in the streaming case, you're obviously interested in what's that time to first result - you want the result as soon as possible. On the batch side, because the data is little older anyway, time to first result doesn't matter. You always assume that batch is as good as it gets - like, I don't get more data because I wait longer. And those two things let you use completely different scheduling algorithms, data structures, and so on on the batch side, which ultimately is what you need to do to make it really competitive with batch systems. So what Flink ultimately did over the years was that it started as almost like two engines in one system, like a completely different batch engine that only exploited bounded streams, and a real-time streaming engine. Later, we started to unify the engines. And that had implications on every level. You need to build a network stack that's extremely good in decoupling execution stages for flexible scheduling. That affects the way you organize memory management in your network stack. It also means that the way the operators are implemented is very different. Like a batch join works fundamentally different from a streaming join. And whenever an update comes from one side, you probe against what you remember from the opposite side. And then there's all sorts of optimizations you can do. But if you want to get all that in the same engine, the same engine has to support both of those patterns very well. So one way to think about this is that what Flink evolved to over time was almost a very sophisticated, distributed flow control network where different operators can activate, deactivate, and prioritize different inputs over others to exploit all these characteristics. It implements a sophisticated flow control model over the network stack that extends to the entire streaming graph and couples the way all operators together decide on how data flows. It's fascinating and also took a while to get there. And it's still one of my favorite parts in the architectural stack. It was a fun learning and discovery and engineering experience over time. Sudip: How long did it take you guys to build this unified engine? Stephan: I don't think it's actually fully done, to be honest. When did this unification effort start? 2016 or 2017, maybe. I would say good parts are done, but in certain areas, it's still ongoing. Let me give you one example. One of the really interesting case where you start to combine batch and streaming execution is when you actually have a stream that is kind of a real-time stream, but it already has a lot of data. Let's just assume you're connecting to a Kafka topic that has half a year of data and retention, but it's also has new data coming in. And this may be models, user interactions from your service, and you want to bootstrap a new recommender with that data. You'd really like to fast process through the previous half year of buffer data, and then you want to keep working in real time on the new data as it comes in. It's kind of like bootstrapping a new service based on a stream of data in the past that keeps updating with real-time data, which is actually a really nice use case of this unified batch streaming model. It is a really nice way to show how a data stream is a unifying fundamental piece, because it represents at the same time the old data and the new data. And you don't even have to think about, okay, I have to first build a separate job that reads the old data and kind of does something, then I have to have a different system that updates it with the real time. It's actually just like one streaming computation and the stream processor just crunches through the past and then updates the present. Conceptually, very, very easy, right? But practically not that easy to do because you can process the entire stream in a streaming fashion, but it's going to be very inefficient. The batch data structures and algorithms tend to be much more efficient than the streaming ones. And if you only use the streaming ones, you also use the streaming way of executing in the past just because you want to continue with the streaming way of executing in the present. You end up wasting a lot of efficiency. It's a real thing and people do make it happen, but they throw way more resources at it than they really want to. What you really want is a system that executes up to the point today in a batch fashion and then switches over to executing in a streaming fashion. That work is still ongoing in Flink to make that type of behavior really nice and smooth. The whole unified batch streaming is a gigantic topic that I'm not sure it will ever be fully done, to be honest. Almost like holy grail for data processing. If you want to build a unified engine, it's also really hard to make it competitive with the specialized engines on both sides. That's kind of the added challenge. There's a value in having the unification, but ultimately you don't want to be much worse than any specialized system because then you're also not giving folks a good reason to use it. So that's always the challenge. Sudip: And since we're on this topic, I wanted to ask you about what do you think about micro-batch processing that something like Spark Streaming does? When you compare Flink's stream processing with what Spark has historically supported through micro-batches, what are the pluses and minuses of that? Stephan: If you have a batch engine, then micro-batching is interesting, I don't know if I should say that it is a hack to do something that comes somewhat close to stream processing for certain use cases. The issue with micro-batching is not necessarily just the latency because of the batching, but as pipelines get more and more complex if you try to execute those pipelines in a micro-batch fashion, you'll be going through so many stages of scheduling that it will be very hard to make that really well behaved. It just flows much more naturally in a real streaming engine that has the right operators, just keeps the data moving, and then implements the fault tolerance as an asynchronous background operation, which is the architecture of Flink. That being said, if you're looking at the batch processing side, you’d start with micro-batching first because that’s a very good match. On the streaming side though, I always found that it hits a limit if you try to add all the things you need, and you get very close to implementing a streaming engine again. Sudip: Right. You are basically using the batch processing system to behave like a stream and then introducing a lot of inefficiencies; so, might as well build a stream processing engine - got it. When you guys started in Hadoop, like you said earlier, Hadoop was really prevalent and widely used. Did you guys benefit from any of the Hadoop ecosystem? Stephan: I would say so, both in a good way and a bad way. One of the things we benefited from was that, with the introduction of Hadoop, the idea of building data lakes liberated data out of central databases. That was a necessary precursor for everything else to happen. So that goes for Flink as well, even though I would say that Kafka was probably even more important for Flink. What Hadoop HDFS did for data at rest, Kafka did for data in motion. But still, in a way Hadoop kicked off the whole thing. Plus one thing that Flink relies on for its fault tolerance model is a place to put asynchronous snapshots in. And the first sort of mass storage that most companies tended to have was HDFS. Today, most folks use S3, but HDFS was the first checkpoint target for Flink. So we usually were working together with the Hadoop vendors. Plus the big data community was almost synonymous with the Hadoop community at that point in time. So when we started, all the first Flink presentations were at the Hadoop summits. In spite of the things that Hadoop did not do right, it was a super important step in the evolution. And they actually did a good job in building a community. And that was really important to unlock projects like Flink and Spark and so on. Sudip: Can I switch to some of the interesting use cases that you guys were seeing with stream processing in particular? You had a company, Data Artisans, at that time, which was building on top of Flink. So I imagine a lot of your users and customers had stream processing use cases. What were some of the most interesting use cases that you saw where people were using Flink? Stephan: It's very hard to pick because there were so many interesting use cases. The first one that blew my mind and one I was super proud of was a telco company that used Flink to understand the health of their cell tower network and the utilization of all the cells to do load predictions. We once visited them and saw their central control room where they were trying to operate the entire network. You could actually see those dashboards and prediction graphs that were all powered by our technology. That was a really fun moment. For apps that I use day to day, whenever I know that Flink is running behind and I understand what is happening as I'm scrolling through Netflix, or I'm ordering an Uber, etc., those things still fascinate me. In hindsight, the most mind blowing use case was a smart city project in China and Southeast Asia where they had an insane amount of sensors to optimize traffic flow in real time to reduce travel time for drivers and carbon emissions from exhaust. Also in case there was an accident, they could evacuate certain roads for ambulances. How they changed the traffic lights, how they aggregated and predicted the traffic, running all of that through the streaming engine was pretty amazing. That is, and maybe that's the most impressive one over all time with real world implications too. Wow. It was not something that just made people click ads. Those folks could actually see huge reduction in exhaust fume emissions and massive reduction in travel time for ambulances and so on. That's nice - literally saving lives. Sudip: That's fascinating! Thank you for sharing that. Now looking back, over the last eight to nine years since Flink was created, I would say it's probably the most popular stream processing engine that not only has been really popular within the open source community, but actually has gotten into production in multiple, real use cases. Now, if you were to look back, what do you think were maybe the most important two, three things that really led to that community adoption which in turn led to Flink being deployed in production? Stephan: I think there were a few. Some technical, some more community. On the technical side, it was the fact that Flink actually managed to do this combination of streaming with low latency, good throughput, stateful exactly-one semantics, while supporting large states, I mean, really large states in terabytes. Flink was able to handle all of that because it had a way of making snapshots incremental, and using those for backup, rollback, etc. It was a stream processor that worked and scaled and one you didn’t have to worry about. The only dependency that it really had was S3, which probably is the cheapest dependency you can probably have in data processing. Other than that, you didn't have to worry about adding more load to any other system. So it was fairly easy to just add new jobs and the needed S3 and compute. So from a technical side, I think that was really good. On the community side, the biggest thing that I can actually pinpoint is this. When certain companies started adopting it, I think when Netflix and Uber and Alibaba started showing off how they're using Flink for very, very serious stuff, that pulled many other companies in their tailwinds. I think those were the years in hindsight that made Flink. It still took a long time for Flink to hit mainstream adoption, but that was mostly because lots of companies just took a while to adopt stream processing because, organizationally or from an infrastructure perspective, it's much harder to adopt stream processing than batch processing. Mainstream takeoff actually did still take a while, but I would say that adoption within early, name-brand adopters helped other companies pick up stream processing. Sudip: So those were like your Lighthouse users, Lighthouse customers, who had serious production workloads and that drew the community and the rest of the market to use Flink. Stephan: Yeah, I would say in hindsight, that would be my perception. It's always hard with these open source projects to understand what exactly drives the user to pick it up. You can sometimes do surveys, but still for 95% of the downloads, you never know where they actually came from. But my perception is those Lighthouse users drew a lot of people, yeah. Sudip: Talking about what was really interesting about Flink, you guys actually supported very high throughput and low event latency at the same time. Was there a particular architectural innovation that allowed you to do both together? Stephan: There's one kind of design principle in there that is exactly the opposite of batching. It's this idea that data has to keep moving through the system. Of course, there's a network boundary. There's a certain amount of buffering, just because you want to use the network in an efficient way. But it's very short and it's very much a locally isolated decision. There's no coordination with any other processes. It's like data keeps flowing and all the fault tolerance work is an asynchronous background process. The way to make this work was based on the distributed snapshot algorithm. That sounds easy, but it's actually not that easy to make it work in practice. Another thing that was a fortunate synergy was the relationship between Flink and RocksDB. The architecture that we came up with in Flink with aligned very well with RocksDB. It's not that we had designed Flink for RocksDB, but when we found RocksDB we felt that it actually matched the Flink design principles pretty well - keep data flowing, make fault tolerance an asynchronous background process; if you do buffering, make it a local decision, and don't have synchronization barriers across parallel processes. Sudip: That kind of reminds me of the paper that Mike Stonebraker wrote on the eight rules of a stream processing system. I think the first rule was exactly that - keep data moving, and don't add costly storage as a step to process your messages. It sounds like you guys did it really, really well and hit it out of the park. Stephan: Yeah, that work would definitely stand the test of time. Sudip: Yeah, absolutely. How did Flink handle missing data or mission failures, given that you were not really persisting the data as much as possible to avoid storage latency? Stephan: The core idea of the asynchronous snapshot method is that you rely on the fact that you can replay a certain amount of the data. But you don't want to replay too much because if you replay too much, then that's going to cost time. So that's actually also where Flink worked really well with Kafka. The way it would work is, on a failure you would go back a couple of records in your Kafka topic, restore the snapshot state to where that specific point was when you were at a certain offset in Kafka and then replay and rebuild from there. That means you don't have to actually track every tuple or every event. What you really need to do is to create a new snapshot, a new fallback point, and tag that position and your progress with logical markers. And then you kick off an asynchronous background process that says, okay, let me persist everything I need now, but in the incremental variant, like building on previous snapshots, to build a new fallback point. And then on failure, it would restore that fallback point, understand to what offset in the input stream that corresponded to that fallback point and then start replaying from there. This is actually not unlike how many databases work. You have your transaction log where you persist stuff once in a while, you checkpoint, you flush all the materialized tables so you can truncate your transaction log and you don't need to replay too much from your transaction log. It's a similar philosophy, but generalized through a distributed data flow graph. Sudip: Which goes back to the original points you were making about bringing some of the database techniques into distributed data processing as the motivation behind your Stratosphere research project in the first place, right? Stephan: Yeah, yeah. It's very interesting. I think the premise of that project was so true, there's a lot to be learned also from old research in that area. So maybe here is an interesting anecdote. How did we actually come up with this whole algorithm in Flink? Yes, we were seeing string processing as an interesting use case and that we can build interesting APIs. But how do you actually make this fault tolerant and efficient? And there were lots of ideas floating around. There was micro-batching that Spark had just written about and there was Storm that tracked things per tuple, but none of them really clicked with me. And the one thing that did click with me was that I still had a paper reading list from my academic times. I had a paper on a system called Distributed Graph Lab that showed an evaluation against a fault tolerance model based on some distributed snapshot method called Cheney-Lamport snapshots, from the early 1980s or so. And for some reason, I thought, okay, that sounds interesting. I want to read more about that snapshot method from the early 80s. And that turned out to be the inspiration in the end. The algorithm that Flink uses is a modified version of that algorithm from the early 80s. The core idea of the Cheney-Lamport algorithm is that, let's assume you have lots of independent asynchronous processes running around and at any point in time you want to know the state of the system as if a photographer had taken a picture. How do you even do that? And that's what that algorithm was about and that's exactly what we needed. We could actually simplify it a bit because we had a very specific setup - a directed acyclic data flow graph. But the core principle is the same, use logical markers to track and see where processes stand in relationship to each other. That's actually one of the discoveries made by computer scientists in the 70s and 80s. And all we do is continuously rediscover the same things. I wouldn't go that far, but there's a lot of the fundamental stuff that are still very, very relevant. Sometimes the biggest thing is that it's very hard to read those papers from back then, but absolutely worth not discarding that research. Sudip: That's a fascinating point you're making, and particularly for up-and-coming engineers who are into the space. It's like going back to some of this old literature in academia is really worth a shot. Stephan: Yeah. Although it's hard to find the relevant one. So what has worked really well for me is to keep looking at other systems that do interesting stuff. For us back then, it was Distributed Graph Lab, which was somewhat of a totally unrelated system in many ways, but still interesting in its implementation. So it just fascinated me. I wanted to learn more about that. And then their related work list also. I really love related work lists and citation lists and so on. And, dig through the ones that actually sound interesting, like use it as a bit of a guided search algorithm. Sudip: Now, when you guys were building Flink, there were two others that were getting built around the same time, maybe even were already in the Apache Foundation at that time. One was Storm. The other one was Samza. I'm curious how you'd compare Flink to those two. Where did you guys really shine? What were some of the things that those engines did better? Stephan: I would say Storm was almost the incumbent in a way back then. It's just the first system that started stream processing, even though I don't think it even called itself a stream processor. It was probably more like event processing. And Storm had this tricky notion of trying to track individual events and the flow of events in order to understand event failure and replay those, which for certain things was helpful. However, it was ultimately impossible to do that at scale - you cannot track fault tolerance related information on each individual event. It doesn't work, doesn't scale, not even for simple event pipelines, let alone for more complex ones. Even if you go to complex streaming SQL queries, this wouldn't work. Flink did not do that. The Snapshot algorithm has the characteristic of saying we're tracking progress in a much more coarse grained manner, in a much cheaper manner. Plus we can actually do it exactly once, not at least once, right? That was something I think that made a big difference. Plus, Flink actually was a stateful system. And I think Storm later added some extensions to sort of manage state, but I think the integration never went as deep. So stateful exactly-once cheap fault tolerance was something that Flink had. In a way, even Spark streaming had sort of a stateful nature. I think that's why folks were calling Flink and Spark back then the generation two streaming engines where Storm was generation one. And you can actually see that all newer engines are not built after the Storm pattern. If anything, they're closer to the Flink and Spark model or the Kafka Streams model. I think Streams was very, very interesting. It was in some way, I think the most interesting one at the time and one that from a technical perspective, I was initially most worried about. It had this very interesting notion of how it used Kafka. But just by the limitations of Kafka back then, Streams could not do exactly-once characteristics on state, which Flink could do, which was a big deal to users. I think the second part is there's something good and bad about coupling stream processing to Kafka. And the same still holds true for Kafka Streams. I think it's good in a way that if Kafka is your only dependency and you already have Kafka, it's a very easy way to get started. But at the same time, Kafka is a lot more expensive to operate than S3. Like Flink's minimal dependency is S3, Streams and Kafka's minimal dependency is Kafka. If you actually deploy a streaming pipeline that triples the load on your Kafka cluster, it's not the easiest thing for your infra team. Putting more data in the file system or the object store was always the easier thing, in my opinion. I'm forgetting the fact that in hindsight, Flink has had a pretty good case in its APIs as well. Like the whole event time model, the combination of processing time, event time and so on was a big deal when it came out. It's something that Flink didn't fully come up with by itself. A lot of it came out of the Google Dataflow project, which later open sourced into the Beam project. So I think they were the pioneers of that. But even before Beam got created, Flink picked up those semantics from some research papers on Google Dataflow, and was the first system that implemented those in open source. And those semantics were actually very, very powerful. I have an absolutely love-hate relationship with that thing up to today, because it is crazy powerful, but it's also so hard to get it right that I think it's one of those things that causes people most operational friction. So if I ever rewrite Flink, I'm wondering if there would be a way to just get rid of that. Sudip: And that is a great segue into the next thing I wanted to ask you. If you were to build Flink again today, fundamentally, what would you do differently, if anything at all? Stephan: I would do 100 things differently. Of course, hindsight is 20-20, right? To maybe stick to the previous point, the whole event time model. I don't have a perfect answer on this. It's both an incredible piece and an annoying piece because it allows you to do very, very powerful things, but also how do you actually tune your watermark? How do you understand? How do you build your notion of completeness? Is that something you should even try and do? Or is that something that you can just never get right and we should actually just build, try to build APIs, not assuming we can actually do that. It's one of these questions I haven't answered for myself, but it's definitely something I would revisit. Things that are much more clear are on the runtime architecture side or on the whole deployment side. So when Flink started, it was pre-Kubernetes, almost like cloud was there, but not everything was there. A lot was still on-prem. I think Yarn from the Hadoop project was the most popular scheduler. So Flink and the philosophy there was still like projects have to worry about their scheduling themselves. That's how Yarn and Mesos integrations worked. And so Flink actually built a lot of stuff that you just would not build in a system today, and would just try to be a nice player in Kubernetes and that's it. Over time, we evolved Flink through that, but if you look in certain areas, you can still see the heritage. It was built as a system to run on Yarn initially and not on Kubernetes in the cloud. So definitely, that would be completely different. On the runtime side, cloud native architecture will be the first big thing. That includes more than just like the way you schedule and build your whole deployment. It also means that the whole internal stateful engine, I would build that as a disaggregated engine. I would try to build something closer to a file system, distributed file cache or file system caching layer on top of S3 and build the whole snapshotting more into that layer than into the RocksDB layer. I think there's some fascinating stuff to do. It would be possible today because, first of all, systems like S3 or generally cloud storage and object stores have gotten so much better. The network bandwidth is just phenomenally better than it was when Flink started out. The networks in the cloud actually give you more stable net latencies. That part is so much more mature that one should absolutely build on that today. But it wasn't the case when Flink started. So you can see quite a lot of those changes or like bit by bit coming into the system. But it's an evolution of an architecture that was initially built in a different era and then brought into this new era. You can probably take some shortcuts if you build it directly for this era. So on that side, definitely more cloud native, more taking into account cloud storage and cloud network architectures in a much better way. Sudip: Cool! So my final question to you, Stephan, is as someone who has built a system that is synonymous with stream processing today and have even done built startup in that space. Looking back over the last few years and where we are in the stream processing market, do you feel that stream processing as a space has realized the potential and promise that it had? Where do you think we are on that journey today? Stephan: That's a very good question. I'm not sure if I have a great answer for that. I can just give you some thoughts. I think in many ways, stream processing is still at the point where it's breaking through into the mainstream. I remember that when I left my active work in Flink about one and a half year ago, we started noticing a bit of a shift of the type of companies that started using it. Like it was not just tech companies anymore. It was also non-tech companies that started using, that started asking for support, that wanted to become customers. That's a fairly recent development, I would say. And it goes back to something I said earlier. I feel stream processing, compared to some other technologies, sits so much more centrally in your application architecture that it's not the easiest thing to adopt. You need so many other things in place for it to really deliver value. And I think that's just why it takes time for companies. It took longer for the market to mature than I had expected. But I think it is actually here now. I think it is getting into the mainstream. So I would say it hasn't yet reached its full potential yet. At the same time, you know, things like streaming SQL, which are making stream processing more accessible, also hasn't been around for that long, right? Yet, they're really making a difference, I think, in the breadth of the audience that can use streaming. The streaming potential is higher as it integrates more with other systems, databases that hold other data, etc. There's a surge in popularity of CDC style patterns where you take CDC data out of another database. That, I think, is one of the things that is also really driving a new class of use cases in stream processing. That's also something that hasn’t been happening for too long. Looking beyond Flink, the integration of stream processing into other major databases by players like Snowflake, working on incrementally materialized views and so on, is actually good. The thing that I'm not 100% sure of is what actually will be the shape of stream processing 10 years from now. I think Flink will play a role. It might also look very different from anything we have today. So, I think it's still a fascinating space to watch. There's still a lot more to come and happen. Sudip: Fantastic! That brings me to the final thing I wanted to ask you, which is our lightning round. So we end every episode with three very short questions. The first one is around acceleration. In your view, what has already happened in big data processing that you thought would take much, much longer? Stephan: The fact that so many, rather virtually all, data workloads have moved to the cloud. When we started working on Flink, the assumption working with a bank was that data workloads will never run in the cloud, not in 20 years. And now like five years later, they're all in the cloud. That was much faster than I had thought. Sudip: The second question is, what do you think is the most interesting unsolved question in large scale data processing still? Stephan: Yeah, that's an interesting one. I'm not sure if it is most interesting, but the most challenging one isn't actually very much a technical question. It's more of an organizational question or a question of integrating data. I feel that the data processing technology is evolving, but it's actually well understood what needs to happen there. What's much harder is to actually make all those different systems, teams, sources, formats, semantics, everything come together in a meaningful way. I feel that folks struggle a lot to make that happen. I don't think there's a silver bullet for that. Sudip: And my final question to you is, what is one message you would want everyone that is listening today to remember? Stephan: It's a tough one. If I have to pick one, I would say the journey of Flink was a relatively long one with many steps. If you actually take into account the research project that started it all, we started working on this in 2009. Maybe one and a half years ago, it really started to get into the mainstream. That means 12 years after inception, right? That is a long time. So sometimes things take a while to mature, but it is worth sticking with them. If you actually believe in the idea, if you actually see that there's potential. And so I'm like, it's absolutely worth sticking it out, which maybe is a little ironic to say because we actually sold our company relatively early. But still, I think at least we stuck with the project, we continued building and driving the vision. But, that seems like something that is not very popular today. It feels like folks are hopping between opportunities relatively fast - okay, I do this now for two years and then I go do the next thing. And then in six months, let me try another project. Sometimes, it's worth staying on certain things. And specifically for a technology like Flink, I said this a few times, on the simple idea of asynchronous snapshots, work is still ongoing till today to make it more reliable, more stable, more scalable, more predictable and handle ever larger scenarios. These things sometimes take years to mature, but when they mature, they become really powerful. So sometimes it's actually worth sticking it out. Sudip: That's a really fascinating and insightful point! Thank you so much, Stephan. Stephan, it was really a pleasure and frankly a privilege hosting you today. I thank you so much for your time. Stephan: Yeah, thank you so much. It was, I had a lot of fun. Thank you for hosting me. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

25. okt. 2024 - 49 min
episode Iceberg: The Open Table Format for Petabyte Scale Analytics - with Ryan Blue cover

Iceberg: The Open Table Format for Petabyte Scale Analytics - with Ryan Blue

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/] for a fascinating “look back [https://sudipchakrabarti.substack.com/p/when-hadoop-was-king-and-yahoo-was]” on Hadoop, and Reynold Xin [https://www.linkedin.com/in/rxin/], co-creator of Apache Spark [https://en.wikipedia.org/wiki/Apache_Spark] and co-founder of Databricks [https://www.databricks.com/] for a technical deep-dive [https://sudipchakrabarti.substack.com/p/from-spark-to-databricks-sparks-origins] into Spark. In this episode, we host Ryan Blue [https://www.linkedin.com/in/rdblue?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAAzlA6sBKpw5AAsa7SgDV425Ay1w6My0b4U&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BohWsnzB2Rw2ipZdP53XWlg%3D%3D], creator of Apache Iceberg [https://github.com/apache/iceberg], the most popular open table format that is driving much of the adoption of Data Lake Houses today. Ryan shares with us what led him to create Iceberg, the technical breakthroughs that have made it possible to handle petabytes of data safely and securely, and the critical role he sees Iceberg playing as more and more enterprises adopt the modern data stack. Show Notes Timestamps * [00:00:01] Introduction and background on Apache Iceberg * [00:02:50] The origin story of Apache Iceberg * [00:12:00] Where Iceberg sits in the modern data stack * [00:10:38] Transactional consistency in Iceberg * [00:14:38] Top features that drive Iceberg’s adoption * [00:20:00] The technical underpinnings of Iceberg * [00:21:33] How Iceberg makes "time travel" for data possible * [00:24:08] Storage system independence in Iceberg * [00:30:13] Query performance improvements with Iceberg * [00:35:08] Alternatives to Iceberg and pros/cons * [00:40:45] Future roadmap and planned features for Apache Iceberg Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. Today, we have another episode in our data engineering series. We are going to dive deep into Apache Iceberg, an amazing project that is redefining the modern data stack. For those of you who already don't know Iceberg, it is an open table format designed for huge petabyte-scale tables. It provides a database-type functionality on top of object stores such as Amazon S3. And the function of a table format is to determine how you manage, organize, and track all of the files that make up a table. You can think of it as an abstraction layer between your physical data files written in Parquet or other formats and how they are structured to form a table. The goal of the Iceberg project is really to allow organizations to finally build true data lake houses in an open architecture, avoiding any vendor and technology lock-in that we all trade off. And just to give you a little bit of history, the Iceberg development started in 2017. In November 2018, the project was open-sourced and donated to the Apache Foundation. And in May 2020, the Iceberg project graduated to become a top-level Apache project. Today, I have the pleasure of welcoming Ryan Blue, the creator of Iceberg project, to our podcast. Welcome, Ryan. It is so great to have you on and thanks for making the time. Ryan: Thanks for having me on. I always enjoy doing these. It's pretty fun to talk about this stuff. Sudip: Awesome. Maybe I'll start with having you tell us the origin story of Iceberg. Why did you create it in the first place? I imagine it probably goes back all the way to your days at Cloudera. Anything you can help us with connecting the dots between where you are at Cloudera, then Netflix, and then now building Iceberg? Ryan: Yeah, you are absolutely correct. It stemmed from my Cloudera days, as well as without Netflix, it wouldn't have happened. So basically what was going on at Cloudera was you had the two main products. You had Hive and Impala in the query space. And they did a just terrible job interacting with one another. I mean, even three years into the life of Iceberg as a project, you were having to have this command to invalidate Hive state in Impala, to pull it over from the meta store. And even then, you just had really tough situations where you want to go update a certain part of a table, and you have to basically work around the limitations of the Hive table format. So if you're overwriting something or modifying data in place, which is a fairly common operation for data engineers, you would have to read the data, union it, and sort of deduplicate or do your merging logic with the new data, and then write it back out as an overwrite. And that's super dangerous, because you're reading all the data essentially into your Hadoop cluster at the time, and then you're deleting all that data from the source of truth. You're saying, okay, this partition is now empty. It doesn't exist anymore. And then you're adding back the data in its place. And if something goes wrong, can you recover? Who knows? And so it was just a mess, and people knew it was a mess all over the place. I talked to engineers at Databricks at the time and was like, hey, we should really collaborate on this and fix the Hive table format and come out with something better. But it was just not to be at Cloudera, because they were distributors. They had so many different products, and it was very hard to get the will behind something like this, which is where Netflix certainly helped. Sudip: What was the scale of data at Netflix when you got there? Just rough estimates. Ryan: I think from around that time, we were saying we had about 50 to 100 petabytes in our data warehouse. It's hard to tell how accurate that is at any given time because of data duplication. Exactly what I was talking about a second ago means you're rewriting a lot. And that write amplification, because your write amplification is at the partition level in Hive tables, you would just do so much extra writing and keep that data around for a few days, and it's very hard to actually know what was live or active. Sudip: So you landed at Netflix. They have this amazing scale of data. You came from Cloudera with some very specific frustrations, and I would imagine some very specific thoughts on how you're going to solve it at Netflix scale. Maybe walk us a little bit through how those first year or two was at Netflix. How did you go about creating what is now Iceberg? Ryan: So we didn't actually work on it for a year or two. First, we moved over to Spark, and that was fun. So Netflix had a pre-existing solution to sort of make tables a little bit better. What we had was called the batch pattern. And this was basically, we hacked our Hive meta store so that you could swap partitions. So you could say, instead of deleting all these data files and then overwrite new data files in the same place, we just used different directories with what we called a batch ID. And then we would swap from the old partition to the new partition. And that allowed us to make atomic changes to our tables at a partition level. And that was really, really useful and got us a long way. In fact, we still see this in practice in other organizations where Netflix data engineers have gone off and gotten involved in infrastructure at other companies. And it worked well, as long as the only changes you needed to make swapped partitions. So again, the granularity was just off. But this affected Iceberg primarily because we really felt the usability challenges. We had to retrain basically all of our data engineers to think in terms of the underlying structure of partitions in a table and what they could do by swapping. So rather than having them think in a more modern row-oriented fashion, saying, okay, take all my new rows, remove anything where the row already exists, and then write it into the table, they had to do all of that manually and then think about swapping because you can't just insert or append. And so the two things were the usability was terrible, and it really got us thinking about how do we not only solve our challenge here, make the infrastructure better, but how do we also make a huge difference in the lives of the customers and everyone on our platform to make it so that they don't have to think about these lower levels. Usability has always been important to me in my career, so I had already said we're just going to fix schema evolution, but really thinking about the operations that people needed to be able to run on these tables, I think, influenced the design. And it also showed me because I'd ported our internal table format to Spark twice by the time we started the Iceberg project, I knew I needed to change in Spark. Sudip: I can imagine. How would you describe the way you guys were storing data on Netflix? Is that closest to something what we all now call Data Lake? Is that roughly what it was? Ryan: The terms are very nebulous here, but I think of Data Lake as basically the post-Hadoop world where everyone basically moved to object stores. And I know that Hortonworks was calling their system a Data Lake at the time, but I think we've really coalesced around this picture of a Data Lake as a really Hadoop-style infrastructure with an object store as your data store. And then, of course, now we have Lake Houses, right, which again plays a pretty meaningful role in Iceberg's adoption too. Sudip: Any particular thought on what's driving this movement towards Lake Houses? Ryan: Well, I'm going to have to answer a question with a question, which is, what do you mean by Lake House? Because there's a broad spectrum here. I kind of see it as where you can actually do warehouse style analytics on Data Lakes. Basically, Data Lake is the way I think about it. You dump whatever data you want in whatever format you want, and then you need some kind of map, obviously, to navigate that. And that's when I think the Data Lake graduates into becoming a Lake House, in my mind. Sudip: Okay. I think that that corresponds with what I hear from most practitioners. Ryan: And this is absolutely what we were targeting with Iceberg. We were saying, let's take the guarantees and semantics of a traditional SQL data warehouse, and let's bring those into the Hadoop world. Let's give full transaction asset semantics to the tables. Let's have schema evolution that just always works, no matter what the underlying data format is. And let's also restore the table abstraction so that people using a table don't need to know what's underneath it. They don't need to be experts in building that structure. They don't need to be experts in how this is laid out so they can query that structure and so on. You should just be able to use the logical table space. And I say this a lot. We're going back to 1992. What I think is difficult about the term Lake House in general is you've got all these other definitions. Starburst thinks about Lake House as an engine that can pull together the highly structured Iceberg tables, semi-structured JSON, completely unstructured storage, and documents and be able to make sense of those, as well as to talk to transactional databases or even Kafka topics and things like that. So they see it from this vision looking downward at a disparate collection of data sources. On the other hand, we, as a single source, we think of it at Tabular, my company, as what all can I connect to Iceberg tables? We've got this rich layer of storage, and so the closest definition for me is Lake House is this architecture where I can connect Flink and Snowflake and BigQuery and use them all. And use them all equally. And then you still have other vendors who actually claim to be Lake House vendors and use it in their marketing, but don't support schema evolution. Don't actually support transactions and ACID semantics. And so I think that there's just a really big space out there. So for our part, we think of independent or separation of compute and storage as sort of defining what we do. Because you get that flexibility to use any engine with the same single source of truth. We layer in security and make sure that no matter if you're coming in through a Python process on someone's laptop, or a big warehouse component like Redshift, that you get that excellent SQL behavior and security. Sudip: That makes sense. I want to kind of maybe double click on something you were just about started talking about, which is transactional consistency. And it is, I would say, one of the most important things that probably makes a Data Lake House in my mind. And you guys have a really interesting way of how you went about solving that. Would you mind expanding on that a little bit? What does it look like from a technical breakthrough standpoint, the way you'd be able to do it? Ryan: I'm actually pretty happy with that being one of the defining features. I would throw in the SQL behavior stuff as well, because I think that really is a critical part. But yeah, I think that the ability to use multiple engines safely is probably the defining characteristic of basically the shift in data architecture in the next 10 years. I didn't even realize that until now. It was a bit of a happy accident because we were looking to make sure that our users at Netflix could use Spark and Pig at the same time. And they could also have Flink and these background jobs that pick data up in one region and move it to our processing region and all of those things need to commit to the same table. And what we didn't realize was that that's not something that you can do in the data warehouse world. In the data warehouse world, everything comes through your query layer and you're sort of bottlenecked on the query layer. Whereas we knew that we wanted to be bottlenecked by the scalability of S3 itself. And we needed to come up with this strategy for coordinating across completely different frameworks and engines that knew nothing about one another. And that was the happy accident. We were doing it so that we had transactions. I was literally thinking at the time, I want someone to be able to run a merge command rather than doing this little dance of how do I get something rather than doing this little dance of how do I structure my job so that I'm overwriting whole partitions. That was our problem was that we wanted to make fine-grained changes to these tables. We wanted to put more of that logic into the engine with the merge and update commands and things like that. But the background was that we had very different writers in three, four different frameworks. And we needed to support all of them. And in designing a system that could support all of them concurrently, we actually unlocked that larger problem which was how do I build data architecture from all these different engines? And that's what we're seeing today as this explosion. We're seeing everyone has data architecture with four, five different data products and engines. What they want is to share storage underneath all of those so that you don't have data silos. So that it's not like, oh, I put all this data in I put all this other data in Snowflake and in order to actually work across those boundaries I have to copy and I have to maintain sync jobs and I have to worry is it up to date? Did today's sync job run? Am I securing that data in both places according to my governance policies? It's a nightmare. And so we're actually seeing this happy accident of, hey, now we have formats that can support both Redshift and Snowflake at the same time. And that is like I said, I think that this is going to drive the next 10 years of change in the analytic data space. Sudip: That's awesome! Can you tell us a little more about how you guys went about really making ACID transactions possible in Iceberg? And I'm probably alluding to other things, real cool things you guys do with snapshots and so on. Can you talk a little more about what is the technical underpinning of that? Ryan: So the idea is to take an already atomic operation and have some strategy to scale up that atomic operation to potentially petabyte scale or even larger. I don't think that there is a limit to the volume of data you can commit in a single transaction. The trade-off is how often you can do transactions rather than the actual size of the transaction itself. So what we do is we start with a simple tree structure. The Hive format that we were using before had a tree structure as well. You had data files as the leaves, directories as nodes, and then a database of directories essentially. So you had this multi-level structure. And what you would do is you'd select the directories you need through additional filters and then go list those directories to get the data files and then you'd scan all of those data files. We wanted something where you didn't have to list directories because in an object store that is really, really bad. In fact, at the beginning it was eventually consistent and so we just had massive problems with correctness that we had to solve. In fact, S3 Guard actually came out of some of the earlier work on our team before I was there. That's a fun little tidbit. What we wanted to do with Iceberg was mimic the same structure, but we wanted to have essentially files that track all of the leaves in this tree structure. We started with files at the bottom of the tree and the leaves and then you have nodes called manifests that list those data files and then you have a manifest list that corresponds to a different version of that and that just says, hey, these four manifests, those make up my table and then you can make changes to that tree structure over time efficiently. You're only rewriting a very small portion of that tree structure and it all sort of rolls up into this one central table level metadata file that stores essentially 100% of the source of truth metadata about that table. You've got metadata file, manifest list for every known version, manifests that are shared across all of those versions, and then finally data files that are also shared across versions. The very basic commit mechanism here is just pointing from one tree root to another. You can make any change you want in this giant tree, write it all out, and then ask the catalog to swap. I'm going from V1 to V2 or V2 to V3 and if two people ask to swap, replace V2 at the same time, one of them succeeds and one of them fails. And that gives us this linear history and is essentially the basis for both isolation, keeping readers and writers from seeing intermediate states as well as the atomicity. Sudip: Got it. And I imagine this architecture is also what makes time travel possible for Iceberg and just for listeners, the way I think about time travel is it allows you to roll back to prior versions if you want to correct issues in case of errors with data processing or even help data scientists to recreate historical versions of their analysis. Can you help us understand a bit more how time travel has been made possible for Iceberg? Ryan: Yeah, absolutely. So any given Iceberg table tracks multiple versions of the table, which we call snapshots. And because you have multiple versions of the table at the same time, that's one of the ways we isolate things. So if I commit a new version, say, snapshot 4 in the table, and you're reading snapshot 3, I can't just go clean up snapshot 3 and all the files that are no longer referenced. I wouldn't want to, but it might also mess up someone who's reading that table. So we keep them around for some period of time. By default, it's five days. And our thinking there is just enough time for you to have a four-day weekend and a problem on Friday, and you can still fix it on Tuesday and know what happened. So we keep them around for five days, and that actually means that by having that isolation, we have the ability to go back in time and say, okay, well, what was the state of the table five days ago? And we've iterated on that quite a bit to bring into use cases that we're talking about. So now you can tag, which is essentially to name one of those snapshots and say, hey, I'm going to keep this around as Q3 2023. And this is the version of the table I use for our, say, audited financials or to train a model. And you want to keep that for a lot longer. So you can set retention policies on tags, and it just keeps that version of the table around. And as long as the bulk of the table is the same between that version and the current version, it costs you very little to keep all that data around. Sudip I know there's no typical scenario, but in general, if you compare the metadata versus the actual data, what is the overhead typically in a snapshot of the metadata you store? What is the volume of data to metadata? Ryan: That sort of ratio? It depends. So wider tables, more columns are going to stack up more metadata. I would say that there's generally a 1000x or three or four orders of magnitude difference between the amount of metadata and the amount of data in a table. Now that differs based on how large your files are, because you can really push that. When we collect file range metadata for each column so that we can at a very fine granularity, this is another way that we improve on the Hive table standard, where you had to read everything in a directory. We can actually go and filter down based on column ranges. You can really push that. So you can have column ranges that are effective, but then within those column or data files, you can have further structures that allow you to skip a lot of data. And so where your trade-off is there is a little fuzzy. You can have lots of data files in a table, or you can compact them and have just a few data files in a table. It kind of depends on the use case which way you would go. Sudip: That makes sense. You have that knob to turn depending on how granular you want to go. I want to talk a little bit about one really striking feature of Iceberg, which is you don't really have any storage system dependency. You could go from S3 to Azure to something else. I'm curious a little bit how you guys have achieved that. It's like, you know, on the Snowflake side, you have separation of compute and storage. In your case, again, you kind of broke away from any dependency on the storage system. Can you tell us a little bit more about how you're able to achieve that? Ryan: I think when we started this project, people had been trying to remove Hadoop from various libraries in the Hadoop ecosystem for a very, very long time. And we actually still have a dependency on Hadoop. You can use libraries, but you don't have to. What we wanted to do was lean into the fact that everyone was moving towards object stores instead of file systems. Hadoop is a file system, and it has things like directory structures and all of that, which led us to this listing directories for our table representation and some of these anti-patterns that we're trying to get rid of with the modern table formats. What we wanted to do was have a simpler abstraction than a file system and just go with what the object store itself could do. No renames, no listing, no directory operations. We just wanted simple put and get operations. We designed Iceberg itself to have no dependencies on these expensive, silly operations. We did that based on our experience actually working with S3 as a Hadoop file system. We had our own implementation of the S3 file system, and we kept turning off guarantees to make it faster. For example, if you go to read a file in S3, like S3A, the Hadoop implementation that talks to S3, you'll actually go and list it as though it were a prefix. The reason why is you want to make sure that it's not a directory. It could be a directory if there are files that have that file name slash and then some other name. We said, well, that's silly. We always want this to go and read it as a file. We don't want that extra round trip going to make sure, hey, this isn't a directory, is it? We were able to really squeeze some extra performance out of our I.O. abstraction by letting go of this idea that it should function and behave like a file system. Let's embrace S3 for what it is. It's an object store, and let's use it properly. Sudip: That's super interesting. Thank you for sharing that. And if I may step back a little bit, when you think about your Apache Iceberg users who are really getting value out of it, what do you think are the top, maybe three features they are particularly excited about? What really makes them move from whatever they are using today to Iceberg? Ryan: Time travel is one of the big ones. And the others, I don't know that I would really consider them features. Granted, from a project perspective, they're features. But I've always talked to people about Iceberg by opening the conversation with, I don't want you to care about Iceberg. Iceberg is a way of making files in an object store appear like a table. And you should care about tables. You should be able to add, drop, rename, reorder columns. And you shouldn't care about things beyond that. So Iceberg, the purpose of it is to restore this table abstraction that we had hopelessly broken in the Hadoop days. If you talk about core features that are super important for Iceberg, it's stuff like being able to make reliable modifications to your tables. Having that isolation between reads and writes, and knowing that as I'm changing a table, you're not going to get bogus results. It is being able to forget about what's underneath the table. In the Hadoop days, you had to know, okay, am I using CSV? Or am I using JSON? Or am I using Parquet? Because all three have different capabilities when it comes to changing their schema. It's all ridiculous, right? CSV, no deleting columns. But you can rename them. Parquet, you can't rename, but you can drop columns. Those sorts of things, they're just paper cuts. One of the things I'm most proud of with Iceberg is that we got rid of all the paper cuts. It just works. It doesn't always give you exactly what you would want because it's a never-ending process to make it better, but it just works all the time. That's what I hear from people who are moving over to it, that it's just like, hey, I didn't have to think about this. The defaults are, for the most part, really good. It's dialed in. It just works, and I don't have to think about these challenges anymore. Those challenges are actually very significant. Things like hidden partitioning. I talked about restoring the table abstraction to the SQL standard in 1992. Partitioning in the Hive-like table formats, they all require additional filters because you're taking some data property that you like, like timestamp. When did this event happen? You're chunking the data up into hours or days along that dimension and storing the data so that you can find it more easily. But in Hive-like formats, that's a manual process, which is ridiculous. Especially if you switch the timestamp mechanism, too. If you want to go from hours to days, that is, again, ridiculous. You can't do that at all. In order to switch the partitioning of a table in Hive or Hive-like tables, you have to completely rewrite the data and all the queries that touched that table. Because it's all manual, the queries are tied to the table structure itself. You have a column that corresponds to a directory and you have to filter by that column or else you scan everything. It's incredibly error-prone. On the right side, you have to remember, what time zone am I using to derive the date? Right? Because that's different. Hey, which date column did I use for this? What date format, potentially, if you're going to a string, did I use? And if you get any of those things wrong, and you just mis-categorize data, you might not see that. The users might not see that. It's just wrong. We've got an example that we did with Starburst. It's a data engineering example where you go and you connect Tabular and Starburst together and you run through this New York City Taxi dataset example. I love this example because you grab a month's worth of New York City Taxi data and you put it into a directory and you read it and it's like stuff from 2008 even though it's 2003 in the dataset. You're like, where did all this extra data come from? And that's exactly what you're doing with manual partitioning all the time. Someone said, hey, this is the data for this month, and you just trusted it and you put it there, and now when you actually go look at it, you find stuff from like 10 years in the future. It's a massive problem. And on the read side, you don't know if they put the data in correctly. You don't know if you're querying it correctly. You might not even query it correctly. You can get different results if you just say, hey, I want timestamps between A and B, or if you say, I want timestamps between A and B but make sure I'm only reading the correct days. You can actually get different results because of where data was misplaced. It's a huge problem. Again, in Iceberg, we just said, hey, let's just handle that. Let's keep track of the relationship between timestamps and how you want the data laid out. Let's have a very, very clear definition of how to go between the two and then be able to take filters and say, oh, you want this timestamp range? Well, I know the data files you're going to need for that. So we do a much better job of just keeping track of data and eliminating those data engineering errors. So you're kind of completely upstarting away everything that data engineers had to do to just keep the data consistent, transactionally valid, all of that. You're kind of taking all of that away and just exposing the table formats to the different engines. We take inspiration from SQL. SQL is declarative. Tell me what you want, not how to get there. And we've really compromised that abstraction in the Hadoop landscape. We've said, oh, in order to cluster the data effectively for reading, you're going to want to sort your data before you write it into the table. Which is kind of crazy, because we don't actually sort it because we want it sorted. We sort it because we want it clustered and written a certain way, and the engine happens to do that if we write it that way. It's a really backwards way of thinking. And in Iceberg, we've actually gone and said, okay, this is the sort order I want on a table. This is the structure I want on a table. We've made all of these things declarative. And then, we've gone back through, and this has been a process of years, we've gone to make the engines respect those things. So, Spark and Trino will take a look at the table configuration and say, oh, okay, I know how to produce data into this table for the downstream consumption after the fact. And you shouldn't have to figure out how to fiddle with your SQL query in order to get Trino to do the right thing. And the same thing in Spark. Everything should just respect those settings. And we're just taking it back to a declarative approach, which is the basis of databases in the first place. I want to touch on one other thing, which is what does all of this mean for query performances? I constantly hear that people who are using Iceberg, they're really happy about the performance they get. Sudip: Can you talk a little bit about how that is possible, given of course you are doing a lot of the work under the hood, which is very complex. Ryan: Yeah. There are a number of dimensions to this. The first thing that we did was we just made job planning faster. Our early, early presentations on Iceberg were like, hey, we had a job that took us four hours to plan. And that was just listing directories. And sure, we took that down to one hour because we were able to list directories in S3 in parallel. And then we said, okay, well, what if we actually made this tree structure and we used metadata files to track the data files for that query? And we get it down to something ridiculous. And then we said, okay, well, what if we then add more metadata? Because we're not relying on directory listing that only gives us this file exists. What if we kept more metadata about that file? What if we knew when it was added to the table and what column ranges were in that file? And things like that. Well, we could then use those statistics to further prune. And so we got this thing that used to not even complete and took four hours just to plan. And we got that down to 45 seconds to plan and run the entire thing because we were able to basically narrow down to just the data that we needed. So, you know, there are success stories like that. We also have a success story where we replaced an Elasticsearch cluster that was multiple millions of dollars per year with an Iceberg table because we were just looking up based on, you know, the one key. Now, ElasticSearch does a lot more than Iceberg, and I don't want to misrepresent there, but having an online system that is constantly indexing by just one thing, you can just use an Iceberg table and its primary index because Iceberg has a multi-level index of data and metadata. It's another reason for that tree that we were talking about earlier. So there's that aspect. Can we find the data faster and things like that? Iceberg also has rich metadata provided by this tree that is actually accurate because you know exactly the files you're going to read, which is a far better input to cost based optimization and that sort of thing. So we just keep accumulating these dimensions. The thing that we've been exploring the most lately is actually unrelated to really better or more metadata about the data. Granted, we are going into, can we keep NDV statistics and things that are better for cost based optimizers and those sorts of things. But what we've been finding out at my company, Tabular, is that automated optimization, which is like background rewrites of data, is amazingly impactful. So let me go into a little of why that's unlocked by Iceberg. In the before time in Hive table formats, you couldn't maintain data. You had to write it the correct way the first time because you couldn't make atomic changes to that data. Whenever I touch data, your query might give you the wrong answer. The solution across your organization is never touch data. Write it once and forget about it. That is, unfortunately, the status quo in most organizations. Write it once, try and do a good job, but don't worry about it after that. And that try and do a good job means only the most important tables actually get someone's attention. If you've got a table where there are a thousand tiny files, well, a thousand tiny files isn't really enough to worry about. It's when we get to hundreds of thousands of tiny files that we start caring. There's this huge, long tale of just performance problems. It's not awful, but it's not great. Automated optimization is allowing us to basically fix things after the fact. The fact that we fixed atomicity, the fact that we can make changes to our table, enable us to combine with the declarative approach I was talking about earlier and actually have processes that go through and look for anti-patterns and problems and fix them. Automated compaction is one where you might have a ton of overhead just because you have tiny files and that means you have more metadata in the table and everything is generally slower and it's not great. If you just have an automated service going and looking for all those tables with one partition with a thousand files in it that are all 5k, turns out you can really save a lot on your storage costs and your compute costs and all of those things and it's cheap. It's automated so it's not like you're spending an entire person doing that. This is not a problem that data engineers have time to care about, but it has a tremendous value to the organization in terms of just everything functions more smoothly. Sudip: So you can basically go back and clean up all those data stores without putting any engineers behind it. Ryan: Yeah, exactly. I mentioned combining this with the declarative approach. What Iceberg has done is it's moved a lot of tuning settings to the table. So we can tune things like compression codec, compression codec level, row group size, page size, different parquet data properties. We can go and figure out the ideal settings for just that data set and then you can have some automated process go and use those settings. So there's a ton of data out there sitting in snappy compression that's probably four or five years old, something like that. What if you had a process that could go turn that into the LZ4 or something safely and cut down your AWS S3 bill by half? That's the kind of impact that we're talking about and while you're cutting down the size of your data by 50%, that also generally makes your queries 50% faster. That is pretty amazing. Anyone with historical data would sign up for that in a heartbeat, right? Because we all are storing data that we never go back and compress and yet pay AWS for. Let's hope. That's kind of where Tabular is building a business. So yeah, I hope so. Sudip: That's fantastic! I want to shift gears a little bit and talk a bit about what are the other alternatives to Iceberg? There is, of course, Delta Lake, Apache Hudi that come up. How do you think about those solutions versus Iceberg? What are the pros and cons versus Iceberg that you can think about? Ryan: I think most people are standardizing on Iceberg these days. I have theories for that, but one of the biggest impacts is the reach of Iceberg into proprietary engines. That's largely been the past year, year and a half, that you've seen real big investments from companies like Snowflake and Google. Redshift announced support. Cloudera is moving basically all of their customers onto Iceberg. The broad commercial adoption for Iceberg in particular I think is one of the biggest selling points. If you're thinking about this space, either as a customer or as a vendor, customers have this extra concern of what vendors support the thing. But vendors and customers have two things in common. The first is really is there a technical fit? Is the technical foundation there? For formats like Hudi and Hive that use Hive partitioning, they don't support schema evolution, they basically don't support the ACID transactions either. Hudi's gone a long way towards getting close to ACID transactions, but it's still not quite there. I fired it up on a laptop last year and was able to drop data in a streaming job, duplicate rows and all sorts of things. So it's just not quite there. So I think Iceberg has that technical foundation, that commit protocol that we were talking about in the beginning. It's the same approach that Delta uses, by the way. That is really solid and reliable. So I think that that's one thing that both vendors and early adopters are looking to. Another technical foundation one is just portability. Can you actually build in another language and support all of the table formats properties? So this is the classic example here is Bucketing and Hive used Java native hashcodes and reimplementing Java native hashing on some other system is not going to happen. So you basically had this whole set of things that was not portable in Java. Hudi has many similar issues there and I don't think that it's going to be something that can be ported to C++ or Python or some of these other languages that you really want to fold into the group. So I think the distinguishing factor between Delta and Iceberg and Hudi is that that technical foundation is just not there. On the other side, I think you have the openness. So assuming that that technical foundation is there, is this a project that people can invest in? Can you become a committer? Can someone from Snowflake become a committer? And that really influences vendors' decisions because it's a very tough spot to support a data format that is wholly controlled by your competitor. And I think that that's why most vendors have chosen to recommend Iceberg because Iceberg has that technical foundation. It's also a very neutral product. It was designed to be that neutral from the start. Even before we realized that independence and neutrality in this space was going to be absolutely critical, again, we didn't think of being able to share data between Databricks and Snowflake. We thought, we want Snowflake to be able to query our data at Netflix. We didn't think about the dynamics of these two giants needing to share a dataset. And do you trust Databricks to expose data to Snowflake or Snowflake to expose data to Databricks? Do you think that that is performant? Do you trust it? There are a lot of issues there. I'm not making any accusations with those two in particular. All I'm saying is you want to know that your format is neutral and puts everyone on a level playing field. And especially if you're a vendor, you want to know that you're on a level playing field with other vendors. Sudip: Absolutely, and I think that comment of openness applies in particular to Delta. That's where I think both vendors and customers have their concerns around. Let me ask you this question. Do you imagine you guys would ever build an engine yourselves since you own the table format, you probably could build a pretty optimized engine. Is that ever a plan? Ryan: Let me correct you there because we don't own the table format. We contribute to the community, but we participate on the same terms as everyone else. And that's actually a really, really important distinction for us because we don't want to turn into a closed community. The adoption of Iceberg has been fueled by the neutrality of the community itself and the ability for all of these different vendors and large tech companies to invest and know that they have a say in what's going on. So I'm always very careful not to say Tabular is not the company behind Iceberg. We're contributors to Iceberg and we know quite a bit about it. I'm a PMC member but I never want to do anything to compromise that neutrality itself. But getting back to your question, are we going to build a compute engine? I don't think so because I think that there is a real opportunity here in independent storage. The vision that I have for our platform as Tabular is to be that neutral vendor. To be the person looking at your query logs coming in from some engine and saying, hey, if we clustered your data this way, it would save you a whole bunch of money. I don't think that we've ever had the incentive structure to do that because we've never had shared data. So the database industry for the last 30 years has lived under the Oracle model. And Oracle I think is just the most prominent person here. But we sell you compute, we sell you storage, and those two things are inseparable. The fact that all of your data is in this silo sort of keeps you coming back for more compute, for upgrades, for the contracts, and it sort of locks you in. And I don't think that that is unnatural. It's just been the reality for databases for 30 years. Now, the shared storage model is overturning that. And so I think there is a really huge opportunity here for us to rebuild the database industry where you don't automatically get those compute dollars simply because someone put data into your database, or wrote with your database. And so I think that it's actually structurally important for the market to try for this separation, for independent storage that tries to make your data work as well as possible with all of the different uses out there. We don't want anyone tipping the scales and saying, you know, this table is going to perform better if you use our streaming solution or our SQL solution. I think you want an independent vendor sitting there that is looking out for your interests as a customer and really helping you find those inefficiencies. Now, I know that some vendors do a great job of making things faster all the time and things like that. I just don't think that the incentives are quite aligned there. And in fact, throughout the majority of the analytic database history, you have largely been responsible for finding your own efficiencies. Oh, you need to put an index on this column. You need to think about clustering your data this way. And you hear horror stories where, oh, we didn't have the data clustered right, and we were spending $5 million too much a year on this use case. I think it is the responsibility of independent storage vendors to find those things and to fix them for you. And so I really think that that could be a valuable piece of the database industry in the future, and that's where we're headed as a company. Not towards compute, but actually being a pure storage vendor. Sudip: And I'd say this probably has not been tried before at the layer that you are operating at. Ryan: Yes, at the lowest storage layer maybe, but not at the layer that really sits in between the engines and the storage layer. Definitely have not seen that. It hasn't been possible to share storage. And it's going to, by the way, have a huge impact on other areas. The example I always give is governance and access controls. Where if you can now share storage between two vastly different databases, it no longer makes any sense to secure the query layer instead of the data layer. And so we have all of these query engine vendors that have security in their query layer. Well, that does you no good if you have three other query layers, or if you have a Python process that's going to come use DuckDB and read that data. I mean, it's a very messy world, and I think we're just starting down this path of what does the analytic database space look like if you separate compute and storage? Sudip: Last question before I ask you a couple of lightning round questions. What is the future roadmap for Iceberg? And when I say future, I mean, let's say in the next 18 to 24 months. What are some of the big things on your mind that you want to tackle? Ryan: There are a few things. As always, getting better, improving performance, and things like that. A couple of the larger ones are multi-table transactions. So Iceberg is actually uniquely suited to be able to have multi-table transactions just like data warehouses. So we're going further and further along that path of data warehouse capabilities. We're also coming out with cross-engine views. So you'll at least be able to say, hey, here's a view for Trino, and here's the same view for Spark, and be able to have a single object in your catalog that functions as a view in both places. Now that you're sharing a table space across vastly different engines, we also need to have the rest of the database world. So views are another area where once you're thinking about having a common data layer, you need to expand that a little bit. You need to expand that to access controls across databases. You also need to expand that to views. Views are a really critical piece of how we model data transformation and different processes without necessarily materializing everything immediately. Views are another area. We're also working on the next revision of the format itself, which is going to include data and metadata encryption, which is a big one, as well as just getting closer and closer to the SQL standard in terms of schema evolution. Schema evolution is actually really critical, and this has been coming up lately in CDC discussions. CDC is a process where you're capturing changes, hence change data capture, from a transactional database, and then using that change log, you're keeping a mirror in your analytic space up to date. And if you can't keep up with the same changes from the transactional system, you can't really mirror that data. So schema evolution is extremely important here because if someone renames a column, and I already talked about if you're just using Parquet or something, you just can't rename columns. If someone renames a column in that upstream database, you need to be able to do the same thing. Iceberg is, I think, ahead of the game in that we handle most schema transitions, but we actually need more. We need more type promotions, we need the default value handling, things like that to really get to feature parity with the upstream systems. Now these are mostly edge cases, but it makes a really big difference when you're running a system at scale, and someone in a different org made a change to their production transactional database, and now you've got a week's worth of time to reload all that data. So we're working on filling out the spec there and adding some new types, stuff like timestamp with nanosecond precision, and then blob stores, I think, as well. Sudip: Wonderful. A lot to look forward to then in the next 12 to 18 months. This has been phenomenal, Ryan, so thank you so much for sharing the whole Iceberg story with our listeners. I want to end with one quick lightning round. We do it at the end of all of our podcasts. Just, you know, real quick questions. So the first one is what has already happened in your space that you thought would take much longer? Ryan: I have been fairly surprised at how rapidly large companies have not only adopted Iceberg, but really put money and effort behind it. AWS is building extensions into Lake Formation and Glue, which is a very significant investment. The work that Snowflake is doing is incredible. We're very happy to have seen that Databricks, even though they back Delta, has also added Iceberg support to their platform, basically acknowledging the fact that Iceberg is the open standard for data interchange. I think I've been pretty shocked there. I would definitely have thought it would take longer for companies to ramp up, but they seem to have really hit the ground running. Sudip: Fantastic. Second question. What do you think is the most interesting unsolved question in your space, again? Ryan: Independent versus tied storage. Sudip: Tell me more. What do you mean? Ryan: I have a blog post, The Case for Independent Storage, that summarizes my thoughts here that we've talked about. This is basically the rise of open table formats has created a world in which you can share data underneath Redshift and Snowflake and Databricks and all these commercial database engines. Everyone is rushing, like the answer to my last question, to support and own data in open storage formats. I think that a big question is going to be whether people trust a single vendor that they also use for compute to manage the storage that is used by other vendors. I think that that is probably the biggest question. I've gone all in. I've placed all my chips on the table. We think that it's going to be independent. That's a central strategic bet of Tabular as a company, is that people are going to want an independent storage layer that connects and treats all of these others the same and tries to represent your interests as a customer and get you the best performance in whatever engine is right for the task. Is that going to matter in the marketplace? Or is it going to be that people are perfectly happy having an integrated solution where I buy storage and compute from one vendor and I sort of add on another, say, streaming vendor that uses that same storage. Is that the model or is it going to be independent? I think that the transition is to independence, and that's because it destroys that vendor lock-in. Even if you think today that this vendor for compute is head and shoulders beyond the rest and is going to be amazing. What about in five years or in ten years? Are they going to get disrupted? Are you going to have to move that data? I'm very bullish on the case for independent storage, but whether it will matter to customers is not something I get to choose. Spark versus Hadoop is a classic analogy there. The compute engine is getting disrupted. Last question. What's one message you'd want everyone to remember today? I would just say go check out Iceberg. It can make your life a whole lot easier. That's what we wanted to do. Sudip: Absolutely. This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti, and I look forward to our next conversation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

1. aug. 2024 - 57 min
episode From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin cover

From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Infrastructure Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do an in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects. We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episode [https://sudipchakrabarti.substack.com/p/when-hadoop-was-king-and-yahoo-was], we hosted Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/] for a fascinating discussion on Hadoop. In this episode, we are incredibly fortunate to have Reynold Xin [https://www.linkedin.com/in/rxin/], co-creator of Apache Spark [https://en.wikipedia.org/wiki/Apache_Spark] and co-founder of Databricks [https://www.databricks.com/], share with us the fascinating origin story of Spark, why Spark gained unprecedented adoption in a very short time, the technical innovations that made Spark truly special, and the core conviction that has made Databricks the most successful Data+AI company. Timestamps * Introduction [00:00:00] * Origin story of Spark [00:03:12] * How Spark benefited from Hadoop [00:07:09] * How Spark leveraged RAM to monopolize large-scale data processing [00:09:27] * RDDs demystified [00:11:43] * Three reasons behind Spark’s amazing adoption [[00:21:47] * Technical breakthroughs that speeded up Spark 100x [00:27:05] * Streaming in Spark [00:31:13] * Balancing open source ethos with commercialization plans [00:37:45] * The core conviction behind Databricks [00:40:40] * Future of Spark in the Generative AI era [00:44:39] * Lightning round [00:49:39] Transcript Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host and partner at Decibel.VC, where we back technical founders building technical products. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today, I have the great pleasure of welcoming Reynold Xin, co-founder of Databricks and co-creator of Spark to our podcast. Hey, Reynold, welcome. [00:00:37] Reynold: Hey, Sudip. [00:00:38] Sudip: Thank you so much for being on our podcast. Really appreciate it! [00:00:41] Reynold: Pleasure to be here. [00:00:42] Sudip: All right, we are going to talk a lot about Spark, the project that you created and is behind the company that is now Databricks. You went to University of Toronto, is that right? [00:00:54] Reynold: I did go to the University of Toronto, spent about five years in Canada, and then came to UC Berkeley for my PhD. So, been in the Bay Area for over, I think almost 15 years by now. [00:01:05] Sudip: What brought you to Berkeley specifically? [00:01:07] Reynold: It's an interesting point. So when I was considering where to pursue my PhD studies, I looked at all the usual suspects, the top schools, and one of the things that really attracted me to Berkeley - actually, it was two things. One is there's a very strong collaborative culture, in particular across disciplines, because in many PhD programs, the way it works in academic research is you have a PI, a principal investigator, a professor who leads a bunch of students, and they collaborate within that group. But one thing that's really unique about Berkeley is that they brought together all these different people from very different disciplines - machine learning, computer systems, databases - and have them all sit in one big open space and collaborate. So it led to a lot of research that was previously much more difficult to do because you really needed that cross discipline. The second part was Berkeley always had this DNA of building real-world systems. A lot of academic research kind of stop at publishing, but Berkeley sort of has had this tradition of going back to BSD, UNIX, Postgres, RAID, RISC, and all of that, actual systems that have a real-world industry impact. And that's really what attracted me. [00:02:20] Sudip: And obviously, that's what you guys did with Spark, too. [00:02:23] Reynold: Yeah, we tried to continue that tradition. So we didn't stop at just the papers. [00:02:27] Sudip: Yes, absolutely. And I think that's a very common criticism of a lot of academic work, right? Great in quality, but doesn't go that last mile to get to production or get to actual users. [00:02:40] Reynold: It's not necessarily a wrong thing either, just different approaches. You could argue, hey, let academia figure out the innovative ideas and validate them and have industry productize them. It's not necessarily the strength of academia to productize systems. To some extent, it's just different ways of doing things. [00:02:57] Sudip: Let me go back to when you guys started Spark, and this is circa 2009. I'm guessing you had just joined the PhD program, and this was still the AMP Lab - Algorithms, Machine, People Lab - right? [00:03:11] Reynold: Yeah. [00:03:12] Sudip: Which, of course, is now Sky Lab and was RISE Lab in between. Can you give us a little bit of an idea of what the motivations were to start the research behind Spark? I mean, this was the time when Hadoop was still king, right? Like, why do Spark? [00:03:27] Reynold: It was an interesting story. So Spark actually started technically the year before I showed up. By the time I showed up, there was this very early, early thing already. So Netflix back then in the 2000s, I think even a little bit before 2009, had this Netflix Prize, which is the competition they created in which they anonymized their movie rating datasets so anybody can participate in the competition to come up with better recommendation models for movies. And whoever can improve the baseline the most would get a million dollars. [00:03:58] Sudip: Yeah, I remember that. [00:04:01] Reynold: That was a big deal. Eventually, it was shut down for privacy reasons. I think maybe there were lawsuits that happened, but it was a big deal in computer science: and in the history of machine learning. And this particular PhD student, Lester, was really into this kind of competitions: and also a million dollars was a lot of money. [00:04:17] Sudip: Sure. For a grad student in particular, right? [00:04:20] Reynold: A grad student makes about $2,000 a month. So he tried to compete, and one thing that he noticed was that this dataset was much larger than the toy dataset he used to work with for academic research, and it wouldn't fit on his laptop anymore. So he needed something to scale out to be able to process all this data and implement machine learning algorithms. And one of the keys with machine learning is that you are not done when you come up with the first model. It is a continuously iterative process to improve it over time. The velocity of iteration is very important. And he tried Hadoop first because that was the unique thing - if you wanted to do distributed data processing, you used Hadoop back in 2009. And he realized it was horribly inefficient to run. Every single run takes minutes, and it was also horribly inefficient to write. So the productivity for iterating on the program itself was very difficult because the API was very complicated, it's very clunky. So he kind of walked down the aisle - the nice thing about having a giant open space with people from very different disciplines - and talked to Matei, who was also a PhD student back then, one of my fellow co-founders at Databricks. He said, hey, I have this challenge, and I think if you have those kind of primitives, I could really do my competition much faster. So Matei and him basically worked together over the weekend and came up with the very first version of Spark, which was only 600 lines of code. It was an extremely simple system that aimed to do two very simple things. One was a very, very simple, elegant API that exposed distributed data sets as if those were a single local data collection. And second, it could put or cache that data set in memory. So now you can repeatedly run computation on it, which is very important for machine learning because a lot of machine learning algorithms are iterative. So with those two primitives, they were able to make progress much faster for the Netflix Prize. And I think Lester's team even tied for the first place in terms of accuracy. [00:06:24] Sudip: Did he get the money? [00:06:24] Reynold: He did not get the money because their team were 20 minutes late in the submission. So they lost a million dollars for a 20-minute difference. So if Matei had worked a little bit harder and had Spark maybe 20 minutes earlier, Lester might have been a million dollar richer. [00:06:44] Sudip: That's such an amazing story. Wow, I actually did not know that! [00:06:48] Reynold: So when Spark started from research, it kind of started for a competition and really just the collaborative open space and the opportunity that all of those people just happened to be there at the right time led its very, very first version. Now, obviously Spark today looks very, very different from what the original 600 lines of code was. But that's how it got started. [00:07:09] Sudip: One question I have since you touched on Hadoop, do you feel that Spark benefited from Hadoop being already there? Did you guys use some components of the Hadoop ecosystem? Like for example, if Hadoop hadn't existed, do you think Spark could have still been created? [00:07:23] Reynold: Spark definitely benefited enormously from Hadoop early on. There are also baggages that we carry from Hadoop that up until today are still there. It's definitely benefited massively. The first example was that Hadoop solved the storage problem for Spark. And as a result, Spark never had to deal with storage. Spark more or less considered storage as a commodity. To a large extent, organization were able to store a large amount of data reliably and cheaply, which was key. [00:07:56] Sudip: And this is HDFS in particular? [00:07:59] Reynold: HDFS, yeah. And later on HDFS largely faded and got replaced by object stores. But Spark never had to worry about, hey, how do you store a large amount of data? And that was very, very important. And Spark piggybacked onto the Hadoop deployments. All the initial deployment of Spark were sort of onto Hadoop clusters themselves. So the existence of those clusters made it easier because if the hardware resources were not there, that would also have been very problematic for any user of large-scale data processing systems. And Spark leveraged a lot of the Hadoop code itself, especially the storage layer, retrieval as well as the data formats. So Spark definitely benefited enormously from it. At the same time, it's a lot of baggage, unfortunately. There's no free lunch. [00:08:45] Sudip: Right, right. And we actually had Doug Cutting and Mike Cafarella, on the previous episode here, and Doug was talking about how he fully anticipated Hadoop eventually evolving and being replaced by a more advanced system. So it's sort of generations of advancement that happened. One of the key game-changers behind Spark at a very high level was Spark's use of RAM, the memory, keeping data in memory. Can you talk a little bit about how Spark uses memory and what are the benefits? Obviously, there are benefits in speed, iterative computation, but can you talk a little bit about that? And then one other question I'll ask at the same time is why was it so difficult for Hadoop to maybe adopt memory? [00:09:27] Reynold: This is actually a very complicated topic, but let's try to maybe explain it in just a few minutes. There are two places where Spark uses memory in a fairly clever way that Hadoop didn't do and those really led to the dramatic improvement. One is the ability to simply keep the data in memory. And as we said earlier, that's very important for any kind of iterative computation in which you want to repeatedly scan the same data. So it was critical for machine learning workloads. But at the same time, it's also the same primitive that's very useful for any interactive data science because you're often looking at the same data sets over and over and over again when you're doing interactive data science. Those actually happened to be the first maybe two killer use cases of Spark. The second place, it's a little bit less about the direct use of memory, but because Spark exposes fundamentally much higher-level abstraction compared with Hadoop. Hadoop is very simple - it's Map, Shuffle, Reduce - MapReduce. It's a very simple paradigm. There's no concept of joins and no concept of filters. You basically create filters yourself and map it back to Map and Reduce. Spark exposes a higher-level abstraction that has the concept of filters, joins, code group, and all of this. And as a result, it can train a more complex computation by DAGs, direct cyclic graphs, of tasks. And as part of it, it knows, for example, hey, if you're running a Map right after a filter, you don't have to persist the output of filter onto disk or onto HDFS and then read it back in. As a matter of fact, it can just stream through them. So this particular optimization now removes the need for data to go to disk repeatedly in a larger computation. But it's a little bit less about just memory. It has to do with a combination of both, hey, let's just pass through data, stream through data and memory, as well as having the ability to express that more complex computation diagram. [00:11:23] Sudip: So if you had like three stages in that DAG, Hadoop would’ve returned the results after every stage to the disk and read it back, whereas Spark can keep it in memory because it knows that DAG? [00:11:36] Reynold: It knows it's the same thing. I mean, it has to do with having the completeness of the computation rather than only Map and Reduce. [00:11:43] Sudip: Right. And then one of the primary concepts behind Spark is what is called RDD, Resilient Distributed Datasets. Can you talk a little bit about that for someone who might not be familiar? [00:11:54] Reynold: I think maybe from two perspectives, one's from a system perspective, the other one's from the user's perspective. And I think the user's perspective probably should go first. The brilliant thing about RDD is that distributed computation used to be super difficult. And if you think about message passing, Hadoop made it slightly simpler to have the MapReduce concept. RDD took it much further and basically said, hey, if you have a large dataset, the way you program large datasets should just be like how you program a collection of data in memory on a single node. If you were to write a Java program, a Scala program, a Python program, everybody knows what a list is, an array is. They're all collections and there are ways to transform collections. Maybe, the way we should be programming large datasets should be identical to that. And that really means the API now for programming a distributed program against a large amount of data is as if you're just manipulating some data that's on a single node. So that dramatically decreased the complexity in the API surface. Now, the second big innovation in RDD is this. One of the big things with distributed computation is, hey, you have all those machines, you might have thousands of machines that could fail. How do you deal with failures? So RDD, while the user-facing API exposes just a bunch of collections, internally, it creates a lineage for every collection. So it doesn't literally materialize when you say, hey, this collection is just a filter on the previous one. Instead of materializing the collection, it is actually lazy. It just tracks, hey, this collection is simply formed through doing a filter on the previous collection. So it gives you that lineage of how you create the datasets. And again, when you really need the result, for example, you want to output the data, you want to get back how many rows there are, it would trigger this computation graph. And if there's a failure on any of the machines, it runs something very simple - it analyzes the computation graph and checks, hey, so what are the downstream, upstream dependencies? If that node fails, I just need to reproduce the data on that node. So it creates a minimum plan to reproduce the partial dataset on that node to handle failures. And it does that all gracefully, without the user having to know anything about it. So that's really the two brilliant parts of RDD. The first is, it creates a new programming paradigm for distributed data processing, which makes you program basically single-node collections. And the second is all the underlying system techniques to make fault recovery work really well. [00:14:29] Sudip: Got it. And then in 2013, you guys introduced DataFrame, and then in 2015, you introduced Datasets. So what are those APIs and how do they connect or relate to RDDs? [00:14:43] Reynold: So in 2014, we introduced DataFrames. I remember because in 2015, at Strata conference, which is a big conference, I gave a talk about DataFrame GA. After we started Databricks and we started working a lot more closely with the Spark users. I mean, we always work very closely with Spark users but after the company started, now we no longer had the academic research to worry about. [00:15:08] Sudip: No paper to write. [00:15:10] Reynold: No paper to write. No exams to take. No courses to take. And then we realized at some point that, even though there's a lot of unstructured data and semi-structured data out there, at some point people introduced structure onto their data. Structure could be, hey, here's an array of floating point numbers for my machine learning vectors. Structure could be, here's a column called email description, which is a pile of text, right? Those are all structures. Probably like 95% of the programs become some sort of loosely defined structured programs. And the collection of data, which while it's a very powerful abstraction, is still not high enough for structured programming. And that involves also a lot of user-defined functions. Like imagine if you're programming in Python, you want to traverse a list, you want to do something to it. You write a lot of code to say, for example, let's do a for loop across the list. And then for each of the element, you try to compare it to say the number one. If it's number one or it's greater than one, you keep it. If it's less than one, you ignore it. There's a lot of code you're writing, expressing in Python. The problem with that code is that the system cannot optimize it because it is Python code. It is Python code with very strong Python-specific semantics that we can't do anything about. But we do know that often people are just doing very basic comparisons. They're doing very basic expressions that exist in a more structured context. So the reason we created DataFrame was twofold. One is we want to raise the level of structure in the API even higher that makes structured programs easier to write. So users have to write less code. The second is we want to be able to capture more and more of the semantics of the computation. So instead of the user writing a lot of user code in Python, they will just express what they want to do in the DSL in Python still. But this time it's the DSL that tells us the semantics of all those computations. And then we can optimize that under the hood. As a matter of fact, we did. Before Spark 2.0, Python was probably 5 to 10 times slower than any JVM language on Scala and Java for Spark users. And even today, if you Google or ask ChatGPT, it's very likely ChatGPT will tell you that if you use Scala, you get better performance with Spark. And the reason for that is not because Spark itself behaves very differently. Simply because if you had used Python before, we had to run a lot of your code in Python, and the Python code would inherently be slower. But with DataFrame, we're able to capture the semantic information and actually generate an execution plan that, regardless of what language you use, will be the same execution plan and we'll optimize it and we'll make it run faster and faster over time. And that was incredibly powerful. And it basically got Python to have exactly the same performance as the JVM languages. So it's a big deal because these days probably 80% of the Spark users use Python. [00:18:27] Sudip: Right. I mean, it's a language for data scientists. [00:18:30] Reynold: Yeah, exactly. And then the Dataset API came, although I actually would not recommend people to use the Dataset API. In hindsight, I would not have created the Dataset API. [00:18:40] Sudip: I see. [00:18:41] Reynold: So Python is weakly typed, dynamically typed. Scala is statically typed. And a lot of Scala programmers really love typed information because that’s powerful. One of the things with DataFrame is that it becomes dynamically typed. So even when we declare DataFrame in Scala, it doesn't actually know what exactly is the type of that DataFrame, what are the columns in it. All of those are dynamically generated. So Dataset was our attempt and I think we did our best job out there to create a statically typed program that allows you to bind in runtime but give you that compile-time safety throughout the program so you would know, hey, this is actually a Dataset of, for example, a student. And student is a class with all these fields. And we give you the compile-time safety, but with the final validation that when you read this data in, the first time you create a Dataset of student, we actually run the validation to see all the data actually have all these fields you need and how does it map to the student class. The reason I said maybe in hindsight, I would not have created it is, as it turned out, it only served a very small percentage of users. And second, it's extremely complicated. The most complicated code in Spark are one, the scheduler - schedulers are always complicated. Second is all this type mapping and static sort of type, dynamic type binding stuff. Those codes are super hairy and very error-prone and very few people know how to change it. So a lot of investment for high technical complexity for very small percentage of users. It is a very cool idea though in abstract. [00:20:21] Sudip: That's a fascinating story. Talking about programming language, one quick question I had was, the choice of programming language for Spark was Scala. Why was that? [00:20:31] Reynold: Every month some new engineer joins Databricks and asks, why do we use Scala? It was actually an easy choice back then. In 2010-2009, because the Hadoop ecosystem was in Java, in order to be able to leverage most of the Hadoop ecosystem, Spark needed to be in the JVM. At the same time, we really wanted something that could be interactive. We felt that the interactive experience was extremely important. Java was not interactive. Even today, Java is not interactive. There's no way you just hand-type Java without an IDE. It's not concise. It is super verbose. So Scala was a language that was identified as basically much more concise than Java with a repo that we could actually hack. So now somebody could just interactively in command line start issuing a one-line Spark code and run across like a thousand machines in the cloud. And there was no other language that would fit the requirements of being in JVM and being interactive [00:21:33] Sudip: That actually clears up a big question for me I've always had, why Scala? [00:21:39] Reynold: I think Spark and Scala had a coexistent relationship and really a lot of Spark users came to Scala because of Spark. [00:21:47] Sudip: Yeah, exactly. A lot of people learned Scala, including myself, just to use Spark. Now, shifting gears a little bit, Reynold, one of the things that obviously we all have witnessed over the last 10 years is the amazing adoption of Spark. Now that you are looking back and it's 2023 now, looking back at the last 10-12 years of Spark history, if you were to pick, let's say, three factors that you think were the most important in driving this adoption of Spark, what do you think those would be? [00:22:13] Reynold: The first and foremost would be, I think, the focus on the end user. Everything I talked about so far I always started with a level of abstraction to make it much simpler to program. These days I don't need to evangelize Spark anymore because it's everywhere. But one thing I often say is that, many of you will come because of the performance improvement. Because you've heard that you can make your program 10 or 100 times faster than Hadoop, but you’ll really stay for the programmability. You'll never go back to a MapReduce program once you program Spark. We didn't end with just the RDD API. We have continued pushing the boundary, introduce streaming APIs, introduce DataFrames and all this. And they're all about how do we help the end users to be a lot more productive, make the APIs more expressive for the tasks they are supposed to do. And that focus on the end user programmability is key. The second one would be a culture of innovation which is not surprising because a bunch of us came from academia and really wanted to apply bleeding-edge ideas to a real-world system and see how we could improve it. So Spark brought a lot of innovative ideas that were never ever done in other systems or only done in very niche systems. But, Spark brought those to the masses. The third one I would pick is unlike Hadoop and many other systems in the past, Spark had a batteries-included approach, which means, hey, what are the most common things people want to do? Let's make it doable out of the box, instead of having another extension framework or project you have to go download, install and configure in the system. We have had, for the longest time, all the popular data types. You could just use the built-in APIs to read them, like popular data types for distributed computation, not necessarily for local stuff because the focus was on large-scale data sets. And we added the whole machine learning library to Spark if you want to run logistic regressions - just run it out of the box, you have it. [00:24:17] Sudip: And you are talking about MLLib here. [00:24:19] Reynold: Exactly. All of this made Spark much more powerful because one of the big pain points for a lot of users is having to configure a system with dependencies and app frameworks and extensions. Whereas in Spark, you install it, and now you have all of this power. [00:24:37] Sudip: And going back to the usability piece, I mean, I like how you put it, like, come for performance and stay for usability, right? And going back to the usability piece, I believe one of those driving factors was when you guys added SQL support. You, in particular, I believe, were behind the original Shark project and then the Spark SQL project. I'm just curious, at a high level, was there any particular technical challenge that you had to resolve to bring SQL in a distributed data processing system? [00:25:09] Reynold: Over the years we have done it, we've been redoing it. Actually, what Shark referred to is actually what I did during my PhD before Databricks even started. The way Shark was architected was that it basically took Hive, which is the SQL Hadoop system. We took the physical client generated by Hive and converted it into a Spark program. And then we were able to run SQL somewhere between 10 to 100 times faster than what Hive would be able to do back then. One of the main challenges there was really the Hive codebase, which was potentially the most spaghetti codebase I've ever seen. I'm hoping I'm not offending anybody here and it probably had nothing to do with the creator. I think it's just because it was created at Facebook for a specific set of use cases and then it suddenly got insanely popular. A lot of use cases and different requirements got piled onto it and people added a lot of code very quickly. It was so bad that when Databricks first started, in the first year one of our founding engineers actually quit after working on this codebase for a few months and that was the sole reason given. And up until this day when I talk to him he's like, yeah, honestly, that was the real reason. That codebase was so difficult to work with. So when Michael Armbrust came to Databricks in early 2014, initially he was actually hired to build a query optimizer for large-scale data. At some point I walked up to him and said, hey, you should kill this thing I created. This nine-headed monster is impossible to deal with and the code is too difficult to maintain. I think if we start from scratch and build it from scratch and have nothing to do with Hive, it would be substantially simpler. It would be much easier to actually evolve. And he did. And then that became Spark SQL which is honestly much easier to set up, much easier to maintain, much easier to iterate on. [00:26:55] Sudip: So you basically ended up killing your own creation. [00:26:57] Reynold: Yeah, a lot of people didn't expect that but I was jumping over it. Some people were like, are you happy that now it happened? I'm like, yeah, that's incredible that it happened. [00:27:05] Sudip: That's so funny. And then going back a few years, talking about the other thing that kind of led to adoption of Spark, I remember back in, I think 2015 or somewhere around that, you guys made a number of key performance improvements to Spark Core. Can you talk a little bit about what those improvements were and what were some of the major architectural changes you guys had done? [00:27:31] Reynold: So 2015-2016 was a big year in terms of improvement and that's when we released Spark 2.0. And the claim at the time was Spark 2.0 was an order of magnitude, 10 times faster than Spark 1. It might not be exactly 10 times for every workload, but it was dramatically faster. And a lot of it had to do, first of all, with the DataFrame API. We talked about DataFrame API that gives an even higher level abstraction. Now we convey semantic information so we can optimize under the hood. And we fully leveraged that in Spark 2.0. So Spark 1 released the DataFrame API, raised the level of abstraction, and enabled us to do the optimization in Spark 2.0. And so those optimizations basically boiled down to two big things. One is we hyper-optimized for Parquet and actually created the first vectorized Parquet reader. And that led to the use of Parquet itself as the columnar format, and Parquet MR at the time was basically implemented row-wise. It was not fully extracting the performance out of a columnar format. So we implemented vectorized Parquet and that got massive speed up just in terms of read performance. And then for the entire query execution engine, we put this idea that exists only in academic papers and academic systems called Hosted Code Generation and put that into Spark itself. The idea is we would take a query plan and generate the actual code that's needed to execute that query. As if you're building a query engine purpose-built just for executing that one query. And the reason that would make it much more efficient is because a generic system has a lot of overhead because it has to be generic. For example, it has a lot of function calls because your query engines in particular have the concept of an iterator model in which you chain a bunch of iterators and each sub-operation is an iterator and every iterator call, when you say next, it generates a virtual function call which is very slow. Then the compiler is not great at optimizing it because those are complex programs. But if you, for example, have a simple program that does the aggregation. If you were to purpose-build a program to do that, you would just write a for loop from the beginning of the data to the end of the data and then sum something up into a local variable and then return the local variable. It would be a three-line program. Compilers would be amazing at optimizing that. So we basically treated Spark itself as a compiler that compiles SQL queries into actual purposely written code for executing those SQL queries. And SQL here doesn't just apply to SQL. A big, cute idea in DataFrame is that DataFrame is not very different from SQL and SQL is not very different from DataFrame. They all generate similar query plans under the hood and once you generate those purpose-built code for that query, you compile it with the JVM. The JVM now is much better at optimizing that. So basically vectorization and CodeGen combined, we actually got to a dramatic speed-up. By the way, we didn't invent either of this. In academia, vectorization always existed but it was never built for the open-source SQL system. This kind of CodeGen was pioneered by Thomas Neumann at TU Munich and they had an academic system called HyPer which was eventually commercialized and got bought by Tableau. So we took that idea and really made it into adoption. Probably more people benefited from that because of that work and benefited from Thomas Neumann's idea than the HyPer system itself. [00:31:13] Sudip: I want to talk a little bit about streaming. There was Spark Streaming and I believe now you guys have Structured Streaming, right? I'd love to understand a little bit what is the difference between those two and then one other related question is, originally the way I know streaming was implemented in Spark was through micro-batches, right? Is it still the case and how is Structured Streaming different from the original idea? [00:31:37] Reynold: Incidentally, the relationship between Structured Streaming and Spark streaming is very similar to DataFrame and RDDs. It's all about raising the level of abstraction. Spark Streaming was a fairly innovative thing because it basically came with this insight which is that, if you just run a batch that is small enough and fast enough, at some point you have the approximate streaming. To the extreme, you run a batch for every row which is basically one row at-a-time processing. And that was pretty popular because it introduced a whole new class and workloads that previously people thought would be very, very difficult and require super specialized systems for. Just with a lot of IoT devices, sensor data, message buses becoming more popular as Spark was growing, streaming workloads started growing too. There was a very important problem with Spark Streaming which was actually not about micro-batches. I personally think the big micro-batch versus true streaming debate is overblown too much. The biggest problem with Spark Streaming was that the window of streaming is completely tied to the physical batch size. So the physical property of its execution is leaked into the programming abstraction. The most common operation in streaming is windowing. If you want to window by, for example, a bunch of records, in Spark Streaming, the way of design is that you would actually run that batch as a window. And this really limits a lot of optimization activities and also limits the programmability. It also limits another very important thing, which is, if you have late-arriving data, now you can't deal with it because you're already done with that batch. Your data shows up later and cannot be considered a part of that window anymore. So the time and everything has to be physically tied to the way it was executed. So after we did DataDrames, we started thinking about how to improve all those issues of streaming. One very obvious one was, many people want the concept of time that's logical instead of physical. Physical meaning whenever the event showed up to the system, that's the time it showed up. Logical means the event has some property called time, maybe a column or a field, and that is the actual time. So we thought that for streaming, let's completely decouple the execution from the API and just think about how you can program a streaming job. We looked at a lot of streaming jobs and realized the intent people express and the transformation people express, were not very different from a regular DataFrame. It's a program, except they want to run that in a continuous mode. And for any data that comes in, they want to be able to run that instead of just running once. So we came up, I think it was 2016 or 2017, I remember I was giving the announcement at Spark Summit and I said that, the easiest way to do streaming is that you don't have to think about streaming at all. So Structured Streaming basically introduced the concept of streaming DataFrames. There's no separate API for streaming. It is just a DataFrame. And the only difference is how you create a DataFrame. If you created a DataFrame using a streaming reader because it's coming from some message bus or even just a pile of files that might continuously arrive on object stores, you have a streaming DataFrame. And you can run the same operations just like your normal batch DataFrames. And everything else is the same. And that really made it much simpler to program streaming because now people don't have to learn a new paradigm. They don't have to think about, hey, what is windowing? Well, it turned out windowing is a group by some time. So that was very, very powerful. And that also led us to realize that a lot of people, when they do streaming, they're not even actually trying to do things in super real time. What they really, really want is they want the incremental-ness that happens in streaming as data flows in. Because virtually every data pipeline is continuous, they always have data coming in. They might come in once a day or once a week, but it's always new data coming in. It's very rare you have a data pipeline you run once and never worry about. Now, as data continues coming in, now you have to worry about, okay, so you don't want to reprocess all of your data all the time because that's highly inefficient. So people invented their own way of doing incremental processing. And it turned out the number one use case for Structured Streaming was people just using it for incremental processing because now they no longer have to worry about the state. The funniest thing is that they would build a streaming pipeline, they would run it, and it processed the data. The data's actually coming in, for example, once a week. So if right after half an hour, if they finish processing their data, they’d actually shut it down manually. And then a week later, when there's new data coming in they rerun their streaming pipeline. And because it's fault-tolerant and because it tracks all the incremental states, it just does this whole end-to-end incremental processing. Because of that, we introduced a concept called streaming once, which is literally that when you run the streaming pipeline it finishes processing all the data it currently sees, and then it shuts it down. And the next time you want to do it, you relaunch it again. And that itself, actually, it probably generates hundreds of millions of workloads just on Databricks today. People are running streaming in a batch mode. [00:37:07] Sudip: Exactly. Yeah, that's what I was going to say. It makes it so much easier. But it's a unified programming interface. It's a unified understanding. And I'm sure it also made it easier on the engine side, right? Reynold: Exactly, the unification. Yeah, because we don't have to build so many different engines. Another question is, so does it still run in micro-batch? There're actually different modes today. There are certain things that are running in micro-batch because for example, obviously, the streaming once mode, it is just a giant batch job, except it does all the incrementalism. There are also continuous modes in which it actually processes data all at a time, in which you can get from records coming in to records going out in some milliseconds. [00:37:45] Sudip: I want to shift gears just a little bit and go to like, you know, circa 2013 when you guys started Databricks, right? And then you had this really fast growing open source community, which was Spark, and then you had this very early company, Databricks, right? And one of the challenges a lot of open source creators have, particularly when they start a company, is how do you balance between the open source community with the commercialization effort? How did you guys manage to do it? Were there certain guiding principles that you guys had that really helped you? [00:38:21] Reynold: It's difficult. I think Ali, the CEO of Databricks often joked about that we should never start another company based on open source projects. The reason for that is you need two strikes, right? Normally to start a company and if you want the company to do well, you focus on the business problem of the company and with all the stars aligned and you're lucky and you work super hard and you have an amazing strategy, it works out. And that is a very difficult problem. If you add open source to it, now you have a two-step process. You have to have the open source project taking off and doing well, and that itself is not a trivial thing. It's probably a little bit easier than building proprietary software, but it's not a guarantee. Most open source projects don't work out. And then after that, you need to have an amazing business model and work towards a great strategy. And now the actual end-to-end success requires you have to multiply the two probabilities, which makes it very low. And probably the reason why there are very few companies that have heavy open source roots, and been successful. One thing we focus on a lot is we have different teams doing open source and they have different mandates. The open source teams are tracking so their KPIs are about adoption metrics of the open source projects. We always open source everything in API. We don't want to lock customers in because we have another kill API that makes it super difficult to migrate out if they ever need to. And we focus a lot on evangelism of the open source project. We try to walk a very careful line. For the longest time, the Spark Summit was called the Spark Summit and there was very little Databricks content in the first day keynote. It was all about open source. We saw this change as time went on and then as we also renamed the project. There was a lot of AI content so a lot of things have changed, but those are a lot of the things we did early on. There's creating a more cleaner delineation, both in terms of organizational structure, KPI tracking, events, and all that. But it's not an easy task. Actually, if we were to redo it again, I'm not sure if we could do it. [00:40:21] Sudip: I'm laughing because that's coming from the founder of the most successful company in open source ecosystem, right? [00:40:29] Reynold: I know a lot of people think of Databricks and say, hey, there's a great business to be made about open source. I'm not sure. I mean, it's not been doable, but it's not an easy job. [00:40:40] Sudip: When you guys started Databricks, I remember 2014-15 time frame, I used to go to some of the board meetings. I remember there was a pretty heavy debate about should Databricks stay all cloud, or should Databricks go and support an on-prem version of Spark? And there was a lot of customer pressure because Spark adoption was increasing. [00:41:03] Reynold: Everyone wanted to pay us - like ten million dollars for Spark. [00:41:05] Sudip: And there was a lot of competitive pressure. Cloudera was really going at the time, Hortonworks was around. But you guys had a very deep conviction that it is either Databricks cloud or nothing. Can you talk a little bit about that, like where that conviction came from? [00:41:23] Reynold: Yeah. I mean, in a kind of sense, we're really glad that we didn't go on-prem and become a support company. It ultimately comes down to the longer-term vision of where we see the puck is going and we try to go towards that rather than capturing what's right in front of us. And that's something easy to say because there are also scenarios in which people think too long term and they're dead before the long-term future and vision could even manifest. But I think one of the reasons that really got us going there was it actually had to do with Berkeley also. So at Berkeley in 2009-2010, there was this very famous paper called the Berkeley view of the cloud. [00:42:05] Sudip: Yeah, above the cloud, right? [00:42:07] Reynold: The view of the cloud. There are variants of the paper. The initial title and then later it's just a view of the cloud. But that paper is probably the most cited technical paper on cloud computing. It was so popular that many business schools incorporated that in various of their classes. Like my wife, for example, went to business school and read that paper. Not because I showed her the paper, but it was actually part of the class. And that paper predicted that the vast majority of the compute and computing infrastructure will move to the cloud and it will be finally true, this computing as a utility. Very few companies and houses have their own generators. They are just using electricity from the grid, right? It's something that's reliable enough, something that's viewed as virtually infinite and the economics simply makes sense. There are niche use cases, those will never go away, but the vast majority don't think about it as much. And that paper predicted that future by analyzing not just the technical foundations, but also what that type of allocations would be possible and the economics and the accounting and all that that have pretty profound influence on, I would say, the way we think about all this at Databricks. And I think some, maybe not everybody, but a couple of Databricks founders are also involved in that paper as well. So we always thought, hey, we really wanted the ability to be able to release software super quickly. We always wanted the ability to be able to provision and get a POC going in a matter of days instead of a matter of months, because now you have to go procure the hardware. We always thought a lot of the complexity of software, especially infrastructure software, is in the operations of it, not just in the building of it. And all of those are enormous values we can create, but we can only do that if it's in the cloud. And we're so early on at the time that we just view, and even today, we kind of view most software in our stack are pretty broken. We need to continuously improve them. Velocity is key. And having simplified environments, not having to worry about the 20 different variants of Linux, 50 variants of IP wireless thing and firewalls and all that would be enormously beneficial. So that's kind of what got us started. [00:44:27] Sudip: And clearly, it paid out so well for you guys. [00:44:31] Reynold: But, it is difficult Because every time we hire an exec, it's like, hey, I have a great idea to increase our revenue by 10x. [00:44:39] Sudip: Yeah, go on-prem. Before I switch to lightning round, one final question about Spark. What's next for Spark? What's in the future? [00:44:49] Reynold: I think the API is actually pretty good. We've been doing a lot of incremental refinement. For example, one of the biggest complaints of Spark is when you use Python, the error messages are simply nonsense because those include JVM stack traces and all that. And we actually spent a lot of time improving those to the point that you could probably still see it, but most of the time, you don't even feel there's a JVM that's running the Python program. So there's all this sharp edges we're trying to remove, which ultimately, they're not big fundamental ideas, but they're really the ones that create friction that gets in the way of everyday users. So a lot of work goes into that. With a lot of the GenAI use cases, it's 2023, everybody has to talk about GenAI, we have noticed a lot of the Spark programs that were generated by ChatGPT and all the other things. I bet there will be more Spark programs generated by machines than by humans in the next year. And most of them don't have good practices because many of them were generated and learned on a giant corpus on Stack Overflow and whatever is on Reddit, mostly based on malpractices from the past. Some were written before certain new things were introduced and were never updated. So we're working on this thing called the English SDK, which basically, if you think of it, it's really just how do we teach ChatGPT with the right prompting so it generates best practice Spark as opposed to generating malpractice Spark that now some other human experts come in and try to fix it. Things like this would make Spark's adoption go far wider and really benefit a lot of users that previously just felt Spark was too daunting. It can be somebody who is reasonably technical but they use Spark and they generate a Spark program in ChatGPT, they realize: oh it crashes for this data. But it turned out with a better chatbot, it will actually generate things that don't crash. We want to get to that point. I think another very big opportunity for Spark is that I fundamentally believe the biggest innovation of Spark is not performance, it's not just the use cases, but rather for something that's very old in data, which is data engineering. In the old school days, they used ETL tools like Informatica and all of that. And later, there's a little bit of Modern Datastack that got super popular the last couple of years and people write SQL queries again. There's something fundamentally wrong with SQL for data engineering. It's a very hyperbole statement, but what's fundamentally wrong with SQL? SQL was not designed with engineering practices in mind. You cannot easily test a SQL query. There's no abstractions in SQL. You could use recursive common subtable expressions, but again, it's just a pile of text. There's no variables, there's no for loops, there's no functions, there's no classes, there's no CI/CD framework, which means the key word is engineering. Engineering requires a sense of rigor, and rigor is backed by fundamental engineering principles. And what are the fundamental engineering principles? They are abstractions. I'm talking about software engineering here, right? They're abstractions. They are testing. They are CI/CD. They are how you roll out. And SQL is just terrible for all of those. I love SQL. I did my PhD in databases. I spent a decade optimizing, figuring out how to build systems to run SQL better. But we actually have a solution in front of us. It's real programming languages that can do everything SQL does, which is actually the Python Dataframe API. If you compose exactly the same program in SQL in a Python Dataframe API, it looks almost the same by readability. You can now actually test it using just vanilla Python code. You could decompose your program because it doesn't have to be a pile of text. You can have multiple files backing them. Each file is a Python file. You can have classes. You can have functions. All of this are great tools. By the way, it's just off the shelf and available because it's Python. People build far more complicated Python programs than the most complicated data engineering programs. So certainly you could use that. We haven't done a good enough job to explain to the world what's the value of Python. Python is now very popular in data science and machine learning. But to the SQL folks, many of them don’t even know Python. We haven't really done a good enough job in educating them. I think that would be one of the most important things that Spark can get right is to tell the world, hey, here's how you can do data engineering. And by the way, it's not that hard. It's just vanilla Python. And the Dataframe API is just the equivalent way of writing SQL. As a matter of fact, everything you're familiar with in SQL translates here, except now you have all the toolkits to do serious engineering. [00:49:39] Sudip: That's fascinating. So, Reynold, we end every one of our episodes with three quick questions. We call it the lightning round, starting with the first one being acceleration. So in your view, what has already happened in Big Data that you thought would take much, much longer? [00:49:57] Reynold: Yeah, the death of Hadoop. I would think it would take a lot longer for Hadoop to die. Enterprise software never really goes away, but Hadoop today is largely irrelevant. I want to see it happen faster, I thought. [00:50:09] Sudip: You definitely had something to do with it, didn't you? [00:50:11] Reynold: Yes, we had a part in that, but if you asked me 10 years ago, I would tell you by 2030, maybe you would see a rapid decline. [00:50:21] Sudip: Then the second question around exploration, what do you think is the most interesting unsolved question in your space, you know, largely Big Data processing? [00:50:31] Reynold: There's many. I'll just pick one. I think how do you combine unstructured data and structured data. Especially with GenAI now, there's great ways of processing unstructured data and analyzing unstructured data, but then how do you combine them is unclear. I think there's a lot of value that can be generated currently. [00:50:48] Sudip: Final question, what's one message you want everyone listening today to remember? [00:50:55] Reynold: I mean, to maybe the builders - the open source framework builders would be - put the user first, think from their shoes. Think about simplicity to the users, I would say. The last thing I said, Python, data engineering, you want to use a real programming language for data engineering to bring the engineering rigor into data. [00:51:15] Sudip: Reynold, thank you so much for sharing your insights. This was a real pleasure and frankly a privilege to have you on. Thank you. [00:51:23] Reynold: Thanks a lot for the invitation, Sudip. [00:51:25] Sudip: This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation. [00:51:41] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

13. dec. 2023 - 51 min
episode When Hadoop was King and Yahoo was Cool - with Doug Cutting and Mike Cafarella cover

When Hadoop was King and Yahoo was Cool - with Doug Cutting and Mike Cafarella

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in infrastructure software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider story. For each such project, we go back in time and do in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to educate the next generation of engineers who were not there when those transformational projects were created. In our first “season,” we start with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. And what better than kicking off the Data Engineering season with an episode on Hadoop, a project that is synonymous with Big Data. We were incredibly fortunate to host the creators of Hadoop, Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/], to share with us the untold history of Hadoop, how multiple technical breakthroughs and a little luck came together for them to create the project, and how Hadoop created a vibrant open source ecosystem that led to the next generation of technologies such as Spark. Timestamps * Introduction [00:00:00] * Origin story of Hadoop [00:03:26] * How Google’s work influenced Hadoop [00:05:47] * Yahoo’s contribution to Hadoop [00:13:51] * Major milestones for Hadoop [00:20:06] * Core components of Hadoop - the why’s and how’s [00:22:44] * Rise of Spark and how the Hadoop ecosystem reacted to it [00:27:19] * Hadoop vendors and the tension between Cloudera and Hortonworks [00:31:51] * Proudest moments for the Hadoop creators [00:33:56] * Lightning round [00:36:04] Transcript Sudip: Welcome to the inaugural episode of the Engineers of Scale podcast. In our first season, we'll cover the projects that have transformed and shaped the data engineering industry. And what's better than starting with Hadoop, the project that is synonymous with Big Data. Today, I have the great pleasure of hosting Doug Cutting [https://www.linkedin.com/in/cutting/] and Mike Cafarella [https://www.linkedin.com/in/mikecafarella/], the creators of Hadoop. And just for the record, Hadoop is an open source software framework for storing enormous data and distributed processing of very large data. Think hundreds and thousands of petabytes of data on again, hundreds and thousands of commodity hardware nodes. If you have anything to do with data ever, you certainly know of Hadoop and have either used it or definitely have benefited from it one way or another. In fact, I remember back in 2008, I was working on my second startup, and we were actually processing massive amounts of data from retailers coming from their point of sale systems and inventory. And as we looked around, Hadoop was the only choice we really had. So today I'm incredibly excited to have the two creators of Hadoop, Mike Caffarella and Doug Cutting with us today. Mike and Doug, welcome to the podcast. It is great having you both. [00:01:02] Doug: It's great to see you. Thank you. Thanks for having us. [00:01:10] Sudip: If you guys don't mind, I think for our listeners, it'll be great to know what you guys are up to these days. Mike, maybe I'll start with you and then Doug. [00:01:19] Mike: Sure. I'm a research scientist in the Data Systems group at MIT. [00:01:27] Doug: I'm a retired guy. I stopped working 18 months ago. My wife ran for public office and it was a good time for me to transition into being a home keeper, do shopping and cooking. But I also have a healthy hobby of mountain biking and doing trail advocacy and development, trying to build more trail systems around the area that I live in. [00:01:44] Sudip: Sounds like you're having real fun, Doug. One day we all aspire to get there, for sure. I'm really curious to know how you guys had met. I've seen some interviews of you guys. You kind of talked about how, I think, Doug, you were working on Lucene at that time and then connected with Mike somehow through a common friend. I'd love to know a little more detail on how you guys met and how you guys started working together. [00:02:06] Doug: It kind of goes back to Hadoop really. Hadoop was preceded by this project, Nutch. Nutch was initiated when a company called Overture, which we'll probably hear more about, called me up out of the blue as a guy who had experience in both web search engines and open source software and said, hey, how would you like to write an open source web search engine? And I said, that'd be cool. And they say they had money to pay me at least part time and maybe a couple other people. And did I know anyone? I didn't know anybody offhand, but I had friends. I called up my freshman roommate, a guy named Sammy Shio, who is a founder of Marimba. And I said, Sammy, do you know anybody? And he said, you should talk to Mike Cafarella. I think it was the only name that I got. And I called Mike and he said, yeah, sure, let's do this. [00:02:49] Mike: So at the time, this would be in like late summer, early fall of 02. I had worked in startups and in industry for a few years, but I was looking to go back to school. So I was putting together applications for grad school. And I was working with an old professor of mine to kind of scoop up my application a little bit because I had been out of research and so on for a while. And that was a fun project, but it wasn't consuming all my time. And so Sammy, who was one of the founders of Marimba, which was my first job out of college, he got in touch and said that his buddy, Doug, had an interesting project and I should make sure I go talk to him, which was great. I was looking for something to do and it came at just the right moment. [00:03:26] Sudip: That was quite a connection, Mike. And then going back to that timeframe, 2002-2003, I think, Doug, you started touching on how you started working on Nutch and eventually became Hadoop. Would you mind just maybe walking us through a little bit like the origin story of Hadoop? I mean, I know Overture funded you for writing the web crawler, but what was their interest in an open source web crawler in the first place? [00:03:49] Doug: I think that's a good question to get back to some of the business context. We want to mostly focus on tech here, but the business context matters, as is often the case. So I had worked on web search from 96 to 98 at a company called Excite. I'd been pretty much the sole engineer doing the backend search and indexing system. And then I transitioned away from that, written this search engine on the side called Lucene, which I ended up open sourcing in 2000. Also in 98, Google launched, and initially they were ad-free. All the other search engines, there were a handful of them, were totally encrusted and covered with display ads. So just think like magazine ads, just random ads that they managed to sell the space to advertisers. Google started with no ads, and they also really focused and spent a lot of effort trying to work on search quality. All they were doing was search. Everybody else was trying all kinds of things to get more ads in front of people, and Google just focused on making search better. And by 2000, they'd succeeded, and the combination of this really clean, simple interface and better quality search results, they had taken most of the search market share already. But they needed a revenue plan. This company called Overture had, in the meantime, invented a way to make a lot of money from web search by auctioning off keywords to advertisers and matching them to the query. Google copied that and started themselves minting money. Overture was nervous because they had this market, and they were licensing it to Yahoo and Microsoft and others, but they were worried that all of their customers were going to get beaten by Google and go out of business. So on one hand, they sued Google. That's an interesting side story. But on the other hand, they decided, we should build our own search engine to compete with Google. We somehow need to do this. They bought AltaVista. They tried to build something internally, and they also thought, you know, open source is this big trend. Let's do an open source one to have something to compete. So they called me, and I called Mike, and we worked with a small team of guys there at Overture, led by a guy named Dan Fain, and we started working on trying to build web search as open source. [00:05:47] Sudip: That is such a phenomenal historical context. Including myself. I don't think many, very many people had that. And then interestingly, Google also came out with their GFS paper in 2003, their MapReduce paper in 2004, which obviously influenced a lot of the work that I think you guys did down the line. I'm curious, what do you think might have caused Google to publish those papers in the first place? Any hypothesis on that? [00:06:14] Mike: I think you're putting your finger on something interesting and important, which was, at the time, that wasn't common practice to have a research paper that told you a lot of technical details about an important piece of infrastructure. I don't think it was part of some genius, long-term plan to profit down the road. It was part of a general culture at the place to emphasize the virtues of publishing and openness and science. Maybe it helped them with hiring or something like that, but if so, that was kind of an indirect benefit. And it was really trend-setting. I mean, they ended up publishing a ton of papers. I think Microsoft and Yahoo and other companies followed suit. There's a whole string of really interesting papers throughout the 2000s and early 2010s, systems that we might never have learned about had they remained totally closed. But it's interesting to think about the impact of the GFS paper, I think, on our experience, Doug, which was we had worked on Nutch for, I guess, about a year. And after about a year's time, I recall that it was indexing on the order of tens of millions of pages, but you couldn't get more than a month's worth of freshness because the disk head just couldn't move that fast in a month. So it was a single machine, but we were limited by storage capacity or by disk throughput on the seek side. If we wanted the index size to grow substantially larger, then we had to have some kind of distributed solution. I remember we spent something like six to nine months, Doug, working on a dedicated distributed indexer for Nutch. I remember the technical details. Maybe you can pitch in a little bit there. But I do remember finishing it, or at least thinking we were finished, and then about five minutes later reading the GFS paper and realizing that we should have done it that way. [00:07:51] Doug: I remember running it. I remember operating that thing. And I think we actually got up to 200 million web pages. We were basically doing, this was still well before the MapReduce paper, doing MapReduce by hand. We'd quickly learned that we couldn't do all of this on a single processor. I more or less knew that from my days at Excite. We sort of hit the limit of what you could do on single processing, and even then we were already doing some things distributed. So we needed to have a way to do it distributed, which was to chop up the problem into pieces and run them in parallel. Overture had bought some hardware for us. A friend of ours named Ben Lutch, another guy we brought on, was running that hardware in a data center somewhere, and we could farm off and run processes on these. But it was a lot of work. We'd run four things doing crawling, get that data down on the disks of those machines. Then we'd parse out all the links, and then we had to sort of combine, do a shuffle effectively in MapReduce terms, and do a merge of all those data on the different machines, and decide which pages to grab next, and then to do indexing. We had all that plumbing working, probably a 10-step process, each step of which was distributed across five machines. But it took you running and monitoring all of these processes for 10 steps, and shuffling files around by hand. We were just using SCP to move things between these nodes. It was laborious, and I don't think practical for more than five machines. We would have needed to start thinking about automating all that. Somewhere in there is when the MapReduce paper came out and automated all that, and added in a lot of other reliability considerations, as did the GFS thing. We didn't have to worry about drive failures and machines crashing. With five machines, that didn't happen, practically. But if we wanted to move up to 100 or 1,000 machines, then we knew it would, and that we'd need all that. So it was a pretty nice gift. I mean, back to motivations for Google, I think part of it was, I think as Mike sort of indicated, they came out of academia, they had this don't be evil motto, and felt like it was sharing this. I think there was also a little bit of an agenda, in that at some level didn't believe that technological edge was sustainable in any case, that what you really needed to do was build company culture and operations. I actually talked to Larry Page about that once, and he claimed that their only sustainable edge was operations, which I thought was an interesting claim to make. But also, they believed having an open source implementation would help them in recruiting, that people would already be familiar with these concepts, with this model, and when they came into Google, they could adapt more readily and come up to speed. Which again, says they weren't worried about competition at a technical level, which is interesting. I'm very grateful they had that sort of high-minded attitude, because we were able to benefit tremendously. That was a big project. It took them, you know, five years with a huge team to work through a lot of different alternatives, and come up with a solution that they published. And Mike and I could just go, hey, let's go implement that. We've got a blueprint here. We were very happy to take all their hard work off the shelf. [00:10:42] Sudip: How long did it take you to kind of incorporate the ideas from those two papers, the GFS and MapReduce, into what you guys were building? [00:10:53] Mike: I remember running about a 40-node version of HDFS roughly six months after the paper came out. So I think by the summer of 2004, we had something limping along. I do remember that that version, though, if one node had the misfortune of going down, a simple thing that the paper doesn't dwell a lot on, but one implements, is a machine goes down, a portion of the file system has now become a little bit more in danger because you have fewer copies of those bytes than you would like to have. So it's time to copy those to duplicate them more. And if a machine went down, the other machines were scrupulous to a fault, would absolutely blast as many bytes as possible to escape this dangerous situation right away. It would paralyze the entire cluster until it had done so. So you had to limit that a little bit, but it was roughly in that time that the basic system was limping along. [00:11:42] Doug: Mike tended to focus on the core GFS algorithms and the core MapReduce algorithms, and I tended to focus more on hooking that all into the rest of Nutch and then running it. Mike also did some crawls up at UW as well. [00:11:53] Mike: I should say, you know, one thing that I've always thought of as an underappreciated element of the project's success, and that Doug really took ownership of, was the readme and the out-of-the-box experience. You could download the thing, and an hour later, it could be working on 100 nodes. And at the time, that was just unfathomable. If you want to get, say, an eight-node distributed database system working at the time, well, you better go have a budget to go hire some consultants from IBM to help you set that up, right? But Nutch, almost Hadoop experience at this point, in a distributed setting was really smooth, and Doug focused on all that stuff. It was really, I think, a big ingredient in its success. [00:12:20] Sudip: I think, speaking from my own experience, it really shortened our time to value, and we didn't have to raise a whole lot of money because cloud was coming up. This was circa 2009. AWS was just about coming up, so we could get up and running so quickly because we had this nice project you guys had created. So, very belated thank you for that. [00:12:32] Sudip: It sounds like you guys had most of the plumbing before the two Google papers came out. Did you ever think of if those hadn't come out, how Hadoop might have looked now? [00:13:00] Doug: I think we would have struggled. I think we would have come up with some scripts to try to automate some of this stuff, and long-term, we would have struggled with reliability issues and scaling issues. [00:13:08] Mike: We were really interested in Nutch, the search engine at the time, right? The goal of all this work was to improve the search engine quality, to improve its coverage, ranking quality, speed to indexing, and so on. And so, in that alternate reality, if things had been a little bit different, I like to think that we would have encountered the same technical issues that the Google ads did, and we would have gone through a similar kind of technical discovery process just a little bit later than they did. Of course, they're very sharp. Would we have actually had as good an outcome? I don't know, but one thing that was nice is that focus on the search engine led us to see some of these problems later than the Google guys, but earlier than most other people. We were a pretty small team next to, I think, the resources that Google had. [00:13:51] Sudip: Yeah, resources and how important search was. Obviously, that was the entire company, entire business model, right? Speaking of which, another search giant at that time, Yahoo, had an amazing role, a disproportionate role in making Hadoop successful. And Doug, you, of course, went to work there, I believe, in 2006, if I'm not mistaken. Maybe if you could walk us through the role that you guys saw Yahoo playing in Hadoop in the early days, how they made it such a successful project. [00:14:17] Doug: Yahoo bought Overture, which was interesting, and they continued to support the work on Nutch. I think that deal closed in probably 2004 or something like that, 2005. And Yahoo was trying to build its own search engine to go against Google, and they bought a company called Inktomi, and a big team of engineers from that who had been running a web search engine. And we're trying to figure out the next generation of that. They recognized the need for a new data backend to hold all the crawl data to do the processing of it. They looked around, they saw the MapReduce paper and the GFS paper, and they said, we need that. And they saw this open source implementation, they thought that'd be a great starting point. And to boot, they had already funded it. I think that was a coincidence as much as anything, but it meant we already knew people there. I went and gave a tech talk about it in 2005. And in 06, they said, let's start, let's adopt this platform that you guys have in Nutch as our backend. We really want to invest in it. I said, great, I'll come work for you. That's what I would love to see is some serious investment. It's just Mike and I working on it. It's going to be a long tail of debugging to get to anywhere near the solidity that it needs. So I joined. They were like, we don't care about all the web search specific stuff, because we already have that. What we need is this backend stuff, the equivalent of GFS and MapReduce. We need the HDFS. And so we split it in two. They were also concerned about intellectual property stuff working in the search space. They wanted to work in as narrow a space as they could for exposure to patent issues and so on. So we split Hadoop out. I had that name waiting on a back burner. My son had this stuffed yellow elephant that he had named Hadoop. He just created this name, and I thought that would be good. It comes with a mascot. [00:15:55] Sudip: And I purposefully didn't ask you that question, because that's one answer everyone knows, for sure. [00:16:01] Doug: And so that was, I think, February, maybe, of 06. We split them out, refactored. It was a pretty small job. I think Mike and I had already factored the code reasonably well, so it was mostly just renaming a lot of things and putting them in different namespaces. And we were kind of off to the races. Yahoo immediately, the day I started, I think they had a team of 10 all of a sudden working on it. And we had a hundred machine cluster of really nice hardware compared to anything that Mike and I had ever had before. [00:16:27] Mike: Yeah, it was incredible. I mean, up until that point, Doug and I were working on it. I think Stefan, Stefan Grosjef, had been contributing on the open source side for Nutch. He was kind of a notable contributor. There were a few other people, but when Yahoo invested in it, it really was an epic change in the number of people paying attention. It was great. We really expanded the set of people who were participating. [00:16:47] Doug: By summer, they had probably a hundred engineers and a thousand node clusters on this. And I don't know if it was the end of 06 or was it in 07 when they actually transitioned their production search engine to running on top of this. It was a big process. Owen O'Malley started really making a lot of improvements there. Who were some of the other guys there, Mike? [00:17:06] Mike: Arun Murthy and Eric14 was the lead engineer, or the engineering manager of that at that time. Rami Stata, who I think had come to Yahoo as part of an acquisition, he was the manager and kind of our champion for the project inside Yahoo for a period of time. He was really instrumental. So one thread of the story is the contingent nature of the project. There were lots of things that had to go right for this to be a success. And at many points, there were individuals or companies who decided to listen to the better angels of their nature. And whether it was Overture funding the project initially, Yahoo deciding to fund it, or Google deciding to write those papers, a lot of things came up right and eventually yielded a good outcome. But it took lots of people working independently to kind of happen into it. [00:17:48] Doug: As an open source maintainer, my goal has always been to get a project to the point where I'm not needed, where it has a life of its own. It's built up enough of a user base, about enough of a developer base. And where Mike and I were in 05 with Nutch wasn't there. It was too raw. You could use it in the out-of-the-box experience was as good as it could be, but it was unproven. I was working as a freelance consultant. I was getting tired. I was looking for a full-time gig. And IBM talked to me and they said, we really want to invest in Lucene. And it would have been a nice job, but Lucene was fine. Lucene was off and running on its own. I didn't need to be there day to day to use Lucene because IBM was already using it. Whereas Nutch really needed a sponsor. And that's when Yahoo came along. And so it was really a great thing that they did and really, really took it and made it real, got it to that point where it really proved itself. And then Facebook and Twitter and all the rest could start using it. [00:18:39] Sudip: In hindsight, what was your take on Yahoo's relationship with an open source project? Was that smooth sailing internally where people really believed in it? Or was there some push to put it within the walls? [00:18:53] Doug: It wasn't something that they had done a lot of before. So it was new ground for them at a corporate level. I had learned because I'd been doing consulting in open source for a while that I needed a clause in my employment agreement that said that I could contribute to open source. None of the Yahoo employees had that. And so although the engineering management was using this and investing in this and committed to the vision of open source, the lawyers said, no, Yahoo employees, except for Doug, can't actually contribute to open source. So they had to submit all their additions and then I could commit them. It was this one little step that I had to do for well over a year. It took a change in Yahoo CEO before we could get somebody to override the legal department and say, this is actually okay for Yahoo employees to directly apply changes to Apache. So there were a lot of things like that. The other thing that was an issue is Yahoo, as I said, had 100 people working on this. And there were a handful of people in other companies that were starting to use Hadoop, but nobody had anything like that. And in open source, it's hard to not dominate when you've got that big of an imbalance. And we really wanted to build an egalitarian community where everybody weighs in. And it was hard for Yahoo to not be the 200 pound gorilla. And there were some growing pains around that over the years. [00:20:06] Sudip: Before I shift to talking about the main components of the Hadoop ecosystem, I want to kind of just read through a couple of milestones that I found. And I'd love to know if any of those sound completely off. So I found that in 2007, within less than a year after you joined Yahoo or Doug, they actually were using Hadoop on a thousand node cluster. And then in April 2008, apparently Hadoop defeated supercomputers and became the fastest processing system to sort through an entire terabyte of data. And then in April 2009, I think Hadoop was used to sort a terabyte of data in 62 seconds, beating Google's MapReduce implementation. And then finally, I think in 2009 also, it was used to sort through a petabyte of data and indexing billions of searches and indexing billions of web pages. These are like heady, heady milestones. I'm curious a little bit, what was the feeling inside Yahoo at that time as you guys were hitting those milestones, which actually went in some cases way beyond what I think you guys had set out to do? [00:21:08] Doug: That was mostly Arun and Owen doing those benchmarks and driving that forward. I think [00:21:13] Mike: We were pretty stoked. It was awesome. Doug's right. The point is to get it to the point where it won't die. And people that we knew, but we were not working with every day, they were taking it and doing amazing things with it. That's what you want to have happen. It was thrilling to see. It was really fun. Yeah, it also gets to motives back to why Google published those papers. [00:21:34] Doug: It gave their employees public visibility and employees like that. You want employees to be happy. I think that was part of the motive for Yahoo adopting an open source solution is people like to be visible in the outside world, more peer recognition, and being involved in open source gives you that. So it made Yahoo a more fun place to work and more rewarding. And also being able to try to beat these kind of records. Again, it's great for recruiting and retention, employee morale, so long as you believe that you're not giving away the bank. When you do that, I think you build a much stronger company. Yahoo's profile isn't what it used to be in the consumer space nowadays. But in the 2000s, they were not as maybe financially successful as Google, but the technical depth of the company was really good. They had a ton of people working on Hadoop. They had a lot of people in different parts of the company, like Yahoo Labs was a research lab that was fantastic at that time. The technical skills inside the company were great. Google was getting a lot of press. So I always felt like some of these guys, it felt great to go make a splash and get some numbers in the way that people in Mountain View were. And they certainly had the brains to do it. So I thought it was great. That's sick. [00:22:44] Sudip: I want to shift to discuss a little more technical stuff. So as I understand, Hadoop has had four main components, HDFS, Yarn, Hive, and Hadoop Common. I'm curious a little bit if you wouldn't mind spending a little bit of time on what was the motivation to create each of those components and what do you think was kind of the guiding principle to build that ecosystem of four main components? [00:23:09] Mike: I'll try to address some of this, Doug. I mean, you were more closely involved with some of these components than I was. I should say the Yahoo engagement in the 05 to 06 and so on, it was thrilling in a lot of ways. It was also kind of the beginning of the end for me because I was in grad school at the time. And when you have 100 people working on the project, I mean, people would file bugs and fix them before I could come in to put in my 10 hours a week. So at some point, I had to decide whether I was going to actually get my PhD or keep trying ineffectually to contribute next to everyone else. So I had to kind of pull out by 07 or so. So some of these later components, I don't have a ton of insight into, but I can comment on some of them. HDFS was the original Google file system element. The original version, I think, reflected a design set of choices that informed a lot of Nutch and to do, which was a focus on correctness rather than performance or other things. That's why Java was the programming language we chose for all this, not known for being super high performance at the time or arguably now. I think it's not as bad as people think, but I think it's okay. But the thing that was good about it was we were really worried like if the system is wrong or if no one can contribute, it'll die. If it's slow, people can probably live with it a little bit. And again, I think the performance penalty of that's a little bit overstated. So HDFS, the initial version, was really designed to emulate GFS in many ways. In some ways, we made decisions that reflected our particular use case, like there were certain kind of mutation operators or append operators in GFS that were not supported in the original version of HDFS. There was no traditional failover in the original version. That would have to come later with Zookeeper and a few other services that offered us high quality distributed system safety. So the original version of HDFS was an emphasis on correctness and just bulk storage. And that's turned out to be an enduring advantage of the whole project, right? That's still something that a long time later people really need. It's become technically much more sophisticated than it was originally. But that emphasis on like reliable storage that is as cheap per byte as possible has still proven to be a good idea. Yeah, I mean, I'll just add, you know, HDFS and MapReduce were kind of the original two functional components, each modeled after a paper from Google. The common portion is just the utilities that we needed to build this kind of system, you know, RPC, storage formats, just that kind of library. Yarn came along a little later. That was a project led by Arun to abstract the scheduling out of MapReduce and try to come up with a scheduler that was general purpose that other systems could use to schedule tasks across a large cluster once you had one built up for more than just use it for more than just MapReduce. [00:25:53] Mike: One of the cool things at that time, you know, this is like the late 2000s, were that people were learning from the MapReduce example, both at a systems perspective and kind of a science perspective. And they were building much more ambitious systems in the original MapReduce pattern. So there was Pig, which was like a SQL processor that was built on top of it. Hive would come out pretty soon. From Microsoft, maybe a year or two later, there's a system called Nyad. And all these were kind of arbitrary computation graph systems. The original infrastructure of MapReduce inside Hadoop couldn't support that stuff, but Yarn could, like they could be much more ambitious about the computation graph. [00:26:27] Sudip: And as I understand, Hive was the one that made someone use SQL to write MapReduce jobs, right? That kind of opened up the user base quite a bit. [00:26:38] Mike: Yeah, that's right. Pig was very similar, but it used a different syntax. So it didn't use SQL syntax, but it tried to obtain something similar. Mike and I weren't doing science. We were doing engineering on this project, and maybe some social work trying to build communities and so on. To really evolve, you need to have people experiment and try to build new kinds of systems. And I think Hadoop really inspired that in the open source space. I think we saw a huge number of things follow on, many of which succeeded, many of which didn't. So we gave people this example, and then they could do some science based on our engineering example. But in the case of Hadoop, MapReduce and GFS were really the science part. [00:27:19] Sudip: Hadoop really succeeded and was probably designed as a batch analytics system in the first place, right? And then as the new use cases were coming up, did you guys ever consider adding more like a streaming use case, more of an interactive analytics use case? Was that ever a consideration for you? [00:27:34] Doug: Not really. I mean, I think we saw other people coming along and addressing those kinds of systems with Spark and other systems that really, really addressed it properly. And it was layered on top. And these systems were designed from the outset to be more batch oriented. A lot of these things were incremental, right? Like the HDFS and MapReduce process were very batch oriented. If you wanted something that was more interactive time or immediate, then maybe you'd use a traditional relational database. And I think what's become apparent since then is that there are a lot of different points in between, right? At the time, batch oriented execution was what I thought made it distinctive. Having a MapReduce experience that was a little more interactive, although not what you would call like real time necessarily, that turned out to be a pretty compelling point in the design space. And I don't think that was obvious to me like in 2011 or 2012. The big advance was before that, you couldn't process that kind of data. Even if you had the hardware, there wasn't really commercial software. That was, I think, what really excited me when I read those papers was it opened up this whole realm of managing terabytes and being able to do computations over. My background originally was in computational linguistics and doing processing over large corpuses of text is critical to that, but you couldn't do it until we had this. And open source really enables that in a way that proprietary solutions aren't as accessible to everyone. That was a thing that really excited me about it was that the possibilities that it opened up for folks doing creative tasks with large amounts of data. I mean, you could do that stuff back then. People had computers that you could attach to a network. Distributed programming was possible, but the distributed programming libraries were for researchers, not everyday programmers. And the storage costs, if you wanted a lot more storage than what a small number of disks hanging off your PC could supply, you could go buy a RAID device or an EMC-style device that was incredibly expensive and didn't give you the scalability. So for most people, it really opened up a lot of stuff that they didn't have before. And the MapReduce API looked really clean and elegant and looked like it made it really simple to process huge amounts of things. Nowadays, people look at it and they say, ooh, how clumsy. You really have to do all that work. But at the time, it was really groundbreaking in the simplicity. [00:29:54] Sudip: I remember watching an interview of you, Doug, back in 2016 or something. Somebody asked you, what would you see as the success of Hadoop ecosystem? I think one comment you made was, I kind of expect some of the Hadoop things to shrink and be replaced by new tools, which turned out to be obviously a completely accurate prediction. So I'm curious a little bit, how did you guys think about when you saw something like Spark come up, which essentially was the newer version of MapReduce, using obviously in-memory. What are your first thoughts? Did you consider adding maybe an in-memory extension in Hadoop, particularly the MapReduce portion, to do something similar? How do you guys think about it? [00:30:32] Doug: I don't think of it as a competition where Hadoop had to keep up with these other projects. Rather, for me, the bigger mission is trying to get more technology in open source. And so that's a success. When Spark comes along and makes some fundamental improvements and provides something that can replace, in many cases, maybe not in every case, Hadoop, that's a good thing. More power to it. The process is working. We're making progress because we're not selling anything. It's very different than a commercial marketplace, where if a competitor comes up with something, you need to try to match it. In open source, we're much happier to have different projects complement one another. [00:31:09] Mike: I fully agree with everything Doug just said. The thing that made those sort benchmarks and the engagement with people inside Yahoo and elsewhere so thrilling was people cared enough to make it better. People who work on Spark cared enough to make it obsolete in some ways. It's not that different. They actually thought it was worthy of paying attention to, and they did a better job. [00:31:30] Sudip: Great, awesome! Let me last in a couple of things. One is the three vendors that came out in Hadoop. There was obviously Cloudera, which started in 2008, as I understand. MapR came out in 2009, and finally Hortonworks in 2011. Doug, of course, you were involved in Cloudera from pretty much day one. I'm guessing, Mike, you also probably advised some of those. [00:31:51] Mike: I did a little bit of advice, but I had gone on my professor career. I was out of the business mostly, but I did a little bit of consulting. [00:31:57] Sudip: I'm curious, looking back, how do you see the three vendors in the Hadoop space? What did they do right? What did they get wrong? And if you were to do any company based on Hadoop today, knowing what you know now, anything different you'd do? [00:32:11] Doug: There was definitely some unfortunate things that happened in there in those years. When VCs approached me in probably 2007, a group of folks from Accel and from a couple other firms and said, we want to start companies in this space. And I was like, fine. I'm not interested in that right now. I'm happy at Yahoo. But if you do, please get together and start one. It's going to be complicated enough for the open source project to deal with a startup, but to deal with multiple ones and them trying to stab each other, it's just going to poison the open source community. We really don't need that. And these folks I talked to started Cloudera and Cloudera got off to a start and I joined Cloudera. But unfortunately, we still had bad blood because Yahoo perceived it as a threat. Yahoo had gotten all this acclaim for investing in open source and building this amazing system. And now some of that acclaim was being taken by someone else, by a startup, and moreover, money was being taken, profit was being taken. And Cloudera had stock options for something which might become big and people at Yahoo were working at a big, already public company. And so there was resentment, which led to conflict in the open source community, which stymied the project there for a while and eventually led to the team from Yahoo creating Hortonworks as a competitor for Cloudera. And the two of us went at it pretty non-productively. It wasn't, I don't think, a really healthy competition where we egged each other on. Rather, we were selling very similar products in the same market and undercutting each other. Maybe it was good for consumers of the software. The customers, yeah. It wasn't great for either company and it wasn't great for the open source projects where the bitterness led over into. So when Hortonworks and Cloudera finally merged, it sort of put that to peace finally. It wasn't ideal, but it was what it was. [00:33:56] Sudip: Absolutely. Coming back to Hadoop, as you kind of look back, you know, since you guys started, which is like, I'm thinking of almost 20 years now, what would be the most proudest moments or moments in your view? Maybe Mike, I'll start with you and then Doug. [00:34:13] Mike: I'll mention one funny one, which is actually from the Nutch era rather than Hadoop. Okay. Nutch was successful in that a lot of its code lived on in Hadoop and so on. But as an actual system, were there that many Nutch users in the world? Not that many. However, there was one that was very notable to me personally, which was the Mitchell South Dakota Chamber of Commerce. I remember this very clearly. They ran like a small intranet search site on it. Maybe that's not that notable, but if you've ever driven cross country past the Corn Palace, I believe the Chamber of Commerce actually owns and operates, or at least did at one time, the Corn Palace. There's a piece of Americana that was tied to Nutch. Whenever some people asked me about Nutch, I would brag about the Corn Palace, how they were secretly running my code. That was pretty great. On Hadoop itself, you know, I think just the breadth of it, those sorting benchmarks were very exciting. The fact that for, you know, many universities based like undergraduate classes on it, that felt great. Those were all really notable moments for me. [00:35:09] Doug: For me, Hadoop exceeded my wildest expectations. When we started sort of trying to re-implement GFS and MapReduce, my goal was to provide open source implementation of things that researchers could use to manage large data collections. To some degrees, I think it didn't hit me the degree to which corporations had massive amounts of data that they weren't harnessing. And that was what occurred to these VCs. They knew that, they were talking to, but I was not involved in enterprise software at all prior to joining Cloudera. I didn't know anything about that whole market that was out there. And to see that explode, to see banks, insurance companies, governments, these kinds of institutions really take off using this stuff was pretty amazing. I didn't see this becoming a staple of enterprises by any means. It was far beyond my imagination for where we could go with this. [00:36:04] Sudip: That is a very nice segue into the last thing we like to do on every podcast, which is a lightning round. So maybe I'll start with you, Doug, since you kind of brought us here. Three quick questions. One is around acceleration. So what has happened in big data that you thought would take much longer? What has already happened? [00:36:23] Doug: I guess, I mean, I'm an optimist. I always think things are going to go quickly and go well. But that said, I'm not sure I expected to move to open source and to the cloud as rapidly as they have. I think we've really seen open source take root in enterprise data technologies and become an accepted way of doing things, which 20 years ago, it was not at all. I don't think it was everything. Everything in enterprise was proprietary, pretty much. And also running things in the cloud. People really wanted to keep their data on their servers and cloud was not widely trusted. And that's, we've seen a real 180 there. To me, it always was appealing. I don't want to run servers. I like being able to rent a server much better and treat it as a service. So that's been a great thing to see. Mike? [00:37:10] Mike: You know, I think if you're going to say what was surprising or that I thought would take longer, I mean, AI is a very true, but in some ways boring way to answer that question. Two things that are interesting about the kind of revolution in AI that we've seen, I would say going back to 2012, when some of the first vision models became really good, and it hasn't been stopping, is first to think about the extent to which big data has been enabling technology for the modern AI stack that we see now. Even if you had had the idea that neural approaches should be turbocharged, you probably couldn't have done anything with that observation in 2002. And the other thing, which is especially interesting to me, is the way that neural models, like these really large scale models that are produced at incredible expense, have migrated so quickly to open source. The open source AI stack is really good. And when you consider they face a lot of the challenges that we did around Nutch, which is like, you've got no hardware and you've got no data, but make it work anyway. It's really impressive to me what the open source AI community has been able to do. [00:38:10] Sudip: I think what is interesting is, you know, Google at that time, when I think you guys were building Hadoop, was kind of the protagonist right at the time. And now they are kind of playing defense in some ways, right? Because of AI and what is happening in open source, thanks to Facebook, Meta, and so on. So it's interesting to see the kind of tables turned a little bit in that way. Second question is around exploration. What do you think is the most interesting unsolved question in your space? Maybe, Mike, I'll start with you. Like, what do you think is still not solved that you'd love to see happen? [00:38:43] Mike: I've got a ton of answers to this question, which I hope doesn't make me seem ungrateful for everything good that's been happening. One thing I would say is that a lot of what people store in these enormous HDFS clusters is documents, right? Like we've got a huge store of company documents. But understanding of a document beyond just the text is pretty poor. So understanding images or plots or the kind of multimodal form of a document is generally not that great. I'm hopeful that some of these AI approaches would make that better. Another thing, which is maybe at the very top of the big data stack, is I'm getting kind of sick of dashboards. I've been seeing like the same, you know, really complicated dashboards for 15 plus years. Like, here's my data center or my complicated system. I've got a big data system underneath, maybe Hadoop, maybe MapReduce, maybe Spark that is collating a ton of data. And I boil it down into some neon acid green set of dashboards that is, you know, honestly, pretty unpleasant to manage. So I'd like to see us do something with all these, like, we have this extremely large and high dimensional data set. I'd like to move beyond just the big pile of dashboards that most people use to investigate what's going on. I don't even know exactly what the answer is, but we've been dealing with the dashboard metaphor for a long time. And I think we need some innovation there. [00:40:02] Sudip: I think on that point, Peter Baylis at Sisu Data is trying to do, you know, something like that, where he's trying to move you away from dashboards and really focus on what is going wrong in your business. [00:40:13] Mike: Yeah, I think that's one possibly really interesting direction. Yeah. Doug, for you? [00:40:17] Doug: One of the big challenges I think we haven't yet met, and I'm hoping we will, is really dealing with issues around privacy and consent and transparency. It's not strictly technical, but there's technical aspects. We want to get value from data. Much of the data which is most valuable is about people, but respecting those people's rights and getting value out of the data at the same time can be in conflict and coming up with mechanisms to really handle that conflict and deal with it in a reasonable way. I think it's hard. I think we're only in the early days of that. [00:40:54] Mike: I think we will see progress. [00:40:55] Doug: I think as a society, we've seen that as we adopt other technologies. It takes decades. If you look at, you know, food safety and automobile safety, that took a long time to evolve, and it's continuing to evolve and develop how that's managed and regulated. Healthcare safety, and I think data safety, we've got some very crude things we're doing so far, and there's a lot of room to advance that. I wish the industry took it more seriously and led rather than follows and has to deal with laws that are crafted by non-technical folks rather than trying to come up with strong technical solutions that really respect people. Anyway, that's one that I'm concerned about. [00:41:34] Sudip: Beautiful. Last question for you guys. What's one message you would like everyone to remember today? Maybe, Mike, if I can start with you. [00:41:42] Mike: I think I'll mention that the success of the Hadoop project over the last, I guess, 20 years, it certainly took a lot of hard work by a lot of people. It also required a huge amount of luck. If I look at the amount of work that I put into this or something else, and maybe I think Doug would say the same thing, Hadoop's been dramatically more successful than a lot of projects. I don't feel like the work on it is any better or worse than some others. There's a heavy amount of luck in it. As I mentioned, there's also a heavy amount of contingency. Like at a few crucial points, some people decided to do something pretty good for the universe. Google didn't have to publish those papers. Yahoo didn't have to keep funding an open-source project. But they did something a little bit better than they had to, and it turned out to be great. So if you're listening to this, and you're in a position to do something a little bit better for the tech universe than you otherwise might have to, maybe you have the seeds of Hadoop on your hands. You should go for it. [00:42:30] Sudip: That's a fantastic point. [00:42:31] Doug: Yeah, no, I definitely want to echo Mike in that we were incredibly lucky and at the right place at the right time with the right skills to move this along to the next step. That said, I think a strategy that I try to employ, and I assume Mike does as well, is you do want to aim big. You do want to aim high. And at the same time, watch your feet so you don't trip. And it's this constant challenge of how to maximize the outcome without compromising your goals. And I think that's the art of doing this, is try to find something which satisfies all these competing concerns. Obviously, we wanted this project to be successful, and we looked around and found the gifts that people had laid out there for us, and opened them and ran with it. It was luck guided by, I think, some successful ability to compromise and find the right path for it. [00:43:19] Mike: Doug, there's one question I was hoping you would answer during this conversation. Maybe I'll just ask it because I don't want it to end before knowing it. It's occasionally people ask me, if we were to do it all over again, would you still choose Java for it? And I give my answer, but I want to know yours. Would you still use Java to do all this stuff? [00:43:34] Doug: Yeah, no question. I did my share of C and C++ programming, and it's painful for this kind of thing in particular. We wanted to focus on algorithms, on the overall architecture, and keep things as simple as possible. And to have done it in another language, I think would have been premature optimization. I think we were able to get solid performance by optimizing where we needed for the most part. So yeah, I don't have a regret there. [00:44:00] Sudip: How about you, Mike? [00:44:01] Mike: I agree with all of Doug's points that I would not have done it in a lower level language. I think the only question I have sometimes is whether it should have been even higher level. Like, I never regretted the claimed performance problems with Java. I never really observed them, or maybe just at that point in the project, they weren't important. I wonder if we should have done it in Python or Perl. Like, how high up the stack could we have gone and still had a successful project? [00:44:21] Doug: Yeah, I think I'm a little bit more of a language snob. [00:44:23] Mike: Fair enough. [00:44:27] Sudip: Fantastic. Thank you so much, guys. [00:44:29] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com [https://sudipchakrabarti.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

29. nov. 2023 - 45 min
Tilmeld dig for at lytte
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
Rigtig god tjeneste med gode eksklusive podcasts og derudover et kæmpe udvalg af podcasts og lydbøger. Kan varmt anbefales, om ikke andet så udelukkende pga Dårligdommerne, Klovn podcast, Hakkedrengene og Han duo 😁 👍
Podimo er blevet uundværlig! Til lange bilture, hverdagen, rengøringen og i det hele taget, når man trænger til lidt adspredelse.

Vælg dit abonnement

Mest populære

Begrænset tilbud

Premium

20 timers lydbøger

  • Podcasts kun på Podimo

  • Ingen reklamer i podcasts fra Podimo

  • Opsig når som helst

2 måneder kun 19 kr.
Derefter 99 kr. / måned

Kom i gang

Premium Plus

100 timers lydbøger

  • Podcasts kun på Podimo

  • Ingen reklamer i podcasts fra Podimo

  • Opsig når som helst

Prøv gratis i 7 dage
Derefter 129 kr. / måned

Prøv gratis

Kun på Podimo

Populære lydbøger

Kom i gang

2 måneder kun 19 kr. Derefter 99 kr. / måned. Opsig når som helst.