Software at Scale

Software at Scale

Podcast door Utsav Shah

Software at Scale is where we discuss the technical stories behind large software applications. www.softwareatscale.dev

Tijdelijke aanbieding

3 maanden voor € 0,99

Daarna € 9,99 / maandElk moment opzegbaar.

Begin hier

Alle afleveringen

60 afleveringen
episode Software at Scale 60 - Data Platforms with Aravind Suresh artwork
Software at Scale 60 - Data Platforms with Aravind Suresh

Aravind [https://www.linkedin.com/in/aravind0suresh/] was a Staff Software Engineer at Uber, and currently works at OpenAI. Apple Podcasts [https://podcasts.apple.com/us/podcast/software-at-scale/id1543617436] | Spotify [https://open.spotify.com/show/7eGnnZnfb3mOGTqWR4gTjK?si=DMsJIWg3QjiiJS65A1nEbQ] | Google Podcasts [https://podcasts.google.com/feed/aHR0cHM6Ly9yZWxpYWJpbGl0eS5zdWJzdGFjay5jb20vZmVlZC8] Edited Transcript Can you tell us about the scale of data Uber was dealing with when you joined in 2018, and how it evolved? When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period. Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data. That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber? This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide. We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames. Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips. These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool. How does Uber manage real-time versus batch data processing, and what are the trade-offs? We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications. For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data. On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time. The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours. The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems. What challenges come with maintaining such large-scale data systems, especially as they mature? As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity. For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds of users, you need a way for people to find the right data for their needs. We built a tool called Data Book at Uber to solve this problem. Governance and compliance are also huge challenges. When you're dealing with sensitive customer data, you need robust systems to enforce data retention policies and handle data deletion requests. This is particularly challenging in a distributed system where data might be replicated across multiple tables and derived data sets. We built an in-house lineage system to track which workloads derive from what data. This is crucial for tasks like deleting specific data across the entire system. It's not just about deleting from one table – you need to track down and update all derived data sets as well. Data deletion itself is a complex process. Because most files in the batch world are kept immutable for efficiency, deleting data often means rewriting entire files. We have to batch these operations and perform them carefully to maintain system performance. Cost optimization is an ongoing challenge. We're constantly looking for ways to make our systems more efficient, whether that's by optimizing our storage formats, improving our query performance, or finding better ways to manage our compute resources. How do you see the future of data infrastructure evolving, especially with recent AI advancements? The rise of AI and particularly generative AI is opening up new dimensions in data infrastructure. One area we're seeing a lot of activity in is vector databases and semantic search capabilities. Traditional keyword-based search is being supplemented or replaced by embedding-based semantic search, which requires new types of databases and indexing strategies. We're also seeing increased demand for real-time processing. As AI models become more integrated into production systems, there's a need to handle more GPUs in the serving flow, which presents its own set of challenges. Another interesting trend is the convergence of traditional data analytics with AI workloads. We're starting to see use cases where people want to perform complex queries that involve both structured data analytics and AI model inference. Overall, I think we're moving towards more integrated, real-time, and AI-aware data infrastructure. The challenge will be balancing the need for advanced capabilities with concerns around cost, efficiency, and maintainability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev [https://www.softwareatscale.dev?utm_medium=podcast&utm_campaign=CTA_1]

05 aug 2024 - 34 min
episode Software at Scale 59 - Incident Management with Nora Jones artwork
Software at Scale 59 - Incident Management with Nora Jones

Nora is the CEO and co-founder of Jeli [https://www.jeli.io/], an incident management platform. Apple Podcasts [https://podcasts.apple.com/us/podcast/software-at-scale/id1543617436] | Spotify [https://open.spotify.com/show/7eGnnZnfb3mOGTqWR4gTjK?si=DMsJIWg3QjiiJS65A1nEbQ] | Google Podcasts [https://podcasts.google.com/feed/aHR0cHM6Ly9yZWxpYWJpbGl0eS5zdWJzdGFjay5jb20vZmVlZC8] Nora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli. Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes. We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents. Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture. We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog [https://danluu.com/everything-is-broken/] on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev [https://www.softwareatscale.dev?utm_medium=podcast&utm_campaign=CTA_1]

05 jul 2023 - 44 min
episode Software at Scale 58 - Measuring Developer Productivity with Abi Noda artwork
Software at Scale 58 - Measuring Developer Productivity with Abi Noda

Abi Noda is the CEO and co-founder of DX [https://getdx.com/], a developer productivity platform. Apple Podcasts [https://podcasts.apple.com/us/podcast/software-at-scale/id1543617436] | Spotify [https://open.spotify.com/show/7eGnnZnfb3mOGTqWR4gTjK?si=DMsJIWg3QjiiJS65A1nEbQ] | Google Podcasts [https://podcasts.google.com/feed/aHR0cHM6Ly9yZWxpYWJpbGl0eS5zdWJzdGFjay5jb20vZmVlZC8] My view on developer experience and productivity measurement aligns extremely closely with DX’s view. The productivity of a group of engineers cannot be measured by tools alone - there’s too many qualitative factors like cross-functional stakeholder beuracracy or inefficiency, and inherent domain/codebase complexity that cannot be measured by tools. At the same time, there are some metrics, like whether an engineer has committed any code-changes in their first week/month, that serve as useful guardrails for engineering leadership. A combination of tools and metrics may provide the holistic view and insights into the engineering organization’s throughput. In this episode, we discuss the DX platform, and Abi’s recently published research paper on developer experience [http://The developer-centric approach to measuring and improving productivity]. We talk about how organizations can use tools and surveys to iterate and improve upon developer experience, and ultimately, engineering throughput. GPT-4 generated summary In this episode, Abi Noda and I explore the landscape of engineering metrics and a quantifiable approach towards developer experience. Our discussion goes from the value of developer surveys and system-based metrics to the tangible ways in which DX is innovating the field. We initiate our conversation with a comparison of developer surveys and system-based metrics. Abi explains that while developer surveys offer a qualitative perspective on tool efficacy and user sentiment, system-based metrics present a quantitative analysis of productivity and code quality. The discussion then moves to the real-world applications of these metrics, with Pfizer and eBay as case studies. Pfizer, for example, uses a model where they employ metrics for a detailed understanding of developer needs, subsequently driving strategic decision-making processes. They have used these metrics to identify bottlenecks in their development cycle, and strategically address these pain points. eBay, on the other hand, uses the insights from developer sentiment surveys to design tools that directly enhance developer satisfaction and productivity. Next, our dialogue around survey development centered on the dilemma between standardization and customization. While standardization offers cost efficiency and benchmarking opportunities, customization acknowledges the unique nature of every organization. Abi proposes a blend of both to cater to different aspects of developer sentiment and productivity metrics. The highlight of the conversation was the introduction of DX's innovative data platform. The platform consolidates data across internal and third-party tools in a ready-to-analyze format, giving users the freedom to build their queries, reports, and metrics. The ability to combine survey and system data allows the unearthing of unique insights, marking a distinctive advantage of DX's approach. In this episode, Abi Noda shares enlightening perspectives on engineering metrics and the role they play in shaping the developer experience. We delve into how DX's unique approach to data aggregation and its potential applications can lead organizations toward more data-driven and effective decision-making processes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev [https://www.softwareatscale.dev?utm_medium=podcast&utm_campaign=CTA_1]

13 jun 2023 - 49 min
episode Software at Scale 57 - Scalable Frontends with Robert Cooke artwork
Software at Scale 57 - Scalable Frontends with Robert Cooke

Robert Cooke is the CTO and co-founder of 3Forge [https://www.3forge.com/], a real-time data visualization platform. Apple Podcasts [https://podcasts.apple.com/us/podcast/software-at-scale/id1543617436] | Spotify [https://open.spotify.com/show/7eGnnZnfb3mOGTqWR4gTjK?si=DMsJIWg3QjiiJS65A1nEbQ] | Google Podcasts [https://podcasts.google.com/feed/aHR0cHM6Ly9yZWxpYWJpbGl0eS5zdWJzdGFjay5jb20vZmVlZC8] In this episode, we delve into Wall Street's high-frequency trading evolution and the importance of high-volume trading data observability. We examine traditional software observability tools, such as Datadog, and contrast them with 3Forge’s financial observability platform, AMI. GPT-4 generated summary In this episode of the Software at Scale podcast, Robert Cooke, CTO and Co-founder of 3Forge, a comprehensive internal tools platform, shares his journey and insights. He outlines his career trajectory, which includes prominent positions such as the Infrastructure Lead at Bear Stearns and the Head of Infrastructure at Liquidnet, and his work on high-frequency trading systems that employ software and hardware to perform rapid, automated trading decisions based on market data. Cooke elucidates how 3Forge empowers subject matter experts to automate trading decisions by encoding business logic. He underscores the criticality of robust monitoring systems around these automated trading systems, drawing an analogy with nuclear reactors due to the potential catastrophic repercussions of any malfunction. The dialogue then shifts to the impact of significant events like the COVID-19 pandemic on high-frequency trading systems. Cooke postulates that these systems can falter under such conditions, as they are designed to follow developer-encoded instructions and lack the flexibility to adjust to unforeseen macro events. He refers to past instances like the Facebook IPO and Knight Capital's downfall, where automated trading systems were unable to handle atypical market conditions, highlighting the necessity for human intervention in such scenarios. Cooke then delves into how 3Forge designs software for mission-critical scenarios, making an analogy with military strategy. Utilizing the OODA loop concept - Observe, Orient, Decide, and Act, they can swiftly respond to situations like outages. He argues that traditional observability tools only address the first step, whereas their solution facilitates quick orientation and decision-making, substantially reducing reaction time. He cites a scenario involving a sudden surge in Facebook orders where their tool allows operators to detect the problem in real time, comprehend the context, decide on the response, and promptly act on it. He extends this example to situations like government incidents or emergencies where an expedited response is paramount. Additionally, Cooke emphasizes the significance of low latency UI updates in their tool. He explains that their software uses an online programming approach, reacting to changes in real-time and only updating the altered components. As data size increases and reaction time becomes more critical, this feature becomes increasingly important. Cooke concludes this segment by discussing the evolution of their clients' use cases, from initially needing static data overviews to progressively demanding real-time information and interactive workflows. He gives the example of users being able to comment on a chart and that comment being immediately visible to others, akin to the real-time collaboration features in tools like Google Docs. In the subsequent segment, Cooke shares his perspective on choosing the right technology to drive business decisions. He stresses the importance of understanding the history and trends of technology, having experienced several shifts in the tech industry since his early software writing days in the 1980s. He projects that while computer speeds might plateau, parallel computing will proliferate, leading to CPUs with more cores. He also predicts continued growth in memory, both in terms of RAM and disk space. He further elucidates his preference for web-based applications due to their security and absence of installation requirements. He underscores the necessity of minimizing the data in the web browser and shares how they have built every component from scratch to achieve this. Their components are designed to handle as much data as possible, constantly pulling in data based on user interaction. He also emphasizes the importance of constructing a high-performing component library that integrates seamlessly with different components, providing a consistent user experience. He asserts that developers often face confusion when required to amalgamate different components since these components tend to behave differently. He envisions a future where software development involves no JavaScript or HTML, a concept that he acknowledges may be unsettling to some developers. Using the example of a dropdown menu, Cooke explains how a component initially designed for a small amount of data might eventually need to handle much larger data sets. He emphasizes the need to design components to handle the maximum possible data from the outset to avoid such issues. The conversation then pivots to the concept of over-engineering. Cooke argues that building a robust and universal solution from the start is not over-engineering but an efficient approach. He notes the significant overlap in applications use cases, making it advantageous to create a component that can cater to a wide variety of needs. In response to the host's query about selling software to Wall Street, Cooke advocates targeting the most demanding customers first. He believes that if a product can satisfy such customers, it's easier to sell to others. They argue that it's challenging to start with a simple product and then scale it up for more complex use cases, but it's feasible to start with a complex product and tailor it for simpler use cases. Cooke further describes their process of creating a software product. Their strategy was to focus on core components, striving to make them as efficient and effective as possible. This involved investing years on foundational elements like string libraries and data marshalling. After establishing a robust foundation, they could then layer on additional features and enhancements. This approach allowed them to produce a mature and capable product eventually. They also underscore the inevitability of users pushing software to its limits, regardless of its optimization. Thus, they argue for creating software that is as fast as possible right from the start. They refer to an interview with Steve Jobs, who argued that the best developers can create software that's substantially faster than others. Cooke's team continually seeks ways to refine and improve the efficiency of their platform. Next, the discussion shifts to team composition and the necessary attributes for software engineers. Cooke emphasizes the importance of a strong work ethic and a passion for crafting good software. He explains how his ambition to become the best software developer from a young age has shaped his company's culture, fostering a virtuous cycle of hard work and dedication among his team. The host then emphasizes the importance of engineers working on high-quality products, suggesting that problems and bugs can sap energy and demotivate a team. Cooke concurs, comparing the experience of working on high-quality software to working on an F1 race car, and how the pursuit of refinement and optimization is a dream for engineers. The conversation then turns to the importance of having a team with diverse thought processes and skillsets. Cooke recounts how the introduction of different disciplines and perspectives in 2019 profoundly transformed his company. The dialogue then transitions to the state of software solutions before the introduction of their high-quality software, touching upon the compartmentalized nature of systems in large corporations and the problems that arise from it. Cooke explains how their solution offers a more comprehensive and holistic overview that cuts across different risk categories. Finally, in response to the host's question about open-source systems, Cooke expresses reservations about the use of open-source software in a corporate setting. However, he acknowledges the extensive overlap and redundancy among the many new systems being developed. Although he does not identify any specific groundbreaking technology, he believes the rapid proliferation of similar technologies might lead to considerable technical debt in the future. Host Utsav wraps up the conversation by asking Cooke about his expectations and concerns for the future of technology and the industry. Cooke voices his concern about the continually growing number of different systems and technologies that companies are adopting, which makes integrating and orchestrating all these components a challenge. He advises companies to exercise caution when adopting multiple technologies simultaneously. However, Cooke also expresses enthusiasm about the future of 3Forge, a platform he has devoted a decade of his life to developing. He expresses confidence in the unique approach and discipline employed in building the platform. Cooke is optimistic about the company's growth and marketing efforts and their focus on fostering a developer community. He believes that the platform will thrive as developers share their experiences, and the product gains momentum. Utsav acknowledges the excitement and potential challenges that lie ahead, especially in managing community-driven systems. They conclude the conversation by inviting Cooke to return for another discussion in the future to review the progression and evolution of the topic. Both express their appreciation for the fruitful discussion before ending the podcast. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev [https://www.softwareatscale.dev?utm_medium=podcast&utm_campaign=CTA_1]

16 mei 2023 - 55 min
episode Software at Scale 56 - SaaS cost with Roi Rav-Hon artwork
Software at Scale 56 - SaaS cost with Roi Rav-Hon

Roi Rav-Hon is the co-founder and CEO of Finout, a SaaS cost management platform. Apple Podcasts [https://podcasts.apple.com/us/podcast/software-at-scale/id1543617436] | Spotify [https://open.spotify.com/show/7eGnnZnfb3mOGTqWR4gTjK?si=DMsJIWg3QjiiJS65A1nEbQ] | Google Podcasts [https://podcasts.google.com/feed/aHR0cHM6Ly9yZWxpYWJpbGl0eS5zdWJzdGFjay5jb20vZmVlZC8] In this episode, we review the challenge of maintaining reasonable SaaS costs for tech companies. Usage-based pricing models of infrastructure costs lead to a gradual ramp-up of costs and always have sneakily come up as a priority in my career as an infrastructure/platform engineer. So I’m particularly interested in how engineering teams can better understand, track, and “shift left” infrastructure cost tracking and prevent regressions. We specifically go over Kubernetes cost management, and why cost management needs to be attributable to the most specific teams in order to be self-governing in an organization. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev [https://www.softwareatscale.dev?utm_medium=podcast&utm_campaign=CTA_1]

17 apr 2023 - 28 min
Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!
Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!
Makkelijk in gebruik!
App ziet er mooi uit, navigatie is even wennen maar overzichtelijk.

Tijdelijke aanbieding

3 maanden voor € 0,99

Daarna € 9,99 / maandElk moment opzegbaar.

Exclusieve podcasts

Advertentievrij

Gratis podcasts

Luisterboeken

20 uur / maand

Begin hier

Alleen bij Podimo

Populaire luisterboeken