
Data Engineering Podcast
Podcast de Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Empieza 7 días de prueba
$99.00 / mes después de la prueba.Cancela cuando quieras.
Todos los episodios
437 episodios
SUMMARY This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. ANNOUNCEMENTS * Hello and welcome to the Data Engineering Podcast, the show about modern data management * Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst [https://www.dataengineeringpodcast.com/starburst] and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. * Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigor INTERVIEW * Introduction * How did you get involved in the area of data management? * Can you describe what Synq is and the story behind it? * Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address? * Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? * Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary? * What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team? * How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach? * With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance? * Can you describe how Synq is designed/implemented? * How have the scope and goals of the product changed since you first started working on it? * For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows? * What are the types of incidents/errors that you are able to identify and alert on? * What does a typical incident/error resolution process look like with Synq? * What are the most interesting, innovative, or unexpected ways that you have seen Synq used? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq? * When is Synq the wrong choice? * What do you have planned for the future of Synq? CONTACT INFO * LinkedIn [https://www.linkedin.com/in/petr-janda/?originalSubdomain=dk] * Substack [https://substack.com/@petrjanda] PARTING QUESTION * From your perspective, what is the biggest gap in the tooling or technology for data management today? CLOSING ANNOUNCEMENTS * Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ [https://www.pythonpodcast.com] covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast [https://www.themachinelearningpodcast.com] helps you go from idea to production with machine learning. * Visit the site [https://www.dataengineeringpodcast.com] to subscribe to the show, sign up for the mailing list, and read the show notes. * If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com [hosts@dataengineeringpodcast.com] with your story. LINKS * Synq [https://www.synq.io/] * Incident Management [https://www.pagerduty.com/resources/learn/what-is-incident-management/] * SLA == Service Level Agreement [https://en.wikipedia.org/wiki/Service-level_agreement] * Data Governance [https://en.wikipedia.org/wiki/Data_governance] * Podcast Episode [https://www.dataengineeringpodcast.com/nicola-askham-practical-data-governance-episode-428] * PagerDuty [https://www.pagerduty.com/] * OpsGenie [https://www.atlassian.com/software/opsgenie] * Clickhouse [https://clickhouse.com/] * Podcast Episode [https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/] * dbt [https://www.getdbt.com/] * Podcast Episode [https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/] * SQLMesh [https://sqlmesh.readthedocs.io/en/stable/] * Podcast Episode [https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380] The intro and outro music is from The Hug [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug] by The Freak Fandango Orchestra [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/] / CC BY-SA [http://creativecommons.org/licenses/by-sa/3.0/] Sponsored By: * Starburst [https://www.dataengineeringpodcast.com/starburst]:  This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst) [https://www.dataengineeringpodcast.com/starburst] Support Data Engineering Podcast [https://dataengineering.supercast.com/]

SUMMARY Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service. ANNOUNCEMENTS * Hello and welcome to the Data Engineering Podcast, the show about modern data management * Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst [https://www.dataengineeringpodcast.com/starburst] and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. * Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou INTERVIEW * Introduction * How did you get involved in the area of data management? * Can you describe what Microsoft Fabric is and the story behind it? * Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? * Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution? * What are the elements of Fabric that were engineered specifically for the service? * What are the most interesting/complicated integration challenges? * How has your prior experience with Ahana and Presto informed your current work at Microsoft? * AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine? * What are the challenges in terms of safety and reliability? * What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? * When is Fabric the wrong choice? * What do you have planned for the future of data lake analytics? CONTACT INFO * LinkedIn [https://www.linkedin.com/in/diptiborkar/] PARTING QUESTION * From your perspective, what is the biggest gap in the tooling or technology for data management today? CLOSING ANNOUNCEMENTS * Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ [https://www.pythonpodcast.com] covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast [https://www.themachinelearningpodcast.com] helps you go from idea to production with machine learning. * Visit the site [https://www.dataengineeringpodcast.com] to subscribe to the show, sign up for the mailing list, and read the show notes. * If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com [hosts@dataengineeringpodcast.com] with your story. LINKS * Microsoft Fabric [https://www.microsoft.com/microsoft-fabric] * Ahana episode [https://www.dataengineeringpodcast.com/ahana-presto-cloud-data-lake-episode-217] * DB2 Distributed [https://www.ibm.com/docs/en/db2/11.5?topic=managers-designing-distributed-databases] * Spark [https://spark.apache.org/] * Presto [https://prestodb.io/] * Azure Data [https://azure.microsoft.com/en-us/products#analytics] * MAD Landscape [https://mattturck.com/mad2024/] * Podcast Episode [https://www.dataengineeringpodcast.com/mad-landscape-2023-data-infrastructure-episode-369] * ML Podcast Episode [https://www.themachinelearningpodcast.com/mad-landscape-2023-ml-ai-episode-21] * Tableau [https://www.tableau.com/] * dbt [https://www.getdbt.com/] * Medallion Architecture [https://dataengineering.wiki/Concepts/Medallion+Architecture] * Microsoft Onelake [https://learn.microsoft.com/fabric/onelake/onelake-overview] * ORC [https://orc.apache.org/] * Parquet [https://parquet.incubator.apache.org] * Avro [https://avro.apache.org/] * Delta Lake [https://delta.io/] * Iceberg [https://iceberg.apache.org/] * Podcast Episode [https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/] * Hudi [https://hudi.apache.org/] * Podcast Episode [https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209] * Hadoop [https://hadoop.apache.org/] * PowerBI [https://www.microsoft.com/power-platform/products/power-bi] * Podcast Episode [https://www.dataengineeringpodcast.com/power-bi-business-intelligence-episode-154] * Velox [https://velox-lib.io/] * Gluten [https://gluten.apache.org/] * Apache XTable [https://xtable.apache.org/] * GraphQL [https://graphql.org/] * Formula 1 [https://www.formula1.com/] * McLaren [https://www.mclaren.com/] The intro and outro music is from The Hug [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug] by The Freak Fandango Orchestra [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/] / CC BY-SA [http://creativecommons.org/licenses/by-sa/3.0/] Sponsored By: * Starburst [https://www.dataengineeringpodcast.com/starburst]:  This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst) [https://www.dataengineeringpodcast.com/starburst] Support Data Engineering Podcast [https://dataengineering.supercast.com/]

SUMMARY Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. ANNOUNCEMENTS * Hello and welcome to the Data Engineering Podcast, the show about modern data management * Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst [https://www.dataengineeringpodcast.com/starburst] and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. * Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse INTERVIEW * Introduction * How did you get involved in the area of data management? * Can you describe what role Trino and Iceberg play in Stripe's data architecture? * What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure? * What were the requirements and selection criteria that led to the selection of that combination of technologies? * What are the other systems that feed into and rely on the Trino/Iceberg service? * what kinds of questions are you answering with table metadata * what use case/team does that support * comparative utility of iceberg REST catalog * What are the shortcomings of Trino and Iceberg? * What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? * When is a lakehouse on Trino/Iceberg the wrong choice? * What do you have planned for the future of Trino and Iceberg at Stripe? CONTACT INFO * Substack [https://kevinjqliu.substack.com] * LinkedIn [https://www.linkedin.com/in/kevinjqliu] PARTING QUESTION * From your perspective, what is the biggest gap in the tooling or technology for data management today? CLOSING ANNOUNCEMENTS * Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ [https://www.pythonpodcast.com] covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast [https://www.themachinelearningpodcast.com] helps you go from idea to production with machine learning. * Visit the site [https://www.dataengineeringpodcast.com] to subscribe to the show, sign up for the mailing list, and read the show notes. * If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com [hosts@dataengineeringpodcast.com] with your story. LINKS * Trino [https://trino.io/] * Iceberg [https://iceberg.apache.org/] * Stripe [https://stripe.com/] * Spark [https://spark.apache.org/] * Redshift [https://aws.amazon.com/redshift/] * Hive Metastore [https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore] * Python Iceberg [https://py.iceberg.apache.org/] * Python Iceberg REST Catalog [https://github.com/kevinjqliu/iceberg-rest-catalog] * Trino Metadata Table [https://trino.io/docs/current/connector/iceberg.html#metadata-tables] * Flink [https://flink.apache.org/] * Podcast Episode [https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57] * Tabular [https://tabular.io/] * Podcast Episode [https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363] * Delta Table [https://delta.io/] * Podcast Episode [https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/] * Databricks Unity Catalog [https://www.databricks.com/product/unity-catalog] * Starburst [https://www.starburst.io/] * AWS Athena [https://aws.amazon.com/athena/] * Kevin Trinofest Presentation [https://trino.io/blog/2023/07/19/trino-fest-2023-stripe.html] * Alluxio [https://www.alluxio.io/] * Podcast Episode [https://www.dataengineeringpodcast.com/alluxio-distributed-storage-episode-70] * Parquet [https://parquet.incubator.apache.org/] * Hudi [https://hudi.apache.org/] * Trino Project Tardigrade [https://trino.io/blog/2022/05/05/tardigrade-launch.html] * Trino On Ice [https://www.starburst.io/blog/iceberg-table-partitioning/] The intro and outro music is from The Hug [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug] by The Freak Fandango Orchestra [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/] / CC BY-SA [http://creativecommons.org/licenses/by-sa/3.0/] Sponsored By: * Starburst [https://www.dataengineeringpodcast.com/starburst]:  This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst) [https://www.dataengineeringpodcast.com/starburst] Support Data Engineering Podcast [https://dataengineering.supercast.com/]

SUMMARY Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink. ANNOUNCEMENTS * Hello and welcome to the Data Engineering Podcast, the show about modern data management * This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments [https://www.dataengineeringpodcast.com/codecomments] today to subscribe. My thanks to the team at Code Comments for their support. * Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst [https://www.dataengineeringpodcast.com/starburst] and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. * Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams INTERVIEW * Introduction * How did you get involved in the area of data management? * Can you describe what Datorios is and the story behind it? * Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink? * How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink? * How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it? * How have the requirements of generative AI shifted the demand for streaming data systems? * What role does Flink play in the architecture of generative AI systems? * Can you describe how Datorios is implemented? * How has the design and goals of Datorios changed since you first started working on it? * How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms? * Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink? * What are the most interesting, innovative, or unexpected ways that you have seen Datorios used? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios? * When is Datorios the wrong choice? * What do you have planned for the future of Datorios? CONTACT INFO * Ronen * LinkedIn [https://www.linkedin.com/in/ronen-korman/] * Stav * LinkedIn [https://www.linkedin.com/in/stav-elkayam-118a2795/?originalSubdomain=il] PARTING QUESTION * From your perspective, what is the biggest gap in the tooling or technology for data management today? CLOSING ANNOUNCEMENTS * Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ [https://www.pythonpodcast.com] covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast [https://www.themachinelearningpodcast.com] helps you go from idea to production with machine learning. * Visit the site [https://www.dataengineeringpodcast.com] to subscribe to the show, sign up for the mailing list, and read the show notes. * If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com [hosts@dataengineeringpodcast.com] with your story. LINKS * Datorios [https://datorios.com/] * Apache Flink [https://flink.apache.org/] * Podcast Episode [https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57] * ChatGPT-4o [https://openai.com/index/hello-gpt-4o/] The intro and outro music is from The Hug [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug] by The Freak Fandango Orchestra [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/] / CC BY-SA [http://creativecommons.org/licenses/by-sa/3.0/] Sponsored By: * Starburst [https://www.dataengineeringpodcast.com/starburst]:  This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst) [https://www.dataengineeringpodcast.com/starburst] * Red Hat Code Comments Podcast [https://link.chtbl.com/codecomments?sid=podcast.dataengineering]:  Putting new technology to use is an exciting prospect. But going from purchase to production isn’t always smooth—even when it’s something everyone is looking forward to. Code Comments covers the bumps, the hiccups, and the setbacks teams face when adjusting to new technology—and the triumphs they pull off once they really get going. Follow Code Comments [anywhere you listen to podcasts](https://link.chtbl.com/codecomments?sid=podcast.dataengineering). [https://link.chtbl.com/codecomments?sid=podcast.dataengineering] Support Data Engineering Podcast [https://dataengineering.supercast.com/]

SUMMARY Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid. ANNOUNCEMENTS * Hello and welcome to the Data Engineering Podcast, the show about modern data management * Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst [https://www.dataengineeringpodcast.com/starburst] and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. * This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments [https://www.dataengineeringpodcast.com/codecomments] today to subscribe. My thanks to the team at Code Comments for their support. * Your host is Tobias Macey and today I'm interviewing Nicola Askham about the practical steps of building out a data governance practice in your organization INTERVIEW * Introduction * How did you get involved in the area of data management? * Can you start by giving an overview of the scope and boundaries of data governance in an organization? * At what point does a lack of an explicit governance policy become a liability? * What are some of the misconceptions that you encounter about data governance? * What impact has the evolution of data technologies had on the implementation of governance practices? (e.g. number/scale of systems, types of data, AI) * Data governance can often become an exercise in boiling the ocean. What are the concrete first steps that will increase the success rate of a governance practice? * Once a data governance project is underway, what are some of the common roadblocks that might derail progress? * What are the net benefits to the data team and the organization when a data governance practice is established, active, and healthy? * What are the most interesting, innovative, or unexpected ways that you have seen data governance applied? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on data governance/training/coaching? * What are some of the pitfalls in data governance? * What are some of the future trends in data governance that you are excited by? * Are there any trends that concern you? CONTACT INFO * Website [https://www.nicolaaskham.com/] * LinkedIn [https://www.linkedin.com/in/nicolaaskham/] PARTING QUESTION * From your perspective, what is the biggest gap in the tooling or technology for data management today? CLOSING ANNOUNCEMENTS * Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ [https://www.pythonpodcast.com] covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast [https://www.themachinelearningpodcast.com] helps you go from idea to production with machine learning. * Visit the site [https://www.dataengineeringpodcast.com] to subscribe to the show, sign up for the mailing list, and read the show notes. * If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com [hosts@dataengineeringpodcast.com]) with your story. LINKS * Website [https://www.nicolaaskham.com/] * Master Data Management [https://en.wikipedia.org/wiki/Master_data_management] * Cartesian Join [https://www.geeksforgeeks.org/cartesian-join/] * DAMA == Data Management Community [https://www.dama.org/] * DMBOK == Data Management Body of Knowledge [https://www.dama.org/cpages/body-of-knowledge] * DAMA DMBOK Wheel [https://www.dama.org/cpages/dmbok-2-wheel-images] * CDMP (Certified Data Management Professional) Exam [https://www.dama.org/cpages/cdmp-information] * Data Mesh [https://www.datamesh-architecture.com/] * Data Governance First Steps Checklist [https://www.nicolaaskham.com/free-data-governance-checklist] * The Never Normal [https://www.linkedin.com/newsletters/the-never-normal-6862024032934477824/] The intro and outro music is from The Hug [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug] by The Freak Fandango Orchestra [http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/] / CC BY-SA [http://creativecommons.org/licenses/by-sa/3.0/] Sponsored By: * Red Hat Code Comments Podcast [https://link.chtbl.com/codecomments?sid=podcast.dataengineering]:  Putting new technology to use is an exciting prospect. But going from purchase to production isn’t always smooth—even when it’s something everyone is looking forward to. Code Comments covers the bumps, the hiccups, and the setbacks teams face when adjusting to new technology—and the triumphs they pull off once they really get going. Follow Code Comments [anywhere you listen to podcasts](https://link.chtbl.com/codecomments?sid=podcast.dataengineering). [https://link.chtbl.com/codecomments?sid=podcast.dataengineering] * Starburst [https://www.dataengineeringpodcast.com/starburst]:  This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. Go to [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst) [https://www.dataengineeringpodcast.com/starburst] Support Data Engineering Podcast [https://dataengineering.supercast.com/]
Empieza 7 días de prueba
$99.00 / mes después de la prueba.Cancela cuando quieras.
Podcasts exclusivos
Sin anuncios
Podcast gratuitos
Audiolibros
20 horas / mes