Free the Data Podcast
Data engineering has historically involved extracting data from disperate sources, transforming it to a standard layout, and then loading it into a new database for analytics. Usually these data engineering pipeline jobs would run on a schedule such as nightly or weekly. In today's fastpaced high-tech world however the need for data closer to real-time, meaning when it was first generated, is higher than ever. In today's episode we hear from Dustin Vannoy who is a consultant and blogger in the streaming data space about how to use Apache Spark, the most popular streaming analytics platform. How to connect with Dustin: - WEBSITE: https://dustinvannoy.com/ [https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbWxpM3RHUFlXcUpjRk81czJxcGI3VEl6TEp2QXxBQ3Jtc0tsQTJ4SHNMcFpXWk5ZZjdfSllMbzFEYXU2Vl96VUxEN29rN0dsdEhJTjlJTVF3dllYRkhIQ2lucHVXZU03UjJEOGp1RHRsaEw2czUwTjgtLUUtdmNsSXVKaTRMaFRJSWd0YzdYUHgyWHpVZjV6dExaTQ&q=https%3A%2F%2Fdustinvannoy.com%2F&v=dw1hoxWqOpQ] - TWITTER: / dustinvannoy [https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa216dGI5UHJGdzg3c092ZE11Vk5JOWN1bEFHUXxBQ3Jtc0tsMGRld1RZQ1ZDdEFfUWJVdXRDT0pGMkF4N0hvb2wwOElOcnR2TDNSNV9Wd2tnSmFiajdNdENWWmRjUFgzdnFuSmVZUE11cG1URTN3ajNUdi1hMGNEY2xzV2lTMEljakR5ajJSd1dscmNaeXdBRzBrMA&q=https%3A%2F%2Ftwitter.com%2Fdustinvannoy&v=dw1hoxWqOpQ] - LINKEDIN: / dustinvannoy [https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbG9sT3I5X0w3QmJFWFJvR2FJb2V3ZW1HQXN6d3xBQ3Jtc0ttYkxBRGluWnVlVDE1VVFNMnN1Wi1VWGV3cmExbWdzdXRjQnhtXzdZSjlqejV4SGdhd0R4QVYwSXZKNG1zQUQ4Q1lZbHlGbDFDUXBtWjNMWWw2N3Blc2pfVkZqTWtZUzBJWGs1S1R5bXlUdHF0dHQyaw&q=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fdustinvannoy%2F&v=dw1hoxWqOpQ] - YOUTUBE: / @dustinvannoy [https://www.youtube.com/channel/UCYdC0t9EFtyVAs0-cwqVCTw/videos/videos] Learn data skills at our academy and elevate your career. Start for free at https://ftdacademy.com/YT [https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbktILVUzTlJzYXpNUmRrT2MtTHB2UHVtOGt6UXxBQ3Jtc0tuQkh3ODJRdklaX0VYQXc5bC1zM3ZUN0FMY0lFZVJYWExGcHk3d3R4d3Jpb2hFSjdhQXZzb3Y0SDVueGJKZWRPWW5IdTA1WjBYU2tvenkwaU1NMlVHNlY2bF9WZkZGeFZsOHRzX2dLeWZQdW1ZejQtdw&q=https%3A%2F%2Fftdacademy.com%2FYT&v=dw1hoxWqOpQ] Chapters: 0:00:00 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=0s] Intro 0:01:01 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=61s] Dustin's Background 0:09:51 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=591s] Transitioning from legacy databases to Big Data and Streaming 0:13:29 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=809s] Microbatching vs Streaming 0:18:17 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=1097s] What is Spark and why use it? 0:22:33 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=1353s] Apache Spark vs Data Bricks 0:26:24 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=1584s] Pay for a hosted Spark version or roll your own? 0:28:27 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=1707s] Databricks setup 0:30:25 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=1825s] How Databricks executes queries 0:32:41 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=1961s] Scaling approaches to Spark 0:35:14 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2114s] Connecting to external databases in databricks 0:37:51 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2271s] Visualizing data in Databricks 0:39:40 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2380s] Using Spark for ETL work 0:42:50 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2570s] What is real-time processing? 0:44:25 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2665s] How to build a streaming job in Spark using Kafka 0:46:18 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2778s] Streaming architecture overview 0:49:15 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=2955s] Pulling data from Kafka into Spark streaming 0:51:09 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=3069s] Why apps use Kafka 0:54:33 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=3273s] Why use Spark versus alternatives 0:57:37 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=3457s] What is Confluent? 0:59:38 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=3578s] Ways to learn Spark 1:02:04 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=3724s] How hard is Spark to learn? 1:04:16 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=3856s] Troubleshooting errors in Spark 1:07:03 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=4023s] How hard is it to transition to Spark from traditional databases? 1:11:51 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=4311s] Interviewing for a Spark job 1:15:46 [https://www.youtube.com/watch?v=dw1hoxWqOpQ&list=PLMmWiZZ0QbBrMBMW7vECz4FykpocfBfdX&index=7&t=4546s] Outro
10 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de Free the Data Podcast!