Managing The DoorDash Data Platform

Data Engineering Podcast - Podcast autorstwa Tobias Macey - Niedziele

Kategorie:

Summary The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform Interview Introduction How did you get involved in the area of data management? Can you start by giving a quick overview of what you do at DoorDash? What are some of the ways that data is used to power the business? How has the pandemic affected the scale and volatility of the data that you are working with? Can you describe the type(s) of data that you are working with? What are the primary sources of data that you collect? What secondary or third party sources of information do you rely on? Can you give an overview of the collection process for that data? In selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating the build vs. buy decision? In your recent post about how you are scaling the capabilities and capacity of your data platform you mentioned the concept of maintaining a "paved path" of supported technologies to simplify integration across teams. What are the technologies that you use and rely on for the "paved path"? How are you managing quality and consistency of your data across its lifecycle? What are some of the specific data quality solutions that you have integrated into the platform and "paved path"? What are some of the technologies that were used early on at DoorDash that failed to keep up as the business scaled? How do you manage the migration path for adopting new technologies or techniques? In the same post you mentioned the tendency to allow for building point solutions before deciding whether to generalize a given use case into a generalized platform capability. Can you give some examples of cases where a point solution remains a one-off versus when it needs to be expanded into a widely used component? How do you identify and tracking cost factors in the data platform? What do you do with that information? What is your approach for identifying and measuring useful OKRs (Objectives and Key Results)? How do you quantify potentially subjective metrics such as reliability and quality? How have you designed the organizational structure for your data teams? What are the responsibilities and organizational interfaces for data engineers within the company? How have the organizational structures/patterns shifted or changed at different levels of scale/maturity for the business? What are some of the most interesting, useful, unexpected, or challenging lessons that you have learned during your time as a data professional at DoorDash? What are some of the upcoming projects or changes that you anticipate in the near to medium future? Contact Info LinkedIn @stonse on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links How DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand DoorDash Uber Netscape Netflix Change Data Capture Debezium Podcast Episode SnowflakeDB Podcast Episode Airflow Podcast.__init__ Episode Kafka Flink Podcast Episode Pinot GDPR CCPA Data Governance AWS LightGBM XGBoost Big Data Landscape Kinesis Kafka Connect Cassandra PostgreSQL Podcast Episode Amundsen Podcast Episode SQS Feature Toggles BigEye Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Visit the podcast's native language site