How Shopify Is Building Their Production Data Warehouse Using DBT
Data Engineering Podcast - Podcast autorstwa Tobias Macey - Niedziele
Kategorie:
Summary With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Shopify platform is? What kinds of data sources are you working with? Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage? How have you structured your data teams to be able to deliver those projects? What are the systems that you have in place, technological or otherwise, to allow you to support the needs of the various data professionals and business users? What was the tipping point that led you to reconsider your system design and start down the road of architecting a data warehouse? What were your criteria when selecting a platform for your data warehouse? What decision did that criteria lead you to make? Once you decided to orient a large portion of your reporting around a data warehouse, what were the biggest unknowns that you were faced with while deciding how to structure the workflows and access policies? What were your criteria for determining what toolchain to use for managing the data warehouse? You ultimately decided to standardize on DBT. What were the other options that you explored and what were the requirements that you had for determining the candidates? What was your process for onboarding users into the DBT toolchain and determining how to structure the project layout? What are some of the shortcomings or edge cases that you ran into? Rather than rely on the vanilla DBT workflow you created a wrapper project to add additional functionality. What were some of the features that you needed to add to suit your particular needs? What has been your experience with extending and integrating with DBT to customize it for your environment? Can you talk through how you manage testing of your DBT pipelines and the tables that it is responsible for? How much of the testing are you able to do with out-of-the-box functionality from DBT? What are the additional capabilities that you have bolted on to provide a more robust and scalable means of verifying your pipeline changes? Can you share how you manage the CI/CD process for changes in your data warehouse? What kinds of monitoring or metrics collection do you perform on the execution of your DBT pipelines? How do you integrate the management of your data warehouse and DBT workflows with your broader data platform? Now that you have been using DBT in production for a while, what are the challenges that you have encountered when using it at scale? Are there any patterns that you and your team have found useful that are worth digging into for other teams who are considering DBT or are actively using it? What are the opportunities and available mechanisms that you have found for introducing abstraction layers to reduce the maintenance burden for your data warehouse? What is the data modeling approach that you are using? (e.g. Data Vault, Star/Snowflake Schema, wide tables, etc.) As you continue to work with DBT and rely on the data warehouse for production use cases, what are some of the additional features/improvements that you have planned? What are some of the unexpected/innovative/surprising use cases that you and your team have found for the Seamster tool or the data models that it generates? What are the cases where you think that DBT or data warehousing is the wrong answer and teams should be looking to other solutions? What are the most interesting, unexpected, or challenging lessons that you learned while working through the process of migrating a portion of your data workloads into the data warehouse and managing them with DBT? Contact Info Zeeshan @zeeshanq on Twitter Website Michelle @michellearky on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links How to Build a Production Grade Workflow with SQL Modelling Shopify JRuby PySpark Druid Amplitude Mode Snowflake Schema Data Vault Podcast Episode BigQuery Amazon Redshift CI/CD Great Expectations Podcast Episode Master Data Management Podcast Episode Flink SQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast