Sign in

Education
Technology
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Total 447 episodes
12
...
78
9
Go to
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

Summary With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms Interview Introduction How did you get involved in the area of data management? Can you start by establishing a definition of data mastering that we can work from? How does the master data set get used within the overall analytical and processing systems of an organization? What is the traditional workflow for creating a master data set? What has changed in the current landscape of businesses and technology platforms that makes that approach impractical? What are the steps that an organization can take to evolve toward an agile approach to data mastering? At what scale of company or project does it makes sense to start building a master data set? What are the limitations of using ML/AI to merge data sets? What are the limitations of a golden master data set in practice? Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them? Are there specific problem domains that are more likely to benefit from a master data set? Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) What storage mechanisms are typically used for managing a master data set? Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure? How do you manage latency issues when trying to reference the same entities from multiple disparate systems? What have you found to be the most common stumbling blocks for a group that is implementing a master data platform? What suggestions do you have to help prevent such a project from being derailed? What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Tamr Multi-Dimensional Database Master Data Management ETL EDW (Enterprise Data Warehouse) Waterfall Development Method Agile Development Method DataOps Feature Engineering Tableau Qlik Data Catalog PowerBI RDBMS (Relational Database Management System) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
47:1603/09/2018
Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

Summary There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use Interview Introduction How did you get involved in the area of data security? Can you start by explaining what your mission is with Enveil and how the company got started? One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it? What are some of the challenges associated with scaling homomorphic encryption? What are some difficulties associated with working on encrypted data sets? Can you describe the underlying architecture for your data platform? How has that architecture evolved from when you first began building it? What are some use cases that are unlocked by having a fully encrypted data platform? For someone using the Enveil platform, what does their workflow look like? A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors? What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns) What do you have planned for the future of Enveil? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data security today? Links Enveil NSA GDPR Intellectual Property Zero Trust Homomorphic Encryption Ciphertext Hadoop PII (Personally Identifiable Information) TLS (Transport Layer Security) Spark Elasticsearch Side-channel attacks Spectre and Meltdown The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
24:4227/08/2018
Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Summary The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data. Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40! Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database. Interview Introduction How did you get involved in the area of data management? What is DGraph and what motivated you to build it? Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems? The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings? What are some of the common uses of graph storage systems? What are some potential uses that are often overlooked? There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose? How does the query interface and data storage in DGraph differ from other options? What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL? How is DGraph architected and how has that architecture evolved from when it first started? How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write? In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition? What are the limitations of DGraph in terms of scalability or usability? Where does it fall along the axes of the CAP theorem? For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like? What have been the most challenging aspects of building and growing the DGraph project and community? What are some of the most interesting or unexpected uses of DGraph that you are aware of? When is DGraph the wrong choice? What are your plans for the future of DGraph? Contact Info @manishrjain on Twitter manishrjain on GitHub Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links DGraph Badger Google Knowledge Graph Graph Theory Graph Database SQL Relational Database NoSQL OLTP (On-Line Transaction Processing) Neo4J PostgreSQL MySQL BigTable Recommendation System Fraud Detection Customer 360 Usenet Express IPFS Gremlin Cypher GSQL GraphQL MetaWeb RAFT Spanner HBase Elasticsearch Kubernetes TLS (Transport Layer Security) Jepsen Tests The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
42:4020/08/2018
Putting Airflow Into Production With James Meickle - Episode 43

Putting Airflow Into Production With James Meickle - Episode 43

Summary The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation Interview Introduction How did you get involved in the area of data management? What was your initial project requirement? What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target? Can you describe your current deployment architecture? How many engineers are involved in writing tasks for your Airflow installation? What resources were the most helpful while learning about Airflow design patterns? How have you architected your DAGs for deployment and extensibility? What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation? If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice? What are your next steps for improvements and fixes? Contact Info @eronarn on Twitter Website eronarn on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm Podcast Interview AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin Medium Blog Celery Dask Podcast Interview PostgreSQL Podcast Interview Redis Cloudformation Jupyter Notebook Qubole Astronomer Podcast Interview Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
48:0613/08/2018
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Summary One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers Interview Introduction How did you get involved in the area of data management? How did you get involved in the Postgres project? For anyone who hasn’t used it, can you describe what PostgreSQL is? Where did Postgres get started and how has it evolved over the intervening years? What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project? What are some cases where Postgres is the wrong choice? What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience) The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities? What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer? Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB? What is in store for the future of Postgres? Contact Info @jkatz05 on Twitter jkatz on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links PostgreSQL Crunchy Data Venuebook Paperless Post LAMP Stack MySQL PHP SQL ORDBMS Edgar Codd A Relational Model of Data for Large Shared Data Banks Relational Algebra Oracle DB UC Berkeley Dr. Michael Stonebraker Ingres Informix QUEL ANSI C CVS BSD License UUID JSON XML HStore PostGIS BTree Index GIN Index GIST Index KNN GIST SPGIST Full Text Search BRIN Index WAL (Write-Ahead Log) SQLite PGAdmin Vim Emacs Linux OLAP (Online Analytical Processing) Postgres IRC Postgres Slack Postgres Conferences UPSERT Postgres Roadmap CockroachDB Podcast Interview Citus Data Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
56:2206/08/2018
Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Summary With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy Interview Introduction How did you get involved in the area of data management? What is Ona and how did the company get started? What are some examples of the types of customers that you work with? What types of data do you support in your collection platform? What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users? Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization? What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers? Can you describe the flow of the data from collection through to analysis? To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project? What are the architectural considerations that you factored in when designing it? What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users? What are your plans for the future of Ona and Canopy? Contact Info Email pld on Github Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links OpenSRP Ona Canopy Open Data Kit Earth Institute at Columbia University Sustainable Engineering Lab WHO Bill and Melinda Gates Foundation XLSForms PostGIS Kafka Druid Superset Postgres Ansible Docker Terraform The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
29:1430/07/2018
Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

Summary When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface. Interview Introduction How did you get involved in the area of data management? Can you start with an overview of what Ceph is? What was the motivation for starting the project? What are some of the most common use cases for Ceph? There are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)? Given that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions? What mechanisms are available to ensure data integrity across the cluster? How is Ceph implemented and how has the design evolved over time? What is required to deploy and manage a Ceph cluster? What are the scaling factors for a cluster? What are the limitations? How does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files? In services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster? In what situations would you advise someone against using Ceph? What are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community? What are some of the plans that you have for the future of Ceph? Contact Info Email @liewegas on Twitter liewegas on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Ceph Red Hat DreamHost UC Santa Cruz Los Alamos National Labs Dream Objects OpenStack Proxmox POSIX GlusterFS Hadoop Ceph Architecture Paxos relatime Prometheus Zabbix Kubernetes NVMe DNS-SD Consul EtcD DNS SRV Record Zeroconf Bluestore XFS Erasure Coding NFS Seastar Rook The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
48:3116/07/2018
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Summary Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi Interview Introduction How did you get involved in the area of data management? Can you start by explaining what NiFi is? What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code? How did you get involved with the project? Where does it sit in the broader landscape of data tools? Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)? How do you manage versioning and backup of data flows, as well as promoting them between environments? One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed? What types of reporting are available across this information? What are some of the use cases or requirements that lend themselves well to being solved by NiFi? When is NiFi the wrong choice? What is involved in deploying and scaling a NiFi installation? What are some of the system/network parameters that should be considered? What are the scaling limitations? What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community? What do you have planned for the future of NiFi? Contact Info Kevin Doran @kevdoran on Twitter Email Andy LoPresto @yolopey on Twitter Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links NiFi HortonWorks DataFlow HortonWorks Apache Software Foundation Apple CSV XML JSON Perl Python Internet Scale Asset Management Documentum DataFlow NSA (National Security Agency) 24 (TV Show) Technology Transfer Program Agile Software Development Waterfall Spark Flink Kafka Oozie Luigi Airflow FluentD ETL (Extract, Transform, and Load) ESB (Enterprise Service Bus) MiNiFi Java C++ Provenance Kubernetes Apache Atlas Data Governance Kibana K-Nearest Neighbors DevOps DSL (Domain Specific Language) NiFi Registry Artifact Repository Nexus NiFi CLI Maven Archetype IoT Docker Backpressure NiFi Wiki TLS (Transport Layer Security) Mozilla TLS Observatory NiFi Flow Design System Data Lineage GDPR (General Data Protection Regulation) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
01:04:1608/07/2018
Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale Interview Introduction How did you get involved in the area of data management? To start, can you explain the problem space that Alegion is targeting and how you operate? When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects? What are some of the biggest challenges associated with managing human input to data sets intended for machine usage? For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like? What tools and processes do you have in place to ensure the accuracy of their inputs? How do you prevent bad actors from contributing data that would compromise the trained model? What are the limitations of crowd-sourced data labels? When is it beneficial to incorporate domain experts in the process? When doing data collection from various sources, how do you ensure that intellectual property rights are respected? How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers? What kinds of metadata do you track and how is that recorded/transmitted? Do you think that human intelligence will be a necessary piece of ML/AI forever? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alegion University of Texas at Austin Cognitive Science Labeled Data Mechanical Turk Computer Vision Sentiment Analysis Speech Recognition Taxonomy Feature Engineering The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
46:1402/07/2018
Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Summary Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data Interview Introduction How did you get involved in the area of data management? What is the intended use case for Quilt and how did the project get started? Can you step through a typical workflow of someone using Quilt? How does that change as you go from a single user to a team of data engineers and data scientists? Can you describe the elements of what a data package consists of? What was your criteria for the file formats that you chose? How is Quilt architected and what have been the most significant changes or evolutions since you first started? How is the data registry implemented? What are the limitations or edge cases that you have run into? What optimizations have you made to accelerate synchronization of the data to and from the repository? What are the limitations in terms of data volume, format, or usage? What is your goal with the business that you have built around the project? What are your plans for the future of Quilt? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Quilt Data GitHub Jobs Reproducible Data Dependencies in Jupyter Reproducible Machine Learning with Jupyter and Quilt Allen Institute: Programmatic Data Access with Quilt Quilt Example: MissingNo Oracle Pandas Jupyter Ycombinator Data.World Podcast Episode with CTO Bryon Jacob Kaggle Parquet HDF5 Arrow PySpark Excel Scala Binder Merkle Tree Allen Institute for Cell Science Flask PostGreSQL Docker Airflow Quilt Teams Hive Hive Metastore PrestoDB Podcast Episode Netflix Iceberg Kubernetes Helm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
41:4325/06/2018
User Analytics In Depth At Heap with Dan Robinson - Episode 36

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data Interview Introduction How did you get involved in the area of data management? Can you start by giving a brief overview of Heap? One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data? Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there? Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information? What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage? What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations? What challenges does that pose in your processing architecture? What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data? How has that architecture changed or evolved over the life of the company? What are some changes that you are anticipating in the near future? Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails? What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap? What changes have been necessary as a result of GDPR? What are your plans for the future of Heap? Contact Info @danlovesproofs on twitter [email protected] @drob on github heapanalytics.com / @heap on twitter https://heapanalytics.com/blog/category/engineering?utm_source=rss&utm_medium=rss Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Heap Palantir User Analytics Google Analytics Piwik Mixpanel Hubspot Jepsen Chaos Engineering Node.js Kafka Scala Citus React MobX Redshift Heap SQL BigQuery Webhooks Drip Data Virtualization DNS PII SOC2 The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
45:2717/06/2018
CockroachDB In Depth with Peter Mattis - Episode 35

CockroachDB In Depth with Peter Mattis - Episode 35

Summary With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services Interview Introduction How did you get involved in the area of data management? What was the motivation for creating CockroachDB and building a business around it? Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions? What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions? What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism? Go is an unconventional language for building a database. What are the pros and cons of that choice? What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it? What are the edge cases and failure modes that users should be aware of? I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB? What are some examples of extensions that are specific to CockroachDB? What are some of the most interesting uses of CockroachDB that you have seen? When is CockroachDB the wrong choice? What do you have planned for the future of CockroachDB? Contact Info Peter LinkedIn petermattis on GitHub @petermattis on Twitter Cockroach Labs @CockroackDB on Twitter Website cockroachdb on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links CockroachDB Cockroach Labs SQL Google Bigtable Spanner NoSQL RDBMS (Relational Database Management System) “Big Iron” (colloquial term for mainframe computers) RAFT Consensus Algorithm Consensus MVCC (Multiversion Concurrency Control) Isolation Etcd GDPR Golang C++ Garbage Collection Metaprogramming Rust Static Linking Docker Kubernetes CAP Theorem PostGreSQL ORM (Object Relational Mapping) Information Schema PG Catalog Interleaved Tables Vertica Spark Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
43:4111/06/2018
ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Summary Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage. Interview Introduction How did you get involved in the area of data management? Can you give a high level description of what ArangoDB is and the motivation for creating it? What is the story behind the name? How is ArangoDB constructed? How does the underlying engine store the data to allow for the different ways of viewing it? What are some of the benefits of multi-model data storage? When does it become problematic? For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango? How does it compare to OrientDB? What are the options for scaling a running system? What are the limitations in terms of network architecture or data volumes? One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture? What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code? What are some of the most interesting or surprising uses of this functionality that you have seen? What are some of the most challenging technical and business aspects of building and promoting ArangoDB? What do you have planned for the future of ArangoDB? Contact Info Jan Steemann jsteemann on GitHub @steemann on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links ArangoDB Köln Multi-model Database Graph Algorithms Apache 2 C++ ArangoDB Foxx Raft Protocol Target Partners RocksDB AQL (ArangoDB Query Language) OrientDB PostGreSQL OrientDB Studio Google Spanner 3-Tier Architecture Thomson-Reuters Arango Search Dell EMC Google S2 Index ArangoDB Geographic Functionality JSON Schema The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
40:0504/06/2018
The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Summary Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service Interview Introduction How did you get involved in the area of data management? What is Alooma and what is the origin story? How is the Alooma platform architected? I want to go into stream VS batch here What are the most challenging components to scale? How do you manage the underlying infrastructure to support your SLA of 5 nines? What are some of the complexities introduced by processing data from multiple customers with various compliance requirements? How do you sandbox user’s processing code to avoid security exploits? What are some of the potential pitfalls for automatic schema management in the target database? Given the large number of integrations, how do you maintain the What are some challenges when creating integrations, isn’t it simply conforming with an external API? For someone getting started with Alooma what does the workflow look like? What are some of the most challenging aspects of building and maintaining Alooma? What are your plans for the future of Alooma? Contact Info LinkedIn @yairwein on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alooma Convert Media Data Integration ESB (Enterprise Service Bus) Tibco Mulesoft ETL (Extract, Transform, Load) Informatica Microsoft SSIS OLAP Cube S3 Azure Cloud Storage Snowflake DB Redshift BigQuery Salesforce Hubspot Zendesk Spark The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps RDBMS (Relational Database Management System) SaaS (Software as a Service) Change Data Capture Kafka Storm Google Cloud PubSub Amazon Kinesis Alooma Code Engine Zookeeper Idempotence Kafka Streams Kubernetes SOC2 Jython Docker Python Javascript Ruby Scala PII (Personally Identifiable Information) GDPR (General Data Protection Regulation) Amazon EMR (Elastic Map Reduce) Sequoia Capital Lightspeed Investors Redis Aerospike Cassandra MongoDB The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
47:5028/05/2018
PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Presto is? What are some of the common use cases and deployment patterns for Presto? How does Presto compare to Drill or Impala? What is it about Presto that led you to building a business around it? What are some of the most challenging aspects of running and scaling Presto? For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries? How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with? What are some cases in which Presto is not the right solution? What types of support have you found to be the most commonly requested? What are some of the types of tooling or improvements that you have made to Presto in your distribution? What are some of the notable changes that your team has contributed upstream to Presto? Contact Info Website E-mail Twitter – @starburstdata Twitter – @prestodb Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Starburst Data Presto Hadapt Hadoop Hive Teradata PrestoCare Cost Based Optimizer ANSI SQL Spill To Disk Tempto Benchto Geospatial Functions Cassandra Accumulo Kafka Redis PostGreSQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rssSupport Data Engineering Podcast
42:0821/05/2018
Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal. Interview Andy Eschbacher From Carto What are the challenges associated with storing geospatial data? What are some of the common misconceptions that people have about working with geospatial data? Contact Info andy-esch on GitHub @MrEPhysics on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Carto Geospatial Analysis GeoJSON Todd Blaschka From TigerGraph What are graph databases and how do they differ from relational engines? What are some of the common difficulties that people have when deling with graph algorithms? How does data modeling for graph databases differ from relational stores? Contact Info LinkedIn @toddblaschka on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TigerGraph Graph Databases The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
26:0614/05/2018
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation. Interview Alan Anders from Applecart What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities? What are the biggest technical hurdles at Applecart? Contact Info @alanjanders on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Spark DataBricks DataBricks Delta Applecart Stepan Pushkarev from Hydrosphere.io What is Hydropshere.io? What metrics do you track to determine when a machine learning model is not producing an appropriate output? How do you determine which data points to sample for retraining the model? How does the role of a machine learning engineer differ from data engineers and data scientists? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hydrosphere Machine Learning Engineer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
32:3907/05/2018
Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Summary Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence Interview Introduction How did you get involved in the area of data management? The current goal for most companies is to be “data driven”. How would you define that concept? How does Metabase assist in that endeavor? What is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL? What level of complexity is possible with the query builder? What have you found to be the typical use cases for Metabase in the context of an organization? How do you manage scaling for large or complex queries? What was the motivation for using Clojure as the language for implementing Metabase? What is involved in adding support for a new data source? What are the differentiating features of Metabase that would lead someone to choose it for their organization? What have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective? What do you have planned for the future of Metabase? Contact Info Sameer salsakran on GitHub @sameer_alsakran on Twitter LinkedIn Metabase Website @metabase on Twitter metabase on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Expa Metabase Blackjet Hadoop Imeem Maslow’s Hierarchy of Data Needs 2 Sided Marketplace Honeycomb Interview Excel Tableau Go-JEK Clojure React Python Scala JVM Redash How To Lie With Data Stripe Braintree Payments The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
44:4630/04/2018
Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Summary The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management Interview Introduction How did you get involved in the area of data management? What is OctopAI and what was your motivation for founding it? What are some of the types of information that you classify and collect as metadata? Can you talk through the architecture of your platform? What are some of the challenges that are typically faced by metadata management systems? What is involved in deploying your metadata collection agents? Once the metadata has been collected what are some of the ways in which it can be used? What mechanisms do you use to ensure that customer data is segregated? How do you identify and handle sensitive information during the collection step? What are some of the most challenging aspects of your technical and business platforms that you have faced? What are some of the plans that you have for OctopAI going forward? Contact Info Amnon LinkedIn @octopai_amnon on Twitter OctopAI @OctopaiBI on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links OctopAI Metadata Metadata Management Data Integrity CRM (Customer Relationship Management) ERP (Enterprise Resource Planning) Business Intelligence ETL (Extract, Transform, Load) Informatica SAP Data Governance SSIS (SQL Server Integration Services) Vertica Airflow Luigi Oozie GDPR (General Data Privacy Regulation) Root Cause Analysis The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
39:5323/04/2018
Data Engineering Weekly with Joe Crobak - Episode 27

Data Engineering Weekly with Joe Crobak - Episode 27

Summary The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry. Interview Introduction How did you get involved in the area of data management? What are some of the projects that you have been involved in that were most personally fulfilling? As an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data? Healthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to? What was your motivation for starting a newsletter about the Hadoop space? Can you speak to your reasoning for the recent rebranding of the newsletter? How much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it? After over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments? What have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged? What is your workflow for finding and curating the content that goes into your newsletter? What is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter? How has your experience managing the newsletter influenced your areas of focus in your work and vice-versa? What are your plans going forward? Contact Info Data Eng Weekly Email Twitter – @joecrobak Twitter – @dataengweekly Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links USDS National Labs Cray Amazon EMR (Elastic Map-Reduce) Recommendation Engine Netflix Prize Hadoop Cloudera Puppet healthcare.gov Medicare Quality Payment Program HIPAA NIST National Institute of Standards and Technology PII (Personally Identifiable Information) Threat Modeling Apache JBoss Apache Web Server MarkLogic JMS (Java Message Service) Load Balancer COBOL Hadoop Weekly Data Engineering Weekly Foursquare NiFi Kubernetes Spark Flink Stream Processing DataStax RSS The Flavors of Data Science and Engineering CQRS Change Data Capture Jay Kreps The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
43:3215/04/2018
Defining DataOps with Chris Bergh - Episode 26

Defining DataOps with Chris Bergh - Episode 26

Summary Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps Interview Introduction How did you get involved in the area of data management? How do you define DataOps? How does it compare to the practices encouraged by the DevOps movement? How does it relate to or influence the role of a data engineer? How does a DataOps oriented workflow differ from other existing approaches for building data platforms? One of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments? The practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system? One of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal? In order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments? How does the DataKitchen platform simplify the process of operationalizing a data analytics workflow? As the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed? Contact Info LinkedIn @ChrisBergh on Twitter Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links DataOps Manifesto DataKitchen 2017: The Year Of DataOps Air Traffic Control Chief Data Officer (CDO) Gartner W. Edwards Deming DevOps Total Quality Management (TQM) Informatica Talend Agile Development Cattle Not Pets IDE (Integrated Development Environment) Tableau Delphix Dremio Pachyderm Continuous Delivery by Jez Humble and Dave Farley SLAs (Service Level Agreements) XKCD Image Recognition Comic Airflow Luigi DataKitchen Documentation Continuous Integration Continous Delivery Docker Version Control Git Looker The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
54:3108/04/2018
ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

Summary Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack Interview Introduction How did you get involved in the area of data management? Why don’t you start by explaining what ThreatStack does? What was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for? Can you describe the type(s) of data that you collect and how it is structured? What is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data? How do you ensure a consistent format of the information that you receive? How do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended? How much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context? I understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change? How much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)? How do you ensure the security and provenance of the data that you collect as it traverses your infrastructure? What are some of the most common vulnerabilities that you detect in your client’s infrastructure? For someone who wants to start using ThreatStack, what does the setup process look like? What have you found to be the most challenging aspects of building and managing the data processes in your environment? What are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure? Contact Info Pete Cheslock @petecheslock on Twitter Website petecheslock on GitHub Patrick Cable @patcable on Twitter Website patcable on GitHub ThreatStack Website @threatstack on Twitter threatstack on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links ThreatStack SecDevOps Sonian EC2 Snort Snorby Suricata Tripwire Syscall (System Call) AuditD CloudTrail Naxsi Cloud Native File Integrity Monitoring (FIM) Amazon Web Services (AWS) RabbitMQ ZeroMQ Kafka Spark Slack PagerDuty JSON Microservices Cassandra ElasticSearch Sensu Service Discovery Honeypot Kubernetes PostGreSQL Druid Flink Launch Darkly Chef Consul Terraform CloudFormation The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
51:5201/04/2018
MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

Summary The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data Interview Introduction How did you get involved in the area of data management? What was your motivation for creating MarketStore? What are the characteristics of financial time series data that make it challenging to manage? What are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available? With MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services? What is the worst case scenario if there is a total failure in the data store? What guards have you built to prevent such a situation from occurring? Since MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable? What were the most challenging aspects of building MarketStore and integrating it into the rest of your systems? Motivation for open sourcing the code? What is the next planned major feature for MarketStore, and what use-case is it aiming to support? Contact Info Christopher Email Hitoshi Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links MarketStore GitHub Release Announcement Alpaca IBM DB2 GreenPlum Algorithmic Trading Backtesting OHLC (Open-High-Low-Close) HDF5 Golang C++ Timeseries Database List InfluxDB JSONRPC Slait CircleCI GDAX The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
33:2825/03/2018
Stretching The Elastic Stack with Philipp Krenn - Episode 23

Stretching The Elastic Stack with Philipp Krenn - Episode 23

Summary Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems Interview Introduction How did you get involved in the area of data management? The Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together? Beyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack? What are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data? What do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns? What are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise? What are the biggest challenges facing Elastic as a company in the near to medium term? Open source as a business model: https://www.elastic.co/blog/doubling-down-on-open?utm_source=rss&utm_medium=rss What is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to? Contact Info @xeraa on Twitter xeraa on GitHub Website Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Elastic Vienna – Capital of Austria What Is Developer Advocacy? NoSQL MongoDB Elasticsearch Cassandra Neo4J Hazelcast Apache Lucene Logstash Kibana Beats X-Pack ELK Stack Metrics APM (Application Performance Monitoring) GeoJSON Split Brain Elasticsearch Ingest Nodes PacketBeat Elastic Cloud Elasticon Kibana Canvas SwiftType The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
51:0219/03/2018
Database Refactoring Patterns with Pramod Sadalage - Episode 22

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Summary As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow Interview Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution? How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years? Contact Info Website pramodsadalage on GitHub @pramodsadalage on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Database Refactoring Website Book Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration The Book Wikipedia Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
49:0612/03/2018
The Future Data Economy with Roger Chen - Episode 21

The Future Data Economy with Roger Chen - Episode 21

Summary Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies Interview Introduction How did you get involved in the area of data management? You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term? What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains? Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries? Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information? What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets? What do you view as being the privacy implications from creating and sharing these larger pools of data inventory? What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization? With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations? Contact Info @rgrchen on Twitter LinkedIn Angel List Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Electrical Engineering Berkeley Silicon Nanophotonics Data Liquidity In The Age Of Inference Data Silos Example of a Data Commons Cooperative Google Maps Moat: An article describing how Google Maps has refined raw data to create a new product Genomics Phenomics ImageNet Open Data Data Brokerage Smart Contracts IPFS Dat Protocol Homomorphic Encryption FileCoin Data Programming Snorkel Website Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
42:4805/03/2018
Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Summary One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems Interview Introduction How did you get involved in the area of data management? What is Honeycomb and how did you get started at the company? Can you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph? What are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale? In addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns? A high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user? How does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool? What have been some of the most challenging aspects of building, scaling, and marketing Honeycomb? Contact Info @samstokes on Twitter Blog samstokes on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Honeycomb Retriever Monitoring and Observability Kafka Column Oriented Storage Elasticsearch Elastic Stack Django Ruby on Rails Heroku Kubernetes Launch Darkly Splunk Datadog Cynefin Framework Go-Lang Terraform AWS The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
41:3326/02/2018
Data Teams with Will McGinnis - Episode 19

Data Teams with Will McGinnis - Episode 19

Summary The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists Interview Introduction How did you get involved in the area of data management? The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms? What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators? Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team? What are the benefits of splitting the responsibilities of data engineering and data science? What are the disadvantages? What are some strategies to ensure successful interaction between data engineers and data scientists? How do you view these roles evolving as they become more prevalent across companies and industries? Contact Info Website wdm0006 on GitHub @willmcginniser on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Blog Post: Tendencies of Data Engineers and Data Scientists Predikto Categorical Encoders DevOps SciKit-Learn The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
28:3919/02/2018
TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Summary As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it? What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL? For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down? Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances? What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale? Contact Info Ajay LinkedIn @acoustik on Twitter Timescale Blog Mike Website LinkedIn @michaelfreedman on Twitter Timescale Blog Timescale Website @timescaledb on Twitter GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data Website Data Engineering Podcast Interview Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pg_prometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
01:02:4011/02/2018
Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Pulsar is and what the original inspiration for the project was? What have been some of the most challenging aspects of building and promoting Pulsar? For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components? What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to? What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison? The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka? One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged? When is Pulsar the wrong tool to use? What are some of the improvements or new features that you have planned for the future of Pulsar? Contact Info Matteo merlimat on GitHub @merlimat on Twitter Rajan @dhabaliaraj on Twitter rhabalia on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pulsar Publish-Subscribe Yahoo Streamlio ActiveMQ Kafka Bookkeeper SLA (Service Level Agreement) Write-Ahead Log Ansible Zookeeper Pulsar Deployment Instructions RabbitMQ Confluent Schema Registry Podcast Interview Kafka Connect Wallaroo Podcast Interview Kinesis Athenz The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
53:4704/02/2018
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future Interview Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible? Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project? Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of? Contact Info Dat datproject.org Email @dat_project on Twitter Dat Chat Danielle Email @daniellecrobins Joe Email @joeahand on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Click here to read the unedited transcript… Tobias Macey 00:13 Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure. Danielle Robinson 02:10 My name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe. Joe Hand 02:32 Joe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now. Tobias Macey 02:42 And Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure. Danielle Robinson 02:48 So I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting. Tobias Macey 04:54 And Joe, how did you get involved in data management? Joe Hand 04:57 Yeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that. Tobias Macey 08:16 So that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well, Danielle Robinson 08:33 yeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month. Tobias Macey 11:27 And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project? Joe Hand 11:42 Yeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol. Tobias Macey 13:10 And the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing? Joe Hand 13:44 Yeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah, Danielle Robinson 14:41 I think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space. Tobias Macey 15:58 And now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project? Danielle Robinson 16:09 Yes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space. Tobias Macey 20:32 And digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves. Joe Hand 21:19 Yeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now. I think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems. Tobias Macey 30:24 And that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust? Joe Hand 31:27 Yeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with. Tobias Macey 32:09 And one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior Danielle Robinson 32:33 there are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now. archiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question. Tobias Macey 37:39 And one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort? Joe Hand 38:08 Yeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network. Tobias Macey 40:59 And one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case? Joe Hand 41:20 Yeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those. Danielle Robinson 42:29 You mean, fritter not. freighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms. Tobias Macey 43:13 And the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space? Joe Hand 43:47 Yeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management. Tobias Macey 46:15 And one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks? Joe Hand 46:50 Yeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope. Tobias Macey 48:46 And for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to? Danielle Robinson 49:05 Sure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So Tobias Macey 50:58 and what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project, Danielle Robinson 51:10 I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building. Joe Hand 53:03 Yeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately. Tobias Macey 55:29 And are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share? Danielle Robinson 55:36 Sure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up. Joe Hand 58:36 And I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out. Tobias Macey 59:09 Are there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show? Danielle Robinson 59:15 Um, I think I’m feeling pretty good. What about you, Joe? Joe Hand 59:18 Yeah, I think that’s it for me. Okay. Tobias Macey 59:20 So for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today? Joe Hand 59:42 I’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah. Danielle Robinson 01:00:48 Plus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us. Tobias Macey 01:02:30 Yeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening. Unknown Speaker 01:02:48 Thank you. Thank you. Transcribed by https://otter.ai?utm_source=rss&utm_medium=rss Support Data Engineering Podcast
01:02:5829/01/2018
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

Summary The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of dark data and how Snorkel helps to extract value from it? What are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes? Can you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale? For someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights? How is Snorkel architected and how has the design evolved over its lifetime? What are some situations where Snorkel would be poorly suited for use? What are some of the most interesting applications of Snorkel that you are aware of? What are some of the other projects that you and your group are working on that interact with Snorkel? What are some of the features or improvements that you have planned for future releases of Snorkel? Contact Info Website ajratner on Github @ajratner on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Stanford DAWN HazyResearch Snorkel Christopher Ré Dark Data DARPA Memex Training Data FDA ImageNet National Library of Medicine Empirical Studies of Conflict Data Augmentation PyTorch Tensorflow Generative Model Discriminative Model Weak Supervision The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
37:1322/01/2018
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

Summary As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems Interview Introduction How did you get involved in the area of data management? You have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations? Other than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness? One of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been? Can you provide examples of some production systems or available tools that are leveraging CRDTs? If someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types? What areas of research are you most excited about right now? Given that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed? Contact Info Website cmeiklejohn on GitHub Google Scholar Citations Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Basho Riak Syncfree LASP CRDT Mesosphere CAP Theorem Cassandra DynamoDB Bayou System (Xerox PARC) Multivalue Register Paxos RAFT Byzantine Fault Tolerance Two Phase Commit Spanner ReactiveX Tensorflow Erlang Docker Kubernetes Erleans Orleans Atom Editor Automerge Martin Klepman Akka Delta CRDTs Antidote DB Kops Eventual Consistency Causal Consistency ACID Transactions Joe Hellerstein The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
45:4315/01/2018
Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Summary PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL Interview Introduction How did you get involved in the area of data management? Can you describe what Citus is and how the project got started? Why did you start with Postgres vs. building something from the ground up? What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version? How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale? How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon? How does Citus operate under the covers to enable clustering and replication across multiple hosts? What are the failure modes of Citus and how does it handle loss of nodes in the cluster? For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system? How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version? Are there any use cases that Citus enables which would be impractical to attempt in native Postgres? What have been some of the most challenging aspects of building the Citus extension? What are the situations where you would advise against using Citus? What are some of the most interesting or impressive uses of Citus that you have seen? What are some of the features that you have planned for future releases of Citus? Contact Info Citus Data citusdata.com @citusdata on Twitter citusdata on GitHub Craig Email Website @craigkerstiens on Twitter Ozgun Email ozgune on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Citus Data PostGreSQL NoSQL Timescale SQL blog post PostGIS PostGreSQL Graph Database JSONB Data Type PipelineDB Timescale PostGres-XL Aurora PostGres Amazon RDS Streaming Replication CitusMX CTE (Common Table Expression) HipMunk Citus Sharding Blog Post Wal-e Wal-g Heap Analytics HyperLogLog C-Store The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
46:4408/01/2018
Wallaroo with Sean T. Allen - Episode 12

Wallaroo with Sean T. Allen - Episode 12

Summary Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale Interview Introduction How did you get involved in the area of data engineering? What is Wallaroo and how did the project get started? What is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on? Why did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented? How is Wallaroo architected internally to allow for distributed state management? Is the state persistent, or is it only maintained long enough to complete the desired computation? If so, what format do you use for long term storage of the data? What have been the most challenging aspects of building the Wallaroo platform? Which axes of the CAP theorem have you optimized for? For someone who wants to build an application on top of Wallaroo, what is involved in getting started? Once you have a working application, what resources are necessary for deploying to production and what are the scaling factors? What are the failure modes that users of Wallaroo need to account for in their application or infrastructure? What are some situations or problem types for which Wallaroo would be the wrong choice? What are some of the most interesting or unexpected uses of Wallaroo that you have seen? What do you have planned for the future of Wallaroo? Contact Info IRC Mailing List Wallaroo Labs Twitter Email Personal Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Wallaroo Labs Storm Applied Apache Storm Risk Analysis Pony Language Erlang Akka Tail Latency High Performance Computing Python Apache Software Foundation Beyond Distributed Transactions: An Apostate’s View Consistent Hashing Jepsen Lineage Driven Fault Injection Chaos Engineering QCon 2016 Talk Codemesh in London: How did I get here? CAP Theorem CRDT Sync Free Project Basho Wallaroo on GitHub Docker Puppet Chef Ansible SaltStack Kafka TCP Dask Data Engineering Episode About Dask Beowulf Cluster Redis Flink Haskell The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
59:1325/12/2017
SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Summary Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database Interview Introduction How did you get involved in the area of data engineering? What is SiriDB and how did the project get started? What was the inspiration for the name? What was the landscape of time series databases at the time that you first began work on Siri? How does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.? What do you view as the competition for Siri? How is the server architected and how has the design evolved over the time that you have been working on it? Can you describe how the clustering mechanism functions? Is it possible to create pools with more than two servers? What are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem? In the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon? One of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging? What have been the most challenging aspects of building Siri? In what situations or environments would you advise against using Siri? Contact Info joente on Github LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SiriDB Oversight InfluxDB LevelDB OpenTSDB Timescale DB KairosDB Write Ahead Log Grafana The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
33:5218/12/2017
Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Summary To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry Interview Introduction How did you get involved in the area of data engineering? What is the schema registry and what was the motivating factor for building it? If you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas? How did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options? Conversely, what would be involved in using a storage backend other than Kafka? What are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure? What are some of the biggest challenges that you faced while designing and building the schema registry? What is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations? What are some of the features or enhancements that you have in mind for future work? Contact Info ewencp on GitHub Website @ewencp on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Kafka Confluent Schema Registry Second Life Eve Online Yes, Virginia, You Really Do Need a Schema Registry JSON-Schema Parquet Avro Thrift Protocol Buffers Zookeeper Kafka Connect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
49:2210/12/2017
data.world with Bryon Jacob - Episode 9

data.world with Bryon Jacob - Episode 9

Summary We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world Interview Introduction How did you first get involved in the area of data management? What is data.world and what is its mission and how does your status as a B Corporation tie into that? The platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched? What are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased? What are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use? How do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform? What are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform? What are the projects or companies that you consider to be your competitors? What are some of the most interesting or unexpected uses of the data.world platform that you are aware of? Contact Information @bryonjacob on Twitter bryonjacob on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links data.world HomeAway Semantic Web Knowledge Engineering Ontology Open Data RDF CSVW SPARQL DBPedia Triplestore Header Dictionary Triples Apache Jena Tabula Tableau Connector Excel Connector Data For Democracy Jonathan Morgan The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
46:2403/12/2017
Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems. Interview Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems? What are the switching costs involved in moving from one format to another after you have started using it in a production system? What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity? Contact Information Doug: cutting on GitHub Blog @cutting on Twitter Julien Email @J_ on Twitter Blog julienledem on GitHub Links Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper Twitter Blog on Release of Parquet CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
51:4322/11/2017
Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

Summary Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed Interview Introduction How did you get involved in the area of data management? How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for? What are some of the types of data inputs and outputs that you work with at Buzzfeed? Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision? What does the architecture of your data platform look like and what are some of the most significant areas of technical debt? Which platforms and languages are most widely leveraged in your team and what are some of the outliers? What are some of the most significant challenges that you face, both technically and organizationally? What are some of the dead ends that you have run into or failed projects that you have tried? What has been the most successful project that you have completed and how do you measure that success? Contact Info @hackwalter on Twitter walterm on GitHub Links Data Literacy MIT Media Lab Tumblr Data Capital Data Infrastructure Google Analytics Datadog Python Numpy SciPy NLTK Go Language NSQ Tornado PySpark AWS EMR Redshift Tracking Pixel Google Cloud Don’t try to be google Stop Hiring DevOps Engineers and Start Growing Them The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
43:4014/11/2017
Astronomer with Ry Walker - Episode 6

Astronomer with Ry Walker - Episode 6

Summary Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering. Interview Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of? Contact Information Email @rywalker on Twitter Links Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
42:5006/08/2017
Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Summary Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline Interview with Justin Cunningham Introduction How did you get involved in the area of data engineering? Can you start by giving an overview of your pipeline and the type of workload that you are optimizing for? What are some of the dead ends that you experienced while designing and implementing your pipeline? As you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house? What are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them? What are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy? While you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced? Did you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public? What advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project? Keep in touch Yelp Engineering Blog Email Links Kafka Redshift ETL Business Intelligence Change Data Capture LinkedIn Data Bus Apache Storm Apache Flink Confluent Apache Avro Game Days Chaos Monkey Simian Army PaaSta Apache Mesos Marathon SignalFX Sensu Thrift Protocol Buffers JSON Schema Debezium Kafka Connect Apache Beam The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
42:2818/06/2017
ScyllaDB with Eyal Gutkind - Episode 4

ScyllaDB with Eyal Gutkind - Episode 4

Summary If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Eyal Gutkind about ScyllaDB Interview Introduction How did you get involved in the area of data management? What is ScyllaDB and why would someone choose to use it? How do you ensure sufficient reliability and accuracy of the database engine? The large draw of Scylla is that it is a drop in replacement of Cassandra with faster performance and no requirement to manage th JVM. What are some of the technical and architectural design choices that have enabled you to do that? Deployment and tuning What challenges are inroduced as a result of needing to maintain API compatibility with a diferent product? Do you have visibility or advance knowledge of what new interfaces are being added to the Apache Cassandra project, or are you forced to play a game of keep up? Are there any issues with compatibility of plugins for CassandraDB running on Scylla? For someone who wants to deploy and tune Scylla, what are the steps involved? Is it possible to join a Scylla cluster to an existing Cassandra cluster for live data migration and zero downtime swap? What prompted the decision to form a company around the database? What are some other uses of Seastar? Keep in touch Eyal LinkedIn ScyllaDB Website @ScyllaDB on Twitter GitHub Mailing List Slack Links Seastar Project DataStax XFS TitanDB OpenTSDB KairosDB CQL Pedis The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
35:0718/03/2017
Defining Data Engineering with Maxime Beauchemin - Episode 3

Defining Data Engineering with Maxime Beauchemin - Episode 3

Summary What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more. Transcript provided by CastSource Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin Questions Introduction How did you get involved in the field of data engineering? How do you define data engineering and how has that changed in recent years? Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen? For someone who wants to get started in the field of data engineering what are some of the necessary skills? What do you see as the biggest challenges facing data engineers currently? At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain? How much analytical knowledge is necessary for a typical data engineer? What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality? You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice? How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices? How do you see the role of data engineers evolving in the next few years? Keep In Touch @mistercrunch on Twitter mistercrunch on GitHub Medium Links Datadog Airflow The Rise of the Data Engineer Druid.io Luigi Apache Beam Samza Hive Data Modeling The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
45:2105/03/2017
Dask with Matthew Rocklin - Episode 2

Dask with Matthew Rocklin - Episode 2

Summary There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem. Interview with Matthew Rocklin Introduction How did you get involved in the area of data engineering? Dask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated? There are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options? One of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long? Do you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future? Are you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons? There is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill? For anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like? How do the task graph capabilities compare to something like Airflow or Luigi? Looking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution? What are some of the most interesting or unexpected projects that you have seen Dask used for? What do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support? What are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project? I know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects? Keep in touch @mrocklin on Twitter mrocklin on GitHub Links http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments?utm_source=rss&utm_medium=rss https://opendatascience.com/blog/dask-for-institutions/?utm_source=rss&utm_medium=rss Continuum Analytics 2sigma X-Array Tornado Website Podcast Interview Airflow Luigi Mesos Kubernetes Spark Dryad Yarn Read The Docs XData The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
46:0122/01/2017
Pachyderm with Daniel Whitenack - Episode 1

Pachyderm with Daniel Whitenack - Episode 1

Summary Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today! Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake. Interview with Daniel Whitenack Introduction How did you get started in the data engineering space? What is pachyderm and what problem were you trying to solve when the project was started? Where does the name come from? What are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options? Because of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics? What does Pachyderm use for the distribution and scaling mechanism of the file system? Given that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data? For a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow? Given that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that? Another compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions? How did you implement this feature so that it would be maintainable and easy to implement for end users? Given that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this? The data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow? I see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill? What are some of the most interesting deployments and uses of Pachyderm that you have seen? What are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project? Keep in touch Daniel Twitter – @dwhitena Pachyderm Website Free Weekend Project GopherNotes Links AirBnB RethinkDB Flocker Infinite Project Git LFS Luigi Airflow Kafka Kubernetes Rkt SciKit Learn Docker Minikube General Fusion The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
44:4214/01/2017
Introducing The Show

Introducing The Show

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, share it on social media, and tell your friends and co-workers. I’m your host, Tobias Macey, and today I’m speaking with Maxime Beauchemin about what it means to be a data engineer. Interview Who am I Systems administrator and software engineer, now DevOps, focus on automation Host of Podcast.__init__ How did I get involved in data management Why am I starting a podcast about Data Engineering Interesting area with a lot of activity Not currently any shows focused on data engineering What kinds of topics do I want to cover Data stores Pipelines Tooling Automation Monitoring Testing Best practices Common challenges Defining the role/job hunting Relationship with data engineers/data analysts Get in touch and subscribe Website Newsletter Twitter Email The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
04:2408/01/2017