Cost-Effective Data Pipelines: Balancing Trade-Offs When Developing Pipelines in the Cloud
- Pickup from New Mail
- New Mail Courier
- Pickup from the store
- Other transport services
- Cash upon receipt
- Bank transfer
- Privat 24
- WebMoney
- Автор: Sev Leonard
- ISBN-10: 1492098647
- ISBN-13: 978-1492098645
- Edition: 1st
- Publisher: O'Reilly Media
- Publication date: August 22, 2023
- Language: English
- Dimensions: 7 x 0.6 x 9.19 inches
- Print length: 286 pages
From the brand
-
Databases, data science & more
Visit the Store
-
Sharing the knowledge of experts
O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success.
Our customers are hungry to build the innovations that propel the world forward. And we help them do just that.
From the Publisher
Who This Book Is For
I’ve geared the content toward an intermediate to advanced audience. I assume you have some familiarity with software development best practices, some basics about working with cloud compute and storage, and a general idea about how batch and streaming data pipelines operate.
This book is written from my experience in the day-to-day development of data pipelines. If this is work you either do already or aspire to do in the future, you can consider this book a virtual mentor, advising you of common pitfalls and providing guidance honed from working on a variety of data pipeline projects.
If you’re coming from a data analysis background, you’ll find advice on software best practices to help you build testable, extendable pipelines. This will aid you in connecting analysis with data acquisition and storage to create end-to-end systems.
Developer velocity and cost-conscious design are areas everyone from individual contributors to managers should have on their mind. In this book, you’ll find advice on how to build quality into the development process, make efficient use of cloud resources, and reduce costs. Additionally, you’ll see the elements that go into monitoring to not only keep tabs on system health and performance but also gain insight into where redesign should be considered.
If you manage data engineering teams, you’ll find helpful tips on effective development practices, areas where costs can escalate, and an overall approach to putting the right practices in place to help your team succeed.
What You Will Learn
If you would like to learn or improve your skill in the following, this book will be a useful guide:
- Reduce cloud spend with lower-cost cloud service offerings and smart design strategies.
- Minimize waste without sacrificing performance by right-sizing compute resources.
- Drive pipeline evolution, head off performance issues, and quickly debug with cost-effective monitoring and logging.
- Set up development and test environments that minimize cloud service costs.
- Create data pipeline codebases that are testable and extensible, reducing development time and accelerating pipeline evolution.
- Limit costly data downtime1 by improving data quality and pipeline operation through validation and testing.
What This Book Is Not
This is not an architecture book. There are aspects that tie back into architecture and system requirements, but I will not be discussing different architectural approaches or trade-offs. I do not cover topics such as data governance, data cataloging, or data lineage.
While I provide advice on how to manage the innate cost–performance trade-offs of building data pipelines in the cloud, this book is not a financial operations (FinOps) text. Where a FinOps book would, for example, direct you to look for unused compute instance hours as potential opportunities to reduce costs, this book gets into the nitty-gritty details of reducing instance hours and associated costs.
The design space of data pipelines is constantly growing and changing. The biggest value I can provide is to describe design techniques that can be applied in a variety of circumstances as the field evolves. Where relevant, I mention some specific, fully managed data ingestion services such as Amazon Web Services (AWS) Glue or Google Dataflow, but the focus of this book is on classes of services that apply across many vendors. Understanding these foundational services will help you get the most out of vendor-managed services.
The cloud service offerings I focus on include object storage such as AWS S3 and GCS, serverless functions such as AWS Lambda, and cluster compute services such as AWS Elastic Compute (EC2), AWS Elastic MapReduce (EMR), and Kubernetes. While managing system boundaries, identity management, and security are aspects of this approach, I will not be covering these topics in this book.
I do not provide advice about database services in this book, as the choice of databases and configurations is highly dependent on specific use cases.
You will learn what you need to log and monitor, but I will not cover the details on how to set up monitoring, as tools used for monitoring vary from company to company.