Most ebook files are in PDF format, so you can easily read them using various software such as Foxit Reader or directly on the Google Chrome browser.
Some ebook files are released by publishers in other formats such as .awz, .mobi, .epub, .fb2, etc. You may need to install specific software to read these formats on mobile/PC, such as Calibre.
Please read the tutorial at this link: https://ebookbell.com/faq
We offer FREE conversion to the popular formats you request; however, this may take some time. Therefore, right after payment, please email us, and we will try to provide the service as quickly as possible.
For some exceptional file formats or broken links (if any), please refrain from opening any disputes. Instead, email us first, and we will try to assist within a maximum of 6 hours.
EbookBell Team
4.4
32 reviewsISBN 10: 1801073236
ISBN 13: 9781801073233
Author: Will Girten
Get up to speed with the Databricks Data Intelligence Platform to build and scale modern data applications, leveraging the latest advancements in data engineering
Key Features
Learn how to work with real-time data using Delta Live Tables
Unlock insights into the performance of data pipelines using Delta Live Tables
Apply your knowledge to Unity Catalog for robust data security and governance
Purchase of the print or Kindle book includes a free PDF eBook
Book Description
With so many tools to choose from in today’s data engineering development stack as well as operational complexity, this often overwhelms data engineers, causing them to spend less time gleaning value from their data and more time maintaining complex data pipelines. Guided by a lead specialist solutions architect at Databricks with 10+ years of experience in data and AI, this book shows you how the Delta Live Tables framework simplifies data pipeline development by allowing you to focus on defining input data sources, transformation logic, and output table destinations. This book gives you an overview of the Delta Lake format, the Databricks Data Intelligence Platform, and the Delta Live Tables framework. It teaches you how to apply data transformations by implementing the Databricks medallion architecture and continuously monitor the data quality of your pipelines. You’ll learn how to handle incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll master how to recover from runtime errors automatically. By the end of this book, you’ll be able to build a real-time data pipeline from scratch using Delta Live Tables, leverage CI/CD tools to deploy data pipeline changes automatically across deployment environments, and monitor, control, and optimize cloud costs.
What you will learn
Deploy near-real-time data pipelines in Databricks using Delta Live Tables
Orchestrate data pipelines using Databricks workflows
Implement data validation policies and monitor/quarantine bad data
Apply slowly changing dimensions (SCD), Type 1 and 2, data to lakehouse tables
Secure data access across different groups and users using Unity Catalog
Automate continuous data pipeline deployment by integrating Git with build tools such as Terraform and Databricks Asset Bundles
Who this book is for
This book is for data engineers looking to streamline data ingestion, transformation, and orchestration tasks. Data analysts responsible for managing and processing lakehouse data for analysis, reporting, and visualization will also find this book beneficial. Additionally, DataOps/DevOps engineers will find this book helpful for automating the testing and deployment of data pipelines, optimizing table tasks, and tracking data lineage within the lakehouse. Beginner-level knowledge of Apache Spark and Python is needed to make the most out of this book.
Part 1:Near-Real-Time Data Pipelines for the Lakehouse
Chapter 1: An Introduction to Delta Live Tables
Technical requirements
The emergence of the lakehouse
The Lambda architectural pattern
Introducing the medallion architecture
The Databricks lakehouse
The maintenance predicament of a streaming application
What is the DLT framework?
How is DLT related to Delta Lake?
Introducing DLT concepts
Streaming tables
Materialized views
Views
Pipeline
Pipeline triggers
Workflow
Types of Databricks compute
Databricks Runtime
Unity Catalog
A quick Delta Lake primer
The architecture of a Delta table
The contents of a transaction commit
Supporting concurrent table reads and writes
Tombstoned data files
Calculating Delta table state
Time travel
Tracking table changes using change data feed
A hands-on example – creating your first Delta Live Tables pipeline
Summary
Chapter 2: Applying Data Transformations Using Delta Live Tables
Technical requirements
Ingesting data from input sources
Ingesting data using Databricks Auto Loader
Scalability challenge in structured streaming
Using Auto Loader with DLT
Applying changes to downstream tables
APPLY CHANGES command
The DLT reconciliation process
Publishing datasets to Unity Catalog
Why store datasets in Unity Catalog?
Creating a new catalog
Assigning catalog permissions
Data pipeline settings
The DLT product edition
Pipeline execution mode
Databricks runtime
Pipeline cluster types
A serverless compute versus a traditional compute
Loading external dependencies
Data pipeline processing modes
Hands-on exercise – applying SCD Type 2 changes
Summary
Chapter 3: Managing Data Quality Using Delta Live Tables
Technical requirements
Defining data constraints in Delta Lake
Using temporary datasets to validate data processing
An introduction to expectations
Expectation composition
Hands-on exercise – writing your first data quality expectation
Acting on failed expectations
Hands-on example – failing a pipeline run due to poor data quality
Applying multiple data quality expectations
Decoupling expectations from a DLT pipeline
Hands-on exercise – quarantining bad data for correction
Summary
Chapter 4: Scaling DLT Pipelines
Technical requirements
Scaling compute to handle demand
Hands-on example – setting autoscaling properties using the Databricks REST API
Automated table maintenance tasks
Why auto compaction is important
Vacuuming obsolete table files
Moving compute closer to the data
Optimizing table layouts for faster table updates
Rewriting table files during updates
Data skipping using table partitioning
Delta Lake Z-ordering on MERGE columns
Improving write performance using deletion vectors
Serverless DLT pipelines
Introducing Enzyme, a performance optimization layer
Summary
Part 2:Securing the Lakehouse Using the Unity Catalog
Chapter 5: Mastering Data Governance in the Lakehouse with Unity Catalog
Technical requirements
Understanding data governance in a lakehouse
Introducing the Databricks Unity Catalog
A problem worth solving
An overview of the Unity Catalog architecture
Unity Catalog-enabled cluster types
Unity Catalog object model
Enabling Unity Catalog on an existing Databricks workspace
Identity federation in Unity Catalog
Data discovery and cataloging
Tracking dataset relationships using lineage
Observability with system tables
Tracing the lineage of other assets
Fine-grained data access
Hands-on example – data masking healthcare datasets
Summary
Chapter 6: Managing Data Locations in Unity Catalog
Technical requirements
Creating and managing data catalogs in Unity Catalog
Managed data versus external data
Saving data to storage volumes in Unity Catalog
Setting default locations for data within Unity Catalog
Isolating catalogs to specific workspaces
Creating and managing external storage locations in Unity Catalog
Storing cloud service authentication using storage credentials
Querying external systems using Lakehouse Federation
Hands-on lab – extracting document text for a generative AI pipeline
Generating mock documents
Defining helper functions
Choosing a file format randomly
Creating/assembling the DLT pipeline
Summary
Chapter 7: Viewing Data Lineage Using Unity Catalog
Technical requirements
Introducing data lineage in Unity Catalog
Tracing data origins using the Data Lineage REST API
Visualizing upstream and downstream transformations
Identifying dependencies and impacts
Hands-on lab – documenting data lineage across an organization
Summary
Part 3:Continuous Integration, Continuous Deployment, and Continuous Monitoring
Chapter 8: Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform
Technical requirements
Introducing the Databricks provider for Terraform
Setting up a local Terraform environment
Importing the Databricks Terraform provider
Configuring workspace authentication
Defining a DLT pipeline source notebook
Applying workspace changes
Configuring DLT pipelines using Terraform
name
notification
channel
development
continuous
edition
photon
configuration
library
cluster
catalog
target
storage
Automating DLT pipeline deployment
Hands-on exercise – deploying a DLT pipeline using VS Code
Setting up VS Code
Creating a new Terraform project
Defining the Terraform resources
Deploying the Terraform project
Summary
Chapter 9: Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment
Technical requirements
Introduction to Databricks Asset Bundles
Elements of a DAB configuration file
Specifying a deployment mode
Databricks Asset Bundles in action
User-to-machine authentication
Machine-to-machine authentication
Initializing an asset bundle using templates
Hands-on exercise – deploying your first DAB
Hands-on exercise – simplifying cross-team collaboration with GitHub Actions
Setting up the environment
Configuring the GitHub Action
Testing the workflow
Versioning and maintenance
Summary
Chapter 10: Monitoring Data Pipelines in Production
Technical requirements
Introduction to data pipeline monitoring
Exploring ways to monitor data pipelines
Using DBSQL Alerts to notify data validity
Pipeline health and performance monitoring
Hands-on exercise – querying data quality events for a dataset
Data quality monitoring
Introducing Lakehouse Monitoring
Hands-on exercise – creating a lakehouse monitor
Best practices for production failure resolution
Handling pipeline update failures
Recovering from table transaction failure
Hands-on exercise – setting up a webhook alert when a job runs longer than expected
Summary
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book
is databricks a data lake
building data lakehouse
databricks building the data lakehouse
modern data lakehouse
building the data lakehouse download
Tags: Will Girten, Building, Applications