Most ebook files are in PDF format, so you can easily read them using various software such as Foxit Reader or directly on the Google Chrome browser.
Some ebook files are released by publishers in other formats such as .awz, .mobi, .epub, .fb2, etc. You may need to install specific software to read these formats on mobile/PC, such as Calibre.
Please read the tutorial at this link: https://ebookbell.com/faq
We offer FREE conversion to the popular formats you request; however, this may take some time. Therefore, right after payment, please email us, and we will try to provide the service as quickly as possible.
For some exceptional file formats or broken links (if any), please refrain from opening any disputes. Instead, email us first, and we will try to assist within a maximum of 6 hours.
EbookBell Team
4.8
94 reviewsISBN 13: 9781835461228
Author: Matthew Topol
Harness the power of Apache Arrow to optimize tabular data processing and develop robust, high-performance data systems with its standardized, language-independent columnar memory format
Key Features
Explore Apache Arrow's data types and integration with pandas, Polars, and Parquet
Work with Arrow libraries such as Flight SQL, Acero compute engine, and Dataset APIs for tabular data
Enhance and accelerate machine learning data pipelines using Apache Arrow and its subprojects
Purchase of the print or Kindle book includes a free PDF eBook
Book Description
Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-performance data processing and exchange. This updated second edition gives you an overview of the Arrow format, highlighting its versatility and benefits through real-world use cases. It guides you through enhancing data science workflows, optimizing performance with Apache Parquet and Spark, and ensuring seamless data translation. You’ll explore data interchange and storage formats, and Arrow's relationships with Parquet, Protocol Buffers, FlatBuffers, JSON, and CSV. You’ll also discover Apache Arrow subprojects, including Flight, SQL, Database Connectivity, and nanoarrow. You’ll learn to streamline machine learning workflows, use Arrow Dataset APIs, and integrate with popular analytical data systems such as Snowflake, Dremio, and DuckDB. The latter chapters provide real-world examples and case studies of products powered by Apache Arrow, providing practical insights into its applications. By the end of this book, you’ll have all the building blocks to create efficient and powerful analytical services and utilities with Apache Arrow.
What you will learn
Use Apache Arrow libraries to access data files, both locally and in the cloud
Understand the zero-copy elements of the Apache Arrow format
Improve the read performance of data pipelines by memory-mapping Arrow files
Produce and consume Apache Arrow data efficiently by sharing memory with the C API
Leverage the Arrow compute engine, Acero, to perform complex operations
Create Arrow Flight servers and clients for transferring data quickly
Build the Arrow libraries locally and contribute to the community
Who this book is for
This book is for developers, data engineers, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. Whether you’re building utilities for data analytics and query engines, or building full pipelines with tabular data, this book can help you out regardless of your preferred programming language. A basic understanding of data analysis concepts is needed, but not necessary. Code examples are provided using C++, Python, and Go throughout the book.
Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals
Chapter 1: Getting Started with Apache Arrow
Technical requirements
Understanding the Arrow format and specifications
Why does Arrow use a columnar in-memory format?
Learning the terminology and physical memory layout
Quick summary of physical layouts, or TL;DR
How to speak Arrow
Arrow format versioning and stability
Would you download a library? Of course!
Setting up your shooting range
Using PyArrow for Python
C++ for the 1337 coders
Go, Arrow, go!
Summary
References
Chapter 2: Working with Key Arrow Specifications
Technical requirements
Playing with data, wherever it might be!
Working with Arrow tables
Accessing data files with PyArrow
Accessing data files with Arrow in C++
Bears firing arrows
Putting pandas in your quiver
Making pandas run fast
Keeping pandas from running wild
Polar bears use Rust-y arrows
Sharing is caring… especially when it’s your memory
Diving into memory management
Managing buffers for performance
Crossing boundaries
Summary
Chapter 3: Format and Memory Handling
Technical requirements
Storage versus runtime in-memory versus message-passing formats
Long-term storage formats
In-memory runtime formats
Message-passing formats
Summing up
Passing your Arrows around
What is this sorcery?!
Producing and consuming Arrows
Learning about memory cartography
The base case
Parquet versus CSV
Mapping data into memory
Too long; didn’t read (TL;DR) – computers are magic
Leaving the CPU – using device memory
Starting with a few pointers
Device-agnostic buffer handling
Summary
Part 2: Interoperability with Arrow: The Power of Open Standards
Chapter 4: Crossing the Language Barrier with the Arrow C Data API
Technical requirements
Using the Arrow C data interface
The ArrowSchema structure
The ArrowArray structure
Example use cases
Using the C data API to export Arrow-formatted data
Importing Arrow data with Python
Exporting Arrow data with the C Data API from Python to Go
Streaming Arrow data between Python and Go
What about non-CPU device data?
The ArrowDeviceArray struct
Using ArrowDeviceArray
Other use cases
Some exercises
Summary
Chapter 5: Acero: A Streaming Arrow Execution Engine
Technical requirements
Letting Acero do the work for you
Input shaping
Value casting
Types of functions in Acero
Invoking functions
Using the C++ compute library
Using the compute library in Python
Picking the right tools
Adding a constant value to an array
Compute Add function
A simple for loop
Using std::for_each and reserve space
Divide and conquer
Always have a plan
Where does Acero fit?
Acero’s core concepts
Let’s get streaming!
Simplifying complexity
Summary
Chapter 6: Using the Arrow Datasets API
Technical requirements
Querying multifile datasets
Creating a sample dataset
Discovering dataset fragments
Filtering data programmatically
Expressing yourself – a quick detour
Using expressions for filtering data
Deriving and renaming columns (projecting)
Using the Datasets API in Python
Creating our sample dataset
Discovering the dataset
Using different file formats
Filtering and projecting columns with Python
Streaming results
Working with partitioned datasets
Writing partitioned data
Connecting everything together
Summary
Chapter 7: Exploring Apache Arrow Flight RPC
Technical requirements
The basics and complications of gRPC
Building modern APIs for data
Efficiency and streaming are important
Arrow Flight’s building blocks
Horizontal scalability with Arrow Flight
Adding your business logic to Flight
Other bells and whistles
Understanding the Flight Protobuf definitions
Using Flight, choose your language!
Building a Python Flight server
Building a Go Flight server
What is Flight SQL?
Setting up a performance test
Everyone gets a containerized development environment!
Running the performance test
Flight SQL, the new kid on the block
Summary
Chapter 8: Understanding Arrow Database Connectivity (ADBC)
Technical requirements
ODBC takes an Arrow to the knee
Lost in translation
Arrow adoption in ODBC drivers
The benefits of standards around connectivity
The ADBC specification
ADBC databases
ADBC connections
ADBC statements
ADBC error handling
Using ADBC for performance and adaptability
ADBC with C/C++
Using ADBC with Python
Using ADBC with Go
Summary
Chapter 9: Using Arrow with Machine Learning Workflows
Technical requirements
SPARKing new ideas on Jupyter
Understanding the integration of Arrow in Spark
Containerization makes life easier
SPARKing joy with Arrow and PySpark
Facehuggers implanting data
Setting up your environment
Proving the benefits by checking resource usage
Using Arrow with the standard tools for ML
More GPU, more speed!
Summary
Part 3: Real-World Examples, Use Cases, and Future Development
Chapter 10: Powered by Apache Arrow
Swimming in data with Dremio Sonar
Clarifying Dremio Sonar’s architecture
The library of the gods…of data analysis
Spicing up your data workflows
Arrow in the browser using JavaScript
Gaining a little perspective
Taking flight with Falcon
An Influx of connectivity
Summary
Chapter 11: How to Leave Your Mark on Arrow
Technical requirements
Contributing to open source projects
Communication is key
You don’t necessarily have to contribute code
There are a lot of reasons why you should contribute!
Preparing your first pull request
Creating and navigating GitHub issues
Setting up Git
Orienting yourself in the code base
Building the Arrow libraries
Creating the pull request
Understanding Archery and the CI configuration
Find your interest and expand on it
Getting that sweet, sweet approval
Finishing up with style!
C++ code styling
Python code styling
Go code styling
Summary
Chapter 12: Future Development and Plans
Globetrotting with data – GeoArrow and GeoParquet
Collaboration breeds success
Expanding ADBC adoption
Final words
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
in-memory analytics with apache arrow pdf
in memory analytics with apache arrow 2nd edition
in memory analytics with apache arrow second edition
in-memory analytics
in-memory analytics accelerator
Tags: Matthew Topol, Analytics, Apache