Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale

Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale

分享

作者: White , Tom

出版社: O\'Reilly Media

出版时间: 2015-04

版次: 4

ISBN: 9781491901632

装帧: 其他

页数: 756页

分类: 外文古旧书>英文书>计算机与互联网

2人买过

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you?ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
  Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You?ll learn about recent changes to Hadoop, and explore new case studies on Hadoop?s role in healthcare systems and genomics data processing.
  Learn fundamental components such as MapReduce, HDFS, and YARNExplore MapReduce in depth, including steps for developing applications with itSet up and maintain a Hadoop cluster running HDFS and MapReduce on YARNLearn two data formats: Avro for data serialization and Parquet for nested dataUse data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with HadoopLearn the HBase distributed database and the ZooKeeper distributed configuration service,Counsels programmers and administrators for big and small organizations on how to work with large-scale application datasets using Apache Hadoop, discussing its capacity for storing and processing large amounts of data while demonstrating best practices for building reliable and scalable distributed systems. Original.,
        Hadoop Fundamentals

        Chapter 1Meet Hadoop

        Data!

        Data Storage and Analysis

        Querying All Your Data

        Beyond Batch

        Comparison with Other Systems

        A Brief History of Apache Hadoop

        What’s in This Book?

        Chapter 2MapReduce

        A Weather Dataset

        Analyzing the Data with Unix Tools

        Analyzing the Data with Hadoop

        Scaling Out

        Hadoop Streaming

        Chapter 3The Hadoop Distributed Filesystem

        The Design of HDFS

        HDFS Concepts

        The Command-Line Interface

        Hadoop Filesystems

        The Java Interface

        Data Flow

        Parallel Copying with distcp

        Chapter 4YARN

        Anatomy of a YARN Application Run

        YARN Compared to MapReduce 1

        Scheduling in YARN

        Further Reading

        Chapter 5Hadoop I/O

        Data Integrity

        Compression

        Serialization

        File-Based Data Structures

        MapReduce

        Chapter 1Developing a MapReduce Application

        The Configuration API

        Setting Up the Development Environment

        Writing a Unit Test with MRUnit

        Running Locally on Test Data

        Running on a Cluster

        Tuning a Job

        MapReduce Workflows

        Chapter 2How MapReduce Works

        Anatomy of a MapReduce Job Run

        Failures

        Shuffle and Sort

        Task Execution

        Chapter 3MapReduce Types and Formats

        MapReduce Types

        Input Formats

        Output Formats

        Chapter 4MapReduce Features

        Counters

        Sorting

        Joins

        Side Data Distribution

        MapReduce Library Classes

        Hadoop Operations

        Chapter 1Setting Up a Hadoop Cluster

        Cluster Specification

        Cluster Setup and Installation

        Hadoop Configuration

        Security

        Benchmarking a Hadoop Cluster

        Chapter 2Administering Hadoop

        HDFS

        Monitoring

        Maintenance

        Related Projects

        Chapter 1Avro

        Avro Data Types and Schemas

        In-Memory Serialization and Deserialization

        Avro Datafiles

        Interoperability

        Schema Resolution

        Sort Order

        Avro MapReduce

        Sorting Using Avro MapReduce

        Avro in Other Languages

        Chapter 2Parquet

        Data Model

        Parquet File Format

        Parquet Configuration

        Writing and Reading Parquet Files

        Parquet MapReduce

        Chapter 3Flume

        Installing Flume

        An Example

        Transactions and Reliability

        The HDFS Sink

        Fan Out

        Distribution: Agent Tiers

        Sink Groups

        Integrating Flume with Applications

        Component Catalog

        Further Reading

        Chapter 4Sqoop

        Getting Sqoop

        Sqoop Connectors

        A Sample Import

        Generated Code

        Imports: A Deeper Look

        Working with Imported Data

        Importing Large Objects

        Performing an Export

        Exports: A Deeper Look

        Further Reading

        Chapter 5Pig

        Installing and Running Pig

        An Example

        Comparison with Databases

        Pig Latin

        User-Defined Functions

        Data Processing Operators

        Pig in Practice

        Further Reading

        Chapter 6Hive

        Installing Hive

        An Example

        Running Hive

        Comparison with Traditional Databases

        HiveQL

        Tables

        Querying Data

        User-Defined Functions

        Further Reading

        Chapter 7Crunch

        An Example

        The Core Crunch API

        Pipeline Execution

        Crunch Libraries

        Further Reading

        Chapter 8Spark

        Installing Spark

        An Example

        Resilient Distributed Datasets

        Shared Variables

        Anatomy of a Spark Job Run

        Executors and Cluster Managers

        Further Reading

        Chapter 9HBase

        HBasics

        Concepts

        Installation

        Clients

        Building an Online Query Application

        HBase Versus RDBMS

        Praxis

        Further Reading

        Chapter 10ZooKeeper

        Installing and Running ZooKeeper

        An Example

        The ZooKeeper Service

        Building Applications with ZooKeeper

        ZooKeeper in Production

        Further Reading

        Case Studies

        Chapter 1Composable Data at Cerner

        From CPUs to Semantic Integration

        Enter Apache Crunch

        Building a Complete Picture

        Integrating Healthcare Data

        Composability over Frameworks

        Moving Forward

        Chapter 2Biological Data Science: Saving Lives with Software

        The Structure of DNA

        The Genetic Code: Turning DNA Letters into Proteins

        Thinking of DNA as Source Code

        The Human Genome Project and Reference Genomes

        Sequencing and Aligning DNA

        ADAM, A Scalable Genome Analysis Platform

        From Personalized Ads to Personalized Medicine

        Join In

        Chapter 3Cascading

        Fields, Tuples, and Pipes

        Operations

        Taps, Schemes, and Flows

        Cascading in Practice

        Flexibility

        Hadoop and Cascading at ShareThis

        Summary

        Appendix Installing Apache Hadoop

        Prerequisites

        Installation

        Configuration

        Appendix Cloudera’s Distribution Including Apache Hadoop

        Appendix Preparing the NCDC Weather Data

        Appendix The Old and New Java MapReduce APIs

        Case Studies

        Chapter 1Composable Data at Cerner

        From CPUs to Semantic Integration

        Enter Apache Crunch

        Building a Complete Picture

        Integrating Healthcare Data

        Composability over Frameworks

        Moving Forward

        Chapter 2Biological Data Science: Saving Lives with Software

        The Structure of DNA

        The Genetic Code: Turning DNA Letters into Proteins

        Thinking of DNA as Source Code

        The Human Genome Project and Reference Genomes

        Sequencing and Aligning DNA

        ADAM, A Scalable Genome Analysis Platform

        From Personalized Ads to Personalized Medicine

        Join In

        Chapter 3Cascading

        Fields, Tuples, and Pipes

        Operations

        Taps, Schemes, and Flows

        Cascading in Practice

        Flexibility

        Hadoop and Cascading at ShareThis

        Summary

        Appendix Installing Apache Hadoop

        Prerequisites

        Installation

        Configuration

        Appendix Cloudera’s Distribution Including Apache Hadoop

        Appendix Preparing the NCDC Weather Data

        Appendix The Old and New Java MapReduce APIs

     · · · · · ·
内容简介:
Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you?ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You?ll learn about recent changes to Hadoop, and explore new case studies on Hadoop?s role in healthcare systems and genomics data processing.
Learn fundamental components such as MapReduce, HDFS, and YARNExplore MapReduce in depth, including steps for developing applications with itSet up and maintain a Hadoop cluster running HDFS and MapReduce on YARNLearn two data formats: Avro for data serialization and Parquet for nested dataUse data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with HadoopLearn the HBase distributed database and the ZooKeeper distributed configuration service,Counsels programmers and administrators for big and small organizations on how to work with large-scale application datasets using Apache Hadoop, discussing its capacity for storing and processing large amounts of data while demonstrating best practices for building reliable and scalable distributed systems. Original.,
目录:

        Hadoop Fundamentals

        Chapter 1Meet Hadoop

        Data!

        Data Storage and Analysis

        Querying All Your Data

        Beyond Batch

        Comparison with Other Systems

        A Brief History of Apache Hadoop

        What’s in This Book?

        Chapter 2MapReduce

        A Weather Dataset

        Analyzing the Data with Unix Tools

        Analyzing the Data with Hadoop

        Scaling Out

        Hadoop Streaming

        Chapter 3The Hadoop Distributed Filesystem

        The Design of HDFS

        HDFS Concepts

        The Command-Line Interface

        Hadoop Filesystems

        The Java Interface

        Data Flow

        Parallel Copying with distcp

        Chapter 4YARN

        Anatomy of a YARN Application Run

        YARN Compared to MapReduce 1

        Scheduling in YARN

        Further Reading

        Chapter 5Hadoop I/O

        Data Integrity

        Compression

        Serialization

        File-Based Data Structures

        MapReduce

        Chapter 1Developing a MapReduce Application

        The Configuration API

        Setting Up the Development Environment

        Writing a Unit Test with MRUnit

        Running Locally on Test Data

        Running on a Cluster

        Tuning a Job

        MapReduce Workflows

        Chapter 2How MapReduce Works

        Anatomy of a MapReduce Job Run

        Failures

        Shuffle and Sort

        Task Execution

        Chapter 3MapReduce Types and Formats

        MapReduce Types

        Input Formats

        Output Formats

        Chapter 4MapReduce Features

        Counters

        Sorting

        Joins

        Side Data Distribution

        MapReduce Library Classes

        Hadoop Operations

        Chapter 1Setting Up a Hadoop Cluster

        Cluster Specification

        Cluster Setup and Installation

        Hadoop Configuration

        Security

        Benchmarking a Hadoop Cluster

        Chapter 2Administering Hadoop

        HDFS

        Monitoring

        Maintenance

        Related Projects

        Chapter 1Avro

        Avro Data Types and Schemas

        In-Memory Serialization and Deserialization

        Avro Datafiles

        Interoperability

        Schema Resolution

        Sort Order

        Avro MapReduce

        Sorting Using Avro MapReduce

        Avro in Other Languages

        Chapter 2Parquet

        Data Model

        Parquet File Format

        Parquet Configuration

        Writing and Reading Parquet Files

        Parquet MapReduce

        Chapter 3Flume

        Installing Flume

        An Example

        Transactions and Reliability

        The HDFS Sink

        Fan Out

        Distribution: Agent Tiers

        Sink Groups

        Integrating Flume with Applications

        Component Catalog

        Further Reading

        Chapter 4Sqoop

        Getting Sqoop

        Sqoop Connectors

        A Sample Import

        Generated Code

        Imports: A Deeper Look

        Working with Imported Data

        Importing Large Objects

        Performing an Export

        Exports: A Deeper Look

        Further Reading

        Chapter 5Pig

        Installing and Running Pig

        An Example

        Comparison with Databases

        Pig Latin

        User-Defined Functions

        Data Processing Operators

        Pig in Practice

        Further Reading

        Chapter 6Hive

        Installing Hive

        An Example

        Running Hive

        Comparison with Traditional Databases

        HiveQL

        Tables

        Querying Data

        User-Defined Functions

        Further Reading

        Chapter 7Crunch

        An Example

        The Core Crunch API

        Pipeline Execution

        Crunch Libraries

        Further Reading

        Chapter 8Spark

        Installing Spark

        An Example

        Resilient Distributed Datasets

        Shared Variables

        Anatomy of a Spark Job Run

        Executors and Cluster Managers

        Further Reading

        Chapter 9HBase

        HBasics

        Concepts

        Installation

        Clients

        Building an Online Query Application

        HBase Versus RDBMS

        Praxis

        Further Reading

        Chapter 10ZooKeeper

        Installing and Running ZooKeeper

        An Example

        The ZooKeeper Service

        Building Applications with ZooKeeper

        ZooKeeper in Production

        Further Reading

        Case Studies

        Chapter 1Composable Data at Cerner

        From CPUs to Semantic Integration

        Enter Apache Crunch

        Building a Complete Picture

        Integrating Healthcare Data

        Composability over Frameworks

        Moving Forward

        Chapter 2Biological Data Science: Saving Lives with Software

        The Structure of DNA

        The Genetic Code: Turning DNA Letters into Proteins

        Thinking of DNA as Source Code

        The Human Genome Project and Reference Genomes

        Sequencing and Aligning DNA

        ADAM, A Scalable Genome Analysis Platform

        From Personalized Ads to Personalized Medicine

        Join In

        Chapter 3Cascading

        Fields, Tuples, and Pipes

        Operations

        Taps, Schemes, and Flows

        Cascading in Practice

        Flexibility

        Hadoop and Cascading at ShareThis

        Summary

        Appendix Installing Apache Hadoop

        Prerequisites

        Installation

        Configuration

        Appendix Cloudera’s Distribution Including Apache Hadoop

        Appendix Preparing the NCDC Weather Data

        Appendix The Old and New Java MapReduce APIs

        Case Studies

        Chapter 1Composable Data at Cerner

        From CPUs to Semantic Integration

        Enter Apache Crunch

        Building a Complete Picture

        Integrating Healthcare Data

        Composability over Frameworks

        Moving Forward

        Chapter 2Biological Data Science: Saving Lives with Software

        The Structure of DNA

        The Genetic Code: Turning DNA Letters into Proteins

        Thinking of DNA as Source Code

        The Human Genome Project and Reference Genomes

        Sequencing and Aligning DNA

        ADAM, A Scalable Genome Analysis Platform

        From Personalized Ads to Personalized Medicine

        Join In

        Chapter 3Cascading

        Fields, Tuples, and Pipes

        Operations

        Taps, Schemes, and Flows

        Cascading in Practice

        Flexibility

        Hadoop and Cascading at ShareThis

        Summary

        Appendix Installing Apache Hadoop

        Prerequisites

        Installation

        Configuration

        Appendix Cloudera’s Distribution Including Apache Hadoop

        Appendix Preparing the NCDC Weather Data

        Appendix The Old and New Java MapReduce APIs

     · · · · · ·

查看详情

相关分类

Literature&Fiction文学与虚构 ScienceFiction&Fantasy科幻与玄幻 Romance浪漫小说 Mystery&Thrillers神秘与惊悚 Teens青少年读物 Biographies&Memoirs传记与自传 Comics&GraphicNovels动漫与绘画小说 ELT英语学习 Reference参考书 StandardTest标准考试 Business&Investing商业与投资 Self-Help励志自助

目前没有书店销售此书