Create bookmark
Cassandra High Performance Cookbook
Over 150 recipes to design and optimize large scale Apache Cassandra deployments
Notes
Please login to add notes
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Table of Contents
- Preface
-
+
Chapter 1:
Getting Started
- Introduction
- A simple single node Cassandra installation
- Reading and writing test data using the command-line interface
- Running multiple instances on a single machine
- Scripting a multiple instance installation
- Setting up a build and test environment for tasks in this book
- Running in the foreground with full debugging
- Calculating ideal Initial Tokens for use with Random Partitioner
- Choosing Initial Tokens for use with Partitioners that preserve ordering
- Insight into Cassandra with JConsole
- Connecting with JConsole over a SOCKS proxy
- Connecting to Cassandra with Java and Thrift
-
+
Chapter 2:
The Command-line Interface
- Connecting to Cassandra with the CLI
- Creating a keyspace from the CLI
- Creating a column family with the CLI
- Describing a keyspace
- Writing data with the CLI
- Reading data with the CLI
- Deleting rows and columns from the CLI
- Listing and paginating all rows
- in a column family
- Dropping a keyspace or a column family
- CLI operations with super columns
- Using the assume keyword to decode column names or column values
- Supplying time to live information when inserting columns
- Using built-in CLI functions
- Using column metadata and comparators for type enforcement
- Changing the consistency level of the CLI
- Getting help from the CLI
- Loading CLI statements from a file
-
+
Chapter 3:
Application Programmer Interface
- Introduction
- Connecting to a Cassandra server
- Creating a keyspace and column family from the client
- Using MultiGet to limit round trips and overhead
- Writing unit tests with an embedded Cassandra server
- Cleaning up data directories before unit tests
- Generating Thrift bindings for other languages (C++, PHP, and others)
- Using the Cassandra Storage Proxy "Fat Client"
- Using range scans to find and remove old data
- Iterating all the columns of a large key
- Slicing columns in reverse
- Batch mutations to improve insert performance and code robustness
- Using TTL to create columns with self-deletion times
- Working with secondary indexes
-
+
Chapter 4:
Performance Tuning
- Introduction
- Choosing an operating system and distribution
- Choosing a Java Virtual Machine
- Using a dedicated Commit Log disk
- Choosing a high performing RAID level
- File system optimization for hard disk performance
- Boosting read performance with the Key Cache
- Boosting read performance with the Row Cache
- Disabling Swap Memory for predictable performance
- Stopping Cassandra from using swap without disabling it system-wide
- Enabling Memory Mapped Disk modes
- Tuning Memtables for write-heavy workloads
- Saving memory on 64bit architectures with compressed pointers
- Tuning Concurrent Readers and Writers for throughput
- Setting compaction thresholds
- Garbage collection tuning to avoid JVM pauses
- Raising the open file limit to deal with many clients
- Increasing performance by scaling up
-
+
Chapter 5:
Consistency, Availability, and Partition Tolerance with Cassandra
- Introduction
- Working with the formula for strong consistency
- Supplying the timestamp value with write requests
- Disabling the hinted handoff mechanism
- Adjusting read repair chance for less intensive data reads
- Confirming schema agreement across the cluster
- Adjusting replication factor to work with quorum
- Using write consistency ONE, read consistency ONE for low latency operations
- Using write consistency QUORUM, read consistency QUORUM for strong consistency
- Mixing levels write consistency QUORUM, read consistency ONE
- Choosing consistency over availability consistency ALL
- Choosing availability over consistency with write consistency ANY
- Demonstrating how consistency is not a lock or a transaction
-
+
Chapter 6:
Schema Design
- Introduction
- Saving disk space by using small column names
- Serializing data into large columns for smaller index sizes
- Storing time series data effectively
- Using Super Columns for nested maps
- Using a lower Replication Factor for disk space saving and performance enhancements
- Hybrid Random Partitioner using Order Preserving Partitioner
- Storing large objects
- Using Cassandra for distributed caching
- Storing large or infrequently accessed data in a separate column family
- Storing and searching edge graph data in Cassandra
- Developing secondary data orderings or indexes
-
+
Chapter 7:
Administration
- Defining seed nodes for Gossip Communication
- Nodetool Move: Moving a node to a specific ring location
- Nodetool Remove: Removing a downed node
- Nodetool Decommission: Removing a live node
- Joining nodes quickly with auto_bootstrap set to false
- Generating SSH keys for password-less interaction
- Copying the data directory to new hardware
- A node join using external data copy methods
- Nodetool Repair: When to use anti-entropy repair
- Nodetool Drain: Stable files on upgrade
- Lowering gc_grace for faster tombstone cleanup
- Scheduling Major Compaction
- Using nodetool snapshot for backups
- Clearing snapshots with nodetool clearsnapshot
- Restoring from a snapshot
- Exporting data to JSON with sstable2json
- Nodetool cleanup: Removing excess data
- Nodetool Compact: Defragment data and remove deleted data from disk
-
+
Chapter 8:
Multiple Datacenter Deployments
- Changing debugging to determine where read operations are being routed
- Using IPTables to simulate complex network scenarios in a local environment
- Choosing IP addresses to work with RackInferringSnitch
- Scripting a multiple datacenter installation
- Determining natural endpoints, datacenter, and rack for a given key
- Manually specifying Rack and Datacenter configuration with a property file snitch
- Troubleshooting dynamic snitch using JConsole
- Quorum operations in multi-datacenter environments
- Using traceroute to troubleshoot latency between network devices
- Ensuring bandwidth between switches in multiple rack environments
- Increasing rpc_timeout for dealing with latency across datacenters
- Changing consistency level from the CLI to test various consistency levels with
- multiple datacenter deployments
- Using the consistency levels TWO and THREE
- Calculating Ideal Initial Tokens for use with Network Topology Strategy and
- Random Partitioner
-
+
Chapter 9:
Coding and Internals
- Introduction
- Installing common development tools
- Building Cassandra from source
- Creating your own type by sub classing abstract type
- Using the validation to check data on insertion
- Communicating with the Cassandra developers and users through IRC and e-mail
- Generating a diff using subversion's diff feature
- Applying a diff using the patch command
- Using strings and od to quickly search through data files
- Customizing the sstable2json export utility
- Configure index interval ratio for lower memory usage
- Increasing phi_convict_threshold for less reliable networks
- Using the Cassandra maven plugin
-
+
Chapter 10:
Libraries and Applications
- Introduction
- Building the contrib stress tool for benchmarking
- Inserting and reading data with the stress tool
- Running the Yahoo! Cloud Serving Benchmark
- Hector, a high-level client for Cassandra
- Doing batch mutations with Hector
- Cassandra with Java Persistence Architecture (JPA)
- Setting up Solandra for full text indexing with a Cassandra backend
- Setting up Zookeeper to support Cages for transactional locking
- Using Cages to implement an atomic read and set
- Using Groovandra as a CLI alternative
- Searchable log storage with Logsandra
-
+
Chapter 11:
Hadoop and Cassandra
- Introduction
- A pseudo-distributed Hadoop setup
- A Map-only program that reads from Cassandra using the
- ColumnFamilyInputFormat
- A Map-only program that writes to Cassandra using the CassandraOutputFormat
- Using MapReduce to do grouping and counting with Cassandra input and output
- Setting up Hive with Cassandra Storage Handler support
- Defining a Hive table over a Cassandra Column Family
- Joining two Column Families with Hive
- Grouping and counting column values with Hive
- Co-locating Hadoop Task Trackers on Cassandra nodes
- Setting up a "Shadow" data center for running only MapReduce jobs
- Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive
-
+
Chapter 12:
Collecting and Analyzing Performance Statistics
- Finding bottlenecks with nodetool tpstats
- Using nodetool cfstats to retrieve column family statistics
- Monitoring CPU utilization
- Adding read/write graphs to find active column families
- Using Memtable graphs to profile when and why they flush
- Graphing SSTable count
- Monitoring disk utilization and having a performance baseline
- Monitoring compaction by graphing its activity
- Using nodetool compaction stats to check the progress of compaction
- Graphing column family statistics to track average/max row sizes
- Using latency graphs to profile time to seek keys
- Tracking the physical disk size of each column family over time
- Using nodetool cfhistograms to see the distribution of query latencies
- Tracking open networking connections
-
+
Chapter 13:
Monitoring Cassandra Servers
- Introduction
- Forwarding Log4j logs to a central sever
- Using top to understand overall performance
- Using iostat to monitor current disk performance
- Using sar to review performance over time
- Using JMXTerm to access Cassandra JMX
- Monitoring the garbage collection events
- Using tpstats to find bottlenecks
- Creating a Nagios Check Script for Cassandra
- Keep an eye out for large rows with compaction limits
- Reviewing network traffic with IPTraf
- Keep on the lookout for dropped messages
- Inspecting column families for dangerous conditions
- Index
This is a cookbook and all tasks are approached as recipes. A recipe describes a task and outlines the steps necessary to complete this task. Some recipes in the book are examples of writing code. An example of this is a recipe that stores and accesses the entries of a phone book in Cassandra. The recipe consists of a description of the program, a full code example is given, the example is run, the output is displayed, and finally the how it works section describes the process or code in greater detail. Other recipes in the book describe a task. An example of this is a recipe that takes a snapshot back up of data in Cassandra. This recipe contains a description of the process, it then shows how to run the snapshot command and confirm that it worked, it then explains what the snapshot command does behind the scenes, finally the ‘see also’ section references other related recipes such as the recipe to restore a snapshot. This book is designed for administrators, developers, and data architects who are interested in Apache Cassandra for redundant, highly performing, and scalable data storage. Typically these users should have experience working with a database technology, multiple node computer clusters, and high availability solutions.
Book Details
Authors
Publishers
Publication year : 2011
License: All rights reserved ©
Times read: 191

