Introduction to Apache Bookkeeper

Apache Bookkeeper is a storage system developed by Yahoo to provide minimum delay, consistency, and durability guarantee. A bookkeeper is a minimum delay, fault tolerant, and scalable service that is optimized for real-time environment workload. It is an enterprise-grade storage system. A BookKeeper is combined with Apache Zookeeper as a subproject since 2011, and in 2015, it became a top-level project. To support different use cases and to store and serve mission critical-data bookkeeper has been adopted by many of the enterprises such that Salesforce, Yahoo, and Twitter.

Apache Bookkeeper Tutorial for Beginners

The Bookkeeper was designed to fulfill the following requirements of enterprise-grade and real-time storage platform.

1) Read and write a series of entries with minimum delay.

2) Data that will be store should be consistent, fault tolerant, and durable.

3) It should provide the ability to tail to propagate data written as before.

4) Efficiently store and provide access to data (historical and real-time).

For several use cases, it is deployed widely, such as:

  • For a distributed system to building high-performance availability or replication facilities.
  • Offering replication facilities between in one cluster and also across the cluster(group).
  • It provides services as a store for pub-sub (publish and subscribe) messaging system.
  • Event Bus and pulsar
  • To store immutable objects for a series of jobs.

Block Diagram of Reliable System:-

User Libraries
Replication Consistency Recovery

Bookkeeper provides secure storage of a series of log entries records in sequence call ledgers. Bookkeeper task is to replicate stored entries across more than one server.

Basic Concepts

  • Every unit of a log is entry.
  • Series of log entries are ledgers.
  • Bookies are the individual servers that store ledgers.

Entry: –         

Entry is our user data with some Metadata that is written to a ledger. Entries are the sequence of bytes that are written into a ledger. An entry has the following fields-

         Field Name          Type of java            Description
1)  Ledger No Long Unique id of the ledger(ledger is that in which entry i).
2)  Entry No Long Unique Id of entry.
3)  Last Confirmed Long Id for that entry which recorded last.
4)  Data Byte Array It is useful information written by a client application.
5)  Authentication code Byte array Authentic code of message that contains all other fields in the entry.


Ledger is the sequence of entries. Entries are written to a ledger either sequentially or at most open.

Bookkeeper provides two types of ledgers, i.e., flat ledger manager and hierarchical data manager.

  • A flat ledger manager is implemented in Flat Ladger Manager class and stored Metadata of all ledger in the child node of a single child node of a zookeeper path. An FLM sequentially creates nodes for the uniqueness of the ledger id.
  • An HLM (Hierarchical ledger manager) implemented in Hierarchical Ledger Manager class. To implement this, we have to obtain a globally unique id from Zookeeper using an EPHEMERAL_SEQUENTIAL.

Note: – Ledgers have append-only semantics. If an entry writes to the ledger, it can’t be modified.


A bookie is an individual Bookkeeper server that manages the ledgers. A bookie store fragments of a ledger, not entire ledger. An entry will be written into a subgroup of bookies rather than all of the bookies when entries write to a ledger.

Metadata Storages

Bookkeeper requires a storage service of metadata to store ledgers and available bookies. For this purpose, Bookkeeper uses Zookeeper.

Data Management in bookies

  • Journals
  • Entry Log
  • Log FILES
  • Ledger Cache
  • Data Flush
  • Adding entries
  • Data Compaction
  • Zookeeper Mata Data


  • All the transaction log contains in the journal file. Before updating to a ledger, a bookie ensures that the update describing by a transaction written to non-volatile storage. Once a bookie starts, a new journal file is created or when the older journal file reaches to journal file size threshold.
  • An entry log file manages the written entry that received from a bookkeeper client. Separate ledger entries are aggregated and written into a sequential manner, and their offsets kept in ledger cache as pointers for fast accessing.
  • For each ledger, there is an index file which comprises several fixed length index pages that contains the offset of data that stored in an entry log file.
  • In a memory pool, ledger index pages are cached that permits for more efficient management of disk head scheduling. Ledger index pages flushed to the index file in the following two cases that are
  • When cache memory of ledger reached to its limit.
  • Thread (Synchronise) is responsible for flushing of index pages.

The entry will go through the following steps to be persisted on disk when a user instructs a bookie to write an entry to a ledger.

  1. An entry appended to an entry log.
  2. In the ledger cache, the index of the entry updated.
  3. A transaction appended to the journal corresponding to this entry update.
  4. It sents a response to the bookkeeper client.


  • Minor and Major are two different types of compaction running with different frequency. The difference in both compactions lies in their threshold values and interval.


  • The size percentage of an entry log file occupied by those undeleted ledgers are known as garbage collection threshold. The default threshold of minor and major compaction is 0.2 and 0.8, respectively.
  • The garbage collection interval is a time of how frequently to run the compaction. Default interval of minor and major compaction is 1 hour and one day respectively.

Note: – Compaction is disabled if the value of threshold and interval is less than or equal to zero.

Installation of Bookkeeper

There are two ways to install Bookkeeping either download or clone Bookkeeper. Once you download or clone bookkeeper, you have to build it locally.

By downloading a Gzipped tarball package or cloning the bookkeeper repository, you can install the Bookkeeper.

There are the following requirements for bookkeeper installation:-

  • It requires a UNIX environment for installation.
  • It requires 1.6 or later version of the Java Development Kit.
  • 0 or later version of Maven.


There are many Apache mirrors from which you can download Apache bookkeeper. We suggest site to download.

An example of the mirror is as follows:


With the help of Git Hub mirror and Apache repository, you can clone the repository. We suggest the

Git Hub mirror


To build it locally, we use Maven. Maven used when you downloaded or clone Bookkeeper.

clone Bookkeeper

Bookkeeper 4.8.0 introduce table service so if you want to build table service, with the help of stream profile you can build it.

introduce table service

When an mvn package is running, it warns you to perform the test, but you can skip tests by adding the -DskipTest c flag when running mvn package.

There are lots of useful Maven commands which are as follows.

S. No Commands Actions
1. mvn clean It uses to remove build artifacts.
2. mvn compile It uses to compile jar files from Java source.
3. mvn compile

spotbugs: spotbugs

It is used to compile jar files using Maven SpotBugsplugin.
4. mvn install It is used to install the Bookkeeper JAR locally in your local Maven cache.
5. mvn deploy It is used to implement the Bookkeeper JAR to the Maven repo.
6. mvn verify It is used to perform a wide variety of verification and validation tasks.
7. Mvn apache-rat: check It is used to run Maven using Apache Ratplugin.
8. mvn compile

Javadoc: aggregate

It is used to build javadoc locally.
9. Mvn –am –pl bookkeeper-dist/server package It is used to build a server distribution using the Maven Assembly plugin

Package directory

A bookkeeper project has several subfolders which are as follows:-

S. No Subfolder Contains
1. Bookkeeper-server It contains bookkeeper server and clients.
2. Bookkeeper-benchmark It contains benchmarking suite for measuring bookkeeper performance.
3. Bookkeeper-stats It contains bookkeeper stats library.
4. Bookkeeper-stats-providers It contains bookkeeper stats library.

Run Bookies Locally

The localbookie command of the bookkeeper CLI tool is used to run an ensemble of bookies locally on a single machine and specifying the no of bookies you’d to include an ensemble.

Start up an ensemble with ten bookies would

Run Bookies Locally

All bookies are run in a single JVM process when you start up an ensemble using localbookie.

bookies are run in a single JVM process

Manual deployment of Bookkeeper

Schedulers like DC/OS are the easiest way to deployed bookkeeper, and you can also use a bookkeeper cluster manually. There are two primary components of bookkeeper cluster.

  • For configuration and coordination related task, it requires a zookeeper cluster.
  • An ensemble of a bookie.

Zookeeper setup

To setup zookeeper cluster we recommend you to follow this link.

Starting up bookies

 You can start up as many bookies as you’d like to form a cluster if your Zookeeper cluster is up and running.You need to modify the bookie’s configuration to make sure that it points to the right Zookeeper cluster before starting up each bookie. You need to download the Bookkeeper package as a tarball on each bookie host. You need to configure the bookie by setting values in the bookkeeper server/conf/bk_server.conf configure file if you have done the previous task. You will need to change the zkServers parameter that will set to the Zookeeper connection string for your cluster.

Zookeeper connection string

With the help of the bookie command of the bookkeeper CLI tool, you can start the bookie if bookie configuration is completed.

bookie configuration

Note: –   Number of bookies you should run in a bookkeeper cluster is entirely depends on the quorum mode that is mainly chosen by you, throughput and number of clients who are using cluster simultaneously.

Type of Quorum No of Bookie
Generic Quorum 4
Self-verifying Quorum 3

If the number of bookies is increased, higher throughput will also enable.

Set up of cluster Metadata

 After setup a cluster of bookies, it is necessary to setup cluster Metadata or cluster with the help of the following command.

setup a cluster of bookies

The meta format command is used to perform all the necessary metadata task of Zookeeper cluster. It requires to run only once from any no of bookies in the bookkeeper. If formatting is complete, your bookkeeper cluster is ready to go.

Bookkeeper Protocols

For guaranteeing persistent storage of entries in an ensemble of the bookie, bookkeeper uses a special replication protocol.

We assume that you have knowledge of leader election and log replication and in distributed system how they used.


  • Ledger metadata
  • Ensemble
  • Write quorums
  • Ack quorums
  • Guarantees

Writing to ledger

  • Closing a ledger as a reader
  • Closing a ledger as a writer

Ledger to log

  • Opening a log
  • Rolling ledgers

Command-line tools


Command Description Usage
bookie Start up a bookie




$ bin/bookkeeper bookie


localbookie start up an ensemble of n bookie in a single JVM process.  


$ bin/bookkeeper localbookie N



Autorecovery It is used to run auto recovery service daemon  

$ bin/bookkeeper autorecovery


Upgrade It is used to upgrade bookie’s file system.  

$ bin/bookkeeper upgrade


Shell It is used to run bookie shell for admin command.  

$ bin/bookkeeper shell

help It is used to display help message for bookkeeper tool.  

$ bin/bookkeeper help

Data Distribution

Data Distribution




Conclusion: – Bookkeeper provides distributed logs which are known as ledgers. It is a combination of bookkeeper client +bookies. In a summarized way client API can be defined as

  • Create Ledger() ->         ledgerId
  • addEntry(data) ->         entryId
  • readEntry(ledgerId,entryId)
  • deleteLedger(leadgerId)

The bookkeeper is an efficient and reliable storage system that provides a minimum delay, consistency, and durability. A bookkeeper is a service that optimized for real-time environment workload.

Pin It on Pinterest

Share This