Bigtable

Reading notes

Motivation

  • Lots of (semi) structure data: URLs, Geo
  • Scale is large
  • want asynchronous processes to be continuously updating different pieces of data

Big table

Building Blocks

  • GFS: stores persistent data
  • SSTable (file format): is used internally to store Bigtable data.
    • (key, value) (both are arbitrary byte strings)
    • each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable)
    • A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened.
  • Scheduler: schedules jobs onto machines
  • Lock service (Chubby):
    • Lock service: distributed lock manager
    • master election, location bootstrapping
  • MapReduce

Data model

  • (row:string, column:string, time:int64) → string
  • Rows

    • Arbitrary string
    • Access to data in a row is atomic
    • Ordered lexicographically
  • Column

    • Tow-level name structure:
    • family: qualifier
    • Column Family is the unit of access control
  • Timestamps

    • Store different versions of data in a cell
    • Lookup options
      • Return most recent K values
      • Return all values

Implementation

  • Single-master distributed system
  • The Bigtable implementation has three major components:
    1. a library that is linked into every client
    2. one master server
      • is responsible for assigning tablets to tablet servers
      • detecting the addition and expiration of tablet servers
      • balancing tablet-server load
      • garbage collection of files in GFS.
      • In addition, it handles schema changes such as table and column family creations.
    3. many tablet servers (can be dynamically added or removed)
      • clients communicate directly with tablet servers for reads and write.

Tablet location

  • Given a row, how do clients find the location of the tablet whose row range covers the target row?

  • METADATA: Key: table id + end row, Data: location

  • Aggressive Caching and Prefetching at Client side

  • Chubby file

    • The first level is a file stored in Chubby that contains the location of the root tablet.
  • Root tablet
    • contains the location of all tablets in a special METADATA table.
    • first tablet in the METADATA table
    • never split to ensure 3-level hierarchy

Tablet Assignment

  • Each tablet is assigned to one tablet server at a time
  • uses Chubby to keep track of tablet servers
    • When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory
  • Chubby
    • Tablet server registers itself by getting a lock in a specific directory chubby
      • Chubby gives “lease” on lock, must be renewed periodically
      • Server loses lock if it gets disconnected
    • Master monitors this directory to find which servers exist/are alive
      • If server not contactable/has lost lock, master grabs lock and reassigns tablets
      • GFS replicates data. Prefer to start tablet server on same machine that the data is already at

Tablet Serving

memtable: the recently committed ones are stored in memory in a sorted buffer.

Compactions

  1. minor compaction

    • When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS
    • two goals:
      • it shrinks the memory usage of the tablet server
      • it reduces the amount of data that has to be read from the commit log during recovery if this server dies.
  2. merging compaction

    • A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable.
  3. major compaction

    • A merging compaction that rewrites all SSTables into exactly one SSTable

Refinements

  1. Locality groups
  2. Compression
  3. Caching for read performance
    • two level
      • Scan Cache: cache key-value pairs
      • Block Cache: cache SSTables blocks (better for sequential reads)
  4. Bloom filters
    • A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.
  5. Commit-log implementation
    • per tablet server
    • co-mingling mutations for different tablets
    • high performance, but complicates recovery.
  6. Speeding up tablet recovery
    1. the source tablet server first does a minor compaction on that tablet
    2. the tablet server stops serving the tablet
    3. minor compaction to eliminate any remaining uncompacted state in the tablet server’s log
    4. the tablet can be loaded on another tablet server
  7. Exploiting immutability
    • (SSTables is immutable)
    • Each tablet’s SSTables are registered in the METADATA table.
    • The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.
    • the immutability of SSTables enables us to split tablets quickly (let the child tablets share the SSTables of the parent tablet)

Class notes

A NoSQL (originally referring to "non SQL" or "non relational")[1] database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases.

APIs

  • Metadata operations
    • Create/delete tables, column families, change metadata
  • Writes
    • Set(): write cells in a row
    • DeleteCells(): delete cells in a row
    • DeleteRow(): delete all cells in a row
  • Reads
    • Scanner: read arbitrary cells in a bigtable
      • Each row read is atomic
      • Can restrict returned rows to a particular range
      • Can ask for just data from 1 row, all rows, etc.
      • Can ask for all columns, just certain column families, or specific columns

Typical Cluster

Chubby

  • {lock/file/name} service
  • Coarse-grained locks
  • Each clients has a session with Chubby.
    • The session expires if it is unable to renew its session lease within the lease expiration time.
  • 5 replicas, need a majority vote to be active

Questions

What is the responsibility of the master node?

  • Assigning tablets to tablet servers
  • load balancing across tablet servers
  • detecting tablet server status

results matching ""

    No results matching ""