Skip to content

Latest commit

 

History

History
241 lines (175 loc) · 6.72 KB

efficient-batching.md

File metadata and controls

241 lines (175 loc) · 6.72 KB

Efficient Batching

One of the most important thing make caching efficiently is Batching & Pipeline. Similar to the N+1 Problem when doing batching gets is much more efficient than doing get one by one.

There are points in the caching solution that when do batching can improve its effectiveness dramatically:

  • Batching Get for keys of the same type of item: for example multi-get for multiple product infos.
  • Batching Get for keys of different type of items, when there is no data dependent between items: for example, getting product information and getting its prices is often independent, and can be fetched at the same time.
  • Batching Get to the database (will be N + 1 problem if it's not) and Batching Set to set back data to the Cache Server when many of the keys are missed at the same time, especially at the start when Cache is empty.
  • Batching Get and "Batching" Sleep for the algorithm that Preventing Thundering Herd. Which means instead of sleep one by one, all keys will sleep for the same duration and then Multi-Get to retry all the keys again.

This library will help solve all these problems.

Efficient Batching through Defer Function Calls

Functional Batching in Pipeline

The idea started with the interface Pipeline:

package memproxy

type Pipeline interface {
	LeaseGet(key string, options LeaseGetOptions) func() (LeaseGetResponse, error)
	LeaseSet(key string, data []byte, cas uint64, options LeaseSetOptions) func() (LeaseSetResponse, error)
	Delete(key string, options DeleteOptions) func() (DeleteResponse, error)

	// Execute flush commands to the network
	Execute()

	// Finish must be called after create a Pipeline, often by defer
	Finish()

	// LowerSession returns a lower priority session
	LowerSession() Session
}

Let focus on these 3 functions:

LeaseGet(key string, options LeaseGetOptions) func () (LeaseGetResponse, error)
LeaseSet(key string, data []byte, cas uint64, options LeaseSetOptions) func () (LeaseSetResponse, error)
Delete(key string, options DeleteOptions) func () (DeleteResponse, error)

Instead of returning immediately with the result, those 3 functions when called will only collect the operations. Only after the returned anonymous functions are called, the collected operations will actually be executed.

For example, in the following code:

fn1 := pipeline.LeaseGet("key01", LeaseGetOptions{})
fn2 := pipeline.LeaseGet("key02", LeaseGetOptions{})
fn3 := pipeline.LeaseGet("key03", LeaseGetOptions{})

resp1, err := fn1()
resp2, err := fn2()
resp3, err := fn3()

The first 3 calls will do nothing except collects the operations in an internal buffer. ONLY at the forth call: resp1, err := fn1(), the operations are actually being flushed to the network and waiting for the results back from memcached servers.

The fifth and the sixth lines:

resp2, err := fn2()
resp3, err := fn3()

actually does NOT much else except mapping the results from the waiting in

resp1, err := fn1()

to the corresponding operations.

Readers familiar with Promise in JavaScript will find this function signature looks like a Promise. Instead of returning the result right away, the function will return a Promise that the result can get from in the future.

In the case of Pipeline, the Promise is a simple and efficient anonymous function that facilitates batching.

Batching between Different Types of Data

Using the idea of returning anonymous functions, we can make the batching between different data types easier to implements.

For example, consider the following example:

package example

import "context"

type ProductInfo struct {
	// detail fields
}

type ProductPrice struct {
	// detail fields
}

type Repository interface {
	GetProductInfo(ctx context.Context, sku string) func() (ProductInfo, error)
	GetProductPrice(ctx context.Context, sku string) func() (ProductPrice, error)
}

func main() {
	var repo Repository // assume to be initialized with a correct implementation
	ctx := context.Background()

	infoFunc1 := repo.GetProductInfo(ctx, "SKU01")
	infoFunc2 := repo.GetProductInfo(ctx, "SKU02")

	priceFunc1 := repo.GetProductPrice(ctx, "SKU01")
	priceFunc2 := repo.GetProductPrice(ctx, "SKU02")

	info1, err := infoFunc1()
	info2, err := infoFunc2()

	price1, err := priceFunc1()
	price2, err := priceFunc2()
}

We can see the 4 operations with 2 kinds of data can be batching efficiently using this approach.

Chaining Anonymous Functions for Complex Batching

Simply returning anonymous functions can help batching between unrelated kinds of data. But how to do batching when the actions are related?

For example, how to do batching for this chain of actions:

get from cache if not found => get from DB => then set back to the cache

The idea is using the help of the Session interface:

package memproxy

type Session interface {
	AddNextCall(fn func())
	Execute()

	// other functions
}

The function AddNextCall simply appends to a list of defer calls. The function Execute loops through this list and executes each defer functions one by one.

Assuming 3 operations above belong to the same interface, if we implement as below:

package example

import (
	"context"
	"errors"
)

var ErrNotFound = errors.New("not found")

type UserData struct {
	// other fields
}

type Session interface {
	AddNextCall(fn func())
	Execute()
}

type Repository interface {
	GetFromCache(ctx context.Context) func() (UserData, error)
	GetFromDB(ctx context.Context) func() (UserData, error)
	SetToCache(ctx context.Context, data UserData) func() error
}

func GetCache(
	ctx context.Context,
	repo Repository,
	sess Session,
) func() (UserData, error) {
	cacheFunc := repo.GetFromCache(ctx)

	var result UserData
	var err error

	sess.AddNextCall(func() {
		result, err = cacheFunc()
		if err == ErrNotFound {
			getDBFunc := repo.GetFromDB(ctx)
			sess.AddNextCall(func() {
				result, err = getDBFunc()
				if err != nil {
					return
				}
				setCacheFunc := repo.SetToCache(ctx, result)
				sess.AddNextCall(func() {
					_ = setCacheFunc()
				})
			})
		}
	})

	return func() (UserData, error) {
		sess.Execute()
		return result, err
	}
}

The way the GetCache function implemented will make the 3 operations:

  • GetFromCache
  • GetFromDB
  • SetToCache

Behave in a batching manner.

The actual implement will be more complicated because of many options and have to deal with sleeping for Thundering Herd Protection. But the main idea remains the same.