Skip to content

Latest commit

 

History

History
36 lines (34 loc) · 3.19 KB

inference_request.md

File metadata and controls

36 lines (34 loc) · 3.19 KB

Inference Request

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory tensors to create a valid InferenceRequest object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:

Name Shape Type Description
request_output_len [1,1] int32_t Max number of output tokens
input_ids [1, num_input_tokens] int32_t Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name Shape Type Description
streaming [1] bool (Default=false). When true, stream out tokens as they are generated. When false return only when the full generation has completed.
beam_width [1] int32_t (Default=1) Beam width for this request; set to 1 for greedy sampling
temperature [1] float Sampling Config param: temperature
runtime_top_k [1] int32_t Sampling Config param: topK
runtime_top_p [1] float Sampling Config param: topP
len_penalty [1] float Sampling Config param: lengthPenalty
repetition_penalty [1] float Sampling Config param: repetitionPenalty
min_length [1] int32_t Sampling Config param: minLength
presence_penalty [1] float Sampling Config param: presencePenalty
frequency_penalty [1] float Sampling Config param: frequencyPenalty
random_seed [1] uint64_t Sampling Config param: randomSeed
end_id [1] int32_t End token Id
pad_id [1] int32_t Pad token Id
embedding_bias [1] float Embedding bias
bad_words_list [2, num_bad_words] int32_t Bad words list
stop_words_list [2, num_stop_words] int32_t Stop words list
prompt_embedding_table [1] float16 P-tuning prompt embedding table
prompt_vocab_size [1] int32_t P-tuning prompt vocab size
lora_weights [ num_lora_modules_layers, D x Hi + Ho x D ] float (model data type) weights for a lora adapter. see lora docs for more details.
lora_config [3] int32_t lora configuration tensor. [ module_id, layer_idx, adapter_size (D aka R value) ] see lora docs for more details.
return_log_probs [1] bool When true, include log probs in the output
return_context_logits [1] bool When true, include context logits in the output
return_generation_logits [1] bool When true, include generation logits in the output
draft_input_ids [num_draft_tokens] int32_t Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration