The main class to describe requests to GptManager
is InferenceRequest
. This is structured as a map of tensors and a uint64_t requestId
.
The mandatory tensors to create a valid InferenceRequest
object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:
Name | Shape | Type | Description |
---|---|---|---|
request_output_len |
[1,1] | int32_t |
Max number of output tokens |
input_ids |
[1, num_input_tokens] | int32_t |
Tensor of input tokens |
Optional tensors that can be supplied to InferenceRequest
are shown below. Default values, where applicable are specified.:
Name | Shape | Type | Description |
---|---|---|---|
streaming |
[1] | bool |
(Default=false ). When true , stream out tokens as they are generated. When false return only when the full generation has completed. |
beam_width |
[1] | int32_t |
(Default=1) Beam width for this request; set to 1 for greedy sampling |
temperature |
[1] | float |
Sampling Config param: temperature |
runtime_top_k |
[1] | int32_t |
Sampling Config param: topK |
runtime_top_p |
[1] | float |
Sampling Config param: topP |
len_penalty |
[1] | float |
Sampling Config param: lengthPenalty |
repetition_penalty |
[1] | float |
Sampling Config param: repetitionPenalty |
min_length |
[1] | int32_t |
Sampling Config param: minLength |
presence_penalty |
[1] | float |
Sampling Config param: presencePenalty |
frequency_penalty |
[1] | float |
Sampling Config param: frequencyPenalty |
random_seed |
[1] | uint64_t |
Sampling Config param: randomSeed |
end_id |
[1] | int32_t |
End token Id |
pad_id |
[1] | int32_t |
Pad token Id |
embedding_bias |
[1] | float |
Embedding bias |
bad_words_list |
[2, num_bad_words] | int32_t |
Bad words list |
stop_words_list |
[2, num_stop_words] | int32_t |
Stop words list |
prompt_embedding_table |
[1] | float16 |
P-tuning prompt embedding table |
prompt_vocab_size |
[1] | int32_t |
P-tuning prompt vocab size |
lora_weights |
[ num_lora_modules_layers, D x Hi + Ho x D ] | float (model data type) |
weights for a lora adapter. see lora docs for more details. |
lora_config |
[3] | int32_t |
lora configuration tensor. [ module_id, layer_idx, adapter_size (D aka R value) ] see lora docs for more details. |
return_log_probs |
[1] | bool |
When true , include log probs in the output |
return_context_logits |
[1] | bool |
When true , include context logits in the output |
return_generation_logits |
[1] | bool |
When true , include generation logits in the output |
draft_input_ids |
[num_draft_tokens] | int32_t |
Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |