Inference Request

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory tensors to create a valid InferenceRequest object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:

Name	Shape	Type	Description
`request_output_len`	[1,1]	`int32_t`	Max number of output tokens
`input_ids`	[1, num_input_tokens]	`int32_t`	Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name	Shape	Type	Description
`streaming`	[1]	`bool`	(Default=`false`). When `true`, stream out tokens as they are generated. When `false` return only when the full generation has completed.
`beam_width`	[1]	`int32_t`	(Default=1) Beam width for this request; set to 1 for greedy sampling
`temperature`	[1]	`float`	Sampling Config param: `temperature`
`runtime_top_k`	[1]	`int32_t`	Sampling Config param: `topK`
`runtime_top_p`	[1]	`float`	Sampling Config param: `topP`
`len_penalty`	[1]	`float`	Sampling Config param: `lengthPenalty`
`repetition_penalty`	[1]	`float`	Sampling Config param: `repetitionPenalty`
`min_length`	[1]	`int32_t`	Sampling Config param: `minLength`
`presence_penalty`	[1]	`float`	Sampling Config param: `presencePenalty`
`frequency_penalty`	[1]	`float`	Sampling Config param: `frequencyPenalty`
`random_seed`	[1]	`uint64_t`	Sampling Config param: `randomSeed`
`end_id`	[1]	`int32_t`	End token Id
`pad_id`	[1]	`int32_t`	Pad token Id
`embedding_bias`	[1]	`float`	Embedding bias
`bad_words_list`	[2, num_bad_words]	`int32_t`	Bad words list
`stop_words_list`	[2, num_stop_words]	`int32_t`	Stop words list
`prompt_embedding_table`	[1]	`float16`	P-tuning prompt embedding table
`prompt_vocab_size`	[1]	`int32_t`	P-tuning prompt vocab size
`lora_weights`	[ num_lora_modules_layers, D x Hi + Ho x D ]	`float` (model data type)	weights for a lora adapter. see lora docs for more details.
`lora_config`	[3]	`int32_t`	lora configuration tensor. `[ module_id, layer_idx, adapter_size (D aka R value) ]` see lora docs for more details.
`return_log_probs`	[1]	`bool`	When `true`, include log probs in the output
`return_context_logits`	[1]	`bool`	When `true`, include context logits in the output
`return_generation_logits`	[1]	`bool`	When `true`, include generation logits in the output
`draft_input_ids`	[num_draft_tokens]	`int32_t`	Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference_request.md

inference_request.md

Inference Request

Files

inference_request.md

Latest commit

History

inference_request.md

File metadata and controls

Inference Request