Stream ModelOutputs #29545

dmarx · 2024-03-08T20:34:02Z

What does this PR do?

(NB: PR is still work in progress, but close to ready)

The "streaming" feature for text generation is currently constrained to streaming token ids. If a user requests an output dict, the stream still only generates token ids without other attributes that may have been requested such as scores, raw logits, activations. After the stream has been consumed, the function returns its final output, which will be the originally requested output dict.

This PR aligns the return type of the streamer with the requested return type by encapsulating the logic that determines how the return value is constructed. In so doing, this change also permits users to stream richer representations than just token ids. EDIT: Additionally, this exposes a useful probe for testing, as demonstrated with the discovery of #29551

Summary of changes:

Adds a ._prepare_output() function to GenerationMixin, encapsulates output formatting logic
._prepare_output() replaces nested conditionals that previously built return values (beam samplers excluded)
Streamer.put() receives its input from ._prepare_output() invocation rather than restricted to token IDs tensor
Inputs to Streamer.put() are no longer sent to the CPU prior to being passed to .put()
- When desired, delegates responsibility for moving the tensor to the provided Streamer.
Adds OutputStreamer and OutputIteratorStreamer classes

Before submitting

Who can review?

@gante

dmarx · 2024-03-08T23:29:37Z

NB: I think I've uncovered an issue in the contrastive decoding implementation. Currently, scores and (raw) logits are the same values, but should not be. I think this is because the scores are warped by logic encapsulated in the _ranking_fast function. I currently have my PR here set such that the streaming outputs match with the baseline, but I achieved that by providing the same values to the streamed logits and scores attributes.

gante · 2024-03-12T18:40:23Z

Hi @dmarx 👋 Thank you for opening this PR!

We're not accepting complex updates to streaming at the moment, as we have an ongoing design not far from being added (and it touches other non-streaming goals for generate).

In a nutshell, we are going to add the option to yield stuff from generate :)

dmarx · 2024-03-12T23:33:48Z

Nice! Looking forward to the change, fingers crossed users will be able to yield output dicts with scores and logits in addition to token_ids (i.e. the motivation for this PR) :)

Any chance you could estimate an ETA for the yield from generate change?

dmarx · 2024-03-13T20:48:18Z

actually better yet, @gante could you maybe point me to the issue or PR I should follow to keep tabs on the "yield from generate" updates?

gante · 2024-03-14T10:42:41Z

@dmarx Here's the roadmap:

Add support for torch.compile (under way, I estimate 1-2 months of additional work). This ensures we have a generate structure that can be optimized, as well as a test suite to prevent regressions
Generate blockwise refactor. Currently generate is a monolith, which prevents adding new modalities or optionally change return into yield without massive if/else blocks in many places
Streaming 2.0. Now that we have optimization features and a lego-style generate function, enable a codepath with yield

No trackers for 2. and 3. yet. I'd estimate 3 months super optimistically, 6 months if a flurry of new generation techniques and model modifications come out in the near future :)

dmarx · 2024-03-14T17:01:30Z

@gante refactoring messy research code is sorta my jam (just ask poli). Let me know if I could help accelerate some of this roadmap, would be happy to help with (2) if you can elaborate on your design plan a bit more.

Also, let me know if it would be helpful to re-articulate this PR's motivations as an issue for your user story tracking. I can just leave this PR open if you prefer.

gante · 2024-03-14T18:23:07Z

@dmarx extra hands would indeed be handy (pun intended)!

I'm going to chat with @zucchini-nlp soon (tomorrow?), so we can gather a set of requirements to then share publicly a concrete plan for the factorization of generate. I'm sure some of the tasks can be done in parallel with the torch.compile goal without causing (big) conflicts. In fact, the refactor is already in motion -- this PR was written with the refactor in mind 😉

If I don't reply within a week, please don't hesitate to ping me -- I do want to take your offer!

dmarx · 2024-03-29T15:29:35Z

@gante @zucchini-nlp bump re the GenerationMixin.generate refactoring roadmap and design plans

gante · 2024-04-18T15:07:36Z

@dmarx not forgotten, we are finalizing gathering requirements internally across the different teams/repos :)

dmarx · 2024-05-09T07:09:10Z

@gante just checking in.

gante · 2024-05-16T09:59:01Z

@dmarx #30810 :D

dmarx · 2024-05-20T16:40:51Z

had kept this open for communication purposes, closing in favor of #30810

dmarx added 30 commits February 29, 2024 12:06

skeleton OutputStreamer

f901ee0

hacky but passes tests

57a77d1

skeleton OutputIteratorStreamer, notes

b34c0b2

fleshed out OutputIteratorStreamer w test case

5eb0e69

token_id streaming passes

ff35e2b

contrastive passes token_id test

76ed5c2

randomize test seed

3885ea1

greedy scores pass test

2958099

ensure we're building the output incrementally

976eb23

multinom sampling outputs match

cabb5fe

throw error if on_ready called on empty buffer

9d075c9

enforce list(values) in on_ready()

ba48d71

POC output_constructor

108d3c2

DRY tests following on_ready() fix

02ec905

fix function spelling

ebe4798

moved output_constructor to class method

528d79a

rename output_constructor -> _prepare_output

97812bf

integrating _prepare_output

2c03ae6

finished integrating _prepare_output

64732bb

placeholder args for attention/hidden streaming

e903f04

cleanup

ca4dc8d

explicit GenerateEncoderDecoderOutput support

6e17323

parameterized OutputStreamer tests

3e2ff84

fixed tests

256884f

assert same types on field, cleanup

75bf3d7

tuple-of-tensors output type parity

673837b

cleanup, test type consistency yielded of stream

8dd7973

test emits helpful info on failure

cd644fb

output_attentions streaming for greedy decoding

2ad0ead

fix skipped arguments

e523c53

dmarx added 4 commits March 7, 2024 18:32

attention streaming passes all test cases

916bf43

add assistive decoding to test parameterization

0b28bab

draft streaming assisted, exceeds max tokens

b89dc83

back out assisted decoding changes for the moment

1e02df6

dmarx mentioned this pull request Mar 8, 2024

Stream ModelOutputs coreweave/transformers#1

Closed

29 tasks

dmarx added 6 commits March 8, 2024 12:41

tuple consistency

965c55c

refactor tests

6c7422e

refactored tests, new issues w contrastive

314c5e1

added note

33fded0

suppress contrastive tests ftm

2a4f657

setting scores=logits, contrastive passes

eb0d567

dmarx mentioned this pull request Mar 9, 2024

Contrastive decoding "raw" logits and scores are identical #29551

Closed

4 tasks

dmarx and others added 2 commits March 11, 2024 13:42

block debugging import (lovely_tensors)

014c8f5

Merge branch 'main' into dmarx.output_streamer

9358cd0

gante mentioned this pull request Mar 14, 2024

TextIteratorStreamer & generate graceful interruption #29536

Closed

2 tasks

dmarx closed this May 20, 2024

dmarx mentioned this pull request May 20, 2024

tracker: generate composability refactor #30810

Open

13 tasks

dmarx mentioned this pull request Sep 5, 2024

Upgrade hydra-node fork to latest transformers coreweave/transformers#2

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream ModelOutputs #29545

Stream ModelOutputs #29545

dmarx commented Mar 8, 2024 •

edited

Loading

dmarx commented Mar 8, 2024

gante commented Mar 12, 2024 •

edited

Loading

dmarx commented Mar 12, 2024 •

edited

Loading

dmarx commented Mar 13, 2024

gante commented Mar 14, 2024

dmarx commented Mar 14, 2024

gante commented Mar 14, 2024

dmarx commented Mar 29, 2024

gante commented Apr 18, 2024

dmarx commented May 9, 2024

gante commented May 16, 2024

dmarx commented May 20, 2024

Stream ModelOutputs #29545

Stream ModelOutputs #29545

Conversation

dmarx commented Mar 8, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

dmarx commented Mar 8, 2024

gante commented Mar 12, 2024 • edited Loading

dmarx commented Mar 12, 2024 • edited Loading

dmarx commented Mar 13, 2024

gante commented Mar 14, 2024

dmarx commented Mar 14, 2024

gante commented Mar 14, 2024

dmarx commented Mar 29, 2024

gante commented Apr 18, 2024

dmarx commented May 9, 2024

gante commented May 16, 2024

dmarx commented May 20, 2024

dmarx commented Mar 8, 2024 •

edited

Loading

gante commented Mar 12, 2024 •

edited

Loading

dmarx commented Mar 12, 2024 •

edited

Loading