Separate workspace/allocations from graph #57

jerinphilip · 2021-09-21T17:24:32Z

Feature description

In the current state of browsermt/marian-dev, the concept of a workspace which manages allocation of tensors is placed behind a graph accessible to the library API bergamot-translator uses. This leads to a temporarily inefficient implementation of multiple-models handling (browsermt/bergamot-translator#210), where the workspaces grow proportional to the number of models active.

@XapaJIaMnu and @kpu have previously solved swapping multiple models by means of swapping tensors onto an active graph. This is "dynamic" and a reference implementation available at https://github.com/kpu/marian-dev/blob/dynamic_swap_mvp/src/translator/swappable.cpp. While this is doable in the case of shared-architectures without incurring much expense, a change in architecture involves reconstructing the graph (eg: tied embedding model swapped out for a non-tied embedding model).

It is optimal to keep the concept of a workspace bound to threads/workers active instead, separate the graph and architecture aside to avoid the blow-up in memory usage than what is originally required.

This issue is intended to investigate how best to make the modifications to solve the above problem in this repository.

/cc @graemenail

jerinphilip · 2021-09-21T22:57:30Z

Found this function, which replaces tensors_ with something supplied externally:

marian-dev/src/graph/expression_graph.h

Lines 242 to 244 in 62bac85

    
           void reuseWorkspace(Ptr<ExpressionGraph> graph) { 
        
             tensors_ = graph->tensors_; 
        
           }

The following example snippet from translator is what eventually makes way into bergamot-translator.

marian-dev/src/translator/translator.h

Lines 96 to 120 in 62bac85

    
             auto graph = New<ExpressionGraph>(true); 
        
             auto prec = options_->get<std::vector<std::string>>("precision", {"float32"}); 
        
             graph->setDefaultElementType(typeFromString(prec[0])); 
        
             graph->setDevice(device); 
        
             graph->getBackend()->configureDevice(options_); 
        
             graph->reserveWorkspaceMB(options_->get<size_t>("workspace")); 
        
             graphs_[id] = graph; 
        
             std::vector<Ptr<Scorer>> scorers; 
        
             if(options_->get<bool>("model-mmap", false)) { 
        
               scorers = createScorers(options_, model_mmaps_); 
        
             } 
        
             else { 
        
               scorers = createScorers(options_, model_items_); 
        
             } 
        
             for(auto scorer : scorers) { 
        
               scorer->init(graph); 
        
               if(shortlistGenerator_) 
        
                 scorer->setShortlistGenerator(shortlistGenerator_); 
        
             } 
        
             scorers_[id] = scorers; 
        
             graph->forward(); 
        
           };

This is super stateful, taking a shot at figuring what's happening below:

// creates empty vessel
auto graph = New<ExpressionGraph>(/*inference=*/true); 
auto prec = options_->get<std::vector<std::string>>("precision", {"float32"}); 
graph->setDefaultElementType(typeFromString(prec[0])); 

// configures device (CPU/GPU) and prepares a Tensors/TensorAllocator.
graph->setDevice(device); 

// GEMM config stuff.
graph->getBackend()->configureDevice(options_); 

// Grows the static storage on the Tensors/TensorAllocator and manually
// manages allocations ahead on this to minimize lot of malloc troubles.
graph->reserveWorkspaceMB(options_->get<size_t>("workspace")); 

// Could replace with a graph supplied from worker externally?
// Overwrites an already allocated something, which hasn't grown to workspace-size. 
// (Why are things like this?).
graph->reuseWorkspace(workerGraph);

// Later down the line
   scorer->init(graph); 
   // ^ Only here are the notions of an architecture (transformer, s2s etc coming). 
   // So this should be where graph is constructed. Allocations done.
   // When are these deallocated?

The above has to be "dynamic", i.e, check if the worker already has a graph and related created for the particular TranslationModel. If not trigger the creation routine. Say a thread never sees a TranslationModel, it doesn't create unnecessary allocations in this case.

We additionally want the allocations created on this workerGraph by a given TranslationModel M to be freed when the M's lifetime is over. Figuring out lifetime seems to be tricky with New<...> and graph being provided everywhere. Can I trust whatever memory management inside to figure if something is in use or not in use and free space?

Current understanding is that scorer->init(graph) triggers the creation of architecture notions on the graph.

There is a long-term and short-term memory of which long-term corresponds to constant nodes which are not "parameters" (Can't all weights be treated thus in inference mode?). Short-term memory is everything else, so should be gradients, parameters being modified, inputs and that sort. I'm guessing keeping long-term has performance benefits.

Where are these getting freed?

The following clears everything except longterm-storage.

marian-dev/src/graph/expression_graph.h

Lines 518 to 527 in 62bac85

    
           void clear() { 
        
             // clear everything apart from parameters and memoized nodes 
        
             count_ = 0; 
        
             nodesForward_.clear(); 
        
             nodesBackward_.clear(); 
        
             topNodes_.clear(); 
        
             tensors_->clear(); 
        
           }

I understand this to be triggered every BeamSearch call.

marian-dev/src/translator/beam_search.cpp

Lines 265 to 267 in 62bac85

    
           for(auto scorer : scorers_) { 
        
             scorer->clear(graph); 
        
           }

Marian operates always with one-model? Does this imply that the long-term created corresponding to constants or params are alive forever?

marian-dev/src/graph/expression_graph.h

Line 154 in 62bac85

void clearLongtermMemory() { longterm_->clear(); }

The above function is available to clear everything in long-term, but is unused in source(?).

jerinphilip · 2021-09-22T13:14:31Z

If I understand the graph subsystem to be an EDSL with storage using some hash-map of string-identifiers and corresponding allocated tensors, there seems to be a way to hijack creating scorers onto the same graph by modifying the fname below:

marian-dev/src/translator/scorers.cpp

Lines 114 to 131 in 62bac85

    
           for(auto ptr : ptrs) { 
        
             std::string fname = "F" + std::to_string(i); 
        
             // load options specific for the scorer 
        
             auto modelOptions = New<Options>(options->clone()); 
        
             if(!options->get<bool>("ignore-model-config")) { 
        
               YAML::Node modelYaml; 
        
               io::getYamlFromModel(modelYaml, "special:model.yml", ptr); 
        
               if(!modelYaml.IsNull()) { 
        
                 LOG(info, "Loaded model config"); 
        
                 modelOptions->merge(modelYaml, true); 
        
               } 
        
               else { 
        
                 LOG(warn, "No model settings found in model file"); 
        
               } 
        
             } 
        
             scorers.push_back(scorerByType(fname, weights[i], ptr, modelOptions));

I expect all further parameters would be prefixed (namespaced) by this string

Do I create a new method to find all variable names by prefix and trigger a free along with TranslationModel destruction? Seems like an idea. I can filter for keys in the storage and manually free them?

jerinphilip · 2021-09-22T15:41:30Z

Global config object fail. I'm guessing I have to fix API to take only parameters required (this is common to TranslationModels, I no longer have Options to create this. Going to do horror for now).

marian-dev/src/tensors/cpu/backend.h

Lines 63 to 67 in 62bac85

    
           void configureDevice(Ptr<const Options> options) override { 
        
             setClip(options->get<float>("clip-gemm")); 
        
             setGemmPrecision(options); 
        
             setLegacyBatchedGemm(options->get<bool>("use-legacy-batching")); 
        
           }

jerinphilip · 2021-09-22T16:18:49Z

The case of same arch models being init on same graph. If params dim match, they get loaded and all sorts of funny things can happen?

marian-dev/src/graph/expression_graph.h

Lines 349 to 368 in 62bac85

    
           (p, params) = findParams(name, elementType, typeSpecified); 
        
           if(!params) { 
        
             params = New<Parameters>(elementType); 
        
             params->init(backend_); 
        
             paramsByElementType_.insert({elementType, params}); 
        
           } else { 
        
             if(p) { 
        
               // if yes add to tape and return 
        
               ABORT_IF(shape != p->shape(), 
        
                        "Requested shape {} for existing parameter '{}' does not match " 
        
                        "original shape {}", 
        
                        shape, 
        
                        name, 
        
                        p->shape()); 
        
               p->setTrainable(!fixed); 
        
               add(p); 
        
               return p; 
        
             }

This can potentially be fixed by providing an alternate way to createScorers namespacing these correctly?

marian-dev/src/translator/scorers.cpp

Lines 114 to 131 in 62bac85

    
           for(auto ptr : ptrs) { 
        
             std::string fname = "F" + std::to_string(i); 
        
             // load options specific for the scorer 
        
             auto modelOptions = New<Options>(options->clone()); 
        
             if(!options->get<bool>("ignore-model-config")) { 
        
               YAML::Node modelYaml; 
        
               io::getYamlFromModel(modelYaml, "special:model.yml", ptr); 
        
               if(!modelYaml.IsNull()) { 
        
                 LOG(info, "Loaded model config"); 
        
                 modelOptions->merge(modelYaml, true); 
        
               } 
        
               else { 
        
                 LOG(warn, "No model settings found in model file"); 
        
               } 
        
             } 
        
             scorers.push_back(scorerByType(fname, weights[i], ptr, modelOptions));

jerinphilip · 2021-09-23T06:42:38Z

@XapaJIaMnu suspects the above namespace hijack might not be feasible due to possible AD assertions of all nodes in the graph to be connected.

In this case, the map containing variable names to tensors is with expression-graph, at:

marian-dev/src/graph/expression_graph.h

Line 196 in 62bac85

ElementTypeParamsMap paramsByElementType_;

In this case, we can have dummy graph at each worker (workerGraph), place another graph with TranslationModel and share storage with the following function by grabbing from workerGraph?

marian-dev/src/graph/expression_graph.h

Lines 242 to 244 in 62bac85

    
           void reuseWorkspace(Ptr<ExpressionGraph> graph) { 
        
             tensors_ = graph->tensors_; 
        
           }

Nobody appears to be using the above function, at least in master, so uncharted territory. On the bright side, we can do with this function what we want?

@graemenail mentioned some concern about this not inclusive of cache, but looks like it is:

marian-dev/src/graph/expression_graph.h

Lines 22 to 25 in 62bac85

    
           class Tensors { 
        
           private: 
        
             Ptr<TensorAllocator> tensors_; 
        
             Ptr<TensorAllocator> cache_;

jerinphilip · 2021-10-05T23:29:49Z

There is a long-term and short-term memory of which long-term corresponds to constant nodes which are not "parameters" (Can't all weights be treated thus in inference mode?).

So if inference mode, params become memoized... possibly long-term and sticks forever...

marian-dev/src/graph/node_operators.cpp

Lines 39 to 50 in 4005259

    
           ParamNode::ParamNode(Ptr<ExpressionGraph> graph, 
        
                                const Shape& shape, 
        
                                const Ptr<inits::NodeInitializer>& init, 
        
                                Type valueType, 
        
                                bool fixed) 
        
               : Node(graph, shape, valueType), 
        
                 init_(init), 
        
                 initialized_(false) { 
        
             init_->setAllocator(graph->allocator()); 
        
             setTrainable(!fixed); 
        
             setMemoize(graph->isInference()); 
        
           }

jerinphilip · 2021-10-15T14:17:31Z

browsermt/bergamot-translator#225 tried to share workspaces (which are identified as TensorAllocators) and ran into finding tensors active during translation which if manually cleared would corrupt translations afterwards.

While I can identify which tensors are associated with which models, I'll need to put mutexes to protect TensorAllocator to die clearing the tensors associated with a model (with workspaces associated with threads and hence races). @kpu's recommendation is to get the processed form (after conversion to intgemm or whatever), store it as const for the lifetime of the model and all threads can read-access it if need be. When the model is destructed, workers are not holding any shared_ptrs and it's safe to clear these tensors since no translation with this model is active and/or waiting.

Development is on pause until the higher powers decide the right direction/approach.

jerinphilip added the enhancement New feature or request label Sep 21, 2021

jerinphilip mentioned this issue Sep 22, 2021

Separating workspace from TranslationModel browsermt/bergamot-translator#223

Closed

jerinphilip mentioned this issue Oct 5, 2021

Tensor lifetime diagnosis instrumentation #58

Closed

jerinphilip mentioned this issue Oct 18, 2021

Wasm test page UI for translating b/w non-English language pairs browsermt/bergamot-translator#231

Merged

jerinphilip mentioned this issue Nov 3, 2021

Separate workspace from graph browsermt/bergamot-translator#257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate workspace/allocations from graph #57

Separate workspace/allocations from graph #57

jerinphilip commented Sep 21, 2021

jerinphilip commented Sep 21, 2021 •

edited

Loading

jerinphilip commented Sep 22, 2021 •

edited

Loading

jerinphilip commented Sep 22, 2021

jerinphilip commented Sep 22, 2021 •

edited

Loading

jerinphilip commented Sep 23, 2021 •

edited

Loading

jerinphilip commented Oct 5, 2021 •

edited

Loading

jerinphilip commented Oct 15, 2021

Separate workspace/allocations from graph #57

Separate workspace/allocations from graph #57

Comments

jerinphilip commented Sep 21, 2021

Feature description

jerinphilip commented Sep 21, 2021 • edited Loading

jerinphilip commented Sep 22, 2021 • edited Loading

jerinphilip commented Sep 22, 2021

jerinphilip commented Sep 22, 2021 • edited Loading

jerinphilip commented Sep 23, 2021 • edited Loading

jerinphilip commented Oct 5, 2021 • edited Loading

jerinphilip commented Oct 15, 2021

jerinphilip commented Sep 21, 2021 •

edited

Loading

jerinphilip commented Sep 22, 2021 •

edited

Loading

jerinphilip commented Sep 22, 2021 •

edited

Loading

jerinphilip commented Sep 23, 2021 •

edited

Loading

jerinphilip commented Oct 5, 2021 •

edited

Loading