Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Kernel Tests #1439

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

CPU Kernel Tests #1439

wants to merge 25 commits into from

Conversation

oOTigger
Copy link
Contributor

@oOTigger oOTigger commented Jul 11, 2024

Tests for:

  • cast_kernels
  • reverse_kernels
  • combine_kernels
  • replicate_kernels
  • managed_ff_stream
  • managed_per_device_handle

Changes to kernels/test_utils, GenericTensorAccessors, and the addition of local_cpu_allocator


This change is Reviewable

@oOTigger oOTigger added repo-refactor kernels Kernels library labels Jul 11, 2024
@oOTigger oOTigger self-assigned this Jul 11, 2024
Copy link

codecov bot commented Jul 11, 2024

Codecov Report

Attention: Patch coverage is 11.02662% with 234 lines in your changes missing coverage. Please review.

Project coverage is 77.55%. Comparing base (1d5140d) to head (51c3eb7).

Files with missing lines Patch % Lines
lib/kernels/src/accessor.cc 23.25% 66 Missing ⚠️
lib/kernels/src/cpu/cast_kernels.cc 0.00% 23 Missing ⚠️
lib/kernels/include/kernels/accessor.h 0.00% 21 Missing ⚠️
lib/kernels/src/cpu/reverse_kernels.cc 0.00% 20 Missing ⚠️
lib/kernels/src/managed_per_device_ff_handle.cc 0.00% 20 Missing ⚠️
lib/kernels/src/cpu/combine_kernels.cc 0.00% 17 Missing ⚠️
lib/kernels/src/cpu/replicate_kernels.cc 0.00% 17 Missing ⚠️
lib/kernels/src/local_cpu_allocator.cc 0.00% 15 Missing ⚠️
lib/kernels/src/managed_ff_stream.cc 0.00% 10 Missing ⚠️
lib/local-execution/src/ops/linear.cc 0.00% 9 Missing ⚠️
... and 5 more
Additional details and impacted files
@@                Coverage Diff                @@
##           repo-refactor    #1439      +/-   ##
=================================================
- Coverage          78.16%   77.55%   -0.61%     
=================================================
  Files                860      866       +6     
  Lines              27994    28374     +380     
  Branches             770      791      +21     
=================================================
+ Hits               21881    22006     +125     
- Misses              6113     6368     +255     
Flag Coverage Δ
unittests 77.55% <11.02%> (-0.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
lib/kernels/include/kernels/allocation.h 83.33% <ø> (ø)
lib/kernels/include/kernels/attention_kernels.h 0.00% <ø> (ø)
lib/kernels/include/kernels/local_cuda_allocator.h 0.00% <ø> (ø)
lib/kernels/include/kernels/transpose_kernels.h 0.00% <ø> (ø)
lib/kernels/src/allocation.cc 72.72% <100.00%> (+10.22%) ⬆️
lib/kernels/src/array_shape.cc 20.83% <ø> (ø)
...tion/include/local-execution/local_cpu_allocator.h 100.00% <ø> (ø)
...cution/include/local-execution/tracked_allocator.h 0.00% <ø> (ø)
lib/local-execution/src/local_cpu_allocator.cc 57.14% <100.00%> (+7.14%) ⬆️
...ocal-execution/src/local_task_argument_accessor.cc 70.68% <100.00%> (ø)
... and 16 more

... and 113 files with indirect coverage changes

@lockshaw lockshaw marked this pull request as ready for review July 14, 2024 06:12
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 23 files at r1, 21 of 29 files at r2, 12 of 12 files at r3, all commit messages.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/include/kernels/cast_kernels_cpu.h line 15 at r3 (raw file):

                    GenericTensorAccessorW const &output,
                    DataType input_type,
                    DataType output_type);

Prefer function names over extra namespaces, the namespaces in kernels are more of a legacy holdover than something that should be used more

Suggestion:

void cpu_forward_kernel(GenericTensorAccessorR const &input,
                    GenericTensorAccessorW const &output,
                    DataType input_type,
                    DataType output_type);

lib/kernels/src/local_cpu_allocator.cc line 6 at r3 (raw file):

namespace FlexFlow {
void *LocalCPUAllocator::allocate(size_t requested_memory_size) {
  void *ptr = calloc(1, requested_memory_size);

Just use malloc

Code quote:

calloc(

lib/kernels/src/local_cpu_allocator.cc line 28 at r3 (raw file):

LocalCPUAllocator::~LocalCPUAllocator() {
  for (auto ptr : ptrs) {

Suggestion:

  for (void *ptr : this->ptrs) {

lib/kernels/src/local_cuda_allocator.cc line 8 at r3 (raw file):

  void *ptr;
  checkCUDA(cudaMalloc(&ptr, requested_memory_size));
  checkCUDA(cudaMemset(ptr, 0, requested_memory_size));

I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?


lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

                            coord_t in_blk_size) {
  coord_t total_elements = num_out_blks * reverse_dim_size * in_blk_size;
  for (coord_t i = 0; i < total_elements; ++i) {

This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)


lib/kernels/src/cuda/ops/reverse_kernels.cu line 41 at r3 (raw file):

// }

/* I mentioned this earlier, but I still think the reverse_forward_kernel code

See my message in slack


lib/kernels/test/src/test_batch_norm_kernel.cc line 91 at r3 (raw file):

      std::vector<float> host_bias_grad_data =
          load_accessor_data<DataType::FLOAT>(
              read_only_accessor_from_write_accessor(bias_grad_accessor));

Why is read_only_accessor_from_write_accessor necessary here? Aren't GenericTensorWs assumed to be RW @reyna-abhyankar ? If so, add an overload to load_accessor_data


lib/kernels/test/src/test_combine_kernel.cc line 64 at r3 (raw file):

      GenericTensorAccessorR input_accessor_gpu =
          read_only_accessor_from_write_accessor(
              create_random_filled_accessor_w(input_shape, gpu_allocator));

Why not also define a create_random_filled_accessor_r?

Code quote:

create_random_filled_accessor_w

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

#include <random>

enum class GpuDirection {

If we're going to have GenericTensorAccessors allocated on different devices it would probably be good to have a field on GenericTensorAccessor that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?

This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions


lib/kernels/test/src/test_utils.h line 12 at r3 (raw file):

enum class GpuDirection {
  HostToDevice = 0,

Why are concrete values needed?


lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

template <typename IDT, typename ODT, typename F>
GenericTensorAccessorW create_transformed_accessor_w(TensorShape const &shape,

It seems like this is just being used to create a random tensor, seems like that might be a better behavior?


lib/kernels/test/src/test_utils.h line 86 at r3 (raw file):

}

template <DataType DT>

Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.


lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

template <DataType DT>
GenericTensorAccessorW
    copy_tensor_between_memories(GenericTensorAccessorR accessor,

The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/


lib/kernels/test/src/test_utils.h line 89 at r3 (raw file):

GenericTensorAccessorW
    copy_tensor_between_memories(GenericTensorAccessorR accessor,
                                 TensorShape const &shape,

GenericTensorAccessor already has a shape, no need to pass an additional shape separately


lib/kernels/test/src/test_utils.h line 104 at r3 (raw file):

}

template <DataType DT>

Pass as a normal argument, not a template argument


lib/kernels/test/src/test_utils.h line 116 at r3 (raw file):

template <DataType DT>
std::vector<real_type<DT>> load_accessor_data(GenericTensorAccessorR accessor,
                                              bool on_host = false) {

Change to an enum and make passing required/copy

Code quote:

bool on_host = false

lib/kernels/test/src/test_utils.cc line 8 at r3 (raw file):

  GenericTensorAccessorW accessor = allocator.allocate_tensor(shape);
  size_t volume = accessor.shape.num_elements();
  std::vector<float> host_data(volume);

Can't TensorShape have different datatypes? If you're going to only do this for float you should at the very least have a check


lib/kernels/test/src/test_utils.cc line 17 at r3 (raw file):

  }

  transfer_memory(static_cast<float *>(accessor.ptr),

Would it be better to have a function that transfers a GenericTensorAccessor, rather than having to do casting to transfer each time? That also saves you from having to pass volume, etc.

Marsella8 and others added 2 commits July 31, 2024 04:52
…lexflow#1429)

* initial commit for machine view adjacent modules

* Formatting

* Tests for new machine_view.cc functions

* formatting

* Minor Test correction

* formatting

* PR fixes

* PR Fixes

---------

Co-authored-by: Pietro Max Marsella <[email protected]>
Copy link
Contributor Author

@oOTigger oOTigger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 50 files reviewed, 18 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


lib/kernels/include/kernels/cast_kernels_cpu.h line 15 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer function names over extra namespaces, the namespaces in kernels are more of a legacy holdover than something that should be used more

Done.


lib/kernels/src/local_cpu_allocator.cc line 6 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just use malloc

Done.


lib/kernels/src/local_cpu_allocator.cc line 28 at r3 (raw file):

LocalCPUAllocator::~LocalCPUAllocator() {
  for (auto ptr : ptrs) {

Done.


lib/kernels/src/local_cuda_allocator.cc line 8 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?

Ended up adding an allocate_and_zero function to the Allocator class, not sure if this is too big of a change though?


lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)

Is this "too" broken down?


lib/kernels/src/cuda/ops/reverse_kernels.cu line 41 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See my message in slack

Currently talking to Xinhao


lib/kernels/test/src/test_batch_norm_kernel.cc line 91 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is read_only_accessor_from_write_accessor necessary here? Aren't GenericTensorWs assumed to be RW @reyna-abhyankar ? If so, add an overload to load_accessor_data

Done.


lib/kernels/test/src/test_combine_kernel.cc line 64 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not also define a create_random_filled_accessor_r?

Done.


lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If we're going to have GenericTensorAccessors allocated on different devices it would probably be good to have a field on GenericTensorAccessor that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?

This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions

Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?

I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?

In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.


lib/kernels/test/src/test_utils.h line 12 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why are concrete values needed?

Done.


lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It seems like this is just being used to create a random tensor, seems like that might be a better behavior?

This is mainly for debugging purposes, so can create like an iota filled accessor or etc.


lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/

Done.


lib/kernels/test/src/test_utils.h line 89 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

GenericTensorAccessor already has a shape, no need to pass an additional shape separately

Done.


lib/kernels/test/src/test_utils.h line 104 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Pass as a normal argument, not a template argument

Done.


lib/kernels/test/src/test_utils.h line 116 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to an enum and make passing required/copy

Just removed entirely


lib/kernels/test/src/test_utils.cc line 8 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can't TensorShape have different datatypes? If you're going to only do this for float you should at the very least have a check

Done.


lib/kernels/test/src/test_utils.cc line 17 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would it be better to have a function that transfers a GenericTensorAccessor, rather than having to do casting to transfer each time? That also saves you from having to pass volume, etc.

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 48 of 48 files at r4, all commit messages.
Reviewable status: all files reviewed, 35 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/include/kernels/accessor.h line 31 at r4 (raw file):

  double *get_double_ptr() const;
  half *get_half_ptr() const;

Code snippet:

GenericTensorAccessorW() = delete;

lib/kernels/include/kernels/accessor.h line 33 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,

Suggestion:

                         ArrayShape const &sh,

lib/kernels/include/kernels/accessor.h line 34 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,
                         req<void *> p,

Suggestion:

                         void *p,

lib/kernels/include/kernels/accessor.h line 35 at r4 (raw file):

                         ArrayShape sh,
                         req<void *> p,
                         bool on_dev = true)

Suggestion:

  GenericTensorAccessorW(DataType data_type,
                         ArrayShape shape,
                         req<void *> ptr,
                         bool on_dev = true)

lib/kernels/include/kernels/accessor.h line 36 at r4 (raw file):

                         req<void *> p,
                         bool on_dev = true)
      : data_type(dt), shape(sh), ptr(p), on_device(on_dev) {}

Move to .cc file


lib/kernels/include/kernels/accessor.h line 41 at r4 (raw file):

  DataType data_type;
  ArrayShape shape;
  req<void *> ptr;

Suggestion:

  void * ptr;

lib/kernels/include/kernels/accessor.h line 42 at r4 (raw file):

  ArrayShape shape;
  req<void *> ptr;
  bool on_device;

Suggestion:

  DeviceType device_type;

lib/kernels/include/kernels/accessor.h line 43 at r4 (raw file):

  req<void *> ptr;
  bool on_device;
};

Suggestion:

  bool operator==(GenericTensorAccessorW const &) const;
  bool operator!=(GenericTensorAccessorW const &) const;

public:
  DataType data_type;
  ArrayShape shape;
  req<void *> ptr;
  bool on_device;
private:
  std::tuple<decltype(data_type) const &,
             decltype(shape) const &,
             decltype(ptr) const &,
             decltype(on_device) const &>
    tie() const;
};

// in .cc file
std::tuple<DataType const &,
           ArrayShape const &,
           void *,
           DeviceType const &>
    GenericTensorAccessorW::tie() const {
  return std::tie(this->data_type, this->shape, this->ptr, this->on_device);     
}

bool GenericTensorAccessorW::operator==(GenericTensorAccessorW const &other) const {
  return this->tie() == other.tie();
}

bool GenericTensorAccessorW::operator!=(GenericTensorAccessorW const &other) const {
  return this->tie() != other.tie(); 
}

lib/kernels/include/kernels/accessor.h line 45 at r4 (raw file):

};
FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(
    GenericTensorAccessorW, data_type, shape, ptr, on_device);

visitable is deprecated


lib/kernels/include/kernels/accessor.h line 64 at r4 (raw file):

  double const *get_double_ptr() const;
  half const *get_half_ptr() const;

Code snippet:

GenericTensorAccessorR() = delete;

lib/kernels/include/kernels/accessor.h line 65 at r4 (raw file):

  half const *get_half_ptr() const;

  GenericTensorAccessorR(DataType dt,

Move to .cc file


lib/kernels/include/kernels/accessor.h line 77 at r4 (raw file):

  bool on_device;
};
FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(

Make same tie changes as above to add ==, !=, hash back in (they're currently added by visitable)


lib/kernels/include/kernels/allocation.h line 14 at r4 (raw file):

struct IAllocator {
  virtual void *allocate(size_t) = 0;
  virtual void *allocate_and_zero(size_t) = 0;

Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface


lib/kernels/include/kernels/allocation.h line 16 at r4 (raw file):

  virtual void *allocate_and_zero(size_t) = 0;
  virtual void deallocate(void *) = 0;

Code snippet:

virtual DeviceType get_allocation_device_type() = 0;

lib/kernels/include/kernels/allocation.h line 25 at r4 (raw file):

  GenericTensorAccessorW allocate_tensor(TensorShape const &tensor_shape);
  GenericTensorAccessorW
      allocate_tensor_and_zero(TensorShape const &tensor_shape);

Delete


lib/kernels/include/kernels/allocation.h line 28 at r4 (raw file):

  void *allocate(size_t mem_size);
  void *allocate_and_zero(size_t mem_size);

Delete


lib/kernels/include/kernels/allocation.h line 40 at r4 (raw file):

  Allocator(std::shared_ptr<IAllocator> ptr) : i_allocator(ptr){};

  AllocLocation alloc_location;

Delete


lib/kernels/include/kernels/local_cpu_allocator.h line 13 at r4 (raw file):

  void *allocate(size_t) override;
  void *allocate_and_zero(size_t) override;

Delete


lib/kernels/include/kernels/local_cuda_allocator.h line 13 at r4 (raw file):

  void *allocate(size_t) override;
  void *allocate_and_zero(size_t) override;

Delete


lib/kernels/src/array_shape.cc line 63 at r4 (raw file):

}

TensorShape get_tensor_shape(ArrayShape const &shape, DataType DT) {

Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78


lib/kernels/src/local_cpu_allocator.cc line 5 at r4 (raw file):

namespace FlexFlow {
void *LocalCPUAllocator::allocate(size_t requested_memory_size) {

Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc


lib/kernels/src/cpu/combine_kernels.cc line 4 at r4 (raw file):

#include "kernels/datatype_dispatch.h"

namespace FlexFlow {

FYI we're now on C++17, so you can instead write namespace FlexFlow::Kernels::Combine {


lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Is this "too" broken down?

The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for GenericTensorAccessor


lib/kernels/src/cuda/ops/reverse_kernels.cu line 41 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Currently talking to Xinhao

Can you change your status on this comment then to "Working": see https://docs.reviewable.io/discussions.html#dispositions-and-resolution


lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

    GenericTensorAccessorW input_accessor =
        create_random_filled_accessor_w<DataType::FLOAT>(input_shape,

Why is the extra template parameter suddenly necessary?


lib/kernels/test/src/test_cast_kernel.cc line 90 at r4 (raw file):

          DataType::FLOAT,
          DataType::INT32);
      std::cout << "Before GPU load" << std::endl;

Delete


lib/kernels/test/src/test_cast_kernel.cc line 103 at r4 (raw file):

          DataType::FLOAT,
          DataType::INT32);
      std::cout << "Before CPU load" << std::endl;

Delete


lib/kernels/test/src/test_combine_kernel.cc line 111 at r4 (raw file):

              output_grad_accessor_gpu, cpu_allocator);
      GenericTensorAccessorW input_grad_accessor_cpu =
          cpu_allocator.allocate_tensor_and_zero(input_shape);

Or even add a helper function create_zero_filled_accessor_w that does these two operations

Suggestion:

      GenericTensorAccessorW input_grad_accessor_cpu =
          cpu_allocator.allocate_tensor(input_shape);
      fill_with_zeros(input_grad_accessor_cpu);

lib/kernels/test/src/test_reverse_kernels.cc line 147 at r4 (raw file):

      // Run CPU Cast Backward Kernel
      GenericTensorAccessorW output_grad_accessor_cpu =
          copy_tensor_between_memories<DataType::FLOAT>(

Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)

Code quote:

copy_tensor_between_memories<DataType::FLOAT>(

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?

I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?

In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.

There shouldn't be that much code explicitly creating GenericTensorAccessors I think? Can you point me to some examples?


lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

This is mainly for debugging purposes, so can create like an iota filled accessor or etc.

Either specialize to a function that creates a random tensor, or pass the current tensor index to transform--the current abstraction is a bit awkwardly intermediate


lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to copy_tensor_to_memory


lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

}

template <DataType DT>

Why not just get the datatype from the shape?


lib/kernels/test/src/test_utils.cc line 11 at r4 (raw file):

}

TensorShape make_tensor_shape_from_legion_dims(FFOrdered<size_t> dims,

This doesn't seem correct, shouldn't this be taking a LegionOrdered?

Suggestion:

legion_dims(FFOrdered<size_t> const &dims,

lib/local-execution/include/local-execution/tracked_allocator.h line 15 at r4 (raw file):

  void *allocate(size_t) override;
  void *allocate_and_zero(size_t) override;

Delete

@lockshaw lockshaw assigned rupanshusoi and unassigned oOTigger Sep 26, 2024
Copy link
Contributor Author

@oOTigger oOTigger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 83 files reviewed, 35 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


lib/kernels/include/kernels/accessor.h line 31 at r4 (raw file):

  double *get_double_ptr() const;
  half *get_half_ptr() const;

Done.


lib/kernels/include/kernels/accessor.h line 33 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,

Done.


lib/kernels/include/kernels/accessor.h line 34 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,
                         req<void *> p,

Done.


lib/kernels/include/kernels/accessor.h line 35 at r4 (raw file):

                         ArrayShape sh,
                         req<void *> p,
                         bool on_dev = true)

Done.


lib/kernels/include/kernels/accessor.h line 36 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move to .cc file

Done.


lib/kernels/include/kernels/accessor.h line 41 at r4 (raw file):

  DataType data_type;
  ArrayShape shape;
  req<void *> ptr;

Done.


lib/kernels/include/kernels/accessor.h line 42 at r4 (raw file):

  ArrayShape shape;
  req<void *> ptr;
  bool on_device;

Done.


lib/kernels/include/kernels/accessor.h line 43 at r4 (raw file):

  req<void *> ptr;
  bool on_device;
};

Done.


lib/kernels/include/kernels/accessor.h line 45 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

visitable is deprecated

Done.


lib/kernels/include/kernels/accessor.h line 64 at r4 (raw file):

  double const *get_double_ptr() const;
  half const *get_half_ptr() const;

Done.


lib/kernels/include/kernels/accessor.h line 65 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move to .cc file

Done.


lib/kernels/include/kernels/accessor.h line 77 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make same tie changes as above to add ==, !=, hash back in (they're currently added by visitable)

Done.


lib/kernels/include/kernels/allocation.h line 14 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface

Done.


lib/kernels/include/kernels/allocation.h line 16 at r4 (raw file):

  virtual void *allocate_and_zero(size_t) = 0;
  virtual void deallocate(void *) = 0;

Done.


lib/kernels/include/kernels/allocation.h line 25 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/include/kernels/allocation.h line 28 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/include/kernels/allocation.h line 40 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/include/kernels/local_cpu_allocator.h line 13 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/include/kernels/local_cuda_allocator.h line 13 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/src/array_shape.cc line 63 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78

Done.


lib/kernels/src/local_cpu_allocator.cc line 5 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc

Done.


lib/kernels/src/cpu/combine_kernels.cc line 4 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

FYI we're now on C++17, so you can instead write namespace FlexFlow::Kernels::Combine {

Done


lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for GenericTensorAccessor

Done.


lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is the extra template parameter suddenly necessary?

I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float


lib/kernels/test/src/test_cast_kernel.cc line 90 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/test/src/test_cast_kernel.cc line 103 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/kernels/test/src/test_combine_kernel.cc line 111 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Or even add a helper function create_zero_filled_accessor_w that does these two operations

Done.


lib/kernels/test/src/test_reverse_kernels.cc line 147 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)

Done.


lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

There shouldn't be that much code explicitly creating GenericTensorAccessors I think? Can you point me to some examples?

I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?


lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Either specialize to a function that creates a random tensor, or pass the current tensor index to transform--the current abstraction is a bit awkwardly intermediate

Just deleted entirely


lib/kernels/test/src/test_utils.h line 86 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.

Done.


lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to copy_tensor_to_memory

Made more explicit


lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just get the datatype from the shape?

Done.


lib/kernels/test/src/test_utils.cc line 11 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This doesn't seem correct, shouldn't this be taking a LegionOrdered?

Done.


lib/local-execution/include/local-execution/tracked_allocator.h line 15 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 72 of 75 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 33 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/CMakeLists.txt line 33 at r6 (raw file):

  nccl
  utils
  pcg

Why?


lib/kernels/include/kernels/accessor.h line 15 at r6 (raw file):

namespace FlexFlow {

class GenericTensorAccessorW {

@reyna-abhyankar Is there a reason we have both GenericTensorAccessorW and GenericTensorAccessorR over just having one GenericTensorAccessor and passing it as either const & or mutable &?


lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

  template <DataType DT, typename... Indices>
  real_type_t<DT> &at(Indices... indices) {
    if (this->device_type != DeviceType::CPU) {

Any reason for this over the simpler at(std::vector<int> const &)?

Code quote:

  template <DataType DT, typename... Indices>
  real_type_t<DT> &at(Indices... indices) {
    if (this->device_type != DeviceType::CPU) {

lib/kernels/include/kernels/reverse_kernels_cpu.h line 5 at r6 (raw file):

#include "accessor.h"
#include "device.h"

Prefer full include paths

Suggestion:

#include "kernels/accessor.h"
#include "kernels/device.h"

lib/kernels/src/accessor.cc line 30 at r6 (raw file):

  }

  size_t offset = 0;

@Marsella8 Do you think we could reasonably add some generic coord -> index & index apsce concept in utils and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented


lib/kernels/src/accessor.cc line 47 at r6 (raw file):

    offset += cur_idx * multiplier;
    multiplier *= this->shape[legion_dim_t(i)];

Prefer at for bounds checking (ArrayShape may have bounds-checking for both, but other containers don't so it's a good habit)

Suggestion:

    multiplier *= this->shape.at(legion_dim_t(i));

lib/kernels/src/allocation.cc line 22 at r6 (raw file):

  void *ptr = this->allocate(get_size_in_bytes(tensor_shape));
  return {
      tensor_shape.data_type, tensor_shape, ptr, get_allocation_device_type()};

Prefer explicit this->

Suggestion:

     tensor_shape.data_type, tensor_shape, ptr, this->get_allocation_device_type()};

lib/kernels/src/cpu/replicate_kernels.cc line 7 at r6 (raw file):

template <typename T>
void cpu_replicate_backward_kernel(T *input,

Move over to use the new GenericTensorAccessor access-by-multidimensional-index support?


lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Yeah this is much clearer, thanks (also it would be good to rename blk to block as it's a bit nonobvious that blk means block and not bulk)


lib/kernels/src/cpu/reverse_kernels.cc line 13 at r6 (raw file):

                  GenericTensorAccessorW &output,
                  coord_t num_out_blks,
                  coord_t reverse_dim_size,

Can't we just get this from the shape information contained in input?

Code quote:

                 coord_t reverse_dim_size,

lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

  DataTypeDispatch1<CPUReverseForwardKernel>{}(input_accessor.data_type,
                                               input_accessor,
                                               std::ref(output_accessor),

Why is std::ref necessary?

Code quote:

std::ref(output_accessor),

lib/kernels/test/CMakeLists.txt line 17 at r6 (raw file):

    cudart
    cublas
    pcg

Why?


lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float

Sure, but the DataType is already present in the input_shape, so probably better to just use that than force the user to pass the DataType twice


lib/kernels/test/src/test_cast_kernel.cc line 87 at r6 (raw file):

      std::vector<int32_t> result_data_gpu =
          load_accessor_data<DataType::INT32>(output_accessor_gpu);

Why not just get back a GenericTensorAccessorR with its data on CPU? Throwing away all the information by dropping down to a std::vector feels unnecessary

Code quote:

      std::vector<int32_t> result_data_gpu =
          load_accessor_data<DataType::INT32>(output_accessor_gpu);

lib/kernels/test/src/test_cast_kernel.cc line 91 at r6 (raw file):

      // Run CPU Forward Kernel
      GenericTensorAccessorW input_accessor_cpu =
          create_random_filled_accessor_w<DataType::FLOAT>(input_shape,

If the data is random, why would I expect input_accessor_cpu to have the same data as input_accessor_gpu? Would it be cleaner to generate a GenericTensorAccessor with random data on CPU, then create a copy of that on GPU, and then run the functions?


lib/kernels/test/src/test_concat_kernel.cc line 9 at r6 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test concat kernel forward and backward") {
    size_t num_inputs = 2;

Why these changes?


lib/kernels/test/src/test_replicate_kernel.cc line 78 at r6 (raw file):

      GenericTensorAccessorW output_accessor_gpu =
          gpu_allocator.allocate_tensor(output_shape);
      fill_with_zeros(output_accessor_gpu);

Wouldn't filling with random data here give you more assurance the function is correct?


lib/kernels/test/src/test_reverse_kernels.cc line 150 at r6 (raw file):

      GenericTensorAccessorW input_grad_accessor_cpu =
          cpu_allocator.allocate_tensor(input_shape);
      fill_with_zeros(input_grad_accessor_cpu);

I'd recommend creating a helper function for creating zero-filled tensors in test_utils to avoid any "use before initialized" issues

Code quote:

          cpu_allocator.allocate_tensor(input_shape);
      fill_with_zeros(input_grad_accessor_cpu);

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?

Not sure what you mean by "but maybe our current behavior is fine with the initializer is fine anyways". Can you clarify?


lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Then why is the template parameter still here?


lib/kernels/test/src/test_utils.h line 34 at r6 (raw file):

template <typename DT>
void transfer_memory(GenericTensorAccessorW dst_accessor,

Suggestion:

void transfer_memory(GenericTensorAccessorW &dst_accessor,

lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

template <typename DT>
void transfer_memory(GenericTensorAccessorW dst_accessor,
                     const DT *src,

Why is one parameter a TensorAccessor and the other is a raw array? Feels inconsistent

Code quote:

void transfer_memory(GenericTensorAccessorW dst_accessor,
                     const DT *src,

lib/kernels/test/src/test_utils.h line 37 at r6 (raw file):

                     const DT *src,
                     DeviceType src_device_type) {
  size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);

Suggestion:

  size_t num_bytes = dst_accessor.shape.get_volume() * sizeof(DT);

lib/kernels/test/src/test_utils.h line 78 at r6 (raw file):

template <DataType DT>
GenericTensorAccessorR create_random_filled_accessor_r(TensorShape const &shape,

Get the DataType from the TensorShape


lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

template <typename T>
GenericTensorAccessorW create_filled_accessor_w(TensorShape const &shape,

Get the DataType from the TensorShape


lib/kernels/test/src/test_utils.h line 124 at r6 (raw file):

template <DataType DT>
std::vector<real_type_t<DT>>
    load_accessor_data(GenericTensorAccessorW accessor) {

Suggestion:

    load_accessor_data(GenericTensorAccessorW const &accessor) {

lib/kernels/test/src/test_utils.h line 149 at r6 (raw file):

template <typename T>
bool vectors_are_approx_equal(T lhs, T rhs) {

Now that we have support for GenericTensorAccessors on CPU, using vectors no longer seems necessary I think

Code quote:

template <typename T>
bool contains_non_zero(std::vector<T> &data) {
  return !all_of(data, [](T const &val) { return val == 0; });
}

template <typename T>
bool vectors_are_approx_equal(T lhs, T rhs) {

lib/kernels/test/src/test_utils.cc line 5 at r6 (raw file):

namespace FlexFlow {

bool device_on_cpu(DeviceType device_type) {

Is this function really necessary? Doesn't seem like you're saving much typing or much complexity


lib/kernels/test/src/test_utils.cc line 44 at r6 (raw file):

  return DataTypeDispatch1<CopyTensorAccessorW>{}(
      src_accessor.data_type, src_accessor, std::ref(allocator));
}

The ability to copy tensors feels like a useful primitive by itself, might be better moved into kernels itself?

Code quote:

template <DataType DT>
struct CopyTensorAccessorW {
  GenericTensorAccessorW operator()(GenericTensorAccessorW const &src_accessor,
                                    Allocator &allocator) {
    TensorShape shape =
        get_tensor_shape(src_accessor.shape, src_accessor.data_type);
    GenericTensorAccessorW copied_tensor = allocator.allocate_tensor(shape);

    transfer_memory(
        copied_tensor, src_accessor.get<DT>(), src_accessor.device_type);

    return copied_tensor;
  }
};

GenericTensorAccessorW
    copy_tensor_accessor_w(GenericTensorAccessorW const &src_accessor,
                           Allocator &allocator) {
  return DataTypeDispatch1<CopyTensorAccessorW>{}(
      src_accessor.data_type, src_accessor, std::ref(allocator));
}

lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

        tensor_backing.shape,
        tensor_backing.ptr,
        this->allocator.get_allocation_device_type()};

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

Code quote:

    GenericTensorAccessorR readonly_tensor_backing = {
        tensor_backing.data_type,
        tensor_backing.shape,
        tensor_backing.ptr,
        this->allocator.get_allocation_device_type()};

lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

        tensor_backing.shape,
        tensor_backing.ptr,
        this->allocator.get_allocation_device_type()};

Why not just use the device type from tensor_backing?


lib/local-execution/src/local_task_argument_accessor.cc line 49 at r6 (raw file):

    for (GenericTensorAccessorW const &tensor_backing :
         variadic_tensor_backing) {
      readonly_variadic_tensor_backing.push_back(

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR


lib/local-execution/src/local_task_argument_accessor.cc line 53 at r6 (raw file):

           tensor_backing.shape,
           tensor_backing.ptr,
           this->allocator.get_allocation_device_type()});

Just use the device type from tensor_backing?

Copy link
Contributor Author

@oOTigger oOTigger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 51 of 85 files reviewed, 33 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


lib/kernels/CMakeLists.txt line 33 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

I'm currently using the DeviceType enum from pcg/device_type.dtg.h


lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any reason for this over the simpler at(std::vector<int> const &)?

Mainly such that you can just use .at

(1,2,3) instead of .at
({1,2,3})... no need for unneeded initializer list


lib/kernels/include/kernels/reverse_kernels_cpu.h line 5 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer full include paths

Done.


lib/kernels/src/accessor.cc line 47 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer at for bounds checking (ArrayShape may have bounds-checking for both, but other containers don't so it's a good habit)

Done.


lib/kernels/src/allocation.cc line 22 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer explicit this->

Done.


lib/kernels/src/cpu/replicate_kernels.cc line 7 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move over to use the new GenericTensorAccessor access-by-multidimensional-index support?

Done.


lib/kernels/src/cpu/reverse_kernels.cc line 13 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can't we just get this from the shape information contained in input?

Done.


lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is std::ref necessary?

I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.

If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?


lib/kernels/test/CMakeLists.txt line 17 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

I'm currently using the DeviceType enum from pcg/device_type.dtg.h


lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Sure, but the DataType is already present in the input_shape, so probably better to just use that than force the user to pass the DataType twice

Done.


lib/kernels/test/src/test_cast_kernel.cc line 87 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just get back a GenericTensorAccessorR with its data on CPU? Throwing away all the information by dropping down to a std::vector feels unnecessary

Done.


lib/kernels/test/src/test_cast_kernel.cc line 91 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If the data is random, why would I expect input_accessor_cpu to have the same data as input_accessor_gpu? Would it be cleaner to generate a GenericTensorAccessor with random data on CPU, then create a copy of that on GPU, and then run the functions?

Didn't catch this as test was going from float between 0 and 1 to an int lol, so it was basically just comparing if a tensor of all zeros was equal to another tensor of all zeros


lib/kernels/test/src/test_concat_kernel.cc line 9 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why these changes?

Was printing outputs and was easier to visually compare


lib/kernels/test/src/test_replicate_kernel.cc line 78 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Wouldn't filling with random data here give you more assurance the function is correct?

The problem I was running into is that the replicate_kernels.cu just directly adds to the value inside of the output accessor, leading to some cases when I was running the test where one of the tensors were allocated with randomized values inside of it, resulting in the test failing as the output tensors were not both initialized with the same initial state


lib/kernels/test/src/test_reverse_kernels.cc line 150 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd recommend creating a helper function for creating zero-filled tensors in test_utils to avoid any "use before initialized" issues

Done.


lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Then why is the template parameter still here?

Done.


lib/kernels/test/src/test_utils.h line 34 at r6 (raw file):

template <typename DT>
void transfer_memory(GenericTensorAccessorW dst_accessor,

Done.


lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is one parameter a TensorAccessor and the other is a raw array? Feels inconsistent

It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...

I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr


lib/kernels/test/src/test_utils.h line 37 at r6 (raw file):

                     const DT *src,
                     DeviceType src_device_type) {
  size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);

Done.


lib/kernels/test/src/test_utils.h line 78 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Get the DataType from the TensorShape

Done.


lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Get the DataType from the TensorShape

I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?


lib/kernels/test/src/test_utils.h line 124 at r6 (raw file):

template <DataType DT>
std::vector<real_type_t<DT>>
    load_accessor_data(GenericTensorAccessorW accessor) {

Done.


lib/kernels/test/src/test_utils.h line 149 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Now that we have support for GenericTensorAccessors on CPU, using vectors no longer seems necessary I think

Done.


lib/kernels/test/src/test_utils.cc line 5 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this function really necessary? Doesn't seem like you're saving much typing or much complexity

Done.


lib/kernels/test/src/test_utils.cc line 44 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The ability to copy tensors feels like a useful primitive by itself, might be better moved into kernels itself?

Done.


lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

Done.


lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just use the device type from tensor_backing?

Done.


lib/local-execution/src/local_task_argument_accessor.cc line 49 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

Done.


lib/local-execution/src/local_task_argument_accessor.cc line 53 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just use the device type from tensor_backing?

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 33 of 34 files at r7, 1 of 1 files at r8, 7 of 7 files at r9, all commit messages.
Dismissed @lockshaw from a discussion.
Reviewable status: all files reviewed, 27 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/include/kernels/accessor.h line 15 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

@reyna-abhyankar Is there a reason we have both GenericTensorAccessorW and GenericTensorAccessorR over just having one GenericTensorAccessor and passing it as either const & or mutable &?

At least in the meantime it seems like it would make sense to have GenericTensorAccessorW non-explicitly coerce to GenericTensorAccessorR I think to avoid all of the overloading?


lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Mainly such that you can just use .at

(1,2,3) instead of .at
({1,2,3})... no need for unneeded initializer list

Let's do the std::vector version unless there's some additional advantage you haven't mentioned yet. Yes the template syntax is cute and all, but it's substantially more complicated feature-wise and all it seems to do is remove the need to type two characters (also it doesn't allow for passing in coordinates of runtime-determined dimension)


lib/kernels/include/kernels/accessor.h line 14 at r9 (raw file):

namespace FlexFlow {

struct Allocator;

Better to #include in the header that contains Allocator


lib/kernels/include/kernels/accessor.h line 264 at r9 (raw file):

    GenericTensorAccessorR const &src_accessor);

void transfer_data_between_accessors(

Makes parameter order clear, and makes the operation a bit clearer ("transfer" sounds more like a "move" operation)

Suggestion:

void copy_accessor_data_to_l_from_r(

lib/kernels/include/kernels/replicate_kernels.h line 15 at r9 (raw file):

void backward_kernel(ffStream_t stream,
                     GenericTensorAccessorR const &output,
                     GenericTensorAccessorW const &input,

Why reorder input and output?


lib/kernels/include/kernels/replicate_kernels_cpu.h line 13 at r9 (raw file):

void cpu_backward_kernel(GenericTensorAccessorR const &output,
                         GenericTensorAccessorW &input,

Why reorder input and output?


lib/kernels/src/accessor.cc line 30 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

@Marsella8 Do you think we could reasonably add some generic coord -> index & index apsce concept in utils and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented

Tracked in #1528


lib/kernels/src/accessor.cc line 32 at r9 (raw file):

                         num_bytes,
                         cudaMemcpyDeviceToDevice));
  }

Suggestion:

  } else {
    assert (src_device_type == DeviceType::GPU);
    assert (dst_device_type == DeviceType::GPU);
    checkCUDA(cudaMemcpy(dst_accessor.ptr,
                         src_accessor.ptr,
                         num_bytes,
                         cudaMemcpyDeviceToDevice));
  }

lib/kernels/src/managed_ff_stream.cc line 14 at r9 (raw file):

ManagedFFStream &ManagedFFStream::operator=(ManagedFFStream &&other) noexcept {
  if (this != &other) {

Why this change?


lib/kernels/src/managed_per_device_ff_handle.cc line 22 at r9 (raw file):

ManagedPerDeviceFFHandle &ManagedPerDeviceFFHandle::operator=(
    ManagedPerDeviceFFHandle &&other) noexcept {
  if (this != &other) {

Why change this from swap?


lib/kernels/src/managed_per_device_ff_handle.cc line 40 at r9 (raw file):

    checkCUDA(cudaFree(this->handle->workSpace));
    delete this->handle;
    this->handle = nullptr;

Why explicitly assign the handle to nullptr?


lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.

If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?

I'm tentatively guessing that might be due to DataTypeDispatch1 and DataTypeDispatch2 taking their arguments as Args ... rather than as Args &&...--can you try making that change and seeing if it fixes the problem?


lib/kernels/src/cuda/ops/linear_kernels.cu line 138 at r9 (raw file):

                           in_dim,
                           &alpha,
                           (void *)weight_ptr,

Avoid c-style casts

Suggestion:

static_cast<void *>(weight_ptr),

lib/kernels/test/src/test_cast_kernel.cc line 83 at r9 (raw file):

                                        output_accessor_cpu,
                                        DataType::FLOAT,
                                        DataType::DOUBLE);

Any reason not to just get these DataTypes from the accessors' shape fields?

Code quote:

                                        DataType::FLOAT,
                                        DataType::DOUBLE);

lib/kernels/test/src/test_cast_kernel.cc line 85 at r9 (raw file):

                                        DataType::DOUBLE);

      CHECK(w_accessors_are_equal<DataType::DOUBLE>(output_accessor_gpu,

Why is a template parameter needed here? Isn't the type information already present in the shape field of the accessors?

Code quote:

<DataType::DOUBLE>(o

lib/kernels/test/src/test_managed_ff_stream.cc line 10 at r9 (raw file):

    ManagedFFStream base_stream{};

    SUBCASE("Test ManagedFFStream Move Constructor") {

Add checks for if it's the same object on both sides of the assignment.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 8 at r9 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed Per Device FF Handle") {
    ManagedPerDeviceFFHandle base_handle{};

Can we force the user to pass arguments here instead? I'm not a huge fan of default constructors where there's no obvious single correct value that should be set. If you want to add a function that creates one with some default arguments and user that that's fine, but I'd rather that be an explicit function call to make it clear that that's kinda an arbitrary choice supplied somewhere rather than intrinsic to the object semantics

Code quote:

    ManagedPerDeviceFFHandle base_handle{};

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 16 at r9 (raw file):

    SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") {
      PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle();

Add checks for if it's the same object on both sides of the assignment.


lib/kernels/test/src/test_reverse_kernels.cc line 60 at r9 (raw file):

  TEST_CASE("Check Reverse Forward and Backward Kernels against CPU Kernels") {
    std::size_t num_out_blks = 1;

Might be nice to have this be >1 as having it be 1 could be hiding bugs (as iteration order doesn't matter if your looop has bound 1)

Code quote:

    std::size_t num_out_blks = 1;

lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...

I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr

Seems like this has been removed since this comment, so I guess it's fine now?


lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?

You could technically get around this by passing in a DataTypeValue which is a variant of the real_type_ts of all of the DataTypes, but I'm fine with the current implementation too. I'd say the big thing is that it should be checked against the DataType in the shape and an error raised if the the chosen T and the TensorShape's DataType don't match


lib/kernels/test/src/test_utils.h line 101 at r9 (raw file):

  if (accessor_a.data_type != accessor_b.data_type) {
    return false;
  }

Suggestion:

  if (accessor_a.shape != accessor_b.shape) {
    throw mk_runtime_error(fmt::format("w_accessors_are_equal expected accessors to have the same shape, but received: {} != {}", accessor_a.shape, accessor_b.shape));  
  }

lib/kernels/test/src/test_utils.h line 115 at r9 (raw file):

  for (size_t i = 0; i < accessor_a.shape.num_elements(); i++) {
    if (a_data_ptr[i] != b_data_ptr[i]) {
      print_accessor(cpu_accessor_a);

printing by default seems like a bad idea, this should probably be handled by the assertion check, not the actual boolean condition itself.


lib/kernels/test/src/test_utils.cc line 126 at r9 (raw file):

GenericTensorAccessorR
    create_cpu_compatible_accessor_r(GenericTensorAccessorR const &accessor,

Suggestion:

copy_to_accessor_to_cpu_if_necessary

lib/kernels/test/src/test_utils.cc line 158 at r9 (raw file):

};

void print_accessor(GenericTensorAccessorR const &accessor) {

Suggestion:

void print_tensor_accessor_contents(

lib/kernels/test/src/test_utils.cc line 158 at r9 (raw file):

};

void print_accessor(GenericTensorAccessorR const &accessor) {

Might be nice to have this return std::string or take in a std::ostream & so the user can choose where it goes rather than just forcing it to std::cout

Code quote:

void 

lib/local-execution/src/ops/linear.cc line 152 at r9 (raw file):

                 per_device_state,
                 input.get_float_ptr(),
                 (float *)input_grad.get_float_ptr(),

No cast should be necessary as you're already calling get_float_ptr?

Copy link
Contributor Author

@oOTigger oOTigger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 57 of 95 files reviewed, 27 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


lib/kernels/include/kernels/accessor.h line 15 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At least in the meantime it seems like it would make sense to have GenericTensorAccessorW non-explicitly coerce to GenericTensorAccessorR I think to avoid all of the overloading?

Done.


lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Let's do the std::vector version unless there's some additional advantage you haven't mentioned yet. Yes the template syntax is cute and all, but it's substantially more complicated feature-wise and all it seems to do is remove the need to type two characters (also it doesn't allow for passing in coordinates of runtime-determined dimension)

Done.


lib/kernels/include/kernels/accessor.h line 14 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Better to #include in the header that contains Allocator

I can't "#include/kernels/allocation.h" as circular dependency between allocator and accessor


lib/kernels/include/kernels/accessor.h line 264 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Makes parameter order clear, and makes the operation a bit clearer ("transfer" sounds more like a "move" operation)

Done.


lib/kernels/include/kernels/replicate_kernels.h line 15 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why reorder input and output?

Slightly unneeded, but I prefer the format of output before input as it matches the other cpu_kernels I have written so far.

As it relates to all other kernels though, a lot of the kernels vary on whether they have input or output first... should probably standardize this?


lib/kernels/include/kernels/replicate_kernels_cpu.h line 13 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why reorder input and output?

same as prev


lib/kernels/src/accessor.cc line 32 at r9 (raw file):

                         num_bytes,
                         cudaMemcpyDeviceToDevice));
  }

Done.


lib/kernels/src/managed_ff_stream.cc line 14 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why this change?

Similar to per_device_ff_handle but I also thought it'd be safer to fully transfer ownership rather than potentially leaving dangling resources


lib/kernels/src/managed_per_device_ff_handle.cc line 22 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why change this from swap?

I thought it's good practice to invalidate the other ManagedPerDeviceFFHandle in a move constructor, so changed to exchange to set other handle's ptr now as nullptr


lib/kernels/src/managed_per_device_ff_handle.cc line 40 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why explicitly assign the handle to nullptr?

Done.


lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm tentatively guessing that might be due to DataTypeDispatch1 and DataTypeDispatch2 taking their arguments as Args ... rather than as Args &&...--can you try making that change and seeing if it fixes the problem?

Fixed the issue


lib/kernels/src/cuda/ops/linear_kernels.cu line 138 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Avoid c-style casts

reinterpret cast okay here?


lib/kernels/test/src/test_cast_kernel.cc line 83 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any reason not to just get these DataTypes from the accessors' shape fields?

Done.


lib/kernels/test/src/test_cast_kernel.cc line 85 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is a template parameter needed here? Isn't the type information already present in the shape field of the accessors?

Done.


lib/kernels/test/src/test_managed_ff_stream.cc line 10 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add checks for if it's the same object on both sides of the assignment.

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 8 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can we force the user to pass arguments here instead? I'm not a huge fan of default constructors where there's no obvious single correct value that should be set. If you want to add a function that creates one with some default arguments and user that that's fine, but I'd rather that be an explicit function call to make it clear that that's kinda an arbitrary choice supplied somewhere rather than intrinsic to the object semantics

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 16 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add checks for if it's the same object on both sides of the assignment.

Done.


lib/kernels/test/src/test_reverse_kernels.cc line 60 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be nice to have this be >1 as having it be 1 could be hiding bugs (as iteration order doesn't matter if your looop has bound 1)

Done.


lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like this has been removed since this comment, so I guess it's fine now?

Yeah, this was fixed, I just moved the function into accessor.h => this is now the copy_accessor_data_to_l_from_r function


lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

You could technically get around this by passing in a DataTypeValue which is a variant of the real_type_ts of all of the DataTypes, but I'm fine with the current implementation too. I'd say the big thing is that it should be checked against the DataType in the shape and an error raised if the the chosen T and the TensorShape's DataType don't match

Done.


lib/kernels/test/src/test_utils.h line 101 at r9 (raw file):

  if (accessor_a.data_type != accessor_b.data_type) {
    return false;
  }

Done.


lib/kernels/test/src/test_utils.h line 115 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

printing by default seems like a bad idea, this should probably be handled by the assertion check, not the actual boolean condition itself.

Done.


lib/kernels/test/src/test_utils.cc line 126 at r9 (raw file):

GenericTensorAccessorR
    create_cpu_compatible_accessor_r(GenericTensorAccessorR const &accessor,

Done.


lib/kernels/test/src/test_utils.cc line 158 at r9 (raw file):

};

void print_accessor(GenericTensorAccessorR const &accessor) {

Done.


lib/kernels/test/src/test_utils.cc line 158 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be nice to have this return std::string or take in a std::ostream & so the user can choose where it goes rather than just forcing it to std::cout

Done.


lib/local-execution/src/ops/linear.cc line 152 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

No cast should be necessary as you're already calling get_float_ptr?

Needed to make some changes to permissions of accessors to enable this, due to const types with bias_ptr and output_grad ptr

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 38 of 38 files at r10, all commit messages.
Reviewable status: all files reviewed, 72 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/include/kernels/accessor.h line 14 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I can't "#include/kernels/allocation.h" as circular dependency between allocator and accessor

Fair enough. In that case I'd recommend pulling copy_tensor_accessor_r and copy_tensor_accessor_w (I believe these are the only two usages of Allocator in this file) into a separate header file--this header file is getting rather large already anyway


lib/kernels/include/kernels/replicate_kernels.h line 15 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Slightly unneeded, but I prefer the format of output before input as it matches the other cpu_kernels I have written so far.

As it relates to all other kernels though, a lot of the kernels vary on whether they have input or output first... should probably standardize this?

Sure, feel free to standardize the kernel functions in a future PR (that would be great actually), though we should make sure we actually have tests up and running for all of the kernels to make sure we don't accidentally break anything. I've created an issue to track this and assigned it to you: #1540


lib/kernels/src/managed_ff_stream.cc line 14 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Similar to per_device_ff_handle but I also thought it'd be safer to fully transfer ownership rather than potentially leaving dangling resources

Makes sense, my only concern is the degree of code duplication. Is there anything reasonable we can do to reduce it to minimize bugs when this file is edited?


lib/kernels/src/managed_per_device_ff_handle.cc line 22 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I thought it's good practice to invalidate the other ManagedPerDeviceFFHandle in a move constructor, so changed to exchange to set other handle's ptr now as nullptr

That's fine (and not a bad practice), it's just currently leading to a lot of duplicated code between this function and the destructor. With the previous solution you implicitly get the destructor call when the moved-from value is deallocated. Is there a way the duplicated code in this solution could be reduced? The current structure makes editing this file a bit bugprone


lib/kernels/src/cuda/ops/linear_kernels.cu line 138 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

reinterpret cast okay here?

Generally you want to use the weakest/most-restrictive cast construction that you can, and static_cast<void const *> works fine here so you should use that throughout this file.


lib/local-execution/src/ops/linear.cc line 152 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Needed to make some changes to permissions of accessors to enable this, due to const types with bias_ptr and output_grad ptr

What do you mean--I'm not seeing any changes to this in accessors.h? Am I missing something?


lib/kernels/include/kernels/accessor.h line 166 at r7 (raw file):

std::ostream &operator<<(std::ostream &, GenericTensorAccessorR const &);

int32_t *get_int32_ptr(GenericTensorAccessorW const &);

Why were these removed? Not necessarily objecting, just curious


lib/kernels/src/accessor.cc line 29 at r10 (raw file):

  } else {
    assert(src_device_type == DeviceType::GPU);
    assert(src_device_type == DeviceType::CPU);

I don't think this is possible as src_device_type can't have two values at once

Code quote:

    assert(src_device_type == DeviceType::GPU);
    assert(src_device_type == DeviceType::CPU);

lib/kernels/src/accessor.cc line 70 at r10 (raw file):

  for (size_t i = 0; i < this->shape.num_dims(); i++) {
    if (indices[i] >= this->shape.at(legion_dim_t(i))) {

Suggestion:

indices.at(i) 

lib/kernels/src/accessor.cc line 75 at r10 (raw file):

                      "when only {} indexes exist",
                      i,
                      indices[i],

Suggestion:

indices.at(i),

lib/kernels/src/accessor.cc line 79 at r10 (raw file):

    }

    offset += indices[i] * multiplier;

Suggestion:

indices.at(i)

lib/kernels/src/accessor.cc line 145 at r10 (raw file):

    std::vector<size_t> const &indices) const {

  if (indices.size() != this->shape.num_dims()) {

Pull out into a separate helper function to avoid the duplication between GenericTensorAccessorR::calculate_index_offset and GenericTensorAccessorW::calculuate_index_offset (and then honestly it's probably a good idea to just remove those methods as I'm not sure they're doing much then other than just forwarding arguments to the helper function.


lib/kernels/src/accessor.cc line 153 at r10 (raw file):

  ssize_t offset = 0;
  size_t multiplier = 1;

Prefer int

Code quote:

  ssize_t offset = 0;
  size_t multiplier = 1;

lib/kernels/src/accessor.cc line 156 at r10 (raw file):

  for (size_t i = 0; i < this->shape.num_dims(); i++) {
    if (indices[i] >= this->shape.at(legion_dim_t(i))) {

Suggestion:

indices.at(i)

lib/kernels/src/accessor.cc line 161 at r10 (raw file):

                      "when only {} indexes exist",
                      i,
                      indices[i],

Suggestion:

                indices.at(i),

lib/kernels/src/accessor.cc line 165 at r10 (raw file):

    }

    offset += indices[i] * multiplier;

Suggestion:

    offset += indices.at(i) * multiplier;

lib/kernels/src/accessor.cc line 166 at r10 (raw file):

    offset += indices[i] * multiplier;
    multiplier *= this->shape.at(legion_dim_t(i));

Prefer {} initialization

Suggestion:

    multiplier *= this->shape.at(legion_dim_t{i});

lib/kernels/include/kernels/accessor.h line 45 at r10 (raw file):

  template <DataType DT>
  real_type_t<DT> const &at(std::vector<size_t> const &indices) const {

Prefer int over size_t to make issues from using negative values at least somewhat detectable

Code quote:

std::vector<size_t> const &indices

lib/kernels/include/kernels/accessor.h line 45 at r10 (raw file):

  template <DataType DT>
  real_type_t<DT> const &at(std::vector<size_t> const &indices) const {

This function should probably be bound-checked.


lib/kernels/include/kernels/accessor.h line 112 at r10 (raw file):

  template <DataType DT>
  real_type_t<DT> &at(std::vector<size_t> const &indices) {

This function should probably be bounds-checked


lib/kernels/test/src/test_gather_kernels.cc line 8 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Gather Forward and Backward Kernel") {
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_flat_kernel.cc line 10 at r10 (raw file):

    Allocator allocator = create_local_cuda_memory_allocator();

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_flat_kernel.cc line 19 at r10 (raw file):

    GenericTensorAccessorR input_accessor =
        read_only_accessor_from_write_accessor(create_filled_accessor_w(
            input_shape, allocator, DataTypeValue(2.0f)));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_flat_kernel.cc line 34 at r10 (raw file):

    SUBCASE("backward_kernel") {
      GenericTensorAccessorW output_grad_accessor = create_filled_accessor_w(
          output_shape, allocator, DataTypeValue(0.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_flat_kernel.cc line 36 at r10 (raw file):

          output_shape, allocator, DataTypeValue(0.0f));
      GenericTensorAccessorW input_grad_accessor =
          create_filled_accessor_w(input_shape, allocator, DataTypeValue(1.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_concat_kernel.cc line 13 at r10 (raw file):

    ff_dim_t concat_axis = ff_dim_t(1);

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Prefer {} initialization (https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es23-prefer-the--initializer-syntax)

Suggestion:

    ManagedPerDeviceFFHandle managed_handle{1024 * 1024, true};

lib/kernels/test/src/test_concat_kernel.cc line 13 at r10 (raw file):

    ff_dim_t concat_axis = ff_dim_t(1);

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Add argument comments where argument meanings are not clear. Normally I'd refrain from comments, but these can actually be automatically checked by https://clang.llvm.org/extra/clang-tidy/checks/bugprone/argument-comment.html

Suggestion:

    ManagedPerDeviceFFHandle managed_handle(/*workSpaceSize=*/1024 * 1024, 
                                            /*allowTensorOpMathConversion=*/true)

lib/kernels/test/src/test_layer_norm_kernels.cc line 20 at r10 (raw file):

        make_tensor_shape_from_legion_dims({feature_size}, DataType::FLOAT);

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_layer_norm_kernels.cc line 36 at r10 (raw file):

        create_random_filled_accessor_r(input_shape, allocator);
    GenericTensorAccessorW gamma_accessor =
        create_filled_accessor_w(feature_shape, allocator, DataTypeValue(1.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_layer_norm_kernels.cc line 42 at r10 (raw file):

          allocator.allocate_tensor(output_shape);
      GenericTensorAccessorW beta_accessor = create_filled_accessor_w(
          feature_shape, allocator, DataTypeValue(0.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/src/managed_per_device_ff_handle.cc line 7 at r9 (raw file):

ManagedPerDeviceFFHandle::ManagedPerDeviceFFHandle() {
  this->handle = new PerDeviceFFHandle;

Suggestion:

  this->handle = new PerDeviceFFHandle{};

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 7 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed Per Device FF Handle") {

Suggestion:

 TEST_CASE("ManagedPerDeviceFFHandle") {

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 8 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed Per Device FF Handle") {
    ManagedPerDeviceFFHandle base_handle{1024 * 1024, true};

Argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 11 at r10 (raw file):

    PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle();

    SUBCASE("Test ManagedPerDeviceFFHandle Constructor") {

Suggestion:

    SUBCASE("constructor") {

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 16 at r10 (raw file):

    }

    SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") {

Suggestion:

    SUBCASE("move constructor") {

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 24 at r10 (raw file):

    SUBCASE("Test ManagedPerDeviceFFHandle Assignment Operator") {
      ManagedPerDeviceFFHandle new_handle{1024 * 1024, true};

Argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 34 at r10 (raw file):

      base_handle = std::move(base_handle);
      CHECK(&base_handle.raw_handle() == base_handle_ptr);
    }

Suggestion:

    SUBCASE("move assignment operator") {
      SUBCASE("move assign to other") {
        ManagedPerDeviceFFHandle new_handle{1024 * 1024, true};
        new_handle = std::move(base_handle);

        CHECK(&base_handle.raw_handle() == nullptr);
        CHECK(&new_handle.raw_handle() == base_handle_ptr);
      }

      SUBCASE("move assign to self") {
        base_handle = std::move(base_handle);
        CHECK(&base_handle.raw_handle() == base_handle_ptr);
      }

lib/kernels/src/cpu/reverse_kernels.cc line 14 at r10 (raw file):

    assert(input.data_type == DT && output.data_type == DT);

    size_t num_out_blocks = input.shape.at(legion_dim_t(0));

Prefer int

Code quote:

    size_t num_out_blocks = input.shape.at(legion_dim_t(0));

lib/kernels/test/src/test_managed_ff_stream.cc line 7 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed FF Stream") {

Suggestion:

  TEST_CASE("ManagedFFStream") {

lib/kernels/test/src/test_managed_ff_stream.cc line 11 at r10 (raw file):

    ffStream_t const *base_stream_ptr = &base_stream.raw_stream();

    SUBCASE("Test ManagedFFStream Move Constructor") {

Suggestion:

    SUBCASE("move constructor") {

lib/kernels/test/src/test_managed_ff_stream.cc line 21 at r10 (raw file):

      new_stream = std::move(base_stream);
      CHECK(&base_stream.raw_stream() == nullptr);
      CHECK(&new_stream.raw_stream() == base_stream_ptr);

Suggestion:

    SUBCASE("move assignment") {
      SUBCASE("move assign to other") {
        ManagedFFStream new_stream{};
        new_stream = std::move(base_stream);
        CHECK(&base_stream.raw_stream() == nullptr);
        CHECK(&new_stream.raw_stream() == base_stream_ptr);
      } 
      
      SUBCASE("move assign to self") {
        ...
      }

lib/kernels/test/src/test_batch_matmul_kernel.cc line 18 at r10 (raw file):

    ManagedFFStream managed_stream{};
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_reverse_kernels.cc line 17 at r10 (raw file):

    TensorShape output_shape = input_shape;

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_reverse_kernels.cc line 25 at r10 (raw file):

      GenericTensorAccessorR input_accessor =
          read_only_accessor_from_write_accessor(create_filled_accessor_w(
              input_shape, allocator, DataTypeValue(1.0f)));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_reverse_kernels.cc line 68 at r10 (raw file):

    TensorShape output_shape = input_shape;

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_partition_kernel.cc line 9 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Partition Forward and Backward") {
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_partition_kernel.cc line 23 at r10 (raw file):

    SUBCASE("forward_kernel") {
      GenericTensorAccessorR input_accessor =
          create_filled_accessor_r(input_shape, allocator, DataTypeValue(1.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_partition_kernel.cc line 35 at r10 (raw file):

    SUBCASE("backward_kernel") {
      GenericTensorAccessorR output_grad_accessor = create_filled_accessor_r(
          output_shape, allocator, DataTypeValue(1.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_partition_kernel.cc line 37 at r10 (raw file):

          output_shape, allocator, DataTypeValue(1.0f));
      GenericTensorAccessorW input_grad_accessor =
          create_filled_accessor_w(input_shape, allocator, DataTypeValue(2.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_batch_norm_kernel.cc line 12 at r10 (raw file):

    ManagedFFStream managed_stream{};
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_batch_norm_kernel.cc line 40 at r10 (raw file):

        create_random_filled_accessor_w(output_shape, allocator);
    GenericTensorAccessorW scale_accessor =
        create_filled_accessor_w(scale_shape, allocator, DataTypeValue(1.0f));

Due to the implicit coercions between all of the numerical types, I recommend creating explicit functions for creating DataTypeValues of different types to avoid ambiguity, e.g., make_float_data_type_value, make_int32_data_type_value, etc. and use them in call of the places you're constructing DataTypeValues


lib/kernels/test/src/test_attention_kernel.cc line 16 at r10 (raw file):

    ManagedFFStream managed_stream{};
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_transpose_kernel.cc line 10 at r10 (raw file):

    std::size_t num_dims = 2;

    std::vector<ff_dim_t> perm = {ff_dim_t(0), ff_dim_t(1)};

Prefer bracket initialization (see test_concat_kernel.cu)


lib/kernels/test/src/test_transpose_kernel.cc line 12 at r10 (raw file):

    std::vector<ff_dim_t> perm = {ff_dim_t(0), ff_dim_t(1)};

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_split_kernel.cc line 15 at r10 (raw file):

    coord_t num_blks = 1;

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_split_kernel.cc line 53 at r10 (raw file):

      GenericTensorAccessorW input_grad_accessor =
          create_filled_accessor_w(input_shape, allocator, DataTypeValue(0.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_dropout.cc line 21 at r10 (raw file):

    ManagedFFStream managed_stream{};
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_softmax_kernel.cc line 11 at r10 (raw file):

    int input_n = 1, input_c = 1, input_h = 1, input_w = 100, channels = 100;

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_replicate_kernel.cc line 16 at r10 (raw file):

        make_tensor_shape_from_legion_dims({100}, DataType::FLOAT);

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_replicate_kernel.cc line 56 at r10 (raw file):

        make_tensor_shape_from_legion_dims({5, num_replicas}, DataType::FLOAT);

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/local-execution/test/src/test_local_cost_estimator.cc line 15 at r10 (raw file):

//   TEST_CASE("Local Cost Estimator") {
//     // local backing initialization
//     ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_combine_kernel.cc line 9 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Call Combine Forward and Backward Kernels") {
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_reduction_kernel.cc line 13 at r10 (raw file):

        {10, 10, 10, 10, 10}, DataType::FLOAT);

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_reduction_kernel.cc line 39 at r10 (raw file):

      GenericTensorAccessorR output_grad_accessor = create_filled_accessor_r(
          output_shape, allocator, DataTypeValue(1.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)


lib/kernels/test/src/test_reshape_kernel.cc line 8 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Reshape Forward and Backward") {
    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_utils.cc line 147 at r10 (raw file):

    T const *data_ptr = accessor.get<DT>();
    for (size_t i = 0; i < accessor.shape.num_elements(); i++) {
      stream << data_ptr[i] << " ";

Ideally should use the shape information of accessor to print something more semantically meaningfull


lib/kernels/test/src/test_utils.cc line 147 at r10 (raw file):

    T const *data_ptr = accessor.get<DT>();
    for (size_t i = 0; i < accessor.shape.num_elements(); i++) {
      stream << data_ptr[i] << " ";

Suggestion:

    if (i != 0) {
      stream << " ";
    }
    stream << data_ptr[i];

lib/kernels/test/src/test_utils.cc line 149 at r10 (raw file):

      stream << data_ptr[i] << " ";
    }
    stream << "\n";

Suggestion:

    stream << std::endl;

lib/kernels/test/src/test_utils.cc line 202 at r10 (raw file):

  GenericTensorAccessorW operator()(TensorShape const &shape,
                                    Allocator &allocator,
                                    DataTypeValue val) {

Would probably be better to move DataTypeValue over to a dtgen .variant.toml type--any reason it isn't already?

Code quote:

DataTypeValue

lib/kernels/test/src/test_pool_2d_kernels.cc line 15 at r10 (raw file):

    PoolOp pool_type = PoolOp::MAX;

    ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);

Bracket initialization and argument name comments (see test_concat_kernel.cu)


lib/kernels/test/src/test_pool_2d_kernels.cc line 60 at r10 (raw file):

    SUBCASE("backward_kernel") {
      GenericTensorAccessorW output_grad_accessor = create_filled_accessor_w(
          output_shape, allocator, DataTypeValue(1.0f));

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Copy link
Contributor Author

@oOTigger oOTigger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 64 of 108 files reviewed, 72 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


lib/kernels/include/kernels/accessor.h line 166 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why were these removed? Not necessarily objecting, just curious

Got sloppy, my bad ~ for some reason didn't see one of them returned const and thought were same


lib/kernels/include/kernels/accessor.h line 14 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Fair enough. In that case I'd recommend pulling copy_tensor_accessor_r and copy_tensor_accessor_w (I believe these are the only two usages of Allocator in this file) into a separate header file--this header file is getting rather large already anyway

Done.


lib/kernels/include/kernels/accessor.h line 45 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer int over size_t to make issues from using negative values at least somewhat detectable

Done.


lib/kernels/include/kernels/accessor.h line 45 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This function should probably be bound-checked.

Done.


lib/kernels/include/kernels/accessor.h line 112 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This function should probably be bounds-checked

Done.


lib/kernels/include/kernels/replicate_kernels.h line 15 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Sure, feel free to standardize the kernel functions in a future PR (that would be great actually), though we should make sure we actually have tests up and running for all of the kernels to make sure we don't accidentally break anything. I've created an issue to track this and assigned it to you: #1540

Sounds good


lib/kernels/src/accessor.cc line 29 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think this is possible as src_device_type can't have two values at once

Done.


lib/kernels/src/accessor.cc line 145 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Pull out into a separate helper function to avoid the duplication between GenericTensorAccessorR::calculate_index_offset and GenericTensorAccessorW::calculuate_index_offset (and then honestly it's probably a good idea to just remove those methods as I'm not sure they're doing much then other than just forwarding arguments to the helper function.

Done.


lib/kernels/src/accessor.cc line 153 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer int

Done.


lib/kernels/src/accessor.cc line 166 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer {} initialization

Done.


lib/kernels/src/managed_ff_stream.cc line 14 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Makes sense, my only concern is the degree of code duplication. Is there anything reasonable we can do to reduce it to minimize bugs when this file is edited?

Done.


lib/kernels/src/managed_per_device_ff_handle.cc line 22 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

That's fine (and not a bad practice), it's just currently leading to a lot of duplicated code between this function and the destructor. With the previous solution you implicitly get the destructor call when the moved-from value is deallocated. Is there a way the duplicated code in this solution could be reduced? The current structure makes editing this file a bit bugprone

Done.


lib/kernels/src/cpu/reverse_kernels.cc line 14 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer int

Done.


lib/kernels/src/cuda/ops/linear_kernels.cu line 138 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Generally you want to use the weakest/most-restrictive cast construction that you can, and static_cast<void const *> works fine here so you should use that throughout this file.

Done.


lib/kernels/test/src/test_attention_kernel.cc line 16 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_batch_matmul_kernel.cc line 18 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_batch_norm_kernel.cc line 12 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_batch_norm_kernel.cc line 40 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Due to the implicit coercions between all of the numerical types, I recommend creating explicit functions for creating DataTypeValues of different types to avoid ambiguity, e.g., make_float_data_type_value, make_int32_data_type_value, etc. and use them in call of the places you're constructing DataTypeValues

Done.


lib/kernels/test/src/test_combine_kernel.cc line 9 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_concat_kernel.cc line 13 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer {} initialization (https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es23-prefer-the--initializer-syntax)

Done.


lib/kernels/test/src/test_concat_kernel.cc line 13 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add argument comments where argument meanings are not clear. Normally I'd refrain from comments, but these can actually be automatically checked by https://clang.llvm.org/extra/clang-tidy/checks/bugprone/argument-comment.html

Done.


lib/kernels/test/src/test_dropout.cc line 21 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_flat_kernel.cc line 10 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_flat_kernel.cc line 19 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_flat_kernel.cc line 34 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_flat_kernel.cc line 36 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_gather_kernels.cc line 8 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_layer_norm_kernels.cc line 20 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_layer_norm_kernels.cc line 36 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_layer_norm_kernels.cc line 42 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 8 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 24 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_partition_kernel.cc line 9 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_partition_kernel.cc line 23 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_partition_kernel.cc line 35 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_partition_kernel.cc line 37 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_pool_2d_kernels.cc line 15 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_pool_2d_kernels.cc line 60 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_reduction_kernel.cc line 13 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_reduction_kernel.cc line 39 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_replicate_kernel.cc line 16 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_replicate_kernel.cc line 56 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_reshape_kernel.cc line 8 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_reverse_kernels.cc line 17 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_reverse_kernels.cc line 25 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_reverse_kernels.cc line 68 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_softmax_kernel.cc line 11 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_split_kernel.cc line 15 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_split_kernel.cc line 53 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use explicit construction functions for DataTypeValue (see test_batch_norm_kernels.cu)

Done.


lib/kernels/test/src/test_transpose_kernel.cc line 10 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer bracket initialization (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_transpose_kernel.cc line 12 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not sure what you mean by "but maybe our current behavior is fine with the initializer is fine anyways". Can you clarify?

This is no longer an issue. Initially brought up as a constructor defaulted to assigning a device type but this is no longer the case. Device type is currently determined based on whatever allocator is used to create tensor.


lib/kernels/test/src/test_utils.cc line 202 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would probably be better to move DataTypeValue over to a dtgen .variant.toml type--any reason it isn't already?

Addressed in Slack


lib/local-execution/src/ops/linear.cc line 152 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What do you mean--I'm not seeing any changes to this in accessors.h? Am I missing something?

Oh, I was referring to making:

auto bias = acc.get_tensorPermissions::RO(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RO(OUTPUT);

to:

auto bias = acc.get_tensorPermissions::RW(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RW(OUTPUT);

as backward_kernel needs a non-const pointer to be passed


lib/local-execution/test/src/test_local_cost_estimator.cc line 15 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Bracket initialization and argument name comments (see test_concat_kernel.cu)

Done.


lib/kernels/src/accessor.cc line 70 at r10 (raw file):

  for (size_t i = 0; i < this->shape.num_dims(); i++) {
    if (indices[i] >= this->shape.at(legion_dim_t(i))) {

Done.


lib/kernels/src/accessor.cc line 75 at r10 (raw file):

                      "when only {} indexes exist",
                      i,
                      indices[i],

Done.


lib/kernels/src/accessor.cc line 79 at r10 (raw file):

    }

    offset += indices[i] * multiplier;

Done.


lib/kernels/src/accessor.cc line 156 at r10 (raw file):

  for (size_t i = 0; i < this->shape.num_dims(); i++) {
    if (indices[i] >= this->shape.at(legion_dim_t(i))) {

Done.


lib/kernels/src/accessor.cc line 161 at r10 (raw file):

                      "when only {} indexes exist",
                      i,
                      indices[i],

Done.


lib/kernels/src/accessor.cc line 165 at r10 (raw file):

    }

    offset += indices[i] * multiplier;

Done.


lib/kernels/test/src/test_managed_ff_stream.cc line 7 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed FF Stream") {

Done.


lib/kernels/test/src/test_managed_ff_stream.cc line 11 at r10 (raw file):

    ffStream_t const *base_stream_ptr = &base_stream.raw_stream();

    SUBCASE("Test ManagedFFStream Move Constructor") {

Done.


lib/kernels/test/src/test_managed_ff_stream.cc line 21 at r10 (raw file):

      new_stream = std::move(base_stream);
      CHECK(&base_stream.raw_stream() == nullptr);
      CHECK(&new_stream.raw_stream() == base_stream_ptr);

Done.


lib/kernels/test/src/test_utils.cc line 147 at r10 (raw file):

    T const *data_ptr = accessor.get<DT>();
    for (size_t i = 0; i < accessor.shape.num_elements(); i++) {
      stream << data_ptr[i] << " ";

Done.


lib/kernels/test/src/test_utils.cc line 149 at r10 (raw file):

      stream << data_ptr[i] << " ";
    }
    stream << "\n";

Done.


lib/kernels/src/managed_per_device_ff_handle.cc line 7 at r9 (raw file):

ManagedPerDeviceFFHandle::ManagedPerDeviceFFHandle() {
  this->handle = new PerDeviceFFHandle;

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 7 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed Per Device FF Handle") {

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 11 at r10 (raw file):

    PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle();

    SUBCASE("Test ManagedPerDeviceFFHandle Constructor") {

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 16 at r10 (raw file):

    }

    SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") {

Done.


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 34 at r10 (raw file):

      base_handle = std::move(base_handle);
      CHECK(&base_handle.raw_handle() == base_handle_ptr);
    }

Done.

@lockshaw lockshaw changed the base branch from repo-refactor to master December 16, 2024 08:37
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 31 of 44 files at r11, 63 of 67 files at r12, all commit messages.
Reviewable status: 140 of 144 files reviewed, 27 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 24 at r10 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Not seeing this done?


lib/kernels/test/src/test_utils.cc line 202 at r10 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Addressed in Slack

I can't seem to find the explanation. Can you re-explain here or point me to where it was in slack?


lib/local-execution/src/ops/linear.cc line 152 at r9 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Oh, I was referring to making:

auto bias = acc.get_tensorPermissions::RO(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RO(OUTPUT);

to:

auto bias = acc.get_tensorPermissions::RW(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RW(OUTPUT);

as backward_kernel needs a non-const pointer to be passed

But why? The kernel doesn't modify the bias, does it? You control the signature of backward_kernel so you can always change the signature if it doesn't make sense, it's not set in stone.


lib/kernels/include/kernels/accessor.h line 74 at r12 (raw file):

      offset += indices.at(i) * multiplier;
      multiplier *= this->shape.at(legion_dim_t{i});

Pull offset calculation (and probably also bounds checking) out into a separate helper function

Code quote:

    int offset = 0;
    int multiplier = 1;
    for (int i = 0; i < this->shape.num_dims(); i++) {
      if (indices.at(i) >= this->shape.at(legion_dim_t{i})) {
        throw mk_runtime_error(
            fmt::format("In {} dimension, attempting to access index {} "
                        "when only {} indexes exist",
                        i,
                        indices.at(i),
                        this->shape.at(legion_dim_t{i})));
      }

      offset += indices.at(i) * multiplier;
      multiplier *= this->shape.at(legion_dim_t{i});

lib/kernels/include/kernels/accessor.h line 159 at r12 (raw file):

      offset += indices.at(i) * multiplier;
      multiplier *= this->shape.at(legion_dim_t{i});

Pull offset calculation (and probably also bounds checking) out into a separate helper function

Code quote:

    int offset = 0;
    int multiplier = 1;
    for (int i = 0; i < this->shape.num_dims(); i++) {
      if (indices.at(i) >= this->shape.at(legion_dim_t{i})) {
        throw mk_runtime_error(
            fmt::format("In {} dimension, attempting to access index {} "
                        "when only {} indexes exist",
                        i,
                        indices.at(i),
                        this->shape.at(legion_dim_t{i})));
      }

      offset += indices.at(i) * multiplier;
      multiplier *= this->shape.at(legion_dim_t{i});

lib/kernels/include/kernels/accessor.h line 185 at r12 (raw file):

    T const *data_ptr = static_cast<T const *>(this->ptr);
    int offset = 0;
    int multiplier = 1;

Pull offset calculation (and probably also bounds checking) out into a separate helper function


lib/kernels/include/kernels/flat_kernels.h line 13 at r12 (raw file):

                    float *output_ptr);

void backward_kernel(cudaStream_t stream,

Why change from ffStream_t to cudaStream_t?

Code quote:

cudaStream_t stream,

lib/kernels/include/kernels/managed_per_device_ff_handle.h line 29 at r12 (raw file):

  void cleanup();

private:

Suggestion:

private:  
  void cleanup();

private:

lib/kernels/include/kernels/managed_ff_stream.h line 25 at r12 (raw file):

private:
  ffStream_t *stream;

Suggestion:

private:  
  void cleanup();

private:
  ffStream_t *stream;

lib/kernels/include/kernels/loss_function_kernels.h line 4 at r12 (raw file):

#define _FLEXFLOW_KERNELS_INCLUDE_KERNELS_LOSS_FUNCTION_KERNELS_H

#include "kernels/device.h"

Why remove this? We generally prefer absolute includes over relative includes


lib/kernels/include/kernels/metrics_kernels.h line 18 at r12 (raw file):

void update_metrics_label_kernel_wrapper(float const *logit_ptr,
                                         float const *label_ptr,
                                         MetricsAttrs const *me,

Why take as a pointer rather than by const ref?


lib/kernels/test/src/test_utils.cc line 151 at r12 (raw file):

    for (int i = 0; i < rows; i++) {
      for (int j = 0; j < cols; j++) {
        stream << data_ptr[i * cols + j];

Don't we have cleaner indexing syntax now for accessors that can take something like (i, j) rather than doing the calculation inline?


lib/kernels/test/src/test_utils.cc line 153 at r12 (raw file):

        stream << data_ptr[i * cols + j];

        if (j < cols - 1) {

Might be better to use join_strings from utils rather than re-implementing the logic here.


lib/local-execution/src/ops/pool_2d.cc line 33 at r12 (raw file):

  auto output = acc.get_tensor<Permissions::WO>(OUTPUT);

  int input_w = input.shape.at(ff_dim_t{0}) + 1;

Why all these + 1s? I suppose also a question for @reyna-abhyankar

Code quote:

  int input_w = input.shape.at(ff_dim_t{0}) + 1;

lib/pcg/src/pcg/metric.cc line 5 at r12 (raw file):

namespace FlexFlow {
MetricsAttrs::MetricsAttrs(LossFunction _loss_type,
                           std::vector<Metric> const &metrics)

Would probably make more sense to take in a std::unordered_set as there shouldn't be duplicates anyway and ordering doesn't matter?

Code quote:

                           std::vector<Metric> const &metrics)

lib/pcg/src/pcg/metric.cc line 33 at r12 (raw file):

      default:
        throw mk_runtime_error(
            "Initializing MetricsAttrs with unrecogonized metrics type");

Suggestion:

        throw mk_runtime_error(fmt::format(
            "Initializing MetricsAttrs with unrecogonized metrics type {}", m));

lib/kernels/include/kernels/pool_2d_kernels.h line 70 at r12 (raw file):

                    void *output_ptr);

void backward_kernel(cudaStream_t stream,

This should stay as ffStream_t--these header files are intended to be stable between the hip and cuda kernels, with the ff* typedefs used to handle the different types used between the two. If you hardcode the type to cudaStream_t this fails.

Code quote:

cudaStream_t stream,

lib/kernels/include/kernels/per_device_op_state.variant.toml line 0 at r12 (raw file):
Why was this moved to kernels?


lib/kernels/src/cuda/embedding_kernels.cu line 40 at r12 (raw file):

}

template <typename TI, typename TD>

Why remove one template parameter? No particular objection, just curious

Code quote:

emplate <typename TI, typename TD>

lib/kernels/src/cuda/embedding_kernels.cu line 100 at r12 (raw file):

                                        int in_dim,
                                        int batch_size,
                                        std::optional<AggregateOp> aggr) {

Why remove the optional?

Code quote:

std::optional<AggregateOp> aggr)

lib/op-attrs/include/op-attrs/make_datatype_value.h line 8 at r12 (raw file):

namespace FlexFlow {

DataTypeValue make_float_data_type_value(float value);

These seem to be duplicated (here and in datatype_value.h)?


lib/op-attrs/include/op-attrs/aggregate_op.enum.toml line 17 at r12 (raw file):

[[values]]
name = "NONE"

Don't add a NONE value--if you want an optional aggregate op, instead use std::optional<AggregateOp>

Code quote:

[[values]]
name = "NONE"

lib/op-attrs/src/op-attrs/make_datatype_value.cc line 5 at r12 (raw file):

namespace FlexFlow {

DataTypeValue make_float_data_type_value(float value) {

Can you make sure these get tested to double check that the right variant gets created?


lib/pcg/include/pcg/metric.h line 17 at r12 (raw file):

  ROOT_MEAN_SQUARED_ERROR,
  MEAN_ABSOLUTE_ERROR,
};

Move over to a .enum.toml so fmt, etc. is auto-generated

Code quote:

enum class Metric {
  ACCURACY,
  CATEGORICAL_CROSSENTROPY,
  SPARSE_CATEGORICAL_CROSSENTROPY,
  MEAN_SQUARED_ERROR,
  ROOT_MEAN_SQUARED_ERROR,
  MEAN_ABSOLUTE_ERROR,
};

lib/pcg/include/pcg/metric.h line 32 at r12 (raw file):

  bool measure_root_mean_squared_error;
  bool measure_mean_absolute_error;
};

Move over to dtgen so all the boilerplate (equality, hashing, fmt, etc.) gets auto-generated?

Code quote:

class MetricsAttrs {
public:
  MetricsAttrs() = delete;
  MetricsAttrs(LossFunction, std::vector<Metric> const &);

public:
  LossFunction loss_type;
  bool measure_accuracy;
  bool measure_categorical_crossentropy;
  bool measure_sparse_categorical_crossentropy;
  bool measure_mean_squared_error;
  bool measure_root_mean_squared_error;
  bool measure_mean_absolute_error;
};

lib/pcg/include/pcg/metric.h line 70 at r12 (raw file):

};

} // namespace fmt

Remove in favor of dtgen-generated fmt

Code quote:

namespace fmt {

template <>
struct formatter<::FlexFlow::Metric> : formatter<string_view> {
  template <typename FormatContext>
  auto format(::FlexFlow::Metric m, FormatContext &ctx) const
      -> decltype(ctx.out()) {
    using namespace FlexFlow;

    string_view name = "unknown";
    switch (m) {
      case Metric::ACCURACY:
        name = "Accuracy";
        break;
      case Metric::CATEGORICAL_CROSSENTROPY:
        name = "CategoricalCrossEntropy";
        break;
      case Metric::SPARSE_CATEGORICAL_CROSSENTROPY:
        name = "SparseCategoricalCrossEntropy";
        break;
      case Metric::MEAN_SQUARED_ERROR:
        name = "MeanSquaredError";
        break;
      case Metric::ROOT_MEAN_SQUARED_ERROR:
        name = "RootMeanSquaredError";
        break;
      case Metric::MEAN_ABSOLUTE_ERROR:
        name = "MeanAbsoluteError";
        break;
    }
    return formatter<string_view>::format(name, ctx);
  }
};

} // namespace fmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernels Kernels library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants