-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Kernel Tests #1439
base: master
Are you sure you want to change the base?
CPU Kernel Tests #1439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 23 files at r1, 21 of 29 files at r2, 12 of 12 files at r3, all commit messages.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/cast_kernels_cpu.h
line 15 at r3 (raw file):
GenericTensorAccessorW const &output, DataType input_type, DataType output_type);
Prefer function names over extra namespaces, the namespaces in kernels
are more of a legacy holdover than something that should be used more
Suggestion:
void cpu_forward_kernel(GenericTensorAccessorR const &input,
GenericTensorAccessorW const &output,
DataType input_type,
DataType output_type);
lib/kernels/src/local_cpu_allocator.cc
line 6 at r3 (raw file):
namespace FlexFlow { void *LocalCPUAllocator::allocate(size_t requested_memory_size) { void *ptr = calloc(1, requested_memory_size);
Just use malloc
Code quote:
calloc(
lib/kernels/src/local_cpu_allocator.cc
line 28 at r3 (raw file):
LocalCPUAllocator::~LocalCPUAllocator() { for (auto ptr : ptrs) {
Suggestion:
for (void *ptr : this->ptrs) {
lib/kernels/src/local_cuda_allocator.cc
line 8 at r3 (raw file):
void *ptr; checkCUDA(cudaMalloc(&ptr, requested_memory_size)); checkCUDA(cudaMemset(ptr, 0, requested_memory_size));
I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
coord_t in_blk_size) { coord_t total_elements = num_out_blks * reverse_dim_size * in_blk_size; for (coord_t i = 0; i < total_elements; ++i) {
This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)
lib/kernels/src/cuda/ops/reverse_kernels.cu
line 41 at r3 (raw file):
// } /* I mentioned this earlier, but I still think the reverse_forward_kernel code
See my message in slack
lib/kernels/test/src/test_batch_norm_kernel.cc
line 91 at r3 (raw file):
std::vector<float> host_bias_grad_data = load_accessor_data<DataType::FLOAT>( read_only_accessor_from_write_accessor(bias_grad_accessor));
Why is read_only_accessor_from_write_accessor
necessary here? Aren't GenericTensorW
s assumed to be RW
@reyna-abhyankar ? If so, add an overload to load_accessor_data
lib/kernels/test/src/test_combine_kernel.cc
line 64 at r3 (raw file):
GenericTensorAccessorR input_accessor_gpu = read_only_accessor_from_write_accessor( create_random_filled_accessor_w(input_shape, gpu_allocator));
Why not also define a create_random_filled_accessor_r
?
Code quote:
create_random_filled_accessor_w
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
#include <random> enum class GpuDirection {
If we're going to have GenericTensorAccessor
s allocated on different devices it would probably be good to have a field on GenericTensorAccessor
that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?
This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions
lib/kernels/test/src/test_utils.h
line 12 at r3 (raw file):
enum class GpuDirection { HostToDevice = 0,
Why are concrete values needed?
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
template <typename IDT, typename ODT, typename F> GenericTensorAccessorW create_transformed_accessor_w(TensorShape const &shape,
It seems like this is just being used to create a random tensor, seems like that might be a better behavior?
lib/kernels/test/src/test_utils.h
line 86 at r3 (raw file):
} template <DataType DT>
Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
template <DataType DT> GenericTensorAccessorW copy_tensor_between_memories(GenericTensorAccessorR accessor,
The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/
lib/kernels/test/src/test_utils.h
line 89 at r3 (raw file):
GenericTensorAccessorW copy_tensor_between_memories(GenericTensorAccessorR accessor, TensorShape const &shape,
GenericTensorAccessor
already has a shape, no need to pass an additional shape
separately
lib/kernels/test/src/test_utils.h
line 104 at r3 (raw file):
} template <DataType DT>
Pass as a normal argument, not a template argument
lib/kernels/test/src/test_utils.h
line 116 at r3 (raw file):
template <DataType DT> std::vector<real_type<DT>> load_accessor_data(GenericTensorAccessorR accessor, bool on_host = false) {
Change to an enum and make passing required/copy
Code quote:
bool on_host = false
lib/kernels/test/src/test_utils.cc
line 8 at r3 (raw file):
GenericTensorAccessorW accessor = allocator.allocate_tensor(shape); size_t volume = accessor.shape.num_elements(); std::vector<float> host_data(volume);
Can't TensorShape
have different datatypes? If you're going to only do this for float
you should at the very least have a check
lib/kernels/test/src/test_utils.cc
line 17 at r3 (raw file):
} transfer_memory(static_cast<float *>(accessor.ptr),
Would it be better to have a function that transfers a GenericTensorAccessor
, rather than having to do casting to transfer each time? That also saves you from having to pass volume
, etc.
…lexflow#1429) * initial commit for machine view adjacent modules * Formatting * Tests for new machine_view.cc functions * formatting * Minor Test correction * formatting * PR fixes * PR Fixes --------- Co-authored-by: Pietro Max Marsella <[email protected]>
…n, other minor fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 50 files reviewed, 18 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/cast_kernels_cpu.h
line 15 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer function names over extra namespaces, the namespaces in
kernels
are more of a legacy holdover than something that should be used more
Done.
lib/kernels/src/local_cpu_allocator.cc
line 6 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just use malloc
Done.
lib/kernels/src/local_cpu_allocator.cc
line 28 at r3 (raw file):
LocalCPUAllocator::~LocalCPUAllocator() { for (auto ptr : ptrs) {
Done.
lib/kernels/src/local_cuda_allocator.cc
line 8 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?
Ended up adding an allocate_and_zero function to the Allocator class, not sure if this is too big of a change though?
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)
Is this "too" broken down?
lib/kernels/src/cuda/ops/reverse_kernels.cu
line 41 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See my message in slack
Currently talking to Xinhao
lib/kernels/test/src/test_batch_norm_kernel.cc
line 91 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
read_only_accessor_from_write_accessor
necessary here? Aren'tGenericTensorW
s assumed to beRW
@reyna-abhyankar ? If so, add an overload toload_accessor_data
Done.
lib/kernels/test/src/test_combine_kernel.cc
line 64 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not also define a
create_random_filled_accessor_r
?
Done.
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If we're going to have
GenericTensorAccessor
s allocated on different devices it would probably be good to have a field onGenericTensorAccessor
that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions
Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?
I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?
In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.
lib/kernels/test/src/test_utils.h
line 12 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why are concrete values needed?
Done.
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It seems like this is just being used to create a random tensor, seems like that might be a better behavior?
This is mainly for debugging purposes, so can create like an iota filled accessor or etc.
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/
Done.
lib/kernels/test/src/test_utils.h
line 89 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
GenericTensorAccessor
already has a shape, no need to pass an additionalshape
separately
Done.
lib/kernels/test/src/test_utils.h
line 104 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Pass as a normal argument, not a template argument
Done.
lib/kernels/test/src/test_utils.h
line 116 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to an enum and make passing required/copy
Just removed entirely
lib/kernels/test/src/test_utils.cc
line 8 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can't
TensorShape
have different datatypes? If you're going to only do this forfloat
you should at the very least have a check
Done.
lib/kernels/test/src/test_utils.cc
line 17 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it be better to have a function that transfers a
GenericTensorAccessor
, rather than having to do casting to transfer each time? That also saves you from having to passvolume
, etc.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 48 of 48 files at r4, all commit messages.
Reviewable status: all files reviewed, 35 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 31 at r4 (raw file):
double *get_double_ptr() const; half *get_half_ptr() const;
Code snippet:
GenericTensorAccessorW() = delete;
lib/kernels/include/kernels/accessor.h
line 33 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh,
Suggestion:
ArrayShape const &sh,
lib/kernels/include/kernels/accessor.h
line 34 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh, req<void *> p,
Suggestion:
void *p,
lib/kernels/include/kernels/accessor.h
line 35 at r4 (raw file):
ArrayShape sh, req<void *> p, bool on_dev = true)
Suggestion:
GenericTensorAccessorW(DataType data_type,
ArrayShape shape,
req<void *> ptr,
bool on_dev = true)
lib/kernels/include/kernels/accessor.h
line 36 at r4 (raw file):
req<void *> p, bool on_dev = true) : data_type(dt), shape(sh), ptr(p), on_device(on_dev) {}
Move to .cc
file
lib/kernels/include/kernels/accessor.h
line 41 at r4 (raw file):
DataType data_type; ArrayShape shape; req<void *> ptr;
Suggestion:
void * ptr;
lib/kernels/include/kernels/accessor.h
line 42 at r4 (raw file):
ArrayShape shape; req<void *> ptr; bool on_device;
Suggestion:
DeviceType device_type;
lib/kernels/include/kernels/accessor.h
line 43 at r4 (raw file):
req<void *> ptr; bool on_device; };
Suggestion:
bool operator==(GenericTensorAccessorW const &) const;
bool operator!=(GenericTensorAccessorW const &) const;
public:
DataType data_type;
ArrayShape shape;
req<void *> ptr;
bool on_device;
private:
std::tuple<decltype(data_type) const &,
decltype(shape) const &,
decltype(ptr) const &,
decltype(on_device) const &>
tie() const;
};
// in .cc file
std::tuple<DataType const &,
ArrayShape const &,
void *,
DeviceType const &>
GenericTensorAccessorW::tie() const {
return std::tie(this->data_type, this->shape, this->ptr, this->on_device);
}
bool GenericTensorAccessorW::operator==(GenericTensorAccessorW const &other) const {
return this->tie() == other.tie();
}
bool GenericTensorAccessorW::operator!=(GenericTensorAccessorW const &other) const {
return this->tie() != other.tie();
}
lib/kernels/include/kernels/accessor.h
line 45 at r4 (raw file):
}; FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION( GenericTensorAccessorW, data_type, shape, ptr, on_device);
visitable
is deprecated
lib/kernels/include/kernels/accessor.h
line 64 at r4 (raw file):
double const *get_double_ptr() const; half const *get_half_ptr() const;
Code snippet:
GenericTensorAccessorR() = delete;
lib/kernels/include/kernels/accessor.h
line 65 at r4 (raw file):
half const *get_half_ptr() const; GenericTensorAccessorR(DataType dt,
Move to .cc
file
lib/kernels/include/kernels/accessor.h
line 77 at r4 (raw file):
bool on_device; }; FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(
Make same tie
changes as above to add ==
, !=
, hash
back in (they're currently added by visitable
)
lib/kernels/include/kernels/allocation.h
line 14 at r4 (raw file):
struct IAllocator { virtual void *allocate(size_t) = 0; virtual void *allocate_and_zero(size_t) = 0;
Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface
lib/kernels/include/kernels/allocation.h
line 16 at r4 (raw file):
virtual void *allocate_and_zero(size_t) = 0; virtual void deallocate(void *) = 0;
Code snippet:
virtual DeviceType get_allocation_device_type() = 0;
lib/kernels/include/kernels/allocation.h
line 25 at r4 (raw file):
GenericTensorAccessorW allocate_tensor(TensorShape const &tensor_shape); GenericTensorAccessorW allocate_tensor_and_zero(TensorShape const &tensor_shape);
Delete
lib/kernels/include/kernels/allocation.h
line 28 at r4 (raw file):
void *allocate(size_t mem_size); void *allocate_and_zero(size_t mem_size);
Delete
lib/kernels/include/kernels/allocation.h
line 40 at r4 (raw file):
Allocator(std::shared_ptr<IAllocator> ptr) : i_allocator(ptr){}; AllocLocation alloc_location;
Delete
lib/kernels/include/kernels/local_cpu_allocator.h
line 13 at r4 (raw file):
void *allocate(size_t) override; void *allocate_and_zero(size_t) override;
Delete
lib/kernels/include/kernels/local_cuda_allocator.h
line 13 at r4 (raw file):
void *allocate(size_t) override; void *allocate_and_zero(size_t) override;
Delete
lib/kernels/src/array_shape.cc
line 63 at r4 (raw file):
} TensorShape get_tensor_shape(ArrayShape const &shape, DataType DT) {
Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78
lib/kernels/src/local_cpu_allocator.cc
line 5 at r4 (raw file):
namespace FlexFlow { void *LocalCPUAllocator::allocate(size_t requested_memory_size) {
Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc
lib/kernels/src/cpu/combine_kernels.cc
line 4 at r4 (raw file):
#include "kernels/datatype_dispatch.h" namespace FlexFlow {
FYI we're now on C++17, so you can instead write namespace FlexFlow::Kernels::Combine {
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Is this "too" broken down?
The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for GenericTensorAccessor
lib/kernels/src/cuda/ops/reverse_kernels.cu
line 41 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Currently talking to Xinhao
Can you change your status on this comment then to "Working": see https://docs.reviewable.io/discussions.html#dispositions-and-resolution
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
GenericTensorAccessorW input_accessor = create_random_filled_accessor_w<DataType::FLOAT>(input_shape,
Why is the extra template parameter suddenly necessary?
lib/kernels/test/src/test_cast_kernel.cc
line 90 at r4 (raw file):
DataType::FLOAT, DataType::INT32); std::cout << "Before GPU load" << std::endl;
Delete
lib/kernels/test/src/test_cast_kernel.cc
line 103 at r4 (raw file):
DataType::FLOAT, DataType::INT32); std::cout << "Before CPU load" << std::endl;
Delete
lib/kernels/test/src/test_combine_kernel.cc
line 111 at r4 (raw file):
output_grad_accessor_gpu, cpu_allocator); GenericTensorAccessorW input_grad_accessor_cpu = cpu_allocator.allocate_tensor_and_zero(input_shape);
Or even add a helper function create_zero_filled_accessor_w
that does these two operations
Suggestion:
GenericTensorAccessorW input_grad_accessor_cpu =
cpu_allocator.allocate_tensor(input_shape);
fill_with_zeros(input_grad_accessor_cpu);
lib/kernels/test/src/test_reverse_kernels.cc
line 147 at r4 (raw file):
// Run CPU Cast Backward Kernel GenericTensorAccessorW output_grad_accessor_cpu = copy_tensor_between_memories<DataType::FLOAT>(
Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)
Code quote:
copy_tensor_between_memories<DataType::FLOAT>(
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?
I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?
In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.
There shouldn't be that much code explicitly creating GenericTensorAccessor
s I think? Can you point me to some examples?
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
This is mainly for debugging purposes, so can create like an iota filled accessor or etc.
Either specialize to a function that creates a random tensor, or pass the current tensor index to transform
--the current abstraction is a bit awkwardly intermediate
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to copy_tensor_to_memory
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
} template <DataType DT>
Why not just get the datatype from the shape
?
lib/kernels/test/src/test_utils.cc
line 11 at r4 (raw file):
} TensorShape make_tensor_shape_from_legion_dims(FFOrdered<size_t> dims,
This doesn't seem correct, shouldn't this be taking a LegionOrdered
?
Suggestion:
legion_dims(FFOrdered<size_t> const &dims,
lib/local-execution/include/local-execution/tracked_allocator.h
line 15 at r4 (raw file):
void *allocate(size_t) override; void *allocate_and_zero(size_t) override;
Delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 83 files reviewed, 35 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 31 at r4 (raw file):
double *get_double_ptr() const; half *get_half_ptr() const;
Done.
lib/kernels/include/kernels/accessor.h
line 33 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh,
Done.
lib/kernels/include/kernels/accessor.h
line 34 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh, req<void *> p,
Done.
lib/kernels/include/kernels/accessor.h
line 35 at r4 (raw file):
ArrayShape sh, req<void *> p, bool on_dev = true)
Done.
lib/kernels/include/kernels/accessor.h
line 36 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move to
.cc
file
Done.
lib/kernels/include/kernels/accessor.h
line 41 at r4 (raw file):
DataType data_type; ArrayShape shape; req<void *> ptr;
Done.
lib/kernels/include/kernels/accessor.h
line 42 at r4 (raw file):
ArrayShape shape; req<void *> ptr; bool on_device;
Done.
lib/kernels/include/kernels/accessor.h
line 43 at r4 (raw file):
req<void *> ptr; bool on_device; };
Done.
lib/kernels/include/kernels/accessor.h
line 45 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
visitable
is deprecated
Done.
lib/kernels/include/kernels/accessor.h
line 64 at r4 (raw file):
double const *get_double_ptr() const; half const *get_half_ptr() const;
Done.
lib/kernels/include/kernels/accessor.h
line 65 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move to
.cc
file
Done.
lib/kernels/include/kernels/accessor.h
line 77 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Make same
tie
changes as above to add==
,!=
,hash
back in (they're currently added byvisitable
)
Done.
lib/kernels/include/kernels/allocation.h
line 14 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface
Done.
lib/kernels/include/kernels/allocation.h
line 16 at r4 (raw file):
virtual void *allocate_and_zero(size_t) = 0; virtual void deallocate(void *) = 0;
Done.
lib/kernels/include/kernels/allocation.h
line 25 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/allocation.h
line 28 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/allocation.h
line 40 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/local_cpu_allocator.h
line 13 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/local_cuda_allocator.h
line 13 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/src/array_shape.cc
line 63 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78
Done.
lib/kernels/src/local_cpu_allocator.cc
line 5 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc
Done.
lib/kernels/src/cpu/combine_kernels.cc
line 4 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
FYI we're now on C++17, so you can instead write
namespace FlexFlow::Kernels::Combine {
Done
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for
GenericTensorAccessor
Done.
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is the extra template parameter suddenly necessary?
I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float
lib/kernels/test/src/test_cast_kernel.cc
line 90 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 103 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/test/src/test_combine_kernel.cc
line 111 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Or even add a helper function
create_zero_filled_accessor_w
that does these two operations
Done.
lib/kernels/test/src/test_reverse_kernels.cc
line 147 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)
Done.
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
There shouldn't be that much code explicitly creating
GenericTensorAccessor
s I think? Can you point me to some examples?
I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Either specialize to a function that creates a random tensor, or pass the current tensor index to
transform
--the current abstraction is a bit awkwardly intermediate
Just deleted entirely
lib/kernels/test/src/test_utils.h
line 86 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.
Done.
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to
copy_tensor_to_memory
Made more explicit
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just get the datatype from the
shape
?
Done.
lib/kernels/test/src/test_utils.cc
line 11 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This doesn't seem correct, shouldn't this be taking a
LegionOrdered
?
Done.
lib/local-execution/include/local-execution/tracked_allocator.h
line 15 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 72 of 75 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 33 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 33 at r6 (raw file):
nccl utils pcg
Why?
lib/kernels/include/kernels/accessor.h
line 15 at r6 (raw file):
namespace FlexFlow { class GenericTensorAccessorW {
@reyna-abhyankar Is there a reason we have both GenericTensorAccessorW
and GenericTensorAccessorR
over just having one GenericTensorAccessor
and passing it as either const &
or mutable &
?
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
template <DataType DT, typename... Indices> real_type_t<DT> &at(Indices... indices) { if (this->device_type != DeviceType::CPU) {
Any reason for this over the simpler at(std::vector<int> const &)
?
Code quote:
template <DataType DT, typename... Indices>
real_type_t<DT> &at(Indices... indices) {
if (this->device_type != DeviceType::CPU) {
lib/kernels/include/kernels/reverse_kernels_cpu.h
line 5 at r6 (raw file):
#include "accessor.h" #include "device.h"
Prefer full include paths
Suggestion:
#include "kernels/accessor.h"
#include "kernels/device.h"
lib/kernels/src/accessor.cc
line 30 at r6 (raw file):
} size_t offset = 0;
@Marsella8 Do you think we could reasonably add some generic coord -> index
& index apsce concept in utils
and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented
lib/kernels/src/accessor.cc
line 47 at r6 (raw file):
offset += cur_idx * multiplier; multiplier *= this->shape[legion_dim_t(i)];
Prefer at
for bounds checking (ArrayShape
may have bounds-checking for both, but other containers don't so it's a good habit)
Suggestion:
multiplier *= this->shape.at(legion_dim_t(i));
lib/kernels/src/allocation.cc
line 22 at r6 (raw file):
void *ptr = this->allocate(get_size_in_bytes(tensor_shape)); return { tensor_shape.data_type, tensor_shape, ptr, get_allocation_device_type()};
Prefer explicit this->
Suggestion:
tensor_shape.data_type, tensor_shape, ptr, this->get_allocation_device_type()};
lib/kernels/src/cpu/replicate_kernels.cc
line 7 at r6 (raw file):
template <typename T> void cpu_replicate_backward_kernel(T *input,
Move over to use the new GenericTensorAccessor
access-by-multidimensional-index support?
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Yeah this is much clearer, thanks (also it would be good to rename blk
to block
as it's a bit nonobvious that blk
means block
and not bulk
)
lib/kernels/src/cpu/reverse_kernels.cc
line 13 at r6 (raw file):
GenericTensorAccessorW &output, coord_t num_out_blks, coord_t reverse_dim_size,
Can't we just get this from the shape information contained in input
?
Code quote:
coord_t reverse_dim_size,
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
DataTypeDispatch1<CPUReverseForwardKernel>{}(input_accessor.data_type, input_accessor, std::ref(output_accessor),
Why is std::ref
necessary?
Code quote:
std::ref(output_accessor),
lib/kernels/test/CMakeLists.txt
line 17 at r6 (raw file):
cudart cublas pcg
Why?
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float
Sure, but the DataType is already present in the input_shape
, so probably better to just use that than force the user to pass the DataType
twice
lib/kernels/test/src/test_cast_kernel.cc
line 87 at r6 (raw file):
std::vector<int32_t> result_data_gpu = load_accessor_data<DataType::INT32>(output_accessor_gpu);
Why not just get back a GenericTensorAccessorR
with its data on CPU? Throwing away all the information by dropping down to a std::vector
feels unnecessary
Code quote:
std::vector<int32_t> result_data_gpu =
load_accessor_data<DataType::INT32>(output_accessor_gpu);
lib/kernels/test/src/test_cast_kernel.cc
line 91 at r6 (raw file):
// Run CPU Forward Kernel GenericTensorAccessorW input_accessor_cpu = create_random_filled_accessor_w<DataType::FLOAT>(input_shape,
If the data is random, why would I expect input_accessor_cpu
to have the same data as input_accessor_gpu
? Would it be cleaner to generate a GenericTensorAccessor
with random data on CPU, then create a copy of that on GPU, and then run the functions?
lib/kernels/test/src/test_concat_kernel.cc
line 9 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test concat kernel forward and backward") { size_t num_inputs = 2;
Why these changes?
lib/kernels/test/src/test_replicate_kernel.cc
line 78 at r6 (raw file):
GenericTensorAccessorW output_accessor_gpu = gpu_allocator.allocate_tensor(output_shape); fill_with_zeros(output_accessor_gpu);
Wouldn't filling with random data here give you more assurance the function is correct?
lib/kernels/test/src/test_reverse_kernels.cc
line 150 at r6 (raw file):
GenericTensorAccessorW input_grad_accessor_cpu = cpu_allocator.allocate_tensor(input_shape); fill_with_zeros(input_grad_accessor_cpu);
I'd recommend creating a helper function for creating zero-filled tensors in test_utils
to avoid any "use before initialized" issues
Code quote:
cpu_allocator.allocate_tensor(input_shape);
fill_with_zeros(input_grad_accessor_cpu);
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?
Not sure what you mean by "but maybe our current behavior is fine with the initializer is fine anyways". Can you clarify?
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Then why is the template parameter still here?
lib/kernels/test/src/test_utils.h
line 34 at r6 (raw file):
template <typename DT> void transfer_memory(GenericTensorAccessorW dst_accessor,
Suggestion:
void transfer_memory(GenericTensorAccessorW &dst_accessor,
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
template <typename DT> void transfer_memory(GenericTensorAccessorW dst_accessor, const DT *src,
Why is one parameter a TensorAccessor
and the other is a raw array? Feels inconsistent
Code quote:
void transfer_memory(GenericTensorAccessorW dst_accessor,
const DT *src,
lib/kernels/test/src/test_utils.h
line 37 at r6 (raw file):
const DT *src, DeviceType src_device_type) { size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);
Suggestion:
size_t num_bytes = dst_accessor.shape.get_volume() * sizeof(DT);
lib/kernels/test/src/test_utils.h
line 78 at r6 (raw file):
template <DataType DT> GenericTensorAccessorR create_random_filled_accessor_r(TensorShape const &shape,
Get the DataType
from the TensorShape
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
template <typename T> GenericTensorAccessorW create_filled_accessor_w(TensorShape const &shape,
Get the DataType
from the TensorShape
lib/kernels/test/src/test_utils.h
line 124 at r6 (raw file):
template <DataType DT> std::vector<real_type_t<DT>> load_accessor_data(GenericTensorAccessorW accessor) {
Suggestion:
load_accessor_data(GenericTensorAccessorW const &accessor) {
lib/kernels/test/src/test_utils.h
line 149 at r6 (raw file):
template <typename T> bool vectors_are_approx_equal(T lhs, T rhs) {
Now that we have support for GenericTensorAccessor
s on CPU, using vectors no longer seems necessary I think
Code quote:
template <typename T>
bool contains_non_zero(std::vector<T> &data) {
return !all_of(data, [](T const &val) { return val == 0; });
}
template <typename T>
bool vectors_are_approx_equal(T lhs, T rhs) {
lib/kernels/test/src/test_utils.cc
line 5 at r6 (raw file):
namespace FlexFlow { bool device_on_cpu(DeviceType device_type) {
Is this function really necessary? Doesn't seem like you're saving much typing or much complexity
lib/kernels/test/src/test_utils.cc
line 44 at r6 (raw file):
return DataTypeDispatch1<CopyTensorAccessorW>{}( src_accessor.data_type, src_accessor, std::ref(allocator)); }
The ability to copy tensors feels like a useful primitive by itself, might be better moved into kernels
itself?
Code quote:
template <DataType DT>
struct CopyTensorAccessorW {
GenericTensorAccessorW operator()(GenericTensorAccessorW const &src_accessor,
Allocator &allocator) {
TensorShape shape =
get_tensor_shape(src_accessor.shape, src_accessor.data_type);
GenericTensorAccessorW copied_tensor = allocator.allocate_tensor(shape);
transfer_memory(
copied_tensor, src_accessor.get<DT>(), src_accessor.device_type);
return copied_tensor;
}
};
GenericTensorAccessorW
copy_tensor_accessor_w(GenericTensorAccessorW const &src_accessor,
Allocator &allocator) {
return DataTypeDispatch1<CopyTensorAccessorW>{}(
src_accessor.data_type, src_accessor, std::ref(allocator));
}
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
tensor_backing.shape, tensor_backing.ptr, this->allocator.get_allocation_device_type()};
At this point probably better to just add a function in accessor.h
for converting from a GenericTensorAccessorW
to a GenericTensorAccessorR
Code quote:
GenericTensorAccessorR readonly_tensor_backing = {
tensor_backing.data_type,
tensor_backing.shape,
tensor_backing.ptr,
this->allocator.get_allocation_device_type()};
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
tensor_backing.shape, tensor_backing.ptr, this->allocator.get_allocation_device_type()};
Why not just use the device type from tensor_backing
?
lib/local-execution/src/local_task_argument_accessor.cc
line 49 at r6 (raw file):
for (GenericTensorAccessorW const &tensor_backing : variadic_tensor_backing) { readonly_variadic_tensor_backing.push_back(
At this point probably better to just add a function in accessor.h
for converting from a GenericTensorAccessorW
to a GenericTensorAccessorR
lib/local-execution/src/local_task_argument_accessor.cc
line 53 at r6 (raw file):
tensor_backing.shape, tensor_backing.ptr, this->allocator.get_allocation_device_type()});
Just use the device type from tensor_backing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 51 of 85 files reviewed, 33 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 33 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
I'm currently using the DeviceType enum from pcg/device_type.dtg.h
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Any reason for this over the simpler
at(std::vector<int> const &)
?
Mainly such that you can just use .at
lib/kernels/include/kernels/reverse_kernels_cpu.h
line 5 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer full include paths
Done.
lib/kernels/src/accessor.cc
line 47 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer
at
for bounds checking (ArrayShape
may have bounds-checking for both, but other containers don't so it's a good habit)
Done.
lib/kernels/src/allocation.cc
line 22 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer explicit
this->
Done.
lib/kernels/src/cpu/replicate_kernels.cc
line 7 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move over to use the new
GenericTensorAccessor
access-by-multidimensional-index support?
Done.
lib/kernels/src/cpu/reverse_kernels.cc
line 13 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can't we just get this from the shape information contained in
input
?
Done.
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
std::ref
necessary?
I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.
If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?
lib/kernels/test/CMakeLists.txt
line 17 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
I'm currently using the DeviceType enum from pcg/device_type.dtg.h
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Sure, but the DataType is already present in the
input_shape
, so probably better to just use that than force the user to pass theDataType
twice
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 87 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just get back a
GenericTensorAccessorR
with its data on CPU? Throwing away all the information by dropping down to astd::vector
feels unnecessary
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 91 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If the data is random, why would I expect
input_accessor_cpu
to have the same data asinput_accessor_gpu
? Would it be cleaner to generate aGenericTensorAccessor
with random data on CPU, then create a copy of that on GPU, and then run the functions?
Didn't catch this as test was going from float between 0 and 1 to an int lol, so it was basically just comparing if a tensor of all zeros was equal to another tensor of all zeros
lib/kernels/test/src/test_concat_kernel.cc
line 9 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why these changes?
Was printing outputs and was easier to visually compare
lib/kernels/test/src/test_replicate_kernel.cc
line 78 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Wouldn't filling with random data here give you more assurance the function is correct?
The problem I was running into is that the replicate_kernels.cu just directly adds to the value inside of the output accessor, leading to some cases when I was running the test where one of the tensors were allocated with randomized values inside of it, resulting in the test failing as the output tensors were not both initialized with the same initial state
lib/kernels/test/src/test_reverse_kernels.cc
line 150 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'd recommend creating a helper function for creating zero-filled tensors in
test_utils
to avoid any "use before initialized" issues
Done.
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Then why is the template parameter still here?
Done.
lib/kernels/test/src/test_utils.h
line 34 at r6 (raw file):
template <typename DT> void transfer_memory(GenericTensorAccessorW dst_accessor,
Done.
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is one parameter a
TensorAccessor
and the other is a raw array? Feels inconsistent
It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...
I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr
lib/kernels/test/src/test_utils.h
line 37 at r6 (raw file):
const DT *src, DeviceType src_device_type) { size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);
Done.
lib/kernels/test/src/test_utils.h
line 78 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Get the
DataType
from theTensorShape
Done.
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Get the
DataType
from theTensorShape
I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?
lib/kernels/test/src/test_utils.h
line 124 at r6 (raw file):
template <DataType DT> std::vector<real_type_t<DT>> load_accessor_data(GenericTensorAccessorW accessor) {
Done.
lib/kernels/test/src/test_utils.h
line 149 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Now that we have support for
GenericTensorAccessor
s on CPU, using vectors no longer seems necessary I think
Done.
lib/kernels/test/src/test_utils.cc
line 5 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is this function really necessary? Doesn't seem like you're saving much typing or much complexity
Done.
lib/kernels/test/src/test_utils.cc
line 44 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The ability to copy tensors feels like a useful primitive by itself, might be better moved into
kernels
itself?
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
At this point probably better to just add a function in
accessor.h
for converting from aGenericTensorAccessorW
to aGenericTensorAccessorR
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just use the device type from
tensor_backing
?
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 49 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
At this point probably better to just add a function in
accessor.h
for converting from aGenericTensorAccessorW
to aGenericTensorAccessorR
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 53 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just use the device type from
tensor_backing
?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 33 of 34 files at r7, 1 of 1 files at r8, 7 of 7 files at r9, all commit messages.
Dismissed @lockshaw from a discussion.
Reviewable status: all files reviewed, 27 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 15 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
@reyna-abhyankar Is there a reason we have both
GenericTensorAccessorW
andGenericTensorAccessorR
over just having oneGenericTensorAccessor
and passing it as eitherconst &
or mutable&
?
At least in the meantime it seems like it would make sense to have GenericTensorAccessorW
non-explicitly coerce to GenericTensorAccessorR
I think to avoid all of the overloading?
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Mainly such that you can just use .at
(1,2,3) instead of .at ({1,2,3})... no need for unneeded initializer list
Let's do the std::vector
version unless there's some additional advantage you haven't mentioned yet. Yes the template syntax is cute and all, but it's substantially more complicated feature-wise and all it seems to do is remove the need to type two characters (also it doesn't allow for passing in coordinates of runtime-determined dimension)
lib/kernels/include/kernels/accessor.h
line 14 at r9 (raw file):
namespace FlexFlow { struct Allocator;
Better to #include
in the header that contains Allocator
lib/kernels/include/kernels/accessor.h
line 264 at r9 (raw file):
GenericTensorAccessorR const &src_accessor); void transfer_data_between_accessors(
Makes parameter order clear, and makes the operation a bit clearer ("transfer" sounds more like a "move" operation)
Suggestion:
void copy_accessor_data_to_l_from_r(
lib/kernels/include/kernels/replicate_kernels.h
line 15 at r9 (raw file):
void backward_kernel(ffStream_t stream, GenericTensorAccessorR const &output, GenericTensorAccessorW const &input,
Why reorder input
and output
?
lib/kernels/include/kernels/replicate_kernels_cpu.h
line 13 at r9 (raw file):
void cpu_backward_kernel(GenericTensorAccessorR const &output, GenericTensorAccessorW &input,
Why reorder input
and output
?
lib/kernels/src/accessor.cc
line 30 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
@Marsella8 Do you think we could reasonably add some generic
coord -> index
& index apsce concept inutils
and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented
Tracked in #1528
lib/kernels/src/accessor.cc
line 32 at r9 (raw file):
num_bytes, cudaMemcpyDeviceToDevice)); }
Suggestion:
} else {
assert (src_device_type == DeviceType::GPU);
assert (dst_device_type == DeviceType::GPU);
checkCUDA(cudaMemcpy(dst_accessor.ptr,
src_accessor.ptr,
num_bytes,
cudaMemcpyDeviceToDevice));
}
lib/kernels/src/managed_ff_stream.cc
line 14 at r9 (raw file):
ManagedFFStream &ManagedFFStream::operator=(ManagedFFStream &&other) noexcept { if (this != &other) {
Why this change?
lib/kernels/src/managed_per_device_ff_handle.cc
line 22 at r9 (raw file):
ManagedPerDeviceFFHandle &ManagedPerDeviceFFHandle::operator=( ManagedPerDeviceFFHandle &&other) noexcept { if (this != &other) {
Why change this from swap
?
lib/kernels/src/managed_per_device_ff_handle.cc
line 40 at r9 (raw file):
checkCUDA(cudaFree(this->handle->workSpace)); delete this->handle; this->handle = nullptr;
Why explicitly assign the handle
to nullptr
?
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.
If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?
I'm tentatively guessing that might be due to DataTypeDispatch1
and DataTypeDispatch2
taking their arguments as Args ...
rather than as Args &&...
--can you try making that change and seeing if it fixes the problem?
lib/kernels/src/cuda/ops/linear_kernels.cu
line 138 at r9 (raw file):
in_dim, &alpha, (void *)weight_ptr,
Avoid c-style casts
Suggestion:
static_cast<void *>(weight_ptr),
lib/kernels/test/src/test_cast_kernel.cc
line 83 at r9 (raw file):
output_accessor_cpu, DataType::FLOAT, DataType::DOUBLE);
Any reason not to just get these DataType
s from the accessors' shape fields?
Code quote:
DataType::FLOAT,
DataType::DOUBLE);
lib/kernels/test/src/test_cast_kernel.cc
line 85 at r9 (raw file):
DataType::DOUBLE); CHECK(w_accessors_are_equal<DataType::DOUBLE>(output_accessor_gpu,
Why is a template parameter needed here? Isn't the type information already present in the shape field of the accessors?
Code quote:
<DataType::DOUBLE>(o
lib/kernels/test/src/test_managed_ff_stream.cc
line 10 at r9 (raw file):
ManagedFFStream base_stream{}; SUBCASE("Test ManagedFFStream Move Constructor") {
Add checks for if it's the same object on both sides of the assignment.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 8 at r9 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed Per Device FF Handle") { ManagedPerDeviceFFHandle base_handle{};
Can we force the user to pass arguments here instead? I'm not a huge fan of default constructors where there's no obvious single correct value that should be set. If you want to add a function that creates one with some default arguments and user that that's fine, but I'd rather that be an explicit function call to make it clear that that's kinda an arbitrary choice supplied somewhere rather than intrinsic to the object semantics
Code quote:
ManagedPerDeviceFFHandle base_handle{};
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 16 at r9 (raw file):
SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") { PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle();
Add checks for if it's the same object on both sides of the assignment.
lib/kernels/test/src/test_reverse_kernels.cc
line 60 at r9 (raw file):
TEST_CASE("Check Reverse Forward and Backward Kernels against CPU Kernels") { std::size_t num_out_blks = 1;
Might be nice to have this be >1 as having it be 1 could be hiding bugs (as iteration order doesn't matter if your looop has bound 1)
Code quote:
std::size_t num_out_blks = 1;
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...
I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr
Seems like this has been removed since this comment, so I guess it's fine now?
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?
You could technically get around this by passing in a DataTypeValue
which is a variant of the real_type_t
s of all of the DataType
s, but I'm fine with the current implementation too. I'd say the big thing is that it should be checked against the DataType
in the shape
and an error raised if the the chosen T
and the TensorShape
's DataType
don't match
lib/kernels/test/src/test_utils.h
line 101 at r9 (raw file):
if (accessor_a.data_type != accessor_b.data_type) { return false; }
Suggestion:
if (accessor_a.shape != accessor_b.shape) {
throw mk_runtime_error(fmt::format("w_accessors_are_equal expected accessors to have the same shape, but received: {} != {}", accessor_a.shape, accessor_b.shape));
}
lib/kernels/test/src/test_utils.h
line 115 at r9 (raw file):
for (size_t i = 0; i < accessor_a.shape.num_elements(); i++) { if (a_data_ptr[i] != b_data_ptr[i]) { print_accessor(cpu_accessor_a);
print
ing by default seems like a bad idea, this should probably be handled by the assertion check, not the actual boolean condition itself.
lib/kernels/test/src/test_utils.cc
line 126 at r9 (raw file):
GenericTensorAccessorR create_cpu_compatible_accessor_r(GenericTensorAccessorR const &accessor,
Suggestion:
copy_to_accessor_to_cpu_if_necessary
lib/kernels/test/src/test_utils.cc
line 158 at r9 (raw file):
}; void print_accessor(GenericTensorAccessorR const &accessor) {
Suggestion:
void print_tensor_accessor_contents(
lib/kernels/test/src/test_utils.cc
line 158 at r9 (raw file):
}; void print_accessor(GenericTensorAccessorR const &accessor) {
Might be nice to have this return std::string
or take in a std::ostream &
so the user can choose where it goes rather than just forcing it to std::cout
Code quote:
void
lib/local-execution/src/ops/linear.cc
line 152 at r9 (raw file):
per_device_state, input.get_float_ptr(), (float *)input_grad.get_float_ptr(),
No cast should be necessary as you're already calling get_float_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 57 of 95 files reviewed, 27 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 15 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
At least in the meantime it seems like it would make sense to have
GenericTensorAccessorW
non-explicitly coerce toGenericTensorAccessorR
I think to avoid all of the overloading?
Done.
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Let's do the
std::vector
version unless there's some additional advantage you haven't mentioned yet. Yes the template syntax is cute and all, but it's substantially more complicated feature-wise and all it seems to do is remove the need to type two characters (also it doesn't allow for passing in coordinates of runtime-determined dimension)
Done.
lib/kernels/include/kernels/accessor.h
line 14 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Better to
#include
in the header that containsAllocator
I can't "#include/kernels/allocation.h" as circular dependency between allocator and accessor
lib/kernels/include/kernels/accessor.h
line 264 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Makes parameter order clear, and makes the operation a bit clearer ("transfer" sounds more like a "move" operation)
Done.
lib/kernels/include/kernels/replicate_kernels.h
line 15 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why reorder
input
andoutput
?
Slightly unneeded, but I prefer the format of output before input as it matches the other cpu_kernels I have written so far.
As it relates to all other kernels though, a lot of the kernels vary on whether they have input or output first... should probably standardize this?
lib/kernels/include/kernels/replicate_kernels_cpu.h
line 13 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why reorder
input
andoutput
?
same as prev
lib/kernels/src/accessor.cc
line 32 at r9 (raw file):
num_bytes, cudaMemcpyDeviceToDevice)); }
Done.
lib/kernels/src/managed_ff_stream.cc
line 14 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why this change?
Similar to per_device_ff_handle but I also thought it'd be safer to fully transfer ownership rather than potentially leaving dangling resources
lib/kernels/src/managed_per_device_ff_handle.cc
line 22 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why change this from
swap
?
I thought it's good practice to invalidate the other ManagedPerDeviceFFHandle in a move constructor, so changed to exchange to set other handle's ptr now as nullptr
lib/kernels/src/managed_per_device_ff_handle.cc
line 40 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why explicitly assign the
handle
tonullptr
?
Done.
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm tentatively guessing that might be due to
DataTypeDispatch1
andDataTypeDispatch2
taking their arguments asArgs ...
rather than asArgs &&...
--can you try making that change and seeing if it fixes the problem?
Fixed the issue
lib/kernels/src/cuda/ops/linear_kernels.cu
line 138 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Avoid c-style casts
reinterpret cast okay here?
lib/kernels/test/src/test_cast_kernel.cc
line 83 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Any reason not to just get these
DataType
s from the accessors' shape fields?
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 85 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is a template parameter needed here? Isn't the type information already present in the shape field of the accessors?
Done.
lib/kernels/test/src/test_managed_ff_stream.cc
line 10 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add checks for if it's the same object on both sides of the assignment.
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 8 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can we force the user to pass arguments here instead? I'm not a huge fan of default constructors where there's no obvious single correct value that should be set. If you want to add a function that creates one with some default arguments and user that that's fine, but I'd rather that be an explicit function call to make it clear that that's kinda an arbitrary choice supplied somewhere rather than intrinsic to the object semantics
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 16 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add checks for if it's the same object on both sides of the assignment.
Done.
lib/kernels/test/src/test_reverse_kernels.cc
line 60 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be nice to have this be >1 as having it be 1 could be hiding bugs (as iteration order doesn't matter if your looop has bound 1)
Done.
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Seems like this has been removed since this comment, so I guess it's fine now?
Yeah, this was fixed, I just moved the function into accessor.h => this is now the copy_accessor_data_to_l_from_r function
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You could technically get around this by passing in a
DataTypeValue
which is a variant of thereal_type_t
s of all of theDataType
s, but I'm fine with the current implementation too. I'd say the big thing is that it should be checked against theDataType
in theshape
and an error raised if the the chosenT
and theTensorShape
'sDataType
don't match
Done.
lib/kernels/test/src/test_utils.h
line 101 at r9 (raw file):
if (accessor_a.data_type != accessor_b.data_type) { return false; }
Done.
lib/kernels/test/src/test_utils.h
line 115 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Done.
lib/kernels/test/src/test_utils.cc
line 126 at r9 (raw file):
GenericTensorAccessorR create_cpu_compatible_accessor_r(GenericTensorAccessorR const &accessor,
Done.
lib/kernels/test/src/test_utils.cc
line 158 at r9 (raw file):
}; void print_accessor(GenericTensorAccessorR const &accessor) {
Done.
lib/kernels/test/src/test_utils.cc
line 158 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be nice to have this return
std::string
or take in astd::ostream &
so the user can choose where it goes rather than just forcing it tostd::cout
Done.
lib/local-execution/src/ops/linear.cc
line 152 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
No cast should be necessary as you're already calling
get_float_ptr
?
Needed to make some changes to permissions of accessors to enable this, due to const types with bias_ptr and output_grad ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 38 of 38 files at r10, all commit messages.
Reviewable status: all files reviewed, 72 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 14 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I can't "#include/kernels/allocation.h" as circular dependency between allocator and accessor
Fair enough. In that case I'd recommend pulling copy_tensor_accessor_r
and copy_tensor_accessor_w
(I believe these are the only two usages of Allocator
in this file) into a separate header file--this header file is getting rather large already anyway
lib/kernels/include/kernels/replicate_kernels.h
line 15 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Slightly unneeded, but I prefer the format of output before input as it matches the other cpu_kernels I have written so far.
As it relates to all other kernels though, a lot of the kernels vary on whether they have input or output first... should probably standardize this?
Sure, feel free to standardize the kernel functions in a future PR (that would be great actually), though we should make sure we actually have tests up and running for all of the kernels to make sure we don't accidentally break anything. I've created an issue to track this and assigned it to you: #1540
lib/kernels/src/managed_ff_stream.cc
line 14 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Similar to per_device_ff_handle but I also thought it'd be safer to fully transfer ownership rather than potentially leaving dangling resources
Makes sense, my only concern is the degree of code duplication. Is there anything reasonable we can do to reduce it to minimize bugs when this file is edited?
lib/kernels/src/managed_per_device_ff_handle.cc
line 22 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I thought it's good practice to invalidate the other ManagedPerDeviceFFHandle in a move constructor, so changed to exchange to set other handle's ptr now as nullptr
That's fine (and not a bad practice), it's just currently leading to a lot of duplicated code between this function and the destructor. With the previous solution you implicitly get the destructor call when the moved-from value is deallocated. Is there a way the duplicated code in this solution could be reduced? The current structure makes editing this file a bit bugprone
lib/kernels/src/cuda/ops/linear_kernels.cu
line 138 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
reinterpret cast okay here?
Generally you want to use the weakest/most-restrictive cast construction that you can, and static_cast<void const *>
works fine here so you should use that throughout this file.
lib/local-execution/src/ops/linear.cc
line 152 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Needed to make some changes to permissions of accessors to enable this, due to const types with bias_ptr and output_grad ptr
What do you mean--I'm not seeing any changes to this in accessors.h
? Am I missing something?
lib/kernels/include/kernels/accessor.h
line 166 at r7 (raw file):
std::ostream &operator<<(std::ostream &, GenericTensorAccessorR const &); int32_t *get_int32_ptr(GenericTensorAccessorW const &);
Why were these removed? Not necessarily objecting, just curious
lib/kernels/src/accessor.cc
line 29 at r10 (raw file):
} else { assert(src_device_type == DeviceType::GPU); assert(src_device_type == DeviceType::CPU);
I don't think this is possible as src_device_type
can't have two values at once
Code quote:
assert(src_device_type == DeviceType::GPU);
assert(src_device_type == DeviceType::CPU);
lib/kernels/src/accessor.cc
line 70 at r10 (raw file):
for (size_t i = 0; i < this->shape.num_dims(); i++) { if (indices[i] >= this->shape.at(legion_dim_t(i))) {
Suggestion:
indices.at(i)
lib/kernels/src/accessor.cc
line 75 at r10 (raw file):
"when only {} indexes exist", i, indices[i],
Suggestion:
indices.at(i),
lib/kernels/src/accessor.cc
line 79 at r10 (raw file):
} offset += indices[i] * multiplier;
Suggestion:
indices.at(i)
lib/kernels/src/accessor.cc
line 145 at r10 (raw file):
std::vector<size_t> const &indices) const { if (indices.size() != this->shape.num_dims()) {
Pull out into a separate helper function to avoid the duplication between GenericTensorAccessorR::calculate_index_offset
and GenericTensorAccessorW::calculuate_index_offset
(and then honestly it's probably a good idea to just remove those methods as I'm not sure they're doing much then other than just forwarding arguments to the helper function.
lib/kernels/src/accessor.cc
line 153 at r10 (raw file):
ssize_t offset = 0; size_t multiplier = 1;
Prefer int
Code quote:
ssize_t offset = 0;
size_t multiplier = 1;
lib/kernels/src/accessor.cc
line 156 at r10 (raw file):
for (size_t i = 0; i < this->shape.num_dims(); i++) { if (indices[i] >= this->shape.at(legion_dim_t(i))) {
Suggestion:
indices.at(i)
lib/kernels/src/accessor.cc
line 161 at r10 (raw file):
"when only {} indexes exist", i, indices[i],
Suggestion:
indices.at(i),
lib/kernels/src/accessor.cc
line 165 at r10 (raw file):
} offset += indices[i] * multiplier;
Suggestion:
offset += indices.at(i) * multiplier;
lib/kernels/src/accessor.cc
line 166 at r10 (raw file):
offset += indices[i] * multiplier; multiplier *= this->shape.at(legion_dim_t(i));
Prefer {}
initialization
Suggestion:
multiplier *= this->shape.at(legion_dim_t{i});
lib/kernels/include/kernels/accessor.h
line 45 at r10 (raw file):
template <DataType DT> real_type_t<DT> const &at(std::vector<size_t> const &indices) const {
Prefer int
over size_t
to make issues from using negative values at least somewhat detectable
Code quote:
std::vector<size_t> const &indices
lib/kernels/include/kernels/accessor.h
line 45 at r10 (raw file):
template <DataType DT> real_type_t<DT> const &at(std::vector<size_t> const &indices) const {
This function should probably be bound-checked.
lib/kernels/include/kernels/accessor.h
line 112 at r10 (raw file):
template <DataType DT> real_type_t<DT> &at(std::vector<size_t> const &indices) {
This function should probably be bounds-checked
lib/kernels/test/src/test_gather_kernels.cc
line 8 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Gather Forward and Backward Kernel") { ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_flat_kernel.cc
line 10 at r10 (raw file):
Allocator allocator = create_local_cuda_memory_allocator(); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_flat_kernel.cc
line 19 at r10 (raw file):
GenericTensorAccessorR input_accessor = read_only_accessor_from_write_accessor(create_filled_accessor_w( input_shape, allocator, DataTypeValue(2.0f)));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_flat_kernel.cc
line 34 at r10 (raw file):
SUBCASE("backward_kernel") { GenericTensorAccessorW output_grad_accessor = create_filled_accessor_w( output_shape, allocator, DataTypeValue(0.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_flat_kernel.cc
line 36 at r10 (raw file):
output_shape, allocator, DataTypeValue(0.0f)); GenericTensorAccessorW input_grad_accessor = create_filled_accessor_w(input_shape, allocator, DataTypeValue(1.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_concat_kernel.cc
line 13 at r10 (raw file):
ff_dim_t concat_axis = ff_dim_t(1); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Prefer {}
initialization (https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es23-prefer-the--initializer-syntax)
Suggestion:
ManagedPerDeviceFFHandle managed_handle{1024 * 1024, true};
lib/kernels/test/src/test_concat_kernel.cc
line 13 at r10 (raw file):
ff_dim_t concat_axis = ff_dim_t(1); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Add argument comments where argument meanings are not clear. Normally I'd refrain from comments, but these can actually be automatically checked by https://clang.llvm.org/extra/clang-tidy/checks/bugprone/argument-comment.html
Suggestion:
ManagedPerDeviceFFHandle managed_handle(/*workSpaceSize=*/1024 * 1024,
/*allowTensorOpMathConversion=*/true)
lib/kernels/test/src/test_layer_norm_kernels.cc
line 20 at r10 (raw file):
make_tensor_shape_from_legion_dims({feature_size}, DataType::FLOAT); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_layer_norm_kernels.cc
line 36 at r10 (raw file):
create_random_filled_accessor_r(input_shape, allocator); GenericTensorAccessorW gamma_accessor = create_filled_accessor_w(feature_shape, allocator, DataTypeValue(1.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_layer_norm_kernels.cc
line 42 at r10 (raw file):
allocator.allocate_tensor(output_shape); GenericTensorAccessorW beta_accessor = create_filled_accessor_w( feature_shape, allocator, DataTypeValue(0.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/src/managed_per_device_ff_handle.cc
line 7 at r9 (raw file):
ManagedPerDeviceFFHandle::ManagedPerDeviceFFHandle() { this->handle = new PerDeviceFFHandle;
Suggestion:
this->handle = new PerDeviceFFHandle{};
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 7 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed Per Device FF Handle") {
Suggestion:
TEST_CASE("ManagedPerDeviceFFHandle") {
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 8 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed Per Device FF Handle") { ManagedPerDeviceFFHandle base_handle{1024 * 1024, true};
Argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 11 at r10 (raw file):
PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle(); SUBCASE("Test ManagedPerDeviceFFHandle Constructor") {
Suggestion:
SUBCASE("constructor") {
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 16 at r10 (raw file):
} SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") {
Suggestion:
SUBCASE("move constructor") {
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 24 at r10 (raw file):
SUBCASE("Test ManagedPerDeviceFFHandle Assignment Operator") { ManagedPerDeviceFFHandle new_handle{1024 * 1024, true};
Argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 34 at r10 (raw file):
base_handle = std::move(base_handle); CHECK(&base_handle.raw_handle() == base_handle_ptr); }
Suggestion:
SUBCASE("move assignment operator") {
SUBCASE("move assign to other") {
ManagedPerDeviceFFHandle new_handle{1024 * 1024, true};
new_handle = std::move(base_handle);
CHECK(&base_handle.raw_handle() == nullptr);
CHECK(&new_handle.raw_handle() == base_handle_ptr);
}
SUBCASE("move assign to self") {
base_handle = std::move(base_handle);
CHECK(&base_handle.raw_handle() == base_handle_ptr);
}
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r10 (raw file):
assert(input.data_type == DT && output.data_type == DT); size_t num_out_blocks = input.shape.at(legion_dim_t(0));
Prefer int
Code quote:
size_t num_out_blocks = input.shape.at(legion_dim_t(0));
lib/kernels/test/src/test_managed_ff_stream.cc
line 7 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed FF Stream") {
Suggestion:
TEST_CASE("ManagedFFStream") {
lib/kernels/test/src/test_managed_ff_stream.cc
line 11 at r10 (raw file):
ffStream_t const *base_stream_ptr = &base_stream.raw_stream(); SUBCASE("Test ManagedFFStream Move Constructor") {
Suggestion:
SUBCASE("move constructor") {
lib/kernels/test/src/test_managed_ff_stream.cc
line 21 at r10 (raw file):
new_stream = std::move(base_stream); CHECK(&base_stream.raw_stream() == nullptr); CHECK(&new_stream.raw_stream() == base_stream_ptr);
Suggestion:
SUBCASE("move assignment") {
SUBCASE("move assign to other") {
ManagedFFStream new_stream{};
new_stream = std::move(base_stream);
CHECK(&base_stream.raw_stream() == nullptr);
CHECK(&new_stream.raw_stream() == base_stream_ptr);
}
SUBCASE("move assign to self") {
...
}
lib/kernels/test/src/test_batch_matmul_kernel.cc
line 18 at r10 (raw file):
ManagedFFStream managed_stream{}; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_reverse_kernels.cc
line 17 at r10 (raw file):
TensorShape output_shape = input_shape; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_reverse_kernels.cc
line 25 at r10 (raw file):
GenericTensorAccessorR input_accessor = read_only_accessor_from_write_accessor(create_filled_accessor_w( input_shape, allocator, DataTypeValue(1.0f)));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_reverse_kernels.cc
line 68 at r10 (raw file):
TensorShape output_shape = input_shape; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_partition_kernel.cc
line 9 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Partition Forward and Backward") { ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_partition_kernel.cc
line 23 at r10 (raw file):
SUBCASE("forward_kernel") { GenericTensorAccessorR input_accessor = create_filled_accessor_r(input_shape, allocator, DataTypeValue(1.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_partition_kernel.cc
line 35 at r10 (raw file):
SUBCASE("backward_kernel") { GenericTensorAccessorR output_grad_accessor = create_filled_accessor_r( output_shape, allocator, DataTypeValue(1.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_partition_kernel.cc
line 37 at r10 (raw file):
output_shape, allocator, DataTypeValue(1.0f)); GenericTensorAccessorW input_grad_accessor = create_filled_accessor_w(input_shape, allocator, DataTypeValue(2.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_batch_norm_kernel.cc
line 12 at r10 (raw file):
ManagedFFStream managed_stream{}; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_batch_norm_kernel.cc
line 40 at r10 (raw file):
create_random_filled_accessor_w(output_shape, allocator); GenericTensorAccessorW scale_accessor = create_filled_accessor_w(scale_shape, allocator, DataTypeValue(1.0f));
Due to the implicit coercions between all of the numerical types, I recommend creating explicit functions for creating DataTypeValue
s of different types to avoid ambiguity, e.g., make_float_data_type_value
, make_int32_data_type_value
, etc. and use them in call of the places you're constructing DataTypeValue
s
lib/kernels/test/src/test_attention_kernel.cc
line 16 at r10 (raw file):
ManagedFFStream managed_stream{}; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_transpose_kernel.cc
line 10 at r10 (raw file):
std::size_t num_dims = 2; std::vector<ff_dim_t> perm = {ff_dim_t(0), ff_dim_t(1)};
Prefer bracket initialization (see test_concat_kernel.cu)
lib/kernels/test/src/test_transpose_kernel.cc
line 12 at r10 (raw file):
std::vector<ff_dim_t> perm = {ff_dim_t(0), ff_dim_t(1)}; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_split_kernel.cc
line 15 at r10 (raw file):
coord_t num_blks = 1; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_split_kernel.cc
line 53 at r10 (raw file):
GenericTensorAccessorW input_grad_accessor = create_filled_accessor_w(input_shape, allocator, DataTypeValue(0.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_dropout.cc
line 21 at r10 (raw file):
ManagedFFStream managed_stream{}; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_softmax_kernel.cc
line 11 at r10 (raw file):
int input_n = 1, input_c = 1, input_h = 1, input_w = 100, channels = 100; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_replicate_kernel.cc
line 16 at r10 (raw file):
make_tensor_shape_from_legion_dims({100}, DataType::FLOAT); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_replicate_kernel.cc
line 56 at r10 (raw file):
make_tensor_shape_from_legion_dims({5, num_replicas}, DataType::FLOAT); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/local-execution/test/src/test_local_cost_estimator.cc
line 15 at r10 (raw file):
// TEST_CASE("Local Cost Estimator") { // // local backing initialization // ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_combine_kernel.cc
line 9 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Call Combine Forward and Backward Kernels") { ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_reduction_kernel.cc
line 13 at r10 (raw file):
{10, 10, 10, 10, 10}, DataType::FLOAT); ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_reduction_kernel.cc
line 39 at r10 (raw file):
GenericTensorAccessorR output_grad_accessor = create_filled_accessor_r( output_shape, allocator, DataTypeValue(1.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
lib/kernels/test/src/test_reshape_kernel.cc
line 8 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Reshape Forward and Backward") { ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_utils.cc
line 147 at r10 (raw file):
T const *data_ptr = accessor.get<DT>(); for (size_t i = 0; i < accessor.shape.num_elements(); i++) { stream << data_ptr[i] << " ";
Ideally should use the shape information of accessor to print something more semantically meaningfull
lib/kernels/test/src/test_utils.cc
line 147 at r10 (raw file):
T const *data_ptr = accessor.get<DT>(); for (size_t i = 0; i < accessor.shape.num_elements(); i++) { stream << data_ptr[i] << " ";
Suggestion:
if (i != 0) {
stream << " ";
}
stream << data_ptr[i];
lib/kernels/test/src/test_utils.cc
line 149 at r10 (raw file):
stream << data_ptr[i] << " "; } stream << "\n";
Suggestion:
stream << std::endl;
lib/kernels/test/src/test_utils.cc
line 202 at r10 (raw file):
GenericTensorAccessorW operator()(TensorShape const &shape, Allocator &allocator, DataTypeValue val) {
Would probably be better to move DataTypeValue
over to a dtgen
.variant.toml
type--any reason it isn't already?
Code quote:
DataTypeValue
lib/kernels/test/src/test_pool_2d_kernels.cc
line 15 at r10 (raw file):
PoolOp pool_type = PoolOp::MAX; ManagedPerDeviceFFHandle managed_handle(1024 * 1024, true);
Bracket initialization and argument name comments (see test_concat_kernel.cu)
lib/kernels/test/src/test_pool_2d_kernels.cc
line 60 at r10 (raw file):
SUBCASE("backward_kernel") { GenericTensorAccessorW output_grad_accessor = create_filled_accessor_w( output_shape, allocator, DataTypeValue(1.0f));
Use explicit construction functions for DataTypeValue
(see test_batch_norm_kernels.cu
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 64 of 108 files reviewed, 72 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 166 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why were these removed? Not necessarily objecting, just curious
Got sloppy, my bad ~ for some reason didn't see one of them returned const and thought were same
lib/kernels/include/kernels/accessor.h
line 14 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Fair enough. In that case I'd recommend pulling
copy_tensor_accessor_r
andcopy_tensor_accessor_w
(I believe these are the only two usages ofAllocator
in this file) into a separate header file--this header file is getting rather large already anyway
Done.
lib/kernels/include/kernels/accessor.h
line 45 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer
int
oversize_t
to make issues from using negative values at least somewhat detectable
Done.
lib/kernels/include/kernels/accessor.h
line 45 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This function should probably be bound-checked.
Done.
lib/kernels/include/kernels/accessor.h
line 112 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This function should probably be bounds-checked
Done.
lib/kernels/include/kernels/replicate_kernels.h
line 15 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Sure, feel free to standardize the kernel functions in a future PR (that would be great actually), though we should make sure we actually have tests up and running for all of the kernels to make sure we don't accidentally break anything. I've created an issue to track this and assigned it to you: #1540
Sounds good
lib/kernels/src/accessor.cc
line 29 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I don't think this is possible as
src_device_type
can't have two values at once
Done.
lib/kernels/src/accessor.cc
line 145 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Pull out into a separate helper function to avoid the duplication between
GenericTensorAccessorR::calculate_index_offset
andGenericTensorAccessorW::calculuate_index_offset
(and then honestly it's probably a good idea to just remove those methods as I'm not sure they're doing much then other than just forwarding arguments to the helper function.
Done.
lib/kernels/src/accessor.cc
line 153 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer int
Done.
lib/kernels/src/accessor.cc
line 166 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer
{}
initialization
Done.
lib/kernels/src/managed_ff_stream.cc
line 14 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Makes sense, my only concern is the degree of code duplication. Is there anything reasonable we can do to reduce it to minimize bugs when this file is edited?
Done.
lib/kernels/src/managed_per_device_ff_handle.cc
line 22 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
That's fine (and not a bad practice), it's just currently leading to a lot of duplicated code between this function and the destructor. With the previous solution you implicitly get the destructor call when the moved-from value is deallocated. Is there a way the duplicated code in this solution could be reduced? The current structure makes editing this file a bit bugprone
Done.
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer int
Done.
lib/kernels/src/cuda/ops/linear_kernels.cu
line 138 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Generally you want to use the weakest/most-restrictive cast construction that you can, and
static_cast<void const *>
works fine here so you should use that throughout this file.
Done.
lib/kernels/test/src/test_attention_kernel.cc
line 16 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_batch_matmul_kernel.cc
line 18 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_batch_norm_kernel.cc
line 12 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_batch_norm_kernel.cc
line 40 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Due to the implicit coercions between all of the numerical types, I recommend creating explicit functions for creating
DataTypeValue
s of different types to avoid ambiguity, e.g.,make_float_data_type_value
,make_int32_data_type_value
, etc. and use them in call of the places you're constructingDataTypeValue
s
Done.
lib/kernels/test/src/test_combine_kernel.cc
line 9 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_concat_kernel.cc
line 13 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer
{}
initialization (https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es23-prefer-the--initializer-syntax)
Done.
lib/kernels/test/src/test_concat_kernel.cc
line 13 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add argument comments where argument meanings are not clear. Normally I'd refrain from comments, but these can actually be automatically checked by https://clang.llvm.org/extra/clang-tidy/checks/bugprone/argument-comment.html
Done.
lib/kernels/test/src/test_dropout.cc
line 21 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_flat_kernel.cc
line 10 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_flat_kernel.cc
line 19 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_flat_kernel.cc
line 34 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_flat_kernel.cc
line 36 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_gather_kernels.cc
line 8 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_layer_norm_kernels.cc
line 20 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_layer_norm_kernels.cc
line 36 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_layer_norm_kernels.cc
line 42 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 8 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 24 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_partition_kernel.cc
line 9 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_partition_kernel.cc
line 23 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_partition_kernel.cc
line 35 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_partition_kernel.cc
line 37 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_pool_2d_kernels.cc
line 15 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_pool_2d_kernels.cc
line 60 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_reduction_kernel.cc
line 13 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_reduction_kernel.cc
line 39 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_replicate_kernel.cc
line 16 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_replicate_kernel.cc
line 56 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_reshape_kernel.cc
line 8 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_reverse_kernels.cc
line 17 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_reverse_kernels.cc
line 25 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_reverse_kernels.cc
line 68 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_softmax_kernel.cc
line 11 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_split_kernel.cc
line 15 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_split_kernel.cc
line 53 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use explicit construction functions for
DataTypeValue
(seetest_batch_norm_kernels.cu
)
Done.
lib/kernels/test/src/test_transpose_kernel.cc
line 10 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer bracket initialization (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_transpose_kernel.cc
line 12 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not sure what you mean by "but maybe our current behavior is fine with the initializer is fine anyways". Can you clarify?
This is no longer an issue. Initially brought up as a constructor defaulted to assigning a device type but this is no longer the case. Device type is currently determined based on whatever allocator is used to create tensor.
lib/kernels/test/src/test_utils.cc
line 202 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would probably be better to move
DataTypeValue
over to adtgen
.variant.toml
type--any reason it isn't already?
Addressed in Slack
lib/local-execution/src/ops/linear.cc
line 152 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What do you mean--I'm not seeing any changes to this in
accessors.h
? Am I missing something?
Oh, I was referring to making:
auto bias = acc.get_tensorPermissions::RO(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RO(OUTPUT);
to:
auto bias = acc.get_tensorPermissions::RW(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RW(OUTPUT);
as backward_kernel needs a non-const pointer to be passed
lib/local-execution/test/src/test_local_cost_estimator.cc
line 15 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bracket initialization and argument name comments (see test_concat_kernel.cu)
Done.
lib/kernels/src/accessor.cc
line 70 at r10 (raw file):
for (size_t i = 0; i < this->shape.num_dims(); i++) { if (indices[i] >= this->shape.at(legion_dim_t(i))) {
Done.
lib/kernels/src/accessor.cc
line 75 at r10 (raw file):
"when only {} indexes exist", i, indices[i],
Done.
lib/kernels/src/accessor.cc
line 79 at r10 (raw file):
} offset += indices[i] * multiplier;
Done.
lib/kernels/src/accessor.cc
line 156 at r10 (raw file):
for (size_t i = 0; i < this->shape.num_dims(); i++) { if (indices[i] >= this->shape.at(legion_dim_t(i))) {
Done.
lib/kernels/src/accessor.cc
line 161 at r10 (raw file):
"when only {} indexes exist", i, indices[i],
Done.
lib/kernels/src/accessor.cc
line 165 at r10 (raw file):
} offset += indices[i] * multiplier;
Done.
lib/kernels/test/src/test_managed_ff_stream.cc
line 7 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed FF Stream") {
Done.
lib/kernels/test/src/test_managed_ff_stream.cc
line 11 at r10 (raw file):
ffStream_t const *base_stream_ptr = &base_stream.raw_stream(); SUBCASE("Test ManagedFFStream Move Constructor") {
Done.
lib/kernels/test/src/test_managed_ff_stream.cc
line 21 at r10 (raw file):
new_stream = std::move(base_stream); CHECK(&base_stream.raw_stream() == nullptr); CHECK(&new_stream.raw_stream() == base_stream_ptr);
Done.
lib/kernels/test/src/test_utils.cc
line 147 at r10 (raw file):
T const *data_ptr = accessor.get<DT>(); for (size_t i = 0; i < accessor.shape.num_elements(); i++) { stream << data_ptr[i] << " ";
Done.
lib/kernels/test/src/test_utils.cc
line 149 at r10 (raw file):
stream << data_ptr[i] << " "; } stream << "\n";
Done.
lib/kernels/src/managed_per_device_ff_handle.cc
line 7 at r9 (raw file):
ManagedPerDeviceFFHandle::ManagedPerDeviceFFHandle() { this->handle = new PerDeviceFFHandle;
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 7 at r10 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed Per Device FF Handle") {
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 11 at r10 (raw file):
PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle(); SUBCASE("Test ManagedPerDeviceFFHandle Constructor") {
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 16 at r10 (raw file):
} SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") {
Done.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 34 at r10 (raw file):
base_handle = std::move(base_handle); CHECK(&base_handle.raw_handle() == base_handle_ptr); }
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 31 of 44 files at r11, 63 of 67 files at r12, all commit messages.
Reviewable status: 140 of 144 files reviewed, 27 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 24 at r10 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Not seeing this done?
lib/kernels/test/src/test_utils.cc
line 202 at r10 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Addressed in Slack
I can't seem to find the explanation. Can you re-explain here or point me to where it was in slack?
lib/local-execution/src/ops/linear.cc
line 152 at r9 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Oh, I was referring to making:
auto bias = acc.get_tensorPermissions::RO(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RO(OUTPUT);to:
auto bias = acc.get_tensorPermissions::RW(BIAS);
auto output_grad = acc.get_tensor_gradPermissions::RW(OUTPUT);as backward_kernel needs a non-const pointer to be passed
But why? The kernel doesn't modify the bias, does it? You control the signature of backward_kernel
so you can always change the signature if it doesn't make sense, it's not set in stone.
lib/kernels/include/kernels/accessor.h
line 74 at r12 (raw file):
offset += indices.at(i) * multiplier; multiplier *= this->shape.at(legion_dim_t{i});
Pull offset calculation (and probably also bounds checking) out into a separate helper function
Code quote:
int offset = 0;
int multiplier = 1;
for (int i = 0; i < this->shape.num_dims(); i++) {
if (indices.at(i) >= this->shape.at(legion_dim_t{i})) {
throw mk_runtime_error(
fmt::format("In {} dimension, attempting to access index {} "
"when only {} indexes exist",
i,
indices.at(i),
this->shape.at(legion_dim_t{i})));
}
offset += indices.at(i) * multiplier;
multiplier *= this->shape.at(legion_dim_t{i});
lib/kernels/include/kernels/accessor.h
line 159 at r12 (raw file):
offset += indices.at(i) * multiplier; multiplier *= this->shape.at(legion_dim_t{i});
Pull offset calculation (and probably also bounds checking) out into a separate helper function
Code quote:
int offset = 0;
int multiplier = 1;
for (int i = 0; i < this->shape.num_dims(); i++) {
if (indices.at(i) >= this->shape.at(legion_dim_t{i})) {
throw mk_runtime_error(
fmt::format("In {} dimension, attempting to access index {} "
"when only {} indexes exist",
i,
indices.at(i),
this->shape.at(legion_dim_t{i})));
}
offset += indices.at(i) * multiplier;
multiplier *= this->shape.at(legion_dim_t{i});
lib/kernels/include/kernels/accessor.h
line 185 at r12 (raw file):
T const *data_ptr = static_cast<T const *>(this->ptr); int offset = 0; int multiplier = 1;
Pull offset calculation (and probably also bounds checking) out into a separate helper function
lib/kernels/include/kernels/flat_kernels.h
line 13 at r12 (raw file):
float *output_ptr); void backward_kernel(cudaStream_t stream,
Why change from ffStream_t
to cudaStream_t
?
Code quote:
cudaStream_t stream,
lib/kernels/include/kernels/managed_per_device_ff_handle.h
line 29 at r12 (raw file):
void cleanup(); private:
Suggestion:
private:
void cleanup();
private:
lib/kernels/include/kernels/managed_ff_stream.h
line 25 at r12 (raw file):
private: ffStream_t *stream;
Suggestion:
private:
void cleanup();
private:
ffStream_t *stream;
lib/kernels/include/kernels/loss_function_kernels.h
line 4 at r12 (raw file):
#define _FLEXFLOW_KERNELS_INCLUDE_KERNELS_LOSS_FUNCTION_KERNELS_H #include "kernels/device.h"
Why remove this? We generally prefer absolute includes over relative includes
lib/kernels/include/kernels/metrics_kernels.h
line 18 at r12 (raw file):
void update_metrics_label_kernel_wrapper(float const *logit_ptr, float const *label_ptr, MetricsAttrs const *me,
Why take as a pointer rather than by const ref?
lib/kernels/test/src/test_utils.cc
line 151 at r12 (raw file):
for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { stream << data_ptr[i * cols + j];
Don't we have cleaner indexing syntax now for accessors that can take something like (i, j)
rather than doing the calculation inline?
lib/kernels/test/src/test_utils.cc
line 153 at r12 (raw file):
stream << data_ptr[i * cols + j]; if (j < cols - 1) {
Might be better to use join_strings
from utils
rather than re-implementing the logic here.
lib/local-execution/src/ops/pool_2d.cc
line 33 at r12 (raw file):
auto output = acc.get_tensor<Permissions::WO>(OUTPUT); int input_w = input.shape.at(ff_dim_t{0}) + 1;
Why all these + 1
s? I suppose also a question for @reyna-abhyankar
Code quote:
int input_w = input.shape.at(ff_dim_t{0}) + 1;
lib/pcg/src/pcg/metric.cc
line 5 at r12 (raw file):
namespace FlexFlow { MetricsAttrs::MetricsAttrs(LossFunction _loss_type, std::vector<Metric> const &metrics)
Would probably make more sense to take in a std::unordered_set
as there shouldn't be duplicates anyway and ordering doesn't matter?
Code quote:
std::vector<Metric> const &metrics)
lib/pcg/src/pcg/metric.cc
line 33 at r12 (raw file):
default: throw mk_runtime_error( "Initializing MetricsAttrs with unrecogonized metrics type");
Suggestion:
throw mk_runtime_error(fmt::format(
"Initializing MetricsAttrs with unrecogonized metrics type {}", m));
lib/kernels/include/kernels/pool_2d_kernels.h
line 70 at r12 (raw file):
void *output_ptr); void backward_kernel(cudaStream_t stream,
This should stay as ffStream_t
--these header files are intended to be stable between the hip and cuda kernels, with the ff*
typedefs used to handle the different types used between the two. If you hardcode the type to cudaStream_t
this fails.
Code quote:
cudaStream_t stream,
lib/kernels/include/kernels/per_device_op_state.variant.toml
line 0 at r12 (raw file):
Why was this moved to kernels
?
lib/kernels/src/cuda/embedding_kernels.cu
line 40 at r12 (raw file):
} template <typename TI, typename TD>
Why remove one template parameter? No particular objection, just curious
Code quote:
emplate <typename TI, typename TD>
lib/kernels/src/cuda/embedding_kernels.cu
line 100 at r12 (raw file):
int in_dim, int batch_size, std::optional<AggregateOp> aggr) {
Why remove the optional
?
Code quote:
std::optional<AggregateOp> aggr)
lib/op-attrs/include/op-attrs/make_datatype_value.h
line 8 at r12 (raw file):
namespace FlexFlow { DataTypeValue make_float_data_type_value(float value);
These seem to be duplicated (here and in datatype_value.h
)?
lib/op-attrs/include/op-attrs/aggregate_op.enum.toml
line 17 at r12 (raw file):
[[values]] name = "NONE"
Don't add a NONE
value--if you want an optional aggregate op, instead use std::optional<AggregateOp>
Code quote:
[[values]]
name = "NONE"
lib/op-attrs/src/op-attrs/make_datatype_value.cc
line 5 at r12 (raw file):
namespace FlexFlow { DataTypeValue make_float_data_type_value(float value) {
Can you make sure these get tested to double check that the right variant gets created?
lib/pcg/include/pcg/metric.h
line 17 at r12 (raw file):
ROOT_MEAN_SQUARED_ERROR, MEAN_ABSOLUTE_ERROR, };
Move over to a .enum.toml
so fmt, etc. is auto-generated
Code quote:
enum class Metric {
ACCURACY,
CATEGORICAL_CROSSENTROPY,
SPARSE_CATEGORICAL_CROSSENTROPY,
MEAN_SQUARED_ERROR,
ROOT_MEAN_SQUARED_ERROR,
MEAN_ABSOLUTE_ERROR,
};
lib/pcg/include/pcg/metric.h
line 32 at r12 (raw file):
bool measure_root_mean_squared_error; bool measure_mean_absolute_error; };
Move over to dtgen
so all the boilerplate (equality, hashing, fmt, etc.) gets auto-generated?
Code quote:
class MetricsAttrs {
public:
MetricsAttrs() = delete;
MetricsAttrs(LossFunction, std::vector<Metric> const &);
public:
LossFunction loss_type;
bool measure_accuracy;
bool measure_categorical_crossentropy;
bool measure_sparse_categorical_crossentropy;
bool measure_mean_squared_error;
bool measure_root_mean_squared_error;
bool measure_mean_absolute_error;
};
lib/pcg/include/pcg/metric.h
line 70 at r12 (raw file):
}; } // namespace fmt
Remove in favor of dtgen-generated fmt
Code quote:
namespace fmt {
template <>
struct formatter<::FlexFlow::Metric> : formatter<string_view> {
template <typename FormatContext>
auto format(::FlexFlow::Metric m, FormatContext &ctx) const
-> decltype(ctx.out()) {
using namespace FlexFlow;
string_view name = "unknown";
switch (m) {
case Metric::ACCURACY:
name = "Accuracy";
break;
case Metric::CATEGORICAL_CROSSENTROPY:
name = "CategoricalCrossEntropy";
break;
case Metric::SPARSE_CATEGORICAL_CROSSENTROPY:
name = "SparseCategoricalCrossEntropy";
break;
case Metric::MEAN_SQUARED_ERROR:
name = "MeanSquaredError";
break;
case Metric::ROOT_MEAN_SQUARED_ERROR:
name = "RootMeanSquaredError";
break;
case Metric::MEAN_ABSOLUTE_ERROR:
name = "MeanAbsoluteError";
break;
}
return formatter<string_view>::format(name, ctx);
}
};
} // namespace fmt
Tests for:
Changes to kernels/test_utils, GenericTensorAccessors, and the addition of local_cpu_allocator
This change is