-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use rocprofv2 instead of rocprof. #1672
base: develop
Are you sure you want to change the base?
Conversation
MIOPENDRIVER = '/opt/rocm/bin/MIOpenDriver' | ||
BENCHMARKING_RESULT_FILE_NAME = 'results.stats.csv' | ||
BENCHMARKING_METRICS_FILE_NAME = 'results.csv' | ||
BENCHMARKINGV1_RESULT_FILE_NAME = 'results.stats.csv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course things move around for rocprofv2, and there's no way I found to make them the same. In particular, that "pmc_1" directory inserts itself because of either --kernel-trace or the stats in -i, I forget which.
mlir/utils/performance/perfRunner.py
Outdated
@@ -129,6 +139,9 @@ def create_paths(config_file_path, mlir_build_dir_path) -> Paths: | |||
|
|||
# utility functions. | |||
def getNanoSeconds(fileName): | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to assign V1 or V2 to getNanoSeconds. I don't really need this "pass" implementation.
mlir/utils/performance/perfRunner.py
Outdated
if not os.path.exists(fileName): | ||
result = "NaN" | ||
return result | ||
return np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had not been consistent and used "nan", "NaN", and np.nan in different places.
mlir/utils/performance/perfRunner.py
Outdated
with open(fileName, 'r') as csv_file: | ||
reader = csv.DictReader(csv_file, delimiter = ',') | ||
header = reader.fieldnames | ||
if 'LDSBankConflict' not in header: | ||
return np.nan | ||
|
||
result = [] | ||
sum = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counting the rows and accumulating the sum, vs accumulating a list and then calling sum and len on it.
@@ -209,10 +278,10 @@ def getMilliseconds(output): | |||
|
|||
return float(result.group(1)) | |||
|
|||
def runPipeline(proc_specs): | |||
def runPipeline(proc_specs, initial_stdin=subprocess.DEVNULL): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"initial_stdin" exists to send some mlir text into the first stage, see below.
@@ -1048,50 +1117,7 @@ def findRunCommand(filename): | |||
print("WARNING: cannot find valid RUN command in ", filename) | |||
return None, None | |||
|
|||
# Extract testVector and test function name from the test file | |||
def getFusionTestInfo(filename, paths: Paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had missed getFusionTestInfo when I updated everything to use runPipeline. When I started updating it, I noticed that everything up to the tuningKey process was identical with runFusionKernel, so I abstracted the common part into makeBasicFusionPipeline.
Then I realised that it was doing redundant work, because we'd collect tests and call getFusionTestInfo on them, then loop through the collected tests and call runFusionKernel on those, and that would do the basic pipeline twice. I recast it to do the basics once and save the mlir, and merged the collection and running loops so it doesn't have to save the mlir for very long. More notes below.
|
||
def runFusionKernel(mlirfile, rocmlirGenArgs, paths: Paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now takes as input a file of the mlir from the basic-fusion-pipeline.
if not futName: | ||
print("\tCannot find rocmlir-gen with -fut") | ||
continue | ||
# Prepare test cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged the two loops into one, with the split-k hack moved up first. Go through each .mlir file, extract the useful RUN: command if present, and do makeBasicFusionPipeline with it to make the initial mlir code. We have a temp file to hold that code, because I couldn't make a good way to save it as a string and then send it to another pipeline later without having a file.
op = 'gemm' | ||
config = GemmConfiguration.fromCommandLine(commandLine, arch, numCU) | ||
|
||
# Find the best perf_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look for the perf-config for the kernel configuration, or make a dummy NaN one.
perfResults[testVector] = oneEntry | ||
continue | ||
|
||
# Run fusion test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run the kernel, reusing the mlir in the temp file, and record its time. We anticipate duplicates -- eg, in the bert tests there are 24 .mlir file but only eight unique kernels -- and just take the best-performing. Given natural variation, times will be close but the winning file can be different every time.
oneEntry['Fusion/MLIR'] = oneEntry['TFlops']/oneEntry['MLIR TFlops'] | ||
oneEntry['FileName'] = filename | ||
perfResults[testVector] = oneEntry | ||
# Run gemm or conv op with the same configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run generated kernel in the usual way for reference.
xdlop_supported_gpus_str = xdlop_supported_gpus[0] | ||
for gpu in xdlop_supported_gpus[1:]: | ||
xdlop_supported_gpus_str += '|' + gpu | ||
xdlop_supported_gpus_str = '|'.join(xdlop_supported_gpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idiom.
p1.kill() | ||
print("MIOpen tuning timed out") | ||
_, errs = p1.communicate() | ||
runPipeline([MIOpenDriverCommand]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed another one. Last one, I swear.
@@ -1285,7 +1308,7 @@ def getNumCU(chip): | |||
rocminfo = subprocess.check_output("/opt/rocm/bin/rocminfo", | |||
stderr=subprocess.PIPE) | |||
except subprocess.CalledProcessError as e: | |||
print(e.stderr.decode('utf-8')) | |||
print(f"Process error: {e.stderr.decode('utf-8')}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still trying to identify the cause of the intermittent rocminfo failures. Highlight this case in the log file.
mlir/utils/performance/perfRunner.py
Outdated
parsed_args = parser.parse_args(args) | ||
|
||
global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap functions, options, and filenames based on the option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... why are we keeping rocprof v1 support?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1672 +/- ##
===========================================
- Coverage 78.17% 77.97% -0.21%
===========================================
Files 100 100
Lines 27994 27994
Branches 4087 4087
===========================================
- Hits 21884 21827 -57
- Misses 4463 4506 +43
- Partials 1647 1661 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Haven't fully read the PR, minor thoughts)
@@ -894,6 +964,7 @@ def benchmarkExternal(cls, commandLine, paths: Paths, arch, numCU): | |||
benchmarkArgs = config.generateMlirDriverCommandLine("") | |||
# remove the result file generated by rocprof in previous benchmarking | |||
os.system("rm -f "+BENCHMARKING_RESULT_FILE_NAME) | |||
os.system("rm -f "+BENCHMARKING_METRICS_FILE_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call rm
just once
mlir/utils/performance/perfRunner.py
Outdated
parsed_args = parser.parse_args(args) | ||
|
||
global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... why are we keeping rocprof v1 support?
Keeping rocprof V1 in case we have discrepancies to investigate (though since we don't keep much history, that's probably not a big concern) and in case rocprofv2 stops supporting an architecture before we do. Mostly because it was pretty easy to do and might be helpful. |
mlir/utils/performance/perfRunner.py
Outdated
return result | ||
|
||
def getNanoSecondsV2(fileName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall comment: perhaps class Profiler:
is in order here, instead of hot-swapping functions onto variables?
Abstract the boilerplate for collecting results from a process. Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it. Account for nrepeats in a smarter way -- count the rows, while verifying. Don't do attention perfRunner.py on gfx110x. Don't run the CK benchmarking for gfx110x, because ck-benchmark-driver won't compile. getFusionTestInfo and runFusionKernel turn out to be mostly the same. Invent --rocprof-version to switch between rocprof and rocprofv2. Change default to rocprofv2.
Made Profiler classes to handle the V1/V2 switch more cleanly. Made tuningRunner.py use Profiler to get consistent arguments. Some places in tuningRunner.py use runPipeline, some don't yet.
390b3fc
to
f5ba66a
Compare
Use rocprofv2 instead of rocprof.
Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it.
Account for nrepeats in a smarter way -- count the rows, while verifying.
getFusionTestInfo and runFusionKernel turn out to be mostly the same.
Invent --rocprof-version to switch between rocprof and rocprofv2.
Change default to rocprofv2.