Use rocprofv2 instead of rocprof. #1672

pcf000 · 2024-10-08T01:48:10Z

Use rocprofv2 instead of rocprof.
Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it.
Account for nrepeats in a smarter way -- count the rows, while verifying.
getFusionTestInfo and runFusionKernel turn out to be mostly the same.
Invent --rocprof-version to switch between rocprof and rocprofv2.
Change default to rocprofv2.

pcf000 · 2024-10-08T01:50:03Z

mlir/utils/performance/perfRunner.py

-MIOPENDRIVER = '/opt/rocm/bin/MIOpenDriver'
-BENCHMARKING_RESULT_FILE_NAME = 'results.stats.csv'
-BENCHMARKING_METRICS_FILE_NAME = 'results.csv'
+BENCHMARKINGV1_RESULT_FILE_NAME = 'results.stats.csv'


Of course things move around for rocprofv2, and there's no way I found to make them the same. In particular, that "pmc_1" directory inserts itself because of either --kernel-trace or the stats in -i, I forget which.

pcf000 · 2024-10-08T01:50:50Z

mlir/utils/performance/perfRunner.py

@@ -129,6 +139,9 @@ def create_paths(config_file_path, mlir_build_dir_path) -> Paths:

 # utility functions.
 def getNanoSeconds(fileName):
+    pass


I'm going to assign V1 or V2 to getNanoSeconds. I don't really need this "pass" implementation.

pcf000 · 2024-10-08T01:52:00Z

mlir/utils/performance/perfRunner.py

    if not os.path.exists(fileName):
-        result = "NaN"
-        return result
+        return np.nan


We had not been consistent and used "nan", "NaN", and np.nan in different places.

pcf000 · 2024-10-08T01:52:29Z

mlir/utils/performance/perfRunner.py

    with open(fileName, 'r') as csv_file:
        reader = csv.DictReader(csv_file, delimiter = ',')
        header = reader.fieldnames
        if 'LDSBankConflict' not in header:
            return np.nan
-
-        result = []
+        sum = 0


Counting the rows and accumulating the sum, vs accumulating a list and then calling sum and len on it.

pcf000 · 2024-10-08T01:53:23Z

mlir/utils/performance/perfRunner.py

@@ -209,10 +278,10 @@ def getMilliseconds(output):

    return float(result.group(1))

-def runPipeline(proc_specs):
+def runPipeline(proc_specs, initial_stdin=subprocess.DEVNULL):


"initial_stdin" exists to send some mlir text into the first stage, see below.

pcf000 · 2024-10-08T01:57:41Z

mlir/utils/performance/perfRunner.py

@@ -1048,50 +1117,7 @@ def findRunCommand(filename):
    print("WARNING: cannot find valid RUN command in ", filename)
    return None, None

-# Extract testVector and test function name from the test file
-def getFusionTestInfo(filename, paths: Paths):


I had missed getFusionTestInfo when I updated everything to use runPipeline. When I started updating it, I noticed that everything up to the tuningKey process was identical with runFusionKernel, so I abstracted the common part into makeBasicFusionPipeline.

Then I realised that it was doing redundant work, because we'd collect tests and call getFusionTestInfo on them, then loop through the collected tests and call runFusionKernel on those, and that would do the basic pipeline twice. I recast it to do the basics once and save the mlir, and merged the collection and running loops so it doesn't have to save the mlir for very long. More notes below.

pcf000 · 2024-10-08T01:58:23Z

mlir/utils/performance/perfRunner.py


+def runFusionKernel(mlirfile, rocmlirGenArgs, paths: Paths):


Now takes as input a file of the mlir from the basic-fusion-pipeline.

pcf000 · 2024-10-08T02:01:42Z

mlir/utils/performance/perfRunner.py

-        if not futName:
-            print("\tCannot find rocmlir-gen with -fut")
-            continue
+    # Prepare test cases


Merged the two loops into one, with the split-k hack moved up first. Go through each .mlir file, extract the useful RUN: command if present, and do makeBasicFusionPipeline with it to make the initial mlir code. We have a temp file to hold that code, because I couldn't make a good way to save it as a string and then send it to another pipeline later without having a file.

pcf000 · 2024-10-08T02:03:07Z

mlir/utils/performance/perfRunner.py

+                op = 'gemm'
+                config = GemmConfiguration.fromCommandLine(commandLine, arch, numCU)
+
+            # Find the best perf_config


Look for the perf-config for the kernel configuration, or make a dummy NaN one.

pcf000 · 2024-10-08T02:05:44Z

mlir/utils/performance/perfRunner.py

+                        perfResults[testVector] = oneEntry
+                    continue
+
+            # Run fusion test


Run the kernel, reusing the mlir in the temp file, and record its time. We anticipate duplicates -- eg, in the bert tests there are 24 .mlir file but only eight unique kernels -- and just take the best-performing. Given natural variation, times will be close but the winning file can be different every time.

pcf000 · 2024-10-08T02:06:39Z

mlir/utils/performance/perfRunner.py

-        oneEntry['Fusion/MLIR'] = oneEntry['TFlops']/oneEntry['MLIR TFlops']
-        oneEntry['FileName'] = filename
-        perfResults[testVector] = oneEntry
+            # Run gemm or conv op with the same configuration


Run generated kernel in the usual way for reference.

pcf000 · 2024-10-08T02:07:34Z

mlir/utils/performance/perfRunner.py

-    xdlop_supported_gpus_str = xdlop_supported_gpus[0]
-    for gpu in xdlop_supported_gpus[1:]:
-        xdlop_supported_gpus_str += '|' + gpu
+    xdlop_supported_gpus_str = '|'.join(xdlop_supported_gpus)


pcf000 · 2024-10-08T02:07:54Z

mlir/utils/performance/perfRunner.py

-                p1.kill()
-                print("MIOpen tuning timed out")
-                _, errs = p1.communicate()
+            runPipeline([MIOpenDriverCommand])


Missed another one. Last one, I swear.

pcf000 · 2024-10-08T02:08:53Z

mlir/utils/performance/perfRunner.py

@@ -1285,7 +1308,7 @@ def getNumCU(chip):
        rocminfo = subprocess.check_output("/opt/rocm/bin/rocminfo",
                                           stderr=subprocess.PIPE)
    except subprocess.CalledProcessError as e:
-        print(e.stderr.decode('utf-8'))
+        print(f"Process error: {e.stderr.decode('utf-8')}")


Still trying to identify the cause of the intermittent rocminfo failures. Highlight this case in the log file.

pcf000 · 2024-10-08T02:09:25Z

mlir/utils/performance/perfRunner.py

    parsed_args = parser.parse_args(args)

+    global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS


Swap functions, options, and filenames based on the option.

... why are we keeping rocprof v1 support?

codecov · 2024-10-08T05:06:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.97%. Comparing base (99fc9d2) to head (f5ba66a).
Report is 4 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1672      +/-   ##
===========================================
- Coverage    78.17%   77.97%   -0.21%     
===========================================
  Files          100      100              
  Lines        27994    27994              
  Branches      4087     4087              
===========================================
- Hits         21884    21827      -57     
- Misses        4463     4506      +43     
- Partials      1647     1661      +14

Flag	Coverage Δ
mfma	`77.97% <ø> (-0.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

krzysz00

(Haven't fully read the PR, minor thoughts)

krzysz00 · 2024-10-08T14:42:43Z

mlir/utils/performance/perfRunner.py

@@ -894,6 +964,7 @@ def benchmarkExternal(cls, commandLine, paths: Paths, arch, numCU):
        benchmarkArgs = config.generateMlirDriverCommandLine("")
        # remove the result file generated by rocprof in previous benchmarking
        os.system("rm -f "+BENCHMARKING_RESULT_FILE_NAME)
+        os.system("rm -f "+BENCHMARKING_METRICS_FILE_NAME)


We can call rm just once

krzysz00 · 2024-10-08T14:43:46Z

mlir/utils/performance/perfRunner.py

    parsed_args = parser.parse_args(args)

+    global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS


... why are we keeping rocprof v1 support?

pcf000 · 2024-10-08T15:44:53Z

Keeping rocprof V1 in case we have discrepancies to investigate (though since we don't keep much history, that's probably not a big concern) and in case rocprofv2 stops supporting an architecture before we do. Mostly because it was pretty easy to do and might be helpful.

krzysz00 · 2024-10-08T20:32:44Z

mlir/utils/performance/perfRunner.py

        return result

+def getNanoSecondsV2(fileName):


Overall comment: perhaps class Profiler: is in order here, instead of hot-swapping functions onto variables?

Abstract the boilerplate for collecting results from a process. Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it. Account for nrepeats in a smarter way -- count the rows, while verifying. Don't do attention perfRunner.py on gfx110x. Don't run the CK benchmarking for gfx110x, because ck-benchmark-driver won't compile. getFusionTestInfo and runFusionKernel turn out to be mostly the same. Invent --rocprof-version to switch between rocprof and rocprofv2. Change default to rocprofv2.

Made Profiler classes to handle the V1/V2 switch more cleanly. Made tuningRunner.py use Profiler to get consistent arguments. Some places in tuningRunner.py use runPipeline, some don't yet.

pcf000 requested review from jerryyin and sjw36 as code owners October 8, 2024 01:48

pcf000 commented Oct 8, 2024

View reviewed changes

krzysz00 reviewed Oct 8, 2024

View reviewed changes

pcf000 added 2 commits November 11, 2024 11:42

Partial response to review comments and CI failure.

f5ba66a

Made Profiler classes to handle the V1/V2 switch more cleanly. Made tuningRunner.py use Profiler to get consistent arguments. Some places in tuningRunner.py use runPipeline, some don't yet.

pcf000 force-pushed the pf-issue-1496 branch from 390b3fc to f5ba66a Compare November 11, 2024 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use rocprofv2 instead of rocprof. #1672

Use rocprofv2 instead of rocprof. #1672

pcf000 commented Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

pcf000 Oct 8, 2024

krzysz00 Oct 8, 2024

codecov bot commented Oct 8, 2024 •

edited

Loading

krzysz00 left a comment

krzysz00 Oct 8, 2024

krzysz00 Oct 8, 2024

pcf000 commented Oct 8, 2024

krzysz00 Oct 8, 2024


		def runFusionKernel(mlirfile, rocmlirGenArgs, paths: Paths):

		parsed_args = parser.parse_args(args)

		global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS

Use rocprofv2 instead of rocprof. #1672

Are you sure you want to change the base?

Use rocprofv2 instead of rocprof. #1672

Conversation

pcf000 commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 8, 2024 • edited Loading

Codecov Report

krzysz00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcf000 commented Oct 8, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 8, 2024 •

edited

Loading