Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rocprofv2 instead of rocprof. #1672

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open

Use rocprofv2 instead of rocprof. #1672

wants to merge 2 commits into from

Conversation

pcf000
Copy link
Contributor

@pcf000 pcf000 commented Oct 8, 2024

Use rocprofv2 instead of rocprof.
Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it.
Account for nrepeats in a smarter way -- count the rows, while verifying.
getFusionTestInfo and runFusionKernel turn out to be mostly the same.
Invent --rocprof-version to switch between rocprof and rocprofv2.
Change default to rocprofv2.

MIOPENDRIVER = '/opt/rocm/bin/MIOpenDriver'
BENCHMARKING_RESULT_FILE_NAME = 'results.stats.csv'
BENCHMARKING_METRICS_FILE_NAME = 'results.csv'
BENCHMARKINGV1_RESULT_FILE_NAME = 'results.stats.csv'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course things move around for rocprofv2, and there's no way I found to make them the same. In particular, that "pmc_1" directory inserts itself because of either --kernel-trace or the stats in -i, I forget which.

@@ -129,6 +139,9 @@ def create_paths(config_file_path, mlir_build_dir_path) -> Paths:

# utility functions.
def getNanoSeconds(fileName):
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to assign V1 or V2 to getNanoSeconds. I don't really need this "pass" implementation.

if not os.path.exists(fileName):
result = "NaN"
return result
return np.nan
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had not been consistent and used "nan", "NaN", and np.nan in different places.

with open(fileName, 'r') as csv_file:
reader = csv.DictReader(csv_file, delimiter = ',')
header = reader.fieldnames
if 'LDSBankConflict' not in header:
return np.nan

result = []
sum = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counting the rows and accumulating the sum, vs accumulating a list and then calling sum and len on it.

@@ -209,10 +278,10 @@ def getMilliseconds(output):

return float(result.group(1))

def runPipeline(proc_specs):
def runPipeline(proc_specs, initial_stdin=subprocess.DEVNULL):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"initial_stdin" exists to send some mlir text into the first stage, see below.

@@ -1048,50 +1117,7 @@ def findRunCommand(filename):
print("WARNING: cannot find valid RUN command in ", filename)
return None, None

# Extract testVector and test function name from the test file
def getFusionTestInfo(filename, paths: Paths):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had missed getFusionTestInfo when I updated everything to use runPipeline. When I started updating it, I noticed that everything up to the tuningKey process was identical with runFusionKernel, so I abstracted the common part into makeBasicFusionPipeline.

Then I realised that it was doing redundant work, because we'd collect tests and call getFusionTestInfo on them, then loop through the collected tests and call runFusionKernel on those, and that would do the basic pipeline twice. I recast it to do the basics once and save the mlir, and merged the collection and running loops so it doesn't have to save the mlir for very long. More notes below.


def runFusionKernel(mlirfile, rocmlirGenArgs, paths: Paths):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now takes as input a file of the mlir from the basic-fusion-pipeline.

if not futName:
print("\tCannot find rocmlir-gen with -fut")
continue
# Prepare test cases
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the two loops into one, with the split-k hack moved up first. Go through each .mlir file, extract the useful RUN: command if present, and do makeBasicFusionPipeline with it to make the initial mlir code. We have a temp file to hold that code, because I couldn't make a good way to save it as a string and then send it to another pipeline later without having a file.

op = 'gemm'
config = GemmConfiguration.fromCommandLine(commandLine, arch, numCU)

# Find the best perf_config
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look for the perf-config for the kernel configuration, or make a dummy NaN one.

perfResults[testVector] = oneEntry
continue

# Run fusion test
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the kernel, reusing the mlir in the temp file, and record its time. We anticipate duplicates -- eg, in the bert tests there are 24 .mlir file but only eight unique kernels -- and just take the best-performing. Given natural variation, times will be close but the winning file can be different every time.

oneEntry['Fusion/MLIR'] = oneEntry['TFlops']/oneEntry['MLIR TFlops']
oneEntry['FileName'] = filename
perfResults[testVector] = oneEntry
# Run gemm or conv op with the same configuration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run generated kernel in the usual way for reference.

xdlop_supported_gpus_str = xdlop_supported_gpus[0]
for gpu in xdlop_supported_gpus[1:]:
xdlop_supported_gpus_str += '|' + gpu
xdlop_supported_gpus_str = '|'.join(xdlop_supported_gpus)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiom.

p1.kill()
print("MIOpen tuning timed out")
_, errs = p1.communicate()
runPipeline([MIOpenDriverCommand])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed another one. Last one, I swear.

@@ -1285,7 +1308,7 @@ def getNumCU(chip):
rocminfo = subprocess.check_output("/opt/rocm/bin/rocminfo",
stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
print(e.stderr.decode('utf-8'))
print(f"Process error: {e.stderr.decode('utf-8')}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to identify the cause of the intermittent rocminfo failures. Highlight this case in the log file.

parsed_args = parser.parse_args(args)

global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap functions, options, and filenames based on the option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... why are we keeping rocprof v1 support?

Copy link

codecov bot commented Oct 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.97%. Comparing base (99fc9d2) to head (f5ba66a).
Report is 4 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1672      +/-   ##
===========================================
- Coverage    78.17%   77.97%   -0.21%     
===========================================
  Files          100      100              
  Lines        27994    27994              
  Branches      4087     4087              
===========================================
- Hits         21884    21827      -57     
- Misses        4463     4506      +43     
- Partials      1647     1661      +14     
Flag Coverage Δ
mfma 77.97% <ø> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Haven't fully read the PR, minor thoughts)

@@ -894,6 +964,7 @@ def benchmarkExternal(cls, commandLine, paths: Paths, arch, numCU):
benchmarkArgs = config.generateMlirDriverCommandLine("")
# remove the result file generated by rocprof in previous benchmarking
os.system("rm -f "+BENCHMARKING_RESULT_FILE_NAME)
os.system("rm -f "+BENCHMARKING_METRICS_FILE_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call rm just once

parsed_args = parser.parse_args(args)

global getNanoSeconds, getBankConflict, ROCPROF, ROCPROF_OPTS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... why are we keeping rocprof v1 support?

@pcf000
Copy link
Contributor Author

pcf000 commented Oct 8, 2024

Keeping rocprof V1 in case we have discrepancies to investigate (though since we don't keep much history, that's probably not a big concern) and in case rocprofv2 stops supporting an architecture before we do. Mostly because it was pretty easy to do and might be helpful.

return result

def getNanoSecondsV2(fileName):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall comment: perhaps class Profiler: is in order here, instead of hot-swapping functions onto variables?

Abstract the boilerplate for collecting results from a process.
Account for .MLIR_N_REPEATS in rocprofv2 results, which don't include it.
Account for nrepeats in a smarter way -- count the rows, while verifying.
Don't do attention perfRunner.py on gfx110x.
Don't run the CK benchmarking for gfx110x, because ck-benchmark-driver won't compile.
getFusionTestInfo and runFusionKernel turn out to be mostly the same.
Invent --rocprof-version to switch between rocprof and rocprofv2.
Change default to rocprofv2.
Made Profiler classes to handle the V1/V2 switch more cleanly.
Made tuningRunner.py use Profiler to get consistent arguments.
Some places in tuningRunner.py use runPipeline, some don't yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants