Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file based filter #99

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

xinzhengzhang
Copy link

@xinzhengzhang xinzhengzhang commented Dec 23, 2022

#93

Usage

bazel run //:refresh_compile_commands -- --file=a/b/c.c
# TODO: Let's detect out-of-bazel, absolute  paths 
# TODO full paths on windows
# TODO be careful about this case with --file
  • # TODO consider running a preliminary aquery to make this more specific Feature/file based filter todo1 xinzhengzhang/bazel-compile-commands-extractor#1
  • # TODO simplify loop
  • # TODO for consistency, let's have this return a list
  • # TODO switch to --include_param_files
  • # TODO simplify/unify logic around warning about not being able to find commands * 3
  • # TODO for --file, don't continue traversing targets after the first command has been found

Copy link
Contributor

@cpsauer cpsauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinzhengzhang, thank you. This is a good start.

I added some commits with the goal of quickly making small improvements and adding docs--seemed easier to just do than to discuss. Hopefully that's okay. Explanations of why I thought they were improvements are in the commit messages. If there's anything you don't like, just say or change them. I appreciate your giving me write access.

I can see at least a few more cases I'll need to ask you to work through, all around headers. They're tricky!

  • Not all (C++) headers end with .h. (Nor, indeed, any extension at all, sadly.)
    Suggestion: Flip the if statement the other way, checking if the file has a source extension, otherwise treating it as a header.
  • Header-only targets will still present a problem, I think, since they don't have any compile actions in the targets that list them in attrs. Any ideas on how we should solve this?
  • If the header implementation remains similar:
    • attrs seems to operate on a stringified list of targets, rather than paths. Maybe we can use this to be more specific somehow, but if not, let's document a little better what we're doing. It's probably okay to accidentally update a few extra compile commands as as long as it remains fast enough for you, though, of course, being specific is ideal.
    • I'm guessing that using a let variable definition for the header query is more efficient, saving a double evaluation of the target_statement, though I'm not certain.
  • Update in place: Let's put the logic that updates compile_commands.json with the new entries for the --file in this tool, rather than the plugin! That way it can be eventually shared across editor plugins and you don't have to deal with overwriting.
  • I think we probably want to omit the slow header search when not needed with --file?

That's what I can see on a quick read through. I'm sure additional important cases to think through remain. I'll need you to take the lead on making this robust and testing how it meets your needs.

Cheers,
Chris

P.S. I'm not sure which holidays you observe, but I'm wishing you a great set of winter celebrations!

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Dec 24, 2022

Header-only targets will still present a problem, I think, since they don't have any compile actions in the targets that list them in attrs. Any ideas on how we should solve this?

Although they cannot be found in the target action but they should be defined in the BUILD file in attr unless using something like build --spawn_strategy=standalone and it is not recommended. The only question is to find which attr and I think srcs and hers is enough in most case

attrs seems to operate on a stringified list of targets, rather than paths. Maybe we can use this to be more specific somehow, but if not, let's document a little better what we're doing. It's probably okay to accidentally update a few extra compile commands as as long as it remains fast enough for you, though, of course, being specific is ideal.

Since bazel query runs on the post-loading phase target graph it doesn't have enough information to get e.g. the real file path. So we have two ways

  1. Write clearly in the doc, if it is a header file, you need to pass in the relative path starting from BUILD
  2. Only query the file name, like now (It will contain possibly more targets but is easier to use)

Update in place: Let's put the logic that updates compile_commands.json with the new entries for the --file in this tool, rather than the plugin! That way it can be eventually shared across editor plugins and you don't have to deal with overwriting.

Do you think we need a flag to mark override or merge? Or only --file uses merge?

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Dec 24, 2022

Not all (C++) headers end with .h. (Nor, indeed, any extension at all, sadly.)
Suggestion: Flip the if statement the other way, checking if the file has a source extension, otherwise treating it as a header.

8e834df

I'm guessing that using a let variable definition for the header query is more efficient, saving a double evaluation of the target_statement, though I'm not certain.

4cac1f6

Let's put the logic that updates compile_commands.json with the new entries for the --file in this tool, rather than the plugin! That way it can be eventually shared across editor plugins and you don't have to deal with overwriting.

79cf4ab

I think we probably want to omit the slow header search when not needed with --file?

I don't really understand the meaning. If we don't specify --file=a.h, won't we go here? Or do we need an additional flag to filter out users even if they come in?


Thank you for caring and I wish you Merry Christmas~

@cpsauer
Copy link
Contributor

cpsauer commented Jan 30, 2023

Hey @xinzhengzhang! Sorry for being slow--got buried in tasks over here. Resurfacing. I'm so sorry; I promise I'm usually faster, like with my initial replies.

Thanks for getting back on these and for making improvements. As before, I also added some quick commits on small items that seemed better to fix than to spend time discussing. Hopefully that's okay. Explanations in the commit messages. If there's anything you don't like, just say or change them. I appreciate your giving me write access.

For the things that are still open, I'll try to expand on each in more detail:

The (relatively common) case of header-only targets:
Let's consider the following simplified example:
Say we're trying for --file=templated.hpp

cc_binary(
    name = "main",
    srcs = ["main.cpp"],
    deps = ["lib"],
)
cc_library(
    name = "lib",
    hdrs = ["templated.hpp"], # Or srcs
)

Won't our aquery currently first filter to just lib, since it's the one with templated.hpp in hdrs...and then find no actions, since no actions are actually generated by lib, since it's header-only?

If I'm right, we'll need a creative fix here. I think I suggested this one before, but the best I've thought of so far is to first try what you did here and then fall back to the inputs query type that we use for source files, running header search until we find the first one that actually uses the header. I'd love it if you'd think on it and try to come up with something even better!

[There's also the case of system headers, which won't be listed in an attribute, but I think those would also be caught by the above. There's also the case of a library that may have compile actions but none that include the header being requested via --file. More on that in the next section...]

Omitting the slow header search
Nearly all of the runtime of this tool is finding which headers are included for a given source file in _get_headers by running the clang/gcc preprocessor--that's the code that'd turned off by exclude_headers. We should similarly skip doing that work for when we use --file= for a source file, since we're only asking about the one source file, not the headers, and since you care about this running really fast.

The exception is when running --file= for a header. We'll need to use _get_headers there to figure out if an action actually includes the header specified (probably just taking the first for speed)! And if none do, then, as above, we'll need to fall back to the inputs based query.

Attr and stringified targets
What about using a preliminary aquery and the outputs function to get the target for a given path? Maybe there're even better ways.

Override vs merge
I think it's totally fine to just have --file imply mergeing with no separate flag, unless you see value otherwise?

I'm quite excited about this! But I'll need you to take the lead on making this robust.

Cheers,
Chris

@cpsauer
Copy link
Contributor

cpsauer commented Jan 30, 2023

Wishing you also a happy lunar new year :)

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Jan 31, 2023

For headers how about we divide the header extractor into these steps

  1. query targets which attrs(hdrs&srcs) contains the header file. If there has other source extensions like .c then we just extract this one if not we go to step 2 to process header-only library
  2. query actions that using this header file as input (Can only be found without using module) if not found we go to step 3
  3. traverse the target graph in postorder sort which depends the header-only lib and find the first one that actually uses the header. If we still not found then we go to step 4 (the case of system headers)
  4. traverse the complete target graph in topological sort and compile-test a dummy source which include this header. To prevent excessive timeouts we assume a maximum of 50 attempts
# pseudocode
header = "/a/b/c.hpp"
targets = query(root_target, hdrs|srcs, header)
is_header_only_lib = targets.filter(srcs.contains(source_extension))

if targets.length:
    if is_header_only_lib:
      action = aquery(inputs, header_statement).first
      if action:
          return action.args
      else:
          targets = query(root_target, deps, targets_statement).order_by(postorder)
          args = targets.srcs.find( {
              src.get_headers().contains(header)
          } )
          if args:
              return args
    else:
        return targets.srcs.find( {
              src.get_headers().contains(header)
        } )

targets = query(deps(root_targets)).order_by(topological)
test_file = "include {header}"
args = targets[:50].find( {
    compile_test(test_file, target.actions.first.args) == true
})
if not args:
    return "extract failed"

Attr and stringified targets
What about using a preliminary aquery and the outputs function to get the target for a given path? Maybe there're even better ways.

Sorry, I don't quite understand what this given path is used for

@cpsauer
Copy link
Contributor

cpsauer commented Feb 6, 2023

Something like that! I'd suggest a couple tweaks:

For 1: I don't think we can assume that the .c will include the header. Instead, we'll have to run the header search process to check, falling back to 2 if not.

(I'm not quite sure I understand the module part; as much as possible we'll want to turn those off for our analysis so that we don't depend on a build.)

For 4: I think it's okay to require an example of use rather than creating a dummy target, like we currently do for header only libraries, since clangd will automatically fall back to generic flags anyway. But if you think otherwise, please do say!


As above, attrs seems to operate on a stringified list of targets, rather than paths, which looked like it was getting in the way and making the query less specific. So, rather than just getting the filename, I was proposing that we might want to run another aquery, using outputs() to figure out the name of the target for a given file path. Does that make sense?

@cpsauer
Copy link
Contributor

cpsauer commented Feb 28, 2023

Hey, @xinzhengzhang! I wanted to check back in on these to see how you're doing. Thanks for all your efforts; I know this is hard to get right!

@xinzhengzhang
Copy link
Author

Hey, @xinzhengzhang! I wanted to check back in on these to see how you're doing. Thanks for all your efforts; I know this is hard to get right!

Sorry to be too busy with work recently, I will implement a basic version this weekend

@xinzhengzhang xinzhengzhang force-pushed the feature/file-based-filter branch from af8d4e4 to f4441c3 Compare March 3, 2023 05:46
@xinzhengzhang
Copy link
Author

Sorry again for the long time of inactivity due to work.

I just rebased the main branch and squash the commits.
Regarding the header lib problem discussed above, I have carried out a basic implementation f4441c3
I added a big loop in _get_commands to check if the found commands match the expected, if this approach is ok, I will continue to add logic such as short-circuit finding process so as not to parse the complete graph and max attempt for finding header.

@cpsauer
Copy link
Contributor

cpsauer commented Mar 3, 2023

Totally get the busyness. I feel it, too :)

We'll definitely need to invoke query multiple times. A fixed-depth breaking loop like that works, but you might end up finding it cleaner/easier to factor out the query call-(as a nested function if you need to capture?) and then have control flow that explicitly falls through each case if empty?

(Some smaller things, trying to give fast feedback: finded -> found. And you'll want to be careful about the no-action error case, since that's now tolerated for the file_flags case.)

@xinzhengzhang xinzhengzhang force-pushed the feature/file-based-filter branch from f4441c3 to 251bbb2 Compare March 3, 2023 11:22
@xinzhengzhang
Copy link
Author

I have simplify the loop using nested function and fixed no-action case in _get_commands 251bbb2
Terminate the loop if header has been found or reached the max attempt times 8df2238

Since I don't have much experience with Python, and it is a weak type language, the implementation of passing lambda to thread handler seems very strange. If there is a good way to modify it, please also help me to modify it

@xinzhengzhang
Copy link
Author

xinzhengzhang commented Mar 16, 2023

Hi, Chris @cpsauer
I would like to ask if there is anything else that needs to be modified for this pr?

In addition, I want to change my vscode extension to bzlmod form, so I want to patch the dependency compile-commands. I would like to ask if you have the possibility of tagging and being referenced? Of course, it is also possible for me to use extension to use commit integration

@xinzhengzhang xinzhengzhang mentioned this pull request Mar 18, 2023
xinzhengzhang and others added 4 commits March 19, 2023 03:28
Promote multiple --file assert to log_error
Why? It's a usage error we're communicating to the user, not a code invariant we're checking in debug mode.
We could recover, but we don't, because we expect only automated usage and want to give clear feedback.

More pythonic if

Improve my --file flag messaging consistency

Call library to escape regex
Helps avoid edge cases! Also handles the most common special character, '.'

Add an error to catch unsupported --file <file> form

Tweak -- logic now that --file arguments could be after

--file docs

Make aquery spacing consistent

Flip the if statement in constructor target_statment

Use let to combine attr query

Merge compile_commands.json if --file

Small simplification: endswith already can take a tuple of possibilities (as elsewhere)

Small simplification of header path for file flag.
(This code section likely to be replaced later, though.)

Fix my typo on --file= discoverable docs

Some tweaks to merge
Adds note about behavior, some edge case handling around other flags that might start with --file, permissions issues, etc.
@xinzhengzhang xinzhengzhang force-pushed the feature/file-based-filter branch from 8df2238 to 0befc09 Compare March 18, 2023 19:50
@xinzhengzhang
Copy link
Author

Since the entire aquery logic is nested, there are many conflicts. I just rebase main and push -f

@cpsauer
Copy link
Contributor

cpsauer commented Mar 18, 2023

Hey @xinzhengzhang. Sorry for causing--and thanks for rebasing. Apologies for being slow--got sick over here and I'm kinda dragging and backlogged. I'll try to get through this soon.

@cpsauer
Copy link
Contributor

cpsauer commented Mar 27, 2023

Ran out of time--and am now traveling for the week. Will try to get to this weekend.

I definitely saw some bugs on a quick scan through. Could I ask you to polish in the meantime?

@xinzhengzhang
Copy link
Author

Thanks for getting back to me. I'll do my best to identify and fix the bugs. Have a great trip!

@cpsauer
Copy link
Contributor

cpsauer commented May 6, 2023

Hi, @xinzhengzhang! Thanks for your quick work, but most of these are going to require some more experimenting and thought before they work robustly enough that we can get them in. I'd love to have this feature land for you and others--and happy to clarify/help--but I'll still need you to really think through the cases and how to handle them robustly.

Replying to each. There are some important things for us to discuss, starting with the high level. Could I get you to first read down to the first horizontal bar (---) and reply and then we'll return to the code-review level?

Re: TODO discuss the below.

I'm not sure quite what you're asking. Could I ask you to rephrase?

To save round trips, here's what's in my head:
This comes up in the case where someone opens, say, my_header.h and their editor asks for --file=my_header.h, needing a compile command for my_header.h in order to provide good autocomplete. Only source files are run through the compiler directly. Therefore, to understand how my_header.h is compiled, we need to get the command for a source file that includes my_header.h, directly or indirectly. That command contains a bunch of things relevant to how the autocomplete should go--both flags that adjust the explicit include paths, like you list, but also many other flags that affect how the source is interpreted, like, for example, additional defines (-D). Does that align with your understanding and needs?

What that discussion item is about: As we search to find a source file that includes my_header.h, we get, for free, commands that apply to other files, like the other headers included by that source file. We need to decide which of those (file, command) pairs we should add to compile_commands.json.

On one hand, we could just add all the pairs we got for free. But that might kick off a lot of background indexing work, and my understanding is that you want this feature in order to minimize the indexing work required to browse files. Adding a few source files in the background is potentially quite useful for go-to-definition and the discovery of other symbols. But other headers seem fairly pointless and high cost: they don't introduce any symbols not already seen from processing a source file (which pulls in its includes) and they're likely to keep getting new (but equivalent) commands that trigger a lot of reindexing. Hence that tradeoff: Only add/update the command for my_header.h and no other headers--but still add/update commands for source files found along the way. What do you think of that tradeoff and why? Does that seem optimal or would you like something else? That was my guess, but you know your use case better than I do (like with allpaths!). You can also go with this for now, experiment around as you use it, and then adjust it if needed. Either way, we should eventually document what we decide and why, so please add to the comments until they'd have been sufficient for you to understand on a first read.

Aside: reload behavior

Are you doing anything special to get good reload behavior from clangd in the editor? I'm seeing that clangd needs a modify/save event to trigger it getting a new command. Is that what you're seeing?

--include_param_files

Right. (If I'm correctly understanding what you're writing.) Do you use Windows and understand what's going on with param files for Windows? (Totally ok if no on either.) Assuming I'm right in remembering that LoSealL does some development on Windows machines--we might want to ask him if he'd look into these Windows things. Either way, would you be willing to give LoSealL commit access to this branch of your fork so he can help out as needed?

TODO be careful about this case with --file

Also windows-specific stuff. No .d files involved here. Are you developing on windows at all, or should we be looking to ask LoSealL on these?

[To explain more: The concern is that MSVC on windows is producing absolute paths for all headers, including in the workspace, which then won't match in statements elsewhere like if file_path in headers.]

timeouts

I saw your email (now edited) about timeout not working. I was similarly having trouble getting it to trip--hence the ask that you investigate. How did that resolve? Were you able to get the timeout to activate?

It's not obvious to me what you're going for with the timeout code you added--nor why it's bad to be able to traverse many actions if it's done fast. Could you explain?

[I had been thinking that, if we could get the timeout to work, we might want to allocate fewer seconds if there were, e.g., 3 sequential calls to aquery for different top-level targets specified in refresh_compile_commands. Perhaps 1s each instead of 3s each, to maintain a total latency number. I wasn't envisioning any fancy calculating, but you may have thought of something good I haven't!]

TODO: Let's detect out-of-bazel, absolute paths and run this if and only if we're looking for a system header. We need to think about how we want to handle absolute paths more generally, perhaps normalizing them to relative if possible, like with the windows absolute path issue,

This is going to require thinking about what you might pass in from the outside. What about absolute paths to the canonical location of bazel-out or external (i.e. at the other end of the symlink)?

Further, once we've recognized system headers (i.e. absolute paths in none of those locations?) aren't those the only ones we want to run the (expensive) deps(target) query for?

[From your email/past edits] inputs(headers) and allpaths() being too slow

How did this resolve? We definitely want this to be fast and fast enough for you. To answer the questions you'd asked:
I thought I was seeing that headers never appeared under inputs, because they were hidden behind middlemen. Are you seeing otherwise? (It looks like you're seeing the same from you edits, but I want to make sure.)
If that's right, then we need an alternative. If allpaths is too slow, could you see if the --universe_scope & rdeps(,) alternative I'd suggested there might work? Or think about other possibilities? I really need your help here, since you're the one with the enormous codebase to test on.


Moving to commits

More generally, could I ask you to not delete TODOs until we're confident they've been handled robustly? And put back the ones you did? Makes tracking much easier.

Some significant problems remain [see below], so maybe give these a quick skim, but read the "looking back at the commits" sub-section at the bottom before editing.

From ef33509

Come on--we'll need to do quoting in all places where we splice a target into an aquery command, as it said! Please be proactive on these :) Wish I'd had it right from the start--but I figure you'd need to know for mutating the query and don't want to cause an unnecessary merge conflict.

From f8b046e

Not obvious this makes things better? Appreciate the effort, but how about the aquery with inputs(), to get the targets that produce the file, suggested for exploration in the TODO? If I've misunderstood and this works well, or that doesn't work, please tell me.

Not found warnings 9ce1716

Again, let's return to these at the end. We're still going to get a lot of erroneous ones of the first kind, I think, for the reasons in the (deleted) TODO.

40b9131

Thanks. Can you put back the documenting comments from the yield and not use SimpleNamespace, since I think we'll have to strip it for #118. I'm also realizing this part of the comment is obsolete ", with commas after each entry,"--could you delete? (more generally, if you happen upon mistakes like that, please just fix).

0453be4

I see you moved the loop inside--but not the target_flag_pairs template filled by bazel. Don't we want that inside, too? I think the structure here may simplify significantly if we end up with one aquery command per branch (but maybe we won't be able to get there.

Looking back at the commits

If it were me, I'd be pretty tempted to reset the branch back to bd7bde1 and move forward from there. There are some good things in the commits. But since each needs work and a bunch of (TODO and other) lines need to be restored anyway, I'd think restoring to bd7bde1 and then moving forward, copying in useful parts of the previous commits, might be the way to go. Maybe I should be just doing that, but I don't want to step on your toes too hard.

Moving forward, I'd like to ask that you make sure you've got higher confidence before deleting those TODOs and pushing code for review. It's totally ok to not resolve all the TODOs, especially all at once. But it's taking too much time to review and discuss rough versions of the code with issues I suspect you could easily have caught without me. Better to clarify first, discussing if needed, and tell me where you aren't sure or need me to do something. Then, focus individually on the ones where you have confidence, doing some self-review before pushing changes, making sure you've actually resolved the things in the comments and from our discussion. Since this is a big change, another tactic for changes where you're uncertain would be to file PRs against your own branch. That way, we can easily discuss each in its own context. Please don't get me wrong; I massively appreciate your energy and enthusiasm to contribute. I'm just trying to find more efficient ways for us to work together to make this happen.

Thanks,
Chris

@xinzhengzhang
Copy link
Author

Hi, @cpsauer!

Very very thanks for your super patiently mentoring. There are some places I really didn't get what you wanted to express, such as the part in #discuss, but I fully understand after your detailed delineation
As for deleting TODO, it may be my habit to rebase frequently based on a single commit, which will cause the history to be lost. It is indeed inconvenient to quote and discuss.
I will revert all this commits and file PRs which fix TODO from other branches in the future.

Re: TODO discuss the below.

Yes, it totally align with my understanding and needs.

We need to decide which of those (file, command) pairs we should add to compile_commands.json.

In my opinion, the header file should not write anything other than the frame header file, for example in the iOS should just import < Foundation/Foundation.h > and < UIKit/UIKit.h > All other nested dependencies should be in the form of previous declarations. In this view, in fact, we just need to find one source file that directive contains this header, other more indexes are just waste and interference.
Something like Xcode that indexes the entire project can also cause the header file to actually have the ability to prompt a large number of files, and we've even had a case where we wrote a source file with the wrong dependency in the header file.

// We had a bug like this, no matter how we modify file_c.m it didn't work cause file_a.m include all content in file_c.m
// When we modified file_c.m no changed with file_b.h and no dependencies changed with file_a.m cause it use cache.
# file_a.m
# import file_b.h

# file_b.h
# import file_c.m

That's why I lean towards this

Are you doing anything special to get good reload behavior from clangd in the editor? I'm seeing that clangd needs a modify/save event to trigger it getting a new command. Is that what you're seeing?

In the vscode extension, I observed change of BUILD and WORKSPACE to refresh the compile_commands

--include_param_files
TODO be careful about this case with --file

About file_path part, I will give commit access to @LoSealL to improve the windows part together.
ButI digged at _get_headers_msvc code even if it returns the absolute path did not see his relationship with --file, I understand that --file is just to narrow the scope of the query action, understand wrong?


I will reply to other comments later

…rgets after the first command has been found"

This reverts commit 0453be4.
… return a list of SimpleNamespace objects instead of yielding a dictionary & Removal type cast"

This reverts commit 40b9131.
…ment_candidates and warning"

This reverts commit 9ce1716.
@xinzhengzhang
Copy link
Author

timeouts
I saw your email (now edited) about timeout not working. I was similarly having trouble getting it to trip--hence the ask that you investigate. How did that resolve? Were you able to get the timeout to activate?

It's not obvious to me what you're going for with the timeout code you added--nor why it's bad to be able to traverse many actions if it's done fast. Could you explain?

I have write minimum code replicate this bug
I'm guessing that if you iterate outside of scope, the internal implementation of ThreadPoolExecutor will not pass the call next so that no timeoutError is thrown, and the full execution is performed

import os
import concurrent.futures
import time

def _foo(i):
    time.sleep(1)
    return i

print("worker count = {}".format(os.cpu_count()))

with concurrent.futures.ThreadPoolExecutor(
    max_workers=min(32, (os.cpu_count() or 1) + 4)
) as threadpool:
    task = [i for i in range(100)]
    output_iterator = threadpool.map(
        lambda i: _foo(i),
        task,
        timeout=3
    )
#   # work
#   try:
#       print("before worker current time = {}".format(time.time()), flush=True)
#       for output in output_iterator:
#           print("current time = {} output = {}".format(time.time(), output), flush=True)
#           
#   except concurrent.futures.TimeoutError:
#       print("timeout")

# not work
try:
    print("before worker current time = {}".format(time.time()), flush=True)
    for output in output_iterator:
        print("current time = {} output = {}".format(time.time(), output), flush=True)

except concurrent.futures.TimeoutError:
    print("timeout")
// https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor
map(func, *iterables, timeout=None, chunksize=1)
Similar to [map(func, *iterables)](https://docs.python.org/3/library/functions.html#map) except:

the iterables are collected immediately rather than lazily;
func is executed asynchronously and several calls to func may be made concurrently.
The returned iterator raises a [TimeoutError](https://docs.python.org/3/library/exceptions.html#TimeoutError) if [__next__()](https://docs.python.org/3/library/stdtypes.html#iterator.__next__) is called and the result isn’t available after timeout seconds from the original call to [Executor.map()](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map). timeout can be an int or a float. If timeout is not specified or None, there is no limit to the wait time.

It's not obvious to me what you're going for with the timeout code you added--nor why it's bad to be able to traverse many actions if it's done fast. Could you explain?

Now that this bug is solved, this complex estimation method is no longer needed

@cpsauer
Copy link
Contributor

cpsauer commented May 6, 2023

Hey! Yes. Thanks for being so coachable and energetic.

You're right that some of my todo notes were fairly inscrutable. Sorry about that. They were initially meant to be notes to myself and then, well, I figured I'd just clean the up a little bit, hoping they'd either make sense or you ask. Wish I'd done a better job.


Other items:

I'm not understanding what you're saying about the headers. Just language barrier, I think. Could I ask you to take another shot at explaining? Wish I could read Chinese as well as you can English :)

Reload: Ah, sorry, I mean getting clangd to reload the new changes from an updated compile commands.json. Does that make more sense?

Thanks for giving LoSealL access! (I think just ignore that stuff for now. But if you're curious but the problem will be as above. If you pass in a relative path it won't match the absolute paths returned by MSVC.)

@cpsauer
Copy link
Contributor

cpsauer commented May 6, 2023

Re timeout: Whoa, that is very surprising to me. If you figure out why, I would be very curious. Sorry about leaving that bug for you.

(I'll consult with a friend who's a python expert to try to figure out what's going on. My guess is that the threadpool tries to finish all of its actions before exiting scope, but then neglects to check for timeouts?)

@xinzhengzhang
Copy link
Author

Hey! Yes. Thanks for being so coachable and energetic.

You're right that some of my todo notes were fairly inscrutable. Sorry about that. They were initially meant to be notes to myself and then, well, I figured I'd just clean the up a little bit, hoping they'd either make sense or you ask. Wish I'd done a better job.

Other items:

I'm not understanding what you're saying about the headers. Just language barrier, I think. Could I ask you to take another shot at explaining? Wish I could read Chinese as well as you can English :)

Reload: Ah, sorry, I mean getting clangd to reload the new changes from an updated compile commands.json. Does that make more sense?

Thanks for giving LoSealL access! (I think just ignore that stuff for now. But if you're curious but the problem will be as above. If you pass in a relative path it won't match the absolute paths returned by MSVC.)

header

In chines we usually call file which exposed interface are "头文件". Literally translated as header file is also the header I mentioned. For some high-level languages such as swift or python or cpp module usually we do not use the word header to describe their interface.

In case I'm not making my point, I'm going to go into more detail.

Under the c system language, using like #include.h or some variant like #import is actually to expand its content into source.
There is no specification for what should be written in the header file, so there are many usages, such as definitions, macros, umbrella headers, etc. Many usages. But I think the main use of header files should be exposed as interfaces. If exposed as a module(source file) interface, then if there is more interference from other contexts of the complete project, it will greatly interfere with the writing of R & D. And in fact, we have also had a lot of hidden problems that cause problems by writing casually in the header file. So I will require only dependency system headers in our team header file, because single responsibility one module should not have too many dependencies other as exposed.

Reload

I'm using clangd wrapped via sourcekit-lsp, and it works well when compile_commands changes, I'm not too deep into his specific implementation, you can start from https://github.com/apple/sourcekit-lsp/blob/90fd3addb28a7b3cb2ef40acca6670cd57cbdd53/Sources/SourceKitLSP/SourceKitServer.swift#L730 to refer to his line of sight if necessary

abspath for msvc

From here I saw that you will transfer absolute path extracted from process to relative path.
It seems that if you pass in the relative path, it can match. If you pass in the abs path, it will be converted to the relative path in the input. But I don't have the windows environment and experience now. I will test it on the windows platform later.

def _file_is_in_main_workspace_and_not_external(file_str: str):
    file_path = pathlib.PurePath(file_str)
    if file_path.is_absolute():
        # MSVC emits absolute paths for all headers, so we need to handle the case where there are absolute paths into the workspace
        # TODO be careful about this case with --file
        workspace_absolute = pathlib.PurePath(os.environ["BUILD_WORKSPACE_DIRECTORY"])
        if not _is_relative_to(file_path, workspace_absolute):
            return False
        file_path = file_path.relative_to(workspace_absolute)
    # You can now assume that the path is relative to the workspace.
    # [Already assuming that relative paths are relative to the main workspace.]

    # some/file.h, but not external/some/file.h
    # also allows for things like bazel-out/generated/file.h
    if _is_relative_to(file_path, pathlib.PurePath("external")):
        return False

@cpsauer
Copy link
Contributor

cpsauer commented May 6, 2023

Sorry--still totally don't understand the point about the headers. Like I understand all the words but not what you think we should do here.

Reload: Sounds like sourcekit has better behavior here. Heads up that clangd doesn't reload until the file is modified or saved, sadly. Not a problem for you using sourcekit, but something we'll need to keep in mind for this project.

Re msvc absolute header paths: Totally agree that _file_is_in_main_workspace_and_not_external itself works fine. I was just concerned about the behavior documented there causing issues elsewhere. Is that the confusion? If so, totally my fault for writing the TODO ambiguously and poorly--feel free to move it or delete it.

@xinzhengzhang
Copy link
Author

xinzhengzhang commented May 6, 2023

TODO: Let's detect out-of-bazel, absolute paths and run this if and only if we're looking for a system header. We need to think about how we want to handle absolute paths more generally, perhaps normalizing them to relative if possible, like with the windows absolute path issue,
This is going to require thinking about what you might pass in from the outside. What about absolute paths to the canonical location of bazel-out or external (i.e. at the other end of the symlink)?

Further, once we've recognized system headers (i.e. absolute paths in none of those locations?) aren't those the only ones we want to run the (expensive) deps(target) query for?

For now (in my extension) all --file are passed in as relative paths and to be honest, this is indeed the most bazelify input excepted.
Source file: srcs/a/b/c.m
External file: external/a/b/c.h

For absolute path contains WORKSPACE root /Users/xx/project/srcs/a/b/c I think it is easy to handle and it is current implementation.
But for some path like bazel-out/<CONFIGURATION>/bin/srcs/a/b/c.m /<OUTPUT_BASE>/execroot/_main/srcs/a/b/c.m I can indeed identify some features such as execroot/_main bezel-out to force the match, but I think this robustness is too poor and may be affected by the bazel change. Regarding this, your suggestion is that we directly prompt the error, or help them as much as possible to convert to the desired above two kind of paths?

inputs(headers) and allpaths() being too slow

How did this resolve? We definitely want this to be fast and fast enough for you. To answer the questions you'd asked:
I thought I was seeing that headers never appeared under inputs, because they were hidden behind middlemen. Are you seeing otherwise? (It looks like you're seeing the same from you edits, but I want to make sure.)
If that's right, then we need an alternative. If allpaths is too slow, could you see if the --universe_scope & rdeps(,) alternative I'd suggested there might work? Or think about other possibilities? I really need your help here, since you're the one with the enormous codebase to test on.

What I seeing

The reason I modified the comment is because I found that the reason why I can find the header file through inputs and it is not hidden by middleman is that I use objc_library dependency swift_library, and the implementation of swift is to provide the header in the https://bazel.build/rules/lib/providers/CcInfo.html#compilation_context so that I can query, but this is not a common practice

Benchmark

I have tested with two header file one is upper module in srcs/app and one is underlying module in srcs/base
bazel_version: 6.1.1
cmd: time bazel aquery "mnemonic('(Objc|Cpp)Compile', allpaths(//bili-universal:bili-universal, let v = deps(//bili-universal:bili-universal) in attr(hdrs, '//srcs/base/xxx:include/xx.h', \$v) + attr(srcs, '//srcs/base/xxx:include/xx.h', \$v)))" --output=jsonproto --include_artifacts=false --ui_event_filters=-info --noshow_progress --features=-compiler_param_file --features=-layering_check > /dev/null

srcs/app/x.h

allpaths
1.14s user 2.17s system 8% cpu 39.409 total
rdeps
1.17s user 2.13s system 8% cpu 37.865 total
rdeps with --infer_universe_scope
1.14s user 2.16s system 8% cpu 37.424 total

srcs/base/x.h

allpaths
1.17s user 2.13s system 8% cpu 39.709 total
rdeps
1.17s user 2.14s system 8% cpu 38.149 total
rdeps with --infer_universe_scope
1.18s user 2.16s system 8% cpu 39.708 total

other possibilities

Why not just add a switcher for expensive search for allpaths and deps to user? If we can't find their dependency from attr, it is unlikely not used in current practice. When it appears, it also jumps to some useless header files that do not need to be parsed.

@xinzhengzhang
Copy link
Author

xinzhengzhang commented May 6, 2023

What that discussion item is about: As we search to find a source file that includes my_header.h, we get, for free, commands that apply to other files, like the other headers included by that source file. We need to decide which of those (file, command) pairs we should add to compile_commands.json.

On one hand, we could just add all the pairs we got for free. But that might kick off a lot of background indexing work, and my understanding is that you want this feature in order to minimize the indexing work required to browse files. Adding a few source files in the background is potentially quite useful for go-to-definition and the discovery of other symbols. But other headers seem fairly pointless and high cost: they don't introduce any symbols not already seen from processing a source file (which pulls in its includes) and they're likely to keep getting new (but equivalent) commands that trigger a lot of reindexing. Hence that tradeoff: Only add/update the command for my_header.h and no other headers--but still add/update commands for source files found along the way. What do you think of that tradeoff and why? Does that seem optimal or would you like something else? That was my guess, but you know your use case better than I do (like with allpaths!). You can also go with this for now, experiment around as you use it, and then adjust it if needed. Either way, we should eventually document what we decide and why, so please add to the comments until they'd have been sufficient for you to understand on a first read.

Only add/update the command for my_header.h and no other headers--but still add/update commands for source files found along the way. I prefer this, the header file should has as few indexable symbols as possible. The reason is that for example a library has 10 source files and 10 header files, and there should be no or as few dependencies between header files and dependencies should be implemented in source files.

@xinzhengzhang
Copy link
Author

From f8b046e
TODO consider running a preliminary aquery to make this more specific, getting the targets that generate the given file. Use outputs() aquery function. Should also test the case where the file is generated/is coming in as a filegroup--that's likely to be fixed by this change.

Not obvious this makes things better? Appreciate the effort, but how about the aquery with inputs(), to get the targets that produce the file, suggested for exploration in the TODO? If I've misunderstood and this works well, or that doesn't work, please tell me.

See The reason I modified the comment is because I found that the reason why I can find the header file through inputs and it is not hidden by middleman is that I use objc_library dependency swift_library, and the implementation of swift is to provide the header in the https://bazel.build/rules/lib/providers/CcInfo.html#compilation_context so that I can query, but this is not a common practice

When I try to use aquery to get outputs which passed into next aquery process, there was hundred of actions was queried

# bazel aquery "(inputs('x.h', deus('<TARGET>'))" --output=summary
Mnemonics:
  Middleman: 2
  SwiftCompile: 118
  SwiftDumpAST: 118

Take SwiftCompile as an example, it dependency header to generate modulemap, but has nothing to do with this header in the compile parameter. Join all outputs from queried outputs not seems a good idea. But if I add a Middleman as Mnemonics qualifier, it can't handle logic like generated file. So I chose to use bazel query, both source files and generated files can be involved. The tradeoff is we can't handle the file_path like 'bazel-out/xx/generated_file.h'
In fact, when I was writing, I also considered reverse condition

# Currently
file_candidates = list(filter(lambda label: file_path in label.replace(':', '/'), label_candidates))

# Maybe
# The generated file always keeps its label path as a subset
filter( lambda label: label.normalize() in file_path, label_candidates))

@xinzhengzhang
Copy link
Author

Re timeout: Whoa, that is very surprising to me. If you figure out why, I would be very curious. Sorry about leaving that bug for you.

(I'll consult with a friend who's a python expert to try to figure out what's going on. My guess is that the threadpool tries to finish all of its actions before exiting scope, but then neglects to check for timeouts?)

After some digging from python source code. I found the cause of this issue

Start from entry concurrent.futures.ThreadPoolExecutor.map https://github.com/python/cpython/blob/263abd333d18b8825cf6d68a5051818826dbffce/Lib/concurrent/futures/_base.py#L621
We can find that it raise exception from here https://github.com/python/cpython/blob/263abd333d18b8825cf6d68a5051818826dbffce/Lib/concurrent/futures/_base.py#L458

# not raise exception if FINISHED
elif self._state == FINISHED:
      return self.__get_result()

We can see that shutdown will be called when exiting the scope
https://github.com/python/cpython/blob/263abd333d18b8825cf6d68a5051818826dbffce/Lib/concurrent/futures/_base.py#L647

And in shutdown all thread has been joined
https://github.com/python/cpython/blob/263abd333d18b8825cf6d68a5051818826dbffce/Lib/concurrent/futures/thread.py#L235

That is why no timeout will be thrown and wait until all operation queues finished

@cpsauer
Copy link
Contributor

cpsauer commented May 11, 2023

Absolute paths passed to --file:

I think we'll need to pass in system headers as absolute, since there's not a relative alternative.
e.g. "file": "/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/System/Library/Frameworks/UIKit.framework/Headers/UIViewConfigurationState.h",

I was thinking some editors might also try to resolve symlinks and therefore pass in absolute paths for, say, the workspace or external or bazel-out. I feel like I remember someone telling me about that...for emacs maybe? So I think we should handle it if we can do so easily... But I'd also be ok with us just issuing a fatal error if someone passes in a path that's inside bazel info output_base? (either get by resolving //external/.. or with a subprocess.run parallel to bazel dump) remember we've got the _is_relative_to helper for getting things to work pre python 3.9!

inputs()

Ah, interesting. Yeah, I'd just checked cc_library. Bummer that it doesn't work in general.

Benchmark

Wow! Yeah, 40s is definitely too slow. What's our target--10s max overall?

Could you see if adding a depth cap to rdeps (3rd argument) make things faster--or --noinclude_aspects? (Or allrdeps. See also the TODO. If that's fast enough, please also check order.) Thanks for testing on your very large codebase!

Other possibilities

I think header-only libraries (in c++) will unfortunately be pretty common, so we can't just attr everything. Same with system libraries, but that's a different case.

If the above doesn't work fast enough--and you have any clever ideas--I'm all ears.

Only add/update the command for my_header.h

I think it sounds like you agree, but I'm afraid I don't understand your logic about few symbols. Could you confirm? And please reexplain if you want me to understand why you agree. Sorry.

Preliminary Aquery to get target name

Ah sorry. the outputs function, not inputs, to figure out how a file is generated, if it's generated. But maybe it's not worth it in this first version.

Could be good to make the regex a little more specific, though, by including more of the path?

ThreadPoolExecutor

Very interesting! Thank you. The key piece I hadn't realized was that they waited for completion on scope exit.

I think that catches me back up!

Let me know what more you need to get this done.

@xinzhengzhang
Copy link
Author

Benchmark

You saved me! It has nothing to do with rdeps or the like, just because of aspects!

it tooks just one second after adding --include_aspects=false
0.02s user 0.03s system 5% cpu 1.039 total

Absolute paths passed to --file:

I wrote a piece of pseudocode about the handling of path is ok?

def path_handler(file):
    """
    Returns the relative path of the file from the workspace root or the absolute path if the file is not in the workspace
    """
    p = pathlib.PurePath(file)
    if p.is_absolute():
        output_base = `bazel info output_base`
        cwd = os.getcwd()

        if p.is_relative_to(output_base):
            if p.is_relative_to(os.path.join(output_base, "execroot/_main")):
                # Hardcoded path for bazel output_base
                # Returns srcs/xxx or bazel-out/xxx or external/xxx
                return p.relative_to(os.path.join(output_base, "execroot/_main"))
            elif p.is_relative_to(os.path.join(output_base, "external")):
                # Hardcoded path for bazel external
                # Returns external/xxx/
                return p.relative_to(output_base)
        elif p.is_relative_to(cwd):
            # Returns srcs/xxx or bazel-out/xxx or external/xxx
            return p.relative_to(cwd)
        else:
            # System files (headers, etc.)
            return file
    else:
        # throw error if file is not relative to cwd
        return p.relative_to(os.getcwd())

Preliminary Aquery to get target name

I have a question about implementation here. Because bazel aquery --output=jsonproto is output in the form of a path tree, there may be dozens of lines of code to parse the corresponding path tree. Is there any ready-made library for this piece to refer to? If not, I can do it - -

Only add/update the command for my_header.h

Let me state my position first, I think the content in the header file should only have the interface for a class or function which implementation in the source file.
But in the process of declaring the interface, there must be dependencies, such as system headers (stdlib, foundation), or other header files of the same module.
For dependency I want to generate as little as possible except stdlib, the reason is that foundation is highly stable over time, but our code is not. Too many dependencies can lead to high coupling and long compile time (dependency tree changes for incremental compile) and our codebase suffers from it... This is why I want to expose as few symbols as possible to the header, because only the forward declaration class MyClass; struct MyStruct; extern int variable; is required for dependency instead of including its specific interface and let source file gets as many symbols(candidate) as possible in its visible domain.
I don't know if the declare is clear... I'll write a pieces of code for example.

// Pizza.h
#import <Foundation/Foundation.h>

@class PizzaBuilder // I prefer to make it impossible to use any details of PizzaBuilder
// #import <Pizza/PizzaBuilder.h>

@interface Pizza : NSObject
@property NSInteger size;
+ (Model *)createPizzaWithBuilder:((^nullability)(PizzaBuilder))builder;
@end

// ApiClient.h
#import <Foundation/Foundation.h>
#import <Pizza/Pizza.h> // Don't use the Apple-recommended umbrealla header just to write less import
#import <Pizza/PizzaBuilder.h> 

@cpsauer
Copy link
Contributor

cpsauer commented May 12, 2023

Benchmark

:) Delighted to hear it. yay! (Nasty bug they have there--not sure if you read through the linked issue.)

Absolute paths

Seems reasonable--and great to just auto-handle--but could you check/refine it a bit more as you code?

A couple things that jump out:

  • We should probably still have an error for the case that's in-output-base but not external or execroot.
  • Heads that the __main__ workspace name can change, so it's probably worth calling bazel info execution_root
  • (Remember the backwards compatible relative_to helper, as above)
  • (I'm sure there's more I missed on a quick pass; I'm counting on you to experiment and track them down, and give it some solid manual testing.)

Thanks for thinking about handling this case!

Preliminary aquery

Might a different output format be easier to parse?
(No libraries I know of.)

Header file preferences

I understand now your stylistic preference for Obj-C, and I feel like that goes nicely with Include What You Use.
[C++, unfortunately, is less amenable to forward declarations like that.]

@xinzhengzhang
Copy link
Author

I added a todolist to the description of PR to track the todo follow-up. And I opened a sub pr to solve 3 todos.

For the input of file, I simplified the judgment of file conversion and manually tested several possibilities (listed in the description of pr). Can also help to see if there are other cases.
Then I mentioned another confused problem about aquery that may need help...

@cpsauer
Copy link
Contributor

cpsauer commented Jun 2, 2023

Thanks for trying out a new review structure with me! A great experiment given the trickiness of this change.

I'm starting to take a peek. (Went on a trip and scrambling to catch back up. Again, sorry for delays, but I'm doing my best.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants