Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Lazy indexing for faster subdirectory updates #1891

Open
alex-torok opened this issue Aug 26, 2024 · 8 comments · May be fixed by #1892
Open

Proposal: Lazy indexing for faster subdirectory updates #1891

alex-torok opened this issue Aug 26, 2024 · 8 comments · May be fixed by #1892

Comments

@alex-torok
Copy link
Contributor

alex-torok commented Aug 26, 2024

Problem

Currently, gazelle operates in indexing or non-indexing mode. When indexing is enabled, gazelle will walk the entire repo, using the Imports method of language extensions to build up a rule index. For large repos, this can lead to a substantial slowdown compared to operating in a non-indexing mode when trying to run gazelle in a subdirectory (~10s -> 0.8s in my case).

The time penalty from indexing is so great that there are ongoing discussions and work to allow gazelle to save and load its index to the disk - #1181

Not all languages are able to operate without indexing due to ambiguities in mapping source code import statements to the bazel target labels that provide those imports. For example, it is not possible to determine python target labels from import statements. As an example, how can you map the following imports to bazel target labels without an index?

# All of these could be //common/math:foo
import common.math.foo.some_func
import common.math.foo.MyFooClass
import common.math.foo

# Is this //common/math:foo or //common:math?
from common.math import foo

# Is this 'bar function' from //common/math:foo or 'bar module' //common/math/foo:bar
from common.math.foo import bar

Even though you cannot explicitly determine the target labels for each of those imports, you can reasonably guess which packages they would be in (common, common/math, common/math/foo). Because of common conventions for directory structure matching import path structure, we don't need to index the entire repository to resolve these imports. We only need to index the packages that we think the targets would be in.

Proposal

I would like to add a new -lazy_index flag to gazelle. When this flag is enabled (with -index=true), gazelle will not visit all directories in the repo. Instead, it will only visit the directories to update. The GenerateResult struct would gain a PkgsToIndex field that language extensions should populate. Once gazelle is finished generating rules for a given subdir, it would look at all of the PkgsToIndex, then index them prior to finalizing the ruleIndex and calling language extension Resolve methods.

For our above python code example, this leads to a result like,

PkgsToIndex: []string{"", "common", "common/math", "common/math/foo", "common/math/foo/some_func", "common/math/foo/MyFooClass", "common/math/foo/bar"}

Notably, it is okay for rules to suggest indexing packages that don't exist. We can have gazelle ignore directories that don't exist in the workspace when an extension asks to index them.

Benchmarking

I hacked together an implementation of lazy indexing with rules_python. In a repository with ~25k directories, it reduced the runtime of running gazelle in a single subdirectory from 5-10s to ~1s (depending on the directory and how much indexing was avoidable).

The gazelle changes can be seen in #1892.
The rules_python changes can be seen in alex-torok/rules_python#1

Questions / Challenges

  • Is this something that the community is interested in? Would language extension authors be willing to implement the additional behavior to support lazy indexing?
  • Are there languages where lazy indexing would not work or does not make any sense?
    • I'm mostly familiar with golang and python and would like feedback from people familiar with other gazelle language extensions.
  • Does the suggested interface change to add an additional PkgsToIndex field to the GenerateResult sound good, or should this be an entirely separate method?
    • I feel like it makes sense on GenerateResult, as the language extension should have all of the context it needs to make the recommendation to gazelle. Alternatively, there could be a new interface that extensions implement where we pass in the GenerateResult and have it return the packages to index.

This solution does pose problems for repos that don't have a package directory structure that matches the language import path structure. For example, a.b.c could come from //my/repo/is/wierd. In this case, people could write a language extension that has a nearly-no-op Generate method that returns a fixed set of nonconventional packages to index. Alternatively, there could be a -lazy_index_include=some/path to force additional packages (and their subdirs) to be indexed.

@fmeum
Copy link
Member

fmeum commented Aug 29, 2024

Cc @linzhp @tyler-french for Über's "conventions" approach

@jbedard
Copy link
Contributor

jbedard commented Jan 1, 2025

First thought... can we avoid any type of cli flag to opt in/out of this? A cli flag means everywhere gazelle is invoked needs to fully understand which gazelle languages are being used and if they need indexing, if lazy indexing is enough etc.

The first idea that comes to mind would be doing it based on a language interface and gazelle can decide what to do based on which languages are known, which are enabled, maybe how they are configured.

Just adding Language.RequiresIndexing() might be enough that the need for indexing can be automatically determined. Or maybe RequireIndexing(Config) so languages can take configuration into account. Some languages might have a directive that can determine if/when/where indexing is required, or some languages might do that based on the presence of a file (go mod, pnpm lockfile etc).

EDIT: I guess this RequiresIndexing is just a more complicated version of what you're suggesting, see comment below.

@jbedard
Copy link
Contributor

jbedard commented Jan 1, 2025

With the current PkgsToIndex suggestion can't we avoid the cli flag by simply respecting PkgsToIndex of every language? The default just has to work the same as today, then if/when languages implement this and return non-empty arrays they will get the improvements automatically?

@jbedard
Copy link
Contributor

jbedard commented Jan 1, 2025

I guess my RequiresIndexing idea allows languages to dynamically declare what needs to be indexed top-down, where PkgsToIndex allows languages to declare what needs to be indexed in addition to the BUILD being generated.

Maybe both are necessary?

The use case I keep thinking of is npm packages, and maybe pip is similar? I think maven is also similar, maybe go mod can be implemented the same way as well. We need to parse a lockfile, which might be in a subdirectory, and then that data can be used by a resolver to resolve package names.

The package resolver can use RequiresIndexing to traverse the minimal amount of the fs required to find the lockfile (top-down relative to the root), while PkgsToIndex can be used to expand the index from the BUILD like this issue is describing?

@jayconrod
Copy link
Contributor

High level thought: I'm basically in favor of this proposal. It could speed up Gazelle dramatically in large repos while not being too "magical" or requiring a lot of configuration from end users. I haven't yet thought about the details though, and I do think we need to consider what configuration will look like for a few languages.

First thought... can we avoid any type of cli flag to opt in/out of this? A cli flag means everywhere gazelle is invoked needs to fully understand which gazelle languages are being used and if they need indexing, if lazy indexing is enough etc.

The current default is for Gazelle to index everything (-index=true). I think if this proposal is implemented, it implies a change to that default. Perhaps we can defer that change a while until all public language extensions have implemented this wherever it makes sense. In the end though, users shouldn't have to think about this: it should just work and be fast.


Some historical context about indexing, though enough time has passed that my memory is a little muddled: I added indexing to deal with GOPATH and vendoring. At the time, people could put a vendor directory with a tree of Go code anywhere in their repo they wanted. Multiple vendor directories were allowed. Configuration in go_repository was difficult, so we just needed to deal with it. Protos were a problem too, since they don't follow such strict conventions. Most Go repos were small, so full repo scanning usually wasn't an issue.

@alex-torok
Copy link
Contributor Author

alex-torok commented Jan 2, 2025

To give a bit of an update on this -- I've been using this for about 4 months to roll out adoption of python gazelle and we see runtimes of ~0.7-1.5s when running in a subdirectory with fewer dependencies on other packages. It has been working well enough that we were willing to maintain our own patch on gazelle to support it. (We wrap the python lang extension and use reflect to inspect the python extension's imports struct and build up the packages to index list in order to avoid maintaining a python lang extension patch)

A global "use the index or not" configuration isn't what we needed. We still don't index to generate golang targets because the default conventions are strong enough and we historically avoided indexing because of the time penalty in our monorepo. To support lazy indexing for python and no indexing for go, I wrapped the golang language extension. We set the config struct's IndexLibraries field to false before calling the upstream Imports, Generate, etc methods so we didn't end up paying indexing penalty for golang.

Assuming lazy becomes the default mode, I think we have options. There could be a #gazelle:index LANG MODE directive to allow per-language configuration away from the lazy default. Perhaps this directive is only allowed in the root BUILD file? This further drives away from the usage of the -index flag into configuration via directives.

Alternatively, if the -index flag is kept as-is, the language extension itself could gain something like a #gazelle:go_lazy_index off to cause it to return an empty PkgsToIndex and no-op in the Imports method?

@jbedard
Copy link
Contributor

jbedard commented Jan 2, 2025

I mentioned this in the PR... but doesn't the API need the ability to index directories that are not subdirectories? What if imports could be relative such as ../foo? Or x that might be in a common/x or server/x?

@jayconrod
Copy link
Contributor

Assuming lazy becomes the default mode, I think we have options. There could be a #gazelle:index LANG MODE directive to allow per-language configuration away from the lazy default. Perhaps this directive is only allowed in the root BUILD file? This further drives away from the usage of the -index flag into configuration via directives.

I wonder if we could do something more general: like # gazelle:index path/to/dir or # gazelle:index path/to/dirs/..., which would index a specific directory, perhaps recursively. That could be used for subtrees that don't follow predictable conventions (for example, a bag of protos that generate libraries for multiple languages). No need to specify a language name or mode.

I mentioned this in the PR... but doesn't the API need the ability to index directories that are not subdirectories? What if imports could be relative such as ../foo? Or x that might be in a common/x or server/x?

I think all paths in PkgsToIndex should be relative to the workspace root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@jbedard @jayconrod @fmeum @alex-torok and others