Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a 'module graph' #73

Open
0xdevalias opened this issue Dec 12, 2023 · 6 comments
Open

add a 'module graph' #73

0xdevalias opened this issue Dec 12, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@0xdevalias
Copy link

0xdevalias commented Dec 12, 2023

Let me share a bit of my current thoughts on this:

  1. Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top.
  2. Based on 1, the steps gonna be like [unpacked] -> [???] -> [unminify]. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to unminify.
  3. In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.
  4. Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.
  5. I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

It's ok to not link this response everywhere as I'm still thinking about this. And it should be moved to a new issue.

Originally posted by @pionxzh in #34 (comment)

See Also

@0xdevalias
Copy link
Author

0xdevalias commented Dec 12, 2023

Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top.

@pionxzh This sounds like an awesome idea!


Based on 1, the steps gonna be like [unpacked] -> [???] -> [unminify]. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to unminify.

@pionxzh I've only thought about this a little bit, and it depends on how 'all encompassing' you want the module graph to be, but I think it might even make sense for it (or some other metadata/graph) to capture the mapping from original files -> unmapped as well.

--

For some background context (to help understand some of the things I describe for the graph later on below), the workflow I've been thinking about/following for my own needs would probably be as follows:

  • My original workflow:
    • identify when a new build has been published + the manifest/chunk/etc URLs from that (Ref)
    • download all of the raw script files from the website and save them 'as is' in raw/ (Ref)
    • do a 'first stage' 'light unpack' of the relevant manifest/chunks/etc for this build from raw/ by stripping the hashes from the filenames/etc, run prettier on them, and save in unpacked-stage1; I also manually figure out if any chunks have changed their identifier, and remove any chunks from the old build that no longer exist in the new build (Ref: 1, 2)
  • Additional steps now that I have wakaru:
    • do a 'wakaru unpack' of all of the relevant manifest/chunks/etc in unpacked-stage1/, and save them into unpacked-stage2/
    • do a 'wakaru unminify' of all the modules in unpacked-stage2/, and save them in unminified

While that workflow might be overkill for a lot of people, I like that it allows me to keep the outputs of each of the 'intermediary steps' available, and can cross reference between them if/as needed. I might find that as I start to use this more, that I don't find it useful to keep some of those intermediate steps; but at least for now, that is my workflow.

--

Now with that background context, going back to my thoughts about the graph/etc; I think it would be useful to be able to have a graph/similar that shows:

  • a1-b1-c1-ha-sh/_buildManifest.js contains chunk files ["filefoo-abc123.js", "etc.js"] (Ref)
  • a1-b1-c1-ha-sh/_ssgManifest.js contains chunk files ["ssgbar-abc123.js", "ssg-etc.js"] (Ref)
  • webpack-a2b2c2hash.js contains chunk files ["aaaa-bbbb.js", "etc.js"] (Ref)
  • filefoo-abc123.js contains chunk [1337, ...]
  • chunk 1337
    • contains modules [1, 3, 7, 24]
    • which were renamed to ["module1.js", "aUsefulName.js", "a/path/and/a/reallyUsefulName.js", "module24.js"]

And then the actual 'internal module mapping' stuff of what imports/exports what, etc.

I'm not sure exactly how to map the data, but I would probably start with identifying the main 'types' involved, and what makes sense to know/store about each of them. The following might not be complete, but it's what I came up with from a 'first pass':

  • a 'build'
    • all of the original file names
    • (some of the below may make sense to be nested under this, not sure)
  • build manifest (Ref)
    • original filename
    • build hash
    • renamed to filename
    • chunks (and I think the URL paths that map to them; at least for those related to pages (possibly a next.js thing) (Ref))
  • ssg manifest (Ref)
    • original filename
    • build hash
    • renamed to filename
    • etc? (I haven't actually looked at one of these with real data in it yet)
  • chunk files (of which the webpack.js chunk seems a bit special I think?) (Ref)
    • original filename
    • chunk hash
    • renamed to filename
    • chunk IDs that were included in it
  • chunks/modules
    • original chunk filename/etc?
      • (probably will be the same as the 'chunk files' section above; might be a better way to layout this data, but I thought it probably didn't make sense to nest it under the chunk files structure)
    • chunkID in the bundle
    • moduleIDs in the chunk
  • modules
    • chunkID that originally contained it
    • moduleID from the bundle/chunk
    • filename the module was renamed into
    • imported moduleIDs
    • exports

This 'metadata file' / graph / etc could then potentially also include the stuff I've talked about before (Ref) for being able to 'guide' the variable/function/etc names used during unminification.

--

I haven't thought deeply through the above yet; it might turn out that some of the things I described there might make sense being split into 2 different things; but I wanted to capture it all while it was in my head.


In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping.

Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name.

@pionxzh 👌🏻🎉


I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation.

@pionxzh Definitely. Though I (or you, or someone) need to dig into the concepts a bit more and figure out a practical way to implement it; as currently it's sort of a theory in my mind, but not sure how practical it will be in reality.

Created a new issue for that exploration:

@0xdevalias
Copy link
Author

0xdevalias commented Dec 21, 2023

I was wanting to visualize the dependencies between my unminified modules, and stumbled across this project:

It mentioned two of it's dependencies, which sound like they could potentially be useful here:


Off the top of my head, I think the 'high level' module-graph within wakaru would probably make the most sense to be linked based on the module ID's, rather than the actual import/exports / module filenames. That way it would be more robust/not need to change as things are renamed/moved around/etc. So these libraries may not be super useful 'as is' for this.


Some useful commands for visualising module dependencies:

# Get the module dependencies as a static .svg image
madge --image graph.svg path/src/app.js

# Get the module dependencies as a graphviz DOT file
madge --dot path/src/app.js > graph.gv

# Get the module dependencies as json
madge --json path/src/app.js > dependencies.json

The graphviz dot output can then be further explored through an interactive tool such as:

If there are missing dependencies, these are worth noting for how to see/improve it:


In addition to the above, a couple of other 'dependency graph' viewers I came across when I was looking for tools for this today:

@0xdevalias
Copy link
Author

0xdevalias commented Dec 23, 2023

I haven't deeply looked into this, and not for ages, but at one stage I remember having a thought that the chunks specified the other chunks they depended on somewhere (as well as the individual module imports within it) (Ref)

In the code I was most exploring, theres the _buildManifest.js (Ref) and webpack.js (Ref) chunks that seemed to detail some of the 'high level' of the chunk loading/dependencies/etc; though there was also the chunks loaded directly in the html as well.

Looking at a fairly small/basic chunk, it seems like it doesn't have anywhere that specifies dependencies on other chunks (Ref)

But then looking at a far larger chunk file (pages/_app.js (Ref), there is this section after all of the normal module definitions that looks like it might handle loading other chunks if they aren't already loaded, and module dependency order or similar:

function (U) {
  var B = function (B) {
    return U((U.s = B));
  };
  U.O(0, [774, 179], function () {
    return B(18992), B(9869), B(76281);
  }),
    (_N_E = U.O());
},

Originally posted by @0xdevalias in j4k0xb/webcrack#30 (comment)


Another pattern I just noticed, in _app.js (Ref), presumably Next specific:

// module-9869.js
(window.__NEXT_P = window.__NEXT_P || []).push([
  "/_app",
  function () {
    return require(68502);
  },
]);

@0xdevalias
Copy link
Author

Not 100% sure, but Webpack's stats.json file sounds like it might be relevant here (if not directly, then maybe as a source of inspiration):

Even more tangentially related to this, I've pondered how much we could 're-construct' the files necessary to use tools like bundle analyzer, without having access to the original source (or if there would even be any benefit to trying to do so):

My gut feel is that we probably can figure out most of what we need for it; we probably just can't give accurate sizes for the original pre-minified code, etc; and the module names/etc might not be mappable to their originals unless we have module identification type features (see #41)

Originally posted by @0xdevalias in 0xdevalias/chatgpt-source-watch#9 (comment)

Originally posted by @0xdevalias in #121 (comment)

@0xdevalias
Copy link
Author

The Stack Graph / Scope Graph links/references I shared in #34 (comment) may be relevant to this issue as well.

@0xdevalias
Copy link
Author

There has recently been a new source of discussion around code fingerprinting and module identification over on the humanify repo in this issue:

Originally posted by @0xdevalias in #74 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants