i #308 Add Scitools Understand Parser

Adds Scitools Understand Dependencies parser for files and classes. --------- Signed-off-by: Carlos Paradis <[email protected]> Co-authored-by: Nicholas Beydler <[email protected]> Co-authored-by: Carlos Paradis <[email protected]>
sailuh · Dec 8, 2024 · ac522b6 · ac522b6
1 parent 513a3f0
commit ac522b6
Show file tree

Hide file tree

Showing 38 changed files with 607 additions and 42 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -2,7 +2,7 @@ Package: kaiaulu
 Type: Package
 Title: Kaiaulu
 Version: 0.0.0.9700
-Description: Kaiaulu is an R package and common interface that helps with understanding evolving software development communities, and the artifacts (gitlog, mailing list, files, etc.) which developers collaborate and communicate about. See Paradis et al., (2012) <doi:10.1007/978-3-031-15116-3_6>. 
+Description: Kaiaulu is an R package and common interface that helps with understanding evolving software development communities, and the artifacts (gitlog, mailing list, files, etc.) which developers collaborate and communicate about. See Paradis et al., (2012) <doi:10.1007/978-3-031-15116-3_6>.
 Authors@R: c(
     person('Carlos', 'Paradis', role = c('aut', 'cre'),
       email = '[email protected]',
@@ -21,6 +21,7 @@ Authors@R: c(
     person('Anthony', 'Lau', role = c('ctb')),
     person('Sean', 'Sunoo', role = c('ctb')),
     person('Ian Jaymes', 'Iwata', role= c('ctb')),
+    person('Raven', 'Quiddaoen', role= c('ctb')),
     person('Nicholas', 'Beydler', role = c('ctb')),
     person('Mark', 'Burgess', role = c('ctb'))
     )

diff --git a/NAMESPACE b/NAMESPACE
@@ -3,6 +3,7 @@
 export(annotate_src_text)
 export(assign_exact_identity)
 export(bipartite_graph_projection)
+export(build_understand_project)
 export(commit_message_id_coverage)
 export(community_oslom)
 export(convert_pipermail_to_mbox)
@@ -42,6 +43,7 @@ export(example_notebook_alternating_function_in_files)
 export(example_notebook_function_in_code_blocks)
 export(example_renamed_file)
 export(example_test_example_src_repo)
+export(export_understand_dependencies)
 export(filter_by_commit_interval)
 export(filter_by_commit_size)
 export(filter_by_file_extension)
@@ -189,6 +191,7 @@ export(parse_r_dependencies)
 export(parse_r_function_definition)
 export(parse_r_function_dependencies)
 export(parse_rfile_ast)
+export(parse_understand_dependencies)
 export(query_src_text)
 export(query_src_text_class_names)
 export(query_src_text_namespace)
@@ -214,6 +217,7 @@ export(transform_gitlog_to_temporal_network)
 export(transform_r_dependencies_to_network)
 export(transform_reply_to_bipartite_network)
 export(transform_temporal_gitlog_to_adsmj)
+export(transform_understand_dependencies_to_network)
 export(weight_scheme_count_deleted_nodes)
 export(weight_scheme_cum_temporal)
 export(weight_scheme_pairwise_cum_temporal)

diff --git a/NEWS.md b/NEWS.md
@@ -3,6 +3,7 @@ __kaiaulu 0.0.0.9700 (in development)__
 
 ### NEW FEATURES
 
+ * `build`, `export` `parse` and `transform` functions for Scitools Understand have been added. [#308](https://github.com/sailuh/kaiaulu/issues/308)
  * The GitHUB API has been expanded to use refresh, along with other functions. `github_api_project_issue_search` has been added that makes the search/issues endpoint API calls. `github_api_project_issue_or_pr_comments_by_date` and `github_api_project_issue_by_date` have been added to download issue data and comments by date ranges. `github_parse_search_issues_refresh` has been added that parses the issue data downloaded from the search endpoint in the refresh_issues folder. `github_api_project_issue_refresh` and `github_api_project_issue_or_pr_comment_refresh` were added to download issue data or comments respectively that have not already been downloaded. `format_created_at_from_file` was added to retrieve the greatest date from a JSON file. See the Reference Docs on GitHub section for more details. [#282](https://github.com/sailuh/kaiaulu/issues/282)
  * `config.R` now contains a set of getter functions used to centralize the gathering of configuration data and these getter functions are used to refactor configuration file information gathering. For example, loading configuration file information with variable assignment is as follows `git_repo_path <- config_file[["version_control"]][["log"]]` but refactoring with a config.R getter function becomes `git_repo_path <- get_git_repo_path(config_file)`.  [#230](https://github.com/sailuh/kaiaulu/issues/230)
  * `refresh_jira_issues()` had been added. It is a wrapper function for the previous downloader and downloads only issues greater than the greatest key already downloaded. [#275](https://github.com/sailuh/kaiaulu/issues/275)

diff --git a/R/src.R b/R/src.R
@@ -4,8 +4,174 @@
 # License, v. 2.0. If a copy of the MPL was not distributed with this
 # file, You can obtain one at https://mozilla.org/MPL/2.0/.
 
+############## Understand Project Builder ##############
+
+#' Build Understand DB
+#'
+#' Uses Scitools Understand to create a source code project Und Database.
+#'
+#' @param scitools_path path to the scitools binary `und`
+#' @param project_path path to the project source code folder to create the Understand DB.
+#' @param language the primary language of the project (language must be supported by Understand)
+#' @param output_dir path to output directory (formatted output_path/)
+#'
+#' @return The created Scitools Understand DB path
+#' @references See pg. 352 in https://documentation.scitools.com/pdf/understand.pdf Sept. 2024 Edition
+#' @export
+#' @family parsers
+build_understand_project <- function(scitools_path, project_path, language, output_dir){
+
+  scitools_path <- path.expand(scitools_path)
+
+  # Create variables for command line
+  command <- scitools_path
+  project_path <- shQuote(project_path) # Quoting the project path
+  db_dir <- file.path(output_dir, "Understand.und")
+  args <- c("create", "-db", db_dir, "-languages", language)
+
+  # Build the Understand project by parsing through using Understand's und command
+  build_output <- system2(command, args)
+  args <- c("-db", db_dir, "add", project_path)
+  db_output <- system2(command, args)
+  analyze_output <- args <- c("analyze", db_dir)
+  output <- system2(command, args)
+
+  return(db_dir)
+
+}
+
+#' Extract Understand Dependencies
+#'
+#' Extract the XML dependency file for either class or file granularity from
+#' an understand DB.
+#'
+#' @param scitools_path path to the scitools binary `und`
+#' @param db_path path to the scitools DB (see \code{\link{build_understand_project}})
+#' @param parse_type Type of dependencies to generate into xml (either "file" or "class")
+#' @param output_filepath path to the output XML filepath of dependencies
+#'
+#' @return The output directory where the db will be created, i.e. output_dir parameter.
+#' @references See pg. 352 in https://documentation.scitools.com/pdf/understand.pdf Sept. 2024 Edition
+#' @export
+#' @family parsers
+export_understand_dependencies <- function(scitools_path, db_filepath, parse_type = c("file", "class"), output_filepath){
+
+  scitools_path <- path.expand(scitools_path)
+
+  # Before running, check if parse_type is correct
+  parse_type <- match.arg(parse_type)
+
+  # Create the variables used in command lines
+  #db_dir <- file.path(understand_dir, "Understand.und")
+
+  #file_name <- paste0(parse_type, "Dependencies.xml")
+  #xml_dir <- file.path(db_dir, file_name)
+
+  # Generate the XML file
+  # Derived from pg. 352 in https://documentation.scitools.com/pdf/understand.pdf Sept. 2024 Edition
+  args <- c("export", "-dependencies", parse_type, "cytoscape", output_filepath, db_filepath)
+  output <- system2(scitools_path, args)
+
+  return(output_filepath)
+
+  # Generated XML file is assumed to be in this approximate format (regardless of parse_type) using Understand Build 1202
+  # <graph ...>
+  #   ... [Irrelevant graph attributes and rdf grandchildren]
+  #   <node id="67" label="ObjectMapper id:67">
+  #     <att type="string" name="node.shape" value="rect"/>
+  #     <att type="string" name="node.fontSize" value="5"/>
+  #     <att type="string" name="node.label" value="ObjectMapper"/>
+  #     <att type="string" name="longName" value="com.fasterxml.jackson.databind.ObjectMapper"/>
+  #     <att type="string" name="kind" value="Unknown Class"/>
+  #     <graphics type="RECTANGLE" h="35" w="35" x="0" y="0" fill="#ffffff" width="1" outline="#000000" cy:nodeTransparency="1.0" cy:nodeLabelFont="Default-0-8" cy:borderLineType="solid"/>
+  #   </node>
+  #   ... [Other nodes sharing the format]
+  #   <edge source="2" target="9" label="App(Depends On)CalculatorUI">
+  #     <att type="string" name="edge.targetArrowShape" value="ARROW"/>
+  #     <att type="string" name="edge.color" value="#0000FF"/>
+  #     <att type="string" name="canonicalName" value="App(Depends On)CalculatorUI"/>
+  #     <att type="string" name="interaction" value="Depends On"/>
+  #     <att type="string" name="dependency kind" value="Call, Create"/>
+  #   </edge>
+  #   ... [Other edges sharing the format]
+
+
+}
+
 ############## Parsers ##############
 
+#' Parse Scitools Understand Dependencies XML
+#'
+#' Parses either a file or class scitools understand dependency XML to table.
+#'
+#' @param dependencies_path path to the exported Understand dependencies file (see \code{\link{export_understand_dependencies}}).
+#' @export
+#' @family parsers
+parse_understand_dependencies <- function(dependencies_path) {
+
+  # Parse the XML file
+  xml_data <- xmlParse(dependencies_path)  # Creates pointer to file
+  xml_nodes <- xmlRoot(xml_data)  # Finds the head: graph
+  xml_nodes <- xmlChildren(xml_nodes)
+  # xml_nodes now contains the nodes and edges (which were children of graph) and also graph's atts
+
+  # From child nodes- filter for those with name "node"
+  # Create a list by iterating through all the children in xml_nodes
+  node_elements <- lapply(xml_nodes, function(child) {
+    if (xmlName(child) == "node") {  # We're searching for nodes, not att or edges
+      id <- xmlGetAttr(child, "id")  # Extract the id from the node line
+      att_nodes <- xmlChildren(child)  # To access the atts of the node
+      node_label <- xmlGetAttr(att_nodes[[3]], "value")  # Relevant att is the 3rd line
+      long_name <- xmlGetAttr(att_nodes[[4]], "value")  # Relevant att is the 4th line
+      return(data.table(node_label = node_label, id = id, long_name = long_name))  # Returns the table containing the filtered node data
+    } else {
+      return(NULL) # Return NULL for the entry to be filtered out later
+    }
+  })
+
+  # Remove NULLs and combine the results from the node_elements list
+  node_list <- rbindlist(node_elements[!sapply(node_elements, is.null)], use.names = TRUE, fill = TRUE)
+
+  # From child nodes- filter for those with name "edge"
+  # Create a list by iterating through all the children in xml_nodes
+  edge_elements <- lapply(xml_nodes, function(child) {
+    if (xmlName(child) == "edge") {  # We're searching for edges, not att or nodes
+      # Extract the id_from and id_to from the edge line
+      id_from <- xmlGetAttr(child, "source")
+      id_to <- xmlGetAttr(child, "target")
+      att_nodes <- xmlChildren(child)  # To access the atts of the edge
+      dependency_kind <- xmlGetAttr(att_nodes[[5]], "value")  # Relevant att is the 5th line
+      # Error handling for empty and NULL dependency_kind (this is necessary as errors do occur even in the formatted style)
+      # Code correctly handles all the edges, however produces error if error handling is not included... so...
+      if (!is.null(dependency_kind) && dependency_kind != "") {
+        dependency_kind <- unlist(stri_split(dependency_kind, regex = ",\\s*"))  # Separates the string into a vector
+        return(data.table(id_from = id_from, id_to = id_to, dependency_kind = dependency_kind)) # Returns the table containing the filtered node data
+      } else {
+        return(NULL) # Return NULL for the entry to be filtered out later
+      }
+    } else {
+      return(NULL) # Return NULL for the entry to be filtered out later
+    }
+  })
+
+  # Remove NULLs and combine the results from the edge_elements list
+  edge_list <- rbindlist(edge_elements[!sapply(edge_elements, is.null)], use.names = TRUE, fill = TRUE)
+
+  # Merge edges with nodes to get label_from
+  edge_list <- merge(edge_list, node_list[, .(id, node_label)], by.x = "id_from", by.y = "id", all.x = TRUE)
+  setnames(edge_list, "node_label", "label_from")
+
+  # Merge again to get label_to
+  edge_list <- merge(edge_list, node_list[, .(id, node_label)], by.x = "id_to", by.y = "id", all.x = TRUE)
+  setnames(edge_list, "node_label", "label_to")
+
+  # Reorder columns to have label_from and label_to on the left
+  edge_list <- edge_list[, .(label_from, label_to, id_from, id_to, dependency_kind)]
+
+  # Create a list of the network to return
+  graph <- list(node_list = node_list, edge_list = edge_list)
+  return(graph)
+}
 
 #' Parse dependencies from Depends
 #'
@@ -215,6 +381,42 @@ parse_r_dependencies <- function(folder_path){
 
 ############## Network Transform ##############
 
+#' Transform Understand Dependencies
+#'
+#' @description This function subsets a parsed table from parse_understand_dependencies
+#'
+#' @param parsed Parsed table from \code{\link{parse_understand_dependencies}}
+#' @param weight_types The weight types as defined in Depends. Accepts single string and vector input
+#' @export
+#' @family edgelists
+transform_understand_dependencies_to_network <- function(parsed, weight_types) {
+
+  nodes <- parsed[["node_list"]]
+  edges <- parsed[["edge_list"]]
+
+  # Create an ID column, as the file name in a label may occur
+  # again in other parts of the code.
+
+  nodes$node_label <- stringi::stri_c(nodes$node_label,"|",nodes$id)
+
+  edges$label_from <- stringi::stri_c(edges$label_from,"|",edges$id_from)
+  edges$label_to <- stringi::stri_c(edges$label_to,"|",edges$id_to)
+
+  # Filter out by weights if vector provided
+  if (length(weight_types) > 0) {
+    edges <- edges[dependency_kind %in% weight_types]
+  }
+
+  # If filter removed all edges:
+  if (nrow(edges) == 0) {
+    stop("Error: No edges found under weight_types.")
+  }
+
+  # Create a list to return
+  graph <- list(node_list = nodes, edge_list = edges)
+  return(graph)
+}
+
 #' Transform parsed dependencies into a network
 #'
 #' @param depends_parsed A parsed mbox by \code{\link{parse_dependencies}}.

diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -25,8 +25,12 @@ reference:
     Notebooks for examples.
 - contents:
   - parse_dependencies
+  - build_understand_project
+  - export_understand_dependencies
+  - parse_understand_dependencies
   - parse_r_dependencies
   - transform_dependencies_to_network
+  - transform_understand_dependencies_to_network
   - transform_r_dependencies_to_network
 - subtitle: __Gang of Four Patterns__
   desc: >

diff --git a/conf/helix.yml b/conf/helix.yml
@@ -219,7 +219,20 @@ tool:
   #   project_path: ../../rawdata/kaiaulu/git_repo/understand/
   #   # Where the output for the understands analysis is stored
   #   output_path: ../../analysis/kaiaulu/understand/
-
+  understand:
+    # Accepts one language at a time: ada, assembly, c/c++, c#, fortran, java, jovial, delphi/pascal, python, vhdl, basic, javascript
+    code_language: java
+    # Specify which types of Dependencies to keep
+    keep_dependencies_type:
+      - Import
+      - Call
+      - Create
+      - Use
+      - Type GenericArgument
+    # Where the files to analyze should be stored
+    project_path: ../../rawdata/helix/git_repo/helix/
+    # Where the output for the understands analysis is stored
+    output_path: ../../analysis/helix/understand/
 # Analysis Configuration #
 analysis:
   # You can specify the intervals in 2 ways: window, or enumeration

diff --git a/conf/kaiaulu.yml b/conf/kaiaulu.yml
@@ -208,21 +208,6 @@ tool:
   #     3. Use sudo ./gradlew build
   #     4. After building, locate the engine class files and specify as the class_folder_path:
   #        in this case they are in: /path/to/junit5/analysis/junit-platform-engine/build/classes/java/main/org/junit/platform/engine/
-  understand:
-    # Accepts one language at a time: ada, assembly, c/c++, c#, fortran, java, jovial, delphi/pascal, python, vhdl, basic, javascript
-    code_language: java
-    # Specify which types of Dependencies to keep
-    keep_dependencies_type:
-      - Import
-      - Call
-      - Create
-      - Use
-      - Type GenericArgument
-    # Where the files to analyze should be stored
-    project_path: ../../rawdata/kaiaulu/git_repo/understand/
-    # Where the output for the understands analysis is stored
-    output_path: ../../analysis/kaiaulu/understand/
-
 
 # Analysis Configuration #
 analysis:

diff --git a/man/build_understand_project.Rd b/man/build_understand_project.Rd