Skip to content

Commit

Permalink
Filter external repos (#396)
Browse files Browse the repository at this point in the history
* Add WIP config filter

* Add filter option for external repos

* Update for new filter option

* Move components re-ordering to VariableCode

* Align RegioCodeList creation with CodeList

* Update tests

* Add filtering test

* Update docs

* Update poetry.lock to poetry 1.8.3

* Add comment

* Make ruff

* Add test for RegionCodeList

* Fix bug from rebase

* Update paths for test files

* Remove necessary "name" attribute for repositories

* Remove no longer necessary name attribute

* Adjust docs

* Add Code.depth property

* Change string pattern matching to regex

* Apply suggestions from code review

Co-authored-by: Daniel Huppmann <[email protected]>

* Rename class

* Remove name keyword

* Add test for importing model mapping with missing regions

---------

Co-authored-by: Daniel Huppmann <[email protected]>
  • Loading branch information
phackstock and danielhuppmann authored Nov 20, 2024
1 parent 3a76cba commit 701e68c
Show file tree
Hide file tree
Showing 18 changed files with 343 additions and 68 deletions.
40 changes: 38 additions & 2 deletions docs/user_guide/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,47 @@ multiple external repositories can be used as the example below illustrates for
mappings:
repository: common-definitions
The value in *definitions.region.repository* needs to reference the repository in the
*repositories* section.
The value in *definitions.region.repository* can be a list or a single value.

For model mappings the process is analogous using *mappings.repository*.

Filter code lists imported from external repositories
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Since importing the entirety of, for example, common-definitions is too much for most
projects, the list can be filtered using ``include`` and ``exclude`` keywords. Under
these keywords, lists of filters can be given that will be applied to the code list from
the given repository.

The filtering can be done by any attribute:

.. code:: yaml
repositories:
common-definitions:
url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
variable:
repository:
name: common-definitions
include:
- name: [Primary Energy*, Final Energy*]
- name: "Population*"
tier: 1
exclude:
- name: "Final Energy|Industry*"
depth: 2
If a filter is being used for repositories, the *name* attribute **must be used**
for the repository.

In the example above we are including:
1. All variables starting with *Primary Energy* or *Final Energy*
2. All variables starting with *Population* **and** with the tier attribute equal to 1

From this list we are then **excluding** all variables that match "Final
Energy|Industry\*" and have a depth of 2 (meaning that they contain two pipe "|"
characters).
Adding countries to the region codelist
---------------------------------------
Expand Down
18 changes: 17 additions & 1 deletion nomenclature/code.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,10 @@ def flattened_dict_serialized(self):
for key, value in self.flattened_dict.items()
}

@property
def depth(self) -> int:
return self.name.count("|")

def replace_tag(self, tag: str, target: "Code") -> "Code":
"""Return a new instance with tag applied
Expand Down Expand Up @@ -188,7 +192,7 @@ class VariableCode(Code):
)
method: str | None = None
check_aggregate: bool | None = Field(default=False, alias="check-aggregate")
components: Union[List[str], List[Dict[str, List[str]]]] | None = None
components: Union[List[str], Dict[str, list[str]]] | None = None
drop_negative_weights: bool | None = None
model_config = ConfigDict(populate_by_name=True)

Expand All @@ -204,6 +208,18 @@ def deserialize_json(cls, v):
def convert_none_to_empty_string(cls, v):
return v if v is not None else ""

@field_validator("components", mode="before")
def cast_variable_components_args(cls, v):
"""Cast "components" list of dicts to a codelist"""

# translate a list of single-key dictionaries to a simple dictionary
if v is not None and isinstance(v, list) and isinstance(v[0], dict):
comp = {}
for val in v:
comp.update(val)
return comp
return v

@field_serializer("unit")
def convert_str_to_none_for_writing(self, v):
return v if v != "" else None
Expand Down
49 changes: 18 additions & 31 deletions nomenclature/codelist.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,13 +233,12 @@ def from_directory(
for repo in getattr(
config.definitions, name.lower(), CodeListConfig()
).repositories:
code_list.extend(
cls._parse_codelist_dir(
config.repositories[repo].local_path / "definitions" / name,
file_glob_pattern,
repo,
)
repository_code_list = cls._parse_codelist_dir(
config.repositories[repo.name].local_path / "definitions" / name,
file_glob_pattern,
repo.name,
)
code_list.extend(repo.filter_list_of_codes(repository_code_list))
errors = ErrorCollector()
mapping: Dict[str, Code] = {}
for code in code_list:
Expand Down Expand Up @@ -591,21 +590,6 @@ def check_weight_in_vars(cls, v):
)
return v

@field_validator("mapping")
@classmethod
def cast_variable_components_args(cls, v):
"""Cast "components" list of dicts to a codelist"""

# translate a list of single-key dictionaries to a simple dictionary
for var in v.values():
if var.components and isinstance(var.components[0], dict):
comp = {}
for val in var.components:
comp.update(val)
v[var.name].components = comp

return v

def vars_default_args(self, variables: List[str]) -> List[VariableCode]:
"""return subset of variables which does not feature any special pyam
aggregation arguments and where skip_region_aggregation is False"""
Expand Down Expand Up @@ -758,21 +742,25 @@ def from_directory(

# importing from an external repository
for repo in config.definitions.region.repositories:
repo_path = config.repositories[repo].local_path / "definitions" / "region"
repo_path = (
config.repositories[repo.name].local_path / "definitions" / "region"
)

code_list = cls._parse_region_code_dir(
code_list,
repo_list_of_codes = cls._parse_region_code_dir(
repo_path,
file_glob_pattern,
repository=repo,
repository=repo.name,
)
code_list = cls._parse_and_replace_tags(
code_list, repo_path, file_glob_pattern
repo_list_of_codes = cls._parse_and_replace_tags(
repo_list_of_codes, repo_path, file_glob_pattern
)
code_list.extend(repo.filter_list_of_codes(repo_list_of_codes))

# parse from current repository
code_list = cls._parse_region_code_dir(code_list, path, file_glob_pattern)
code_list = cls._parse_and_replace_tags(code_list, path, file_glob_pattern)
local_code_list = cls._parse_region_code_dir(path, file_glob_pattern)
code_list.extend(
cls._parse_and_replace_tags(local_code_list, path, file_glob_pattern)
)

# translate to mapping
mapping: Dict[str, RegionCode] = {}
Expand Down Expand Up @@ -808,13 +796,12 @@ def hierarchy(self) -> List[str]:
@classmethod
def _parse_region_code_dir(
cls,
code_list: List[Code],
path: Path,
file_glob_pattern: str = "**/*",
repository: str | None = None,
) -> List[RegionCode]:
""""""

code_list: List[RegionCode] = []
for yaml_file in (
f
for f in path.glob(file_glob_pattern)
Expand Down
131 changes: 108 additions & 23 deletions nomenclature/config.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from enum import Enum
from pathlib import Path
from typing import Annotated, Optional
from typing import Any
import re

import yaml
from git import Repo
Expand All @@ -11,29 +12,94 @@
field_validator,
model_validator,
ConfigDict,
BeforeValidator,
)
from nomenclature.code import Code
from pyam.str import escape_regexp


class CodeListFromRepository(BaseModel):
name: str
include: list[dict[str, Any]] = [{"name": "*"}]
exclude: list[dict[str, Any]] = Field(default_factory=list)

def filter_function(self, code: Code, filter: dict[str, Any], keep: bool):
# if is list -> recursive
# if is str -> escape all special characters except "*" and use a regex
# if is int -> match exactly
# if is None -> Attribute does not exist therefore does not match
def check_attribute_match(code_value, filter_value):
if isinstance(filter_value, int):
return code_value == filter_value
if isinstance(filter_value, str):
pattern = re.compile(escape_regexp(filter_value) + "$")
return re.match(pattern, code_value) is not None
if isinstance(filter_value, list):
return any(
check_attribute_match(code_value, value) for value in filter_value
)
if filter_value is None:
return False
raise ValueError("Something went wrong with the filtering")

filter_match = all(
check_attribute_match(getattr(code, attribute, None), value)
for attribute, value in filter.items()
)
if keep:
return filter_match
else:
return not filter_match

def filter_list_of_codes(self, list_of_codes: list[Code]) -> list[Code]:
# include first
filter_result = [
code
for code in list_of_codes
if any(
self.filter_function(
code,
filter,
keep=True,
)
for filter in self.include
)
]

if self.exclude:
filter_result = [
code
for code in filter_result
if any(
self.filter_function(code, filter, keep=False)
for filter in self.exclude
)
]


def convert_to_set(v: str | list[str] | set[str]) -> set[str]:
match v:
case set(v):
return v
case list(v):
return set(v)
case str(v):
return {v}
case _:
raise TypeError("`repositories` must be of type str, list or set.")
return filter_result


class CodeListConfig(BaseModel):
dimension: str | None = None
repositories: Annotated[set[str], BeforeValidator(convert_to_set)] = Field(
default_factory=set, alias="repository"
repositories: list[CodeListFromRepository] = Field(
default_factory=list, alias="repository"
)
model_config = ConfigDict(populate_by_name=True)

@field_validator("repositories", mode="before")
@classmethod
def add_name_if_necessary(cls, v: list):
return [
{"name": repository} if isinstance(repository, str) else repository
for repository in v
]

@field_validator("repositories", mode="before")
@classmethod
def convert_to_list_of_repos(cls, v):
if not isinstance(v, list):
return [v]
return v

@property
def repository_dimension_path(self) -> str:
return f"definitions/{self.dimension}"
Expand Down Expand Up @@ -122,10 +188,10 @@ class DataStructureConfig(BaseModel):
"""

model: Optional[CodeListConfig] = Field(default_factory=CodeListConfig)
scenario: Optional[CodeListConfig] = Field(default_factory=CodeListConfig)
region: Optional[RegionCodeListConfig] = Field(default_factory=RegionCodeListConfig)
variable: Optional[CodeListConfig] = Field(default_factory=CodeListConfig)
model: CodeListConfig = Field(default_factory=CodeListConfig)
scenario: CodeListConfig = Field(default_factory=CodeListConfig)
region: RegionCodeListConfig = Field(default_factory=RegionCodeListConfig)
variable: CodeListConfig = Field(default_factory=CodeListConfig)

@field_validator("model", "scenario", "region", "variable", mode="before")
@classmethod
Expand All @@ -141,12 +207,30 @@ def repos(self) -> dict[str, str]:
}


class MappingRepository(BaseModel):
name: str


class RegionMappingConfig(BaseModel):
repositories: Annotated[set[str], BeforeValidator(convert_to_set)] = Field(
default_factory=set, alias="repository"
repositories: list[MappingRepository] = Field(
default_factory=list, alias="repository"
)
model_config = ConfigDict(populate_by_name=True)

@field_validator("repositories", mode="before")
@classmethod
def add_name_if_necessary(cls, v: list):
return [
{"name": repository} if isinstance(repository, str) else repository
for repository in v
]

@field_validator("repositories", mode="before")
def convert_to_set_of_repos(cls, v):
if not isinstance(v, list):
return [v]
return v


class DimensionEnum(str, Enum):
model = "model"
Expand All @@ -172,8 +256,9 @@ def check_definitions_repository(
mapping_repos = {"mappings": v.mappings.repositories} if v.mappings else {}
repos = {**v.definitions.repos, **mapping_repos}
for use, repositories in repos.items():
if repositories - v.repositories.keys():
raise ValueError((f"Unknown repository {repositories} in '{use}'."))
repository_names = [repository.name for repository in repositories]
if unknown_repos := repository_names - v.repositories.keys():
raise ValueError((f"Unknown repository {unknown_repos} in '{use}'."))
return v

def fetch_repos(self, target_folder: Path):
Expand Down
2 changes: 1 addition & 1 deletion nomenclature/processor/region.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,7 +486,7 @@ def from_directory(cls, path: DirectoryPath, dsd: DataStructureDefinition):
mapping_files.extend(
f
for f in (
dsd.config.repositories[repository].local_path / "mappings"
dsd.config.repositories[repository.name].local_path / "mappings"
).glob("**/*")
if f.suffix in {".yaml", ".yml"}
)
Expand Down
2 changes: 1 addition & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

20 changes: 20 additions & 0 deletions tests/data/config/external_repo_filters.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
repositories:
common-definitions:
url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
variable:
repository:
name: common-definitions
include:
- name: [Primary Energy*, Final Energy*]
- name: "Population*"
tier: 1
exclude:
- name: "Final Energy|*|*"
region:
repository:
name: common-definitions
include:
- hierarchy: R5
exclude:
- name: Other (R5)
Empty file.
Empty file.
Loading

0 comments on commit 701e68c

Please sign in to comment.