Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend model system #58

Draft
wants to merge 31 commits into
base: develop
Choose a base branch
from
Draft

Extend model system #58

wants to merge 31 commits into from

Conversation

ndaelman-hu
Copy link
Collaborator

@ndaelman-hu ndaelman-hu commented Apr 28, 2024

This PR adds inter-atomic distances and angles, with optional extension to dihedral angles, to the NOMAD schema.


Schema

These distributions are stored in AtomicCell.geometry_distributions: list[GeometryDistribution] as a repeating subsection. Information on the (cutoffs, used to set up the) bins, is stored under AtomicCell.geometry_analysis_cutoffs.

GeometryDistribution objects are effectively histograms by element pair / triple. This is a choice to facilitate searches and visualizations over the distribution themselves, which are normalized to reproduce the same frequency for primitive cells as supercells. Additional advantages include a limited storage consumption. It is, however, not suitable for extracting the exact atomic pairs / triples matching a given distance / angle.

Note: triples / quadruples are defined around a specific geometric primitive (point / line piece). Their notation accounts for this, by placing them up front in the elements list.

Additional functionalities

To keep state within AtomicCell simple and exert control over the stages in calculation, pure Python helper classes with the following functions are added, in line with the Single Responsibility Principle:

  • DistributionFactory scans the combinatorial space of elements and generates their Distributions.
  • Distribution leverages ase for computing the distances / angles for the elemental combo provided. It can also produce a DistributionHistogram of itself.
  • DistributionHistogram contain the histogram version of their combo distribution.

While only DistributionHistogram will eventually be written out to GeometryDistribution, the rest of the data artifacts are retained under AtomicCell during run-time. As such, once computed, they can be used in other computation, or data analysis (conversion to pandas.DataFrame in a separate PR).

Inclusion of the actual computation pipeline, meanwhile, is toggled via AtomicCell.analyze_geometry: bool.
The execution is handled by the normalizer.

Miscellaneous

To facilitate a cleaner check for missing NOMAD attributes, i.e. quantities, subsections, I added the check_attributes decorator. It is used in AtomicCell.set_ase_atoms, as an example. If accepted, I recommend that it be reallocated to utils.py.

To support a more functional pipeline notation, next to the current imperative style, I make all computational functions and normalize return self. I checked with Area D that the latter does not violate the normalize interface specifications.


P.s.: I'm considering adding this PR description as documentation. Let me know your thoughts.

ndaelman and others added 12 commits April 18, 2024 20:51
- Clarify some descriptions in the schema
- Update tests
- TODO: add angles
- Improve combo generation: unique combos only now
- Robustify `AtomicCell.get_distances()`
- Add quantities storing elemental combinations
- TODO: fix error for matrix storage of `str` (coming from elementatl combinations)
- Apply approximations in testing
- Add angles to the tests
- Add first prototyp of `decorator` checks
- TODO: extend testing
- TODO: check `Cell` structure
@ndaelman-hu ndaelman-hu added new feature New feature or request testing Testing additions or fixes labels Apr 28, 2024
@ndaelman-hu ndaelman-hu self-assigned this Apr 28, 2024
@ndaelman-hu ndaelman-hu marked this pull request as draft April 28, 2024 22:21
@JosePizarro3
Copy link
Collaborator

To support a more functional pipeline notation, next to the current imperative style, I make all computational functions and normalize return self. I checked with Area D that the latter does not violate the normalize interface specifications.

I doubt it does not break anything, especially the MetainfoNormalizer. And if so, I wonder whether you are willing to change ALL normalize() functions to now return self. In any case, I think this is not a good practice here: the only reason I'd see to return self is to chain methods one after another, i.e., if you want to do something like:

class Example:

    def method_a(self):
         ...
         return self

    def method_b(self):
         ...
         return self

    def method_c(self):
         ...
         return self


example = Example()
example_2 = example.method_c().method_a().method_c().method_c()

Did you check that the methods (besides normalize, in which I won't do it) you implemented this return, this kind of logic is what we want?

src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
src/nomad_simulations/model_system.py Show resolved Hide resolved
src/nomad_simulations/model_system.py Show resolved Hide resolved
src/nomad_simulations/model_system.py Show resolved Hide resolved
src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
positions = Quantity(
type=np.float64,
shape=['n_cell_points', 3],
unit='meter',
description="""
Positions of all the atoms in Cartesian coordinates.
Positions of all the atoms in absolute, Cartesian coordinates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for exactly? I think perhaps the name should reflect the difference from the atomic_cell positions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for exactly?

With a (periodic) cell defined, you could either express the position vectors in terms of the lattice vector basis set, or the basis set in which the lattice vectors were defined (typically spanning a Cartesian space).
We end up going with the latter, since it simplifies large-scale manipulation. I think it's also the most common convention in ab initio.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question was what is the difference between the positions stored here and the positions stored in atomic_cell? And can/should the attribute name here denote that difference (i.e., not just the description)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ow, yes, I see. Tbh, I think some confusion entered the setup. Cell is both the parent to AtomicCell, but seems to also contain a grid, i.e. n_cell_points. The positions and the rest refer to them, though my correction may be wrong.

Let me take a closer look. Chema did push some changes here. I'll get back to you, after merging those.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JosePizarro3 could you pls clarify the meaning of Cell.n_cell_points, especially wrt to AtomicCell?
I actually added a note for you there when I asked for the review.

Copy link
Collaborator

@JosePizarro3 JosePizarro3 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the description. This should be fixed.

I tried to break the geometric quantities of a cell and the specific atoms sitting in those positions. Cell contains thus the lattice_vectors and the 0-dimensional points in real space, i.e., positions (and also velocities and other info about the Cell), while AtomsState has the info of the atom sitting in that position.

So essentially, these are the same both in Cell and AtomicCell. The only addition of the atomic cell is the atoms state in each of these positions. Whether these positions are occupied by entities which are atoms, this depends on the problem. It could be a more complex thing which is centered around them, and for modeling purposes is described like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially, these are the same both in Cell and AtomicCell.

Great! That's what I had to know. Then I'll remove positions and velocities from AtomicCell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the description.

Where?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the description.

Where?

Change atoms for points. And you can rename the quantity above from n_cell_points to n_points.

Copy link
Collaborator

@JFRudzinski JFRudzinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good for now, just take a look at my comments.

src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
positions = Quantity(
type=np.float64,
shape=['n_cell_points', 3],
unit='meter',
description="""
Positions of all the atoms in Cartesian coordinates.
Positions of all the atoms in absolute, Cartesian coordinates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question was what is the difference between the positions stored here and the positions stored in atomic_cell? And can/should the attribute name here denote that difference (i.e., not just the description)?

src/nomad_simulations/model_system.py Show resolved Hide resolved
src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved
@JosePizarro3
Copy link
Collaborator

Let me know if you need a review. I opened a question about returning self, and whether if this is what we want.

@ndaelman-hu
Copy link
Collaborator Author

ndaelman-hu commented May 17, 2024

To support a more functional pipeline notation, next to the current imperative style, I make all computational functions and normalize return self. I checked with Area D that the latter does not violate the normalize interface specifications.

I doubt it does not break anything, especially the MetainfoNormalizer.

I held off on responding, until testing the normalization. No issues there. It's also unclear why any should occur.

the only reason I'd see to return self is to chain methods one after another

Exactly, it's a way of supporting functional pipelines. I mentioned adding this 2x before. It's not really a big deal, but we can go over the points you raised.

In any case, I think this is not a good practice here [...]
Did you check that the methods (besides normalize, in which I won't do it) you implemented this return, this kind of logic is what we want?

There's no special logic going on here, just functional composition. I.e. the behavior of these functions does not change ('), and therefore it does not invalidate any old code.

Regarding "good practices" (''), it's just supporting a more compact, often clearer style of writing code ('''). Unless we want to constrain devs to a very procedural style, there's no issue. Actually, these kinds of pipelines are very common in modules like pandas or react, and form the backbone of functional programming. They are indeed becoming popular again across the board.

(') 1 disclaimer: it does not change the behavior of functions that return None. A lot of attributes are now set by the normalizer, i.e. it calls an object function that returns a result. In some cases, like those decorated by @property this is fine.
However, I've had instances, where parsers need access to a computed property, and there this could be beneficial, see (''').

('') Good practices more so have to do with producing safe code (including CI/CD pipelines) and clear abstractions (supplemented with good docs). While style consistency can be conducive here, the real objective is readability. If a different style produces better readability in some subcases, then it's appropriate there. Python itself supports a mix of procedural and declaritive alternatives.

(''') The nice aspect of writing code this way, is that you explain all the manipulations on top of an object in 1 line. You can now produce complex, but well-defined states without having to trigger the section's entire normalization (which might fail if some data is missing).

And if so, I wonder whether you are willing to change ALL normalize() functions to now return self.

We can try it out here first. It would be good to apply it consistently, yes.
For our own normalizers, this is a small issue. Pretty trivial.

@JosePizarro3
Copy link
Collaborator

I think you can simply leave it out. I really doubt this does not break the MetainfoNormalizer functionality, and I won't lose time on even opening a discussion about this topic.

I also have my doubts on readability, as returning self and chaining methods are rather unreadable. The only benefit is just writing one-line code. But I guess computer scientists would have a different point of view, perhaps, on the fact that class1.method_a(...).method_b(...).method_c(...) is more readable... I mean, just look at the normalize() functions, they are pretty readable thanks to the fact that the functions are not chained one after another...

ndaelman added 6 commits May 17, 2024 17:24
…those in `Cell`.

- Rename `Cell.n_cell_points` to `Cell.n_objects`
- Update tests to be in-line with new elemental order (central atoms in the middle)
- TODO: add testing for dihedral angles
- TODO: decide on binning
@JosePizarro3
Copy link
Collaborator

Btw, it seems to me that this feature would live better in the nomad-analysis plugin, but we can leave it here for now.

ndaelman added 4 commits May 24, 2024 00:21
- Migrate cutoff quantities to `GeometryDistribution`
- Update neighbor_list in pipeline and normalization
- Update docs

- TODO: add bin units
- TODO: verify neighbor_list elements
- TODO: remove `check_attributes`
- clean up `to_ase_atoms`
- ensure that `neighbor_list`'s order matches `ase_atoms`
- Remove `type` in favor of `n_elem`
- Reorganize `AtomicCell.normalize()`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature or request testing Testing additions or fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants