Extend model system #58

ndaelman-hu · 2024-04-28T20:55:45Z

This PR adds inter-atomic distances and angles, with optional extension to dihedral angles, to the NOMAD schema.

Schema

These distributions are stored in AtomicCell.geometry_distributions: list[GeometryDistribution] as a repeating subsection. Information on the (cutoffs, used to set up the) bins, is stored under AtomicCell.geometry_analysis_cutoffs.

GeometryDistribution objects are effectively histograms by element pair / triple. This is a choice to facilitate searches and visualizations over the distribution themselves, which are normalized to reproduce the same frequency for primitive cells as supercells. Additional advantages include a limited storage consumption. It is, however, not suitable for extracting the exact atomic pairs / triples matching a given distance / angle.

Note: triples / quadruples are defined around a specific geometric primitive (point / line piece). Their notation accounts for this, by placing them up front in the elements list.

Additional functionalities

To keep state within AtomicCell simple and exert control over the stages in calculation, pure Python helper classes with the following functions are added, in line with the Single Responsibility Principle:

DistributionFactory scans the combinatorial space of elements and generates their Distributions.
Distribution leverages ase for computing the distances / angles for the elemental combo provided. It can also produce a DistributionHistogram of itself.
DistributionHistogram contain the histogram version of their combo distribution.

While only DistributionHistogram will eventually be written out to GeometryDistribution, the rest of the data artifacts are retained under AtomicCell during run-time. As such, once computed, they can be used in other computation, or data analysis (conversion to pandas.DataFrame in a separate PR).

Inclusion of the actual computation pipeline, meanwhile, is toggled via AtomicCell.analyze_geometry: bool.
The execution is handled by the normalizer.

Miscellaneous

To facilitate a cleaner check for missing NOMAD attributes, i.e. quantities, subsections, I added the check_attributes decorator. It is used in AtomicCell.set_ase_atoms, as an example. If accepted, I recommend that it be reallocated to utils.py.

To support a more functional pipeline notation, next to the current imperative style, I make all computational functions and normalize return self. I checked with Area D that the latter does not violate the normalize interface specifications.

P.s.: I'm considering adding this PR description as documentation. Let me know your thoughts.

- TODO: move to own subsection

- Clarify some descriptions in the schema - Update tests - TODO: add angles

- Improve combo generation: unique combos only now - Robustify `AtomicCell.get_distances()` - Add quantities storing elemental combinations - TODO: fix error for matrix storage of `str` (coming from elementatl combinations)

- Apply approximations in testing - Add angles to the tests - Add first prototyp of `decorator` checks

- TODO: extend testing - TODO: check `Cell` structure

- Fix minor bugs in src

- Fix bugs in src

JosePizarro3 · 2024-04-29T05:00:02Z

To support a more functional pipeline notation, next to the current imperative style, I make all computational functions and normalize return self. I checked with Area D that the latter does not violate the normalize interface specifications.

I doubt it does not break anything, especially the MetainfoNormalizer. And if so, I wonder whether you are willing to change ALL normalize() functions to now return self. In any case, I think this is not a good practice here: the only reason I'd see to return self is to chain methods one after another, i.e., if you want to do something like:

class Example:

    def method_a(self):
         ...
         return self

    def method_b(self):
         ...
         return self

    def method_c(self):
         ...
         return self


example = Example()
example_2 = example.method_c().method_a().method_c().method_c()

Did you check that the methods (besides normalize, in which I won't do it) you implemented this return, this kind of logic is what we want?

src/nomad_simulations/model_system.py

JFRudzinski · 2024-04-29T08:58:07Z

src/nomad_simulations/model_system.py

    positions = Quantity(
        type=np.float64,
        shape=['n_cell_points', 3],
        unit='meter',
        description="""
-        Positions of all the atoms in Cartesian coordinates.
+        Positions of all the atoms in absolute, Cartesian coordinates.


What is this for exactly? I think perhaps the name should reflect the difference from the atomic_cell positions.

What is this for exactly?

With a (periodic) cell defined, you could either express the position vectors in terms of the lattice vector basis set, or the basis set in which the lattice vectors were defined (typically spanning a Cartesian space).
We end up going with the latter, since it simplifies large-scale manipulation. I think it's also the most common convention in ab initio.

The question was what is the difference between the positions stored here and the positions stored in atomic_cell? And can/should the attribute name here denote that difference (i.e., not just the description)?

Ow, yes, I see. Tbh, I think some confusion entered the setup. Cell is both the parent to AtomicCell, but seems to also contain a grid, i.e. n_cell_points. The positions and the rest refer to them, though my correction may be wrong.

Let me take a closer look. Chema did push some changes here. I'll get back to you, after merging those.

@JosePizarro3 could you pls clarify the meaning of Cell.n_cell_points, especially wrt to AtomicCell?
I actually added a note for you there when I asked for the review.

There is a typo in the description. This should be fixed.

I tried to break the geometric quantities of a cell and the specific atoms sitting in those positions. Cell contains thus the lattice_vectors and the 0-dimensional points in real space, i.e., positions (and also velocities and other info about the Cell), while AtomsState has the info of the atom sitting in that position.

So essentially, these are the same both in Cell and AtomicCell. The only addition of the atomic cell is the atoms state in each of these positions. Whether these positions are occupied by entities which are atoms, this depends on the problem. It could be a more complex thing which is centered around them, and for modeling purposes is described like this.

So essentially, these are the same both in Cell and AtomicCell.

Great! That's what I had to know. Then I'll remove positions and velocities from AtomicCell.

There is a typo in the description.

Where?

There is a typo in the description.

Where?

Change atoms for points. And you can rename the quantity above from n_cell_points to n_points.

src/nomad_simulations/model_system.py

…labels` as per reviewer request

… feedback) - Add dihedrals to computation

JFRudzinski

It's good for now, just take a look at my comments.

src/nomad_simulations/model_system.py

JFRudzinski · 2024-05-16T12:03:57Z

src/nomad_simulations/model_system.py

    positions = Quantity(
        type=np.float64,
        shape=['n_cell_points', 3],
        unit='meter',
        description="""
-        Positions of all the atoms in Cartesian coordinates.
+        Positions of all the atoms in absolute, Cartesian coordinates.


The question was what is the difference between the positions stored here and the positions stored in atomic_cell? And can/should the attribute name here denote that difference (i.e., not just the description)?

src/nomad_simulations/model_system.py

JosePizarro3 · 2024-05-16T13:15:29Z

Let me know if you need a review. I opened a question about returning self, and whether if this is what we want.

- Resolve merge conflicts in `AtomicCell.to_ase_atoms()`

ndaelman-hu · 2024-05-17T14:54:29Z

To support a more functional pipeline notation, next to the current imperative style, I make all computational functions and normalize return self. I checked with Area D that the latter does not violate the normalize interface specifications.

I doubt it does not break anything, especially the MetainfoNormalizer.

I held off on responding, until testing the normalization. No issues there. It's also unclear why any should occur.

the only reason I'd see to return self is to chain methods one after another

Exactly, it's a way of supporting functional pipelines. I mentioned adding this 2x before. It's not really a big deal, but we can go over the points you raised.

In any case, I think this is not a good practice here [...]
Did you check that the methods (besides normalize, in which I won't do it) you implemented this return, this kind of logic is what we want?

There's no special logic going on here, just functional composition. I.e. the behavior of these functions does not change ('), and therefore it does not invalidate any old code.

Regarding "good practices" (''), it's just supporting a more compact, often clearer style of writing code ('''). Unless we want to constrain devs to a very procedural style, there's no issue. Actually, these kinds of pipelines are very common in modules like pandas or react, and form the backbone of functional programming. They are indeed becoming popular again across the board.

(') 1 disclaimer: it does not change the behavior of functions that return None. A lot of attributes are now set by the normalizer, i.e. it calls an object function that returns a result. In some cases, like those decorated by @property this is fine.
However, I've had instances, where parsers need access to a computed property, and there this could be beneficial, see (''').

('') Good practices more so have to do with producing safe code (including CI/CD pipelines) and clear abstractions (supplemented with good docs). While style consistency can be conducive here, the real objective is readability. If a different style produces better readability in some subcases, then it's appropriate there. Python itself supports a mix of procedural and declaritive alternatives.

(''') The nice aspect of writing code this way, is that you explain all the manipulations on top of an object in 1 line. You can now produce complex, but well-defined states without having to trigger the section's entire normalization (which might fail if some data is missing).

And if so, I wonder whether you are willing to change ALL normalize() functions to now return self.

We can try it out here first. It would be good to apply it consistently, yes.
For our own normalizers, this is a small issue. Pretty trivial.

JosePizarro3 · 2024-05-17T15:13:51Z

I think you can simply leave it out. I really doubt this does not break the MetainfoNormalizer functionality, and I won't lose time on even opening a discussion about this topic.

I also have my doubts on readability, as returning self and chaining methods are rather unreadable. The only benefit is just writing one-line code. But I guess computer scientists would have a different point of view, perhaps, on the fact that class1.method_a(...).method_b(...).method_c(...) is more readable... I mean, just look at the normalize() functions, they are pretty readable thanks to the fact that the functions are not chained one after another...

…those in `Cell`. - Rename `Cell.n_cell_points` to `Cell.n_objects`

- Update tests to be in-line with new elemental order (central atoms in the middle)

- TODO: add testing for dihedral angles - TODO: decide on binning

JosePizarro3 · 2024-05-18T05:31:48Z

Btw, it seems to me that this feature would live better in the nomad-analysis plugin, but we can leave it here for now.

- Migrate cutoff quantities to `GeometryDistribution` - Update neighbor_list in pipeline and normalization - Update docs - TODO: add bin units - TODO: verify neighbor_list elements - TODO: remove `check_attributes`

- clean up `to_ase_atoms`

- ensure that `neighbor_list`'s order matches `ase_atoms`

- Remove `type` in favor of `n_elem` - Reorganize `AtomicCell.normalize()`

ndaelman and others added 12 commits April 18, 2024 20:51

- Add first implementation of bond length calculation + tests

92785b9

- TODO: move to own subsection

Fix typos

e60bceb

Fix typos

6ddc77c

Improve pbc default system\

a1cd476

- Accomodate schema for hosting interatomic distances

68b5d32

- Clarify some descriptions in the schema - Update tests - TODO: add angles

- Add angles

8e3e859

- Improve combo generation: unique combos only now - Robustify `AtomicCell.get_distances()` - Add quantities storing elemental combinations - TODO: fix error for matrix storage of `str` (coming from elementatl combinations)

- Fix shape issue with elementals

c05e46e

- Apply approximations in testing - Add angles to the tests - Add first prototyp of `decorator` checks

- Restructure functionalities into a chain of factory classes

12246f1

- TODO: extend testing - TODO: check `Cell` structure

Fix some mistakes

e63c987

- Work out distance and angle values tests for ethane

479725a

- Fix minor bugs in src

Add test for DistributionFactory

d0506a3

- Add tests for DistributionHistogram

81b3ece

- Fix bugs in src

ndaelman-hu added new feature New feature or request testing Testing additions or fixes labels Apr 28, 2024

ndaelman-hu requested review from JFRudzinski, Bernadette-Mohr and JosePizarro3 April 28, 2024 20:55

ndaelman-hu self-assigned this Apr 28, 2024

Improve analyze_geometry check

897ccb6

ndaelman-hu marked this pull request as draft April 28, 2024 22:21

JFRudzinski reviewed Apr 29, 2024

View reviewed changes

Bernadette-Mohr reviewed May 3, 2024

View reviewed changes

src/nomad_simulations/model_system.py Outdated Show resolved Hide resolved

ndaelman added 4 commits May 13, 2024 10:29

Apply reviewer feedback

5440b4d

Rename GeometryDistribution.elements to `GeometryDistribution.atom_…

fd1bff9

…labels` as per reviewer request

- Split out GeometryDistribution in to subcases (resolving reviewer…

1539c15

… feedback) - Add dihedrals to computation

Correct typing (mypy)

333f15c

ndaelman-hu requested review from JFRudzinski and Bernadette-Mohr May 13, 2024 11:02

JFRudzinski approved these changes May 16, 2024

View reviewed changes

ndaelman added 4 commits May 17, 2024 15:59

- Merge branch 'develop' into extend_model_system

45e12aa

- Resolve merge conflicts in `AtomicCell.to_ase_atoms()`

Fix indentation error AtomicCell.to_ase_atoms()

e8f59f8

Move bond_list from Cell ro AtomicCell upon reviewer request

208c2ba

Reapply @check_attributes to to_ase_atoms

a398857

ndaelman added 6 commits May 17, 2024 17:24

- Remove positions and velocities from AtomicCell, in favor of …

7265b53

…those in `Cell`. - Rename `Cell.n_cell_points` to `Cell.n_objects`

Add test for GeometryDistribution (instantiation and normalization)

4333a53

- Add test case to test_distribution_histogram

2b9c287

- Update tests to be in-line with new elemental order (central atoms in the middle)

Change GeometricDistribution units for angles from degrees to radian

a3ac6a3

- Add documentation

9023837

- TODO: add testing for dihedral angles - TODO: decide on binning

Fix mypy typing nameclash (gosh, mypy sucks)

4da9060

ndaelman added 4 commits May 24, 2024 00:21

- Build in non-stored default for bins in archive storage

6a6305a

- Migrate cutoff quantities to `GeometryDistribution` - Update neighbor_list in pipeline and normalization - Update docs - TODO: add bin units - TODO: verify neighbor_list elements - TODO: remove `check_attributes`

- Remove check_attrributes

4c2bf8c

- clean up `to_ase_atoms`

- add bin units in pipeline classes

b498bd7

- ensure that `neighbor_list`'s order matches `ase_atoms`

- Fix mypy errors

4e92474

- Remove `type` in favor of `n_elem` - Reorganize `AtomicCell.normalize()`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend model system #58

Extend model system #58

ndaelman-hu commented Apr 28, 2024 •

edited

Loading

JosePizarro3 commented Apr 29, 2024

JFRudzinski Apr 29, 2024

ndaelman-hu May 13, 2024

JFRudzinski May 16, 2024

ndaelman-hu May 17, 2024

ndaelman-hu May 17, 2024

JosePizarro3 May 17, 2024 •

edited

Loading

ndaelman-hu May 17, 2024

ndaelman-hu May 17, 2024

JosePizarro3 May 18, 2024

JFRudzinski left a comment

JFRudzinski May 16, 2024

JosePizarro3 commented May 16, 2024

ndaelman-hu commented May 17, 2024 •

edited

Loading

JosePizarro3 commented May 17, 2024

JosePizarro3 commented May 18, 2024

Extend model system #58

Are you sure you want to change the base?

Extend model system #58

Conversation

ndaelman-hu commented Apr 28, 2024 • edited Loading

Schema

Additional functionalities

Miscellaneous

JosePizarro3 commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JosePizarro3 May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JFRudzinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JosePizarro3 commented May 16, 2024

ndaelman-hu commented May 17, 2024 • edited Loading

JosePizarro3 commented May 17, 2024

JosePizarro3 commented May 18, 2024

ndaelman-hu commented Apr 28, 2024 •

edited

Loading

JosePizarro3 May 17, 2024 •

edited

Loading

ndaelman-hu commented May 17, 2024 •

edited

Loading