Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cep27 #259

Open
wants to merge 6 commits into
base: source
Choose a base branch
from
Open

Cep27 #259

Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 287 additions & 0 deletions source/cep/cep27.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
CEP 27 - |Cyclus| Database Restructuring
********************************************

:CEP: 27
:Title: |Cyclus| Database Restructuring
:Last-Modified: 2017-09-11
:Author: Jin whan Bae & Anthony Scopatz
:Status: Draft
:Type: Standards Track
:Created: 2013-09-11

Abstract
============
This CEP proposes to restructure the |cyclus| output database structure in order to
reduce the number of tables and redundancy of data, and ultimately reduce the number
of ``joins`` required for data analysis. Doing so would reduce the computing time
for end-user analysis, and allow for a clearer, more concise output database.


Motivation
==========
The current output database requires the user to join multiple tables to acquire
meaningful material data, such as quantity and composition. This causes long
analysis computing times and confusion for the user.


Rationale
=========
The proposed restructure aims to reduce the number of tables the user has to query
for analysis. This can be done by two methods:

1. Combine redundant tables
2. Reduce a table (``Compositions`` table) into a column with variable-type map.

Additionally, this CEP proposes to store both **Inventories** and **Transactions**
by default. Either table may be backed out of the other (with additional
information coming from **Materials** etc). However, this backing out process has proven
extrodinarily expensive, exploding the number of operations needed to back out non-present
by millions to billions. Even for small databases, this has proven prohibitive.

While storing both **Inventories** and **Transactions** may seem inefficient, consider
that:

* Data storage is cheap,
* Material inventories are what most analysis tasks require, and
* This is precisely double-entry bookkeeping, as applied to the nuclear fuel cycle.

Double-entry bookkeeping was huge innovation in accounting systems. When implemented
correctly and without fraud, it leads to a self-consisent system. This enables errors
to be discovered and corrected earlier. This CEP argues that |Cyclus| should provide
the information needed to verify the mass balances, if requested.


Specification \& Implementation
===============================
The following tables that are currently in output are considered for editing:

1. Compositions
2. Transactions
3. Recipes
4. ExplicitInventory
5. ExplicitInventoryCompact
6. Info
7. InfoExplicitInv
8. ResCreators
9. Resources


Material and Product
--------------------

Currently, both **Material** and **Product** are in the Resources Table.
The internal state of **Material** is stored in **Compositions**, and
the internal state of **Product** is stored in **Products** table.
This requires the user to make joins to acquire the internal state
of the resources.

We can avoid unnecessary joins by creating a **Materials** and
**Products** table, with the internal state (composition and quality)
as a column.

In short, we propose to replace **Compositions**, **Products**, and
**Resources** table with **Materials** and **Products** Table. In the
process, the **QualId** column would be removed.

Currently:

============ ==========
Resources
------------------------
Column Type
============ ==========
SimId uuid
ResourceId int
ObjId int
Type string
TimeCreated int
Quantity double
Units string
QualId int
Parent1 int
Parent2 int
============ ==========



============ ==========
Products
------------------------
Column Type
============ ==========
SimId uuid
QualId int
Quality string
============ ==========




============ ==========
Compositions
------------------------
Column Type
============ ==========
Simid uuid
QualId int
NucId int
MassFrac double
============ ==========

Would be restructured to:


============ ==========
Materials
------------------------
Column Type
============ ==========
SimId uuid
ResourceId int
ObjId int
TimeCreated int
Parent1 int
Parent2 int
Units string
Quantity double
Composition map<int,double>
============ ==========

Where the composition column would map <NucId, MassFrac>

============ ==========
Products
------------------------
Column Type
============ ==========
SimId uuid
ResourceId int
ObjId int
TimeCreated int
Parent1 int
Parent2 int
Units string
Quantity double
Quality string
============ ==========

Also, since **QualId** is removed, the **Recipes** Table
also needs to be edited:

============ ==========
Recipes
------------------------
Column Type
============ ==========
SimId uuid
Recipes string
Composition map<int,double>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous layout anticipated the desire to select on nuclide in the query, and hence a different column for each NucId. Perhaps this has not emerged in the wild, but it seems that a consequence of this change would make this no longer possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, that is a valid point.
Maybe it's necessary for us to define what
For example, if one wants the timeseries mass of Pu239,
the query would be like the following:

SELECT sum(massfrac*quantity) FROM resources
INNER JOIN compositions
ON resources.qualid = compositions.qualid
WHERE nucid=942390000
GROUP BY timecreated

in the newer database structure, it would be:

SELECT quantity, timecreated, composition   FROM materials

followed by a script that processes the result:

Get query results
loop through every row of query results
for every row, 
    look in composition for nucid 94239, get its massfrac value
    multiply that by quantity
    add that to the list of pu239 value timeseries list

So I do assume that it would take a longer time to accomplish
what you mentioned ( and also needs additional scripting outside of the sqlite query)...

You probably know much more than me, but @scopatz and my initial thought was that
this would have more benefit than loss. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will not be optimized for a large calculation (1000-10000 facilities), if you want to see the plutonium inventory in the fleet, you will need to load all the composition, get the informations you need and then re-generate a table.

I would prefer a system that allow us to filter using facility's name and nucid, but I am not sure it is possible without having a gigantic table :(

============ ==========


Transactions
------------
The transactions table would be modified to have an integer flag for whether
the commodity is a material or a product. This flag let's anyone inspecting
the transaction table know which resource table (either **Materials** or
**Products**) to go to to find the actual concrete resource.

**Current:**

============ ==========
Transactions
------------------------
Column Type
============ ==========
SimId uuid
TransactionId int
SenderId int
ReceiverId int
ResourceId int
Commodity string
Time int
============ ==========

**Proposed**

================ ==========
Transactions
----------------------------
Column Type
================ ==========
SimId uuid
TransactionId int
SenderId int
ReceiverId int
**ResourceType** **int**
ResourceId int
Commodity string
Time int
================ ==========

This table will now be optionally written to the database. The default will be to
write this table (true).


ResCreators
-----------
Along with **Transactions**, the **ResCreators**
table would need another column, ResourceType:

============ ==========
ResCreators
------------------------
Column Type
============ ==========
Simid uuid
Resourceid int
AgentId int
ResourceType int
============ ==========


Merge ExplicitInventory & ExplicitInventoryCompact
----------------------------------------------------
The **ExplicitInventory** table and **ExplicitInventoryCompact**
table should be merged to a single table, called **Inventories**,
with the following columns:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm unfamiliar with the previous tables, can you clarify what is changing here by showing the old table layout? (as you did with the others)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I completely understand the question, but the old table layout is shown so that the reader can clearly understand what is being modified / removed with the proposed change.

Copy link
Member

@bam241 bam241 Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Explicitinventory table corresponds to a table with all the different inventory in all the different facilities:
SimId -- AgentId -- Time -- InventoryName -- NucId -- Quantity

where the ExplicitInventoryCompact looks more like the new "Inventory"
SimId -- AgentId -- Time -- InventoryName -- Quantity -- Composition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bam241 answered my question - the old table layout was NOT shown for this case and perhaps should be...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

============ ==========
Inventories
------------------------
Column Type
============ ==========
Simid uuid
Agentid int
Time int
InventoryName string
Quantity double
Composition map<int,double>
============ ==========

This table will be optionally written to the database. The default will be to
write this table (true).


Merge Info & InfoExplicitInv
----------------------------
We saw little reason to separate the two tables. Combining them is a matter of cleanliness.
Additionallty, the single **Info** table will have to contain an extra column, **RecordTransactions**.
Furthermore, the **RecordInventory** column is no longer needed and will be removed.

Other informational tables may also be merged into the single table.


Backwards Compatibility
=======================
This CEP is not backwards compatible.

Document History
================
This document is released under the CC-BY 3.0 license.

References and Footnotes
========================

.. rubric:: References