Skip to content

Select runs and get values

Dmitry Romanov edited this page Dec 18, 2024 · 23 revisions

[DEPRECATED]

Better use select_values. See below why

Contents:


[DEPRECATED] Kind of...

For the most of the use cases the new select_values function is faster and better than select_runs.

  • faster, while using python if statement syntaxis for search queries, select_values relies on SQL search query as much as possible and do only the final selection steps in python. Moreover, it selects only required values, increasing the performance. The select_runs function contraty to that does a lot of filtering on python side and pulls full runs information.
  • beter output, select_values selects the resulting table of values for specified runs. The table of runs and values is what is needed in 99% of cases. select_runs returns a list of Run SQLAlchemy objects. Manipulating over Run ORM objects usually leads to more queries.
  • better introspection, results of select_values also has performance metrics and some other goodies. Not that it is needed for regular users, but can help to figure out why something is slow.

Dataset to practice on

To experiment with the examples on this page, one can download daily recreated SQLite database: https://halldweb.jlab.org/dist/rcdb.sqlite

Using connection string:

sqlite:///<path to file>/rcdb.sqlite

Or connect to readonly mysql:

mysql://[email protected]/rcdb


Selecting runs and getting values


Get values

Suppose, one wants to get all event_count-s and beam_current-s for production runs:

import rcdb

# Connect to database
db = rcdb.RCDBProvider("mysql://[email protected]/rcdb")

# Select runs and get values
table = db.select_runs("@is_production")\
          .get_values(['event_count', 'beam_current'], insert_run_number=True)
print table

As the result one gets something like:

[
[1023, 3984793, 0.145]
[1024, 4569873, 0.230]
...
]

The first column is a run number (we set insert_run_number=True above). The other two columns are 'event_count' and 'beam_current' as we gave it above.

If run number is not needed insert_run_number may be skipped:

table = db.select_runs("@is_production")
          .get_values(['event_count', 'beam_current'])

A nice way to iterate the values:

for row in table:
    event_count, beam_current = tuple(row)
    print event_count, beam_current

Run range

If one wants to apply a run range, say for a particular run period:

table = db.select_runs("@is_production", 10000, 20000)\
          .get_values(['event_count', 'beam_current'], True)

No filtration

To get values for all runs without filtration a search pattern may be skipped:

table = db.select_runs(run_min=10000, run_max=20000)\
               .get_values(['event_count', 'beam_current'], insert_run_number=True)

(note that parameter names are used here, so the python could figure function parameters out)


Sort order

The table is always sorted by run number. It is just a 'feature' of getting runs DB query (that is under the hood). However, the order in with run numbers are sorted could be changed:

table = db.select_runs(run_min=10000, run_max=20000, sort_desc=True)\
               .get_values(['event_count', 'beam_current'], insert_run_number=True)

sort_desc=True - makes rows to be sorted by descending run_number



Iterating over runs and getting conditions


Getting runs

select_runs function returns RunSelectionResult object that contains all selected runs and some other information about how the runs where selected. The RunSelectionResult implements list interface returning ```Run`-s. Thus one can do:

import rcdb
db = rcdb.RCDBProvider("mysql://[email protected]/rcdb")
result = db.select_runs("@is_production")
for run in result:
    print run.number

As one could guess the selected run numbers are printed as the result.


Get any condition of the run

Run has the next useful functions:

def get_conditions_by_name(self):              
    # Get all conditions and returns dictionary of condition.name -> condition
def get_condition(self, condition_name):       
    # Gets Condition object by name if such name condition exist or None
def get_condition_value(self, condition_name): 
    # Gets the condition value if such condition exist or None

So one can iterate selected runs and get any desired condition:

import rcdb
db = rcdb.RCDBProvider("mysql://[email protected]/rcdb")
result = db.select_runs("@is_production")
for run in result:
    print run.get_condition_value('event_count')



Performance

In the performance point of view, the fastest way to get values by using

db.select_runs(...).get_values(...)

Because get_values makes just a single database call to retrieve all values for selected runs.

In case of iterating:

result = db.select_runs("@is_production")
for run in result:
    print run.get_condition_value('event_count')

Database is queried on each get_condition_value