Skip to content
This repository has been archived by the owner on Sep 21, 2022. It is now read-only.

Script to convert proto graphs to Beam #230

Open
Shoeboxam opened this issue May 14, 2020 · 6 comments
Open

Script to convert proto graphs to Beam #230

Shoeboxam opened this issue May 14, 2020 · 6 comments
Assignees

Comments

@Shoeboxam
Copy link
Member

This could just be written in Python, although I would be interested in learning how the Beam communications layer works first.

@mikephelan mikephelan self-assigned this May 27, 2020
@mikephelan
Copy link
Contributor

Is it correct to understand that the input to Apache Beam is the group of templates found in whitenoise-core/validator-rust/prototypes/components ?

@Shoeboxam
Copy link
Member Author

Yes! Each json file needs to have an equivalent Beam representation.

To be DP, there must be no two neighboring datasets (any two datasets that differ by one row) where the runtime succeeds on one dataset and fails on the other.

If a runtime does not implement a component, then it will fail on every dataset-- which is great, because this means runtimes (like beam) do not need to implement every component.

You can also ignore components that have no concrete implementation- like min, max, dp_xxx, to_xxx.

@Shoeboxam
Copy link
Member Author

Shoeboxam commented May 27, 2020

The overall signature is- given a computation graph, privacy definition and release, return a release.

About the overall function (let's call it, distribute_release), I can help with the portion that traverses the graph and calls into the lower level functions (next comment)

@Shoeboxam
Copy link
Member Author

Shoeboxam commented May 27, 2020

From my initial glance at Beam, you might expect each beam component you implement to take in a PCollection for each argument, and elementary python types for each option.

Each component implementation will need similar arguments as the rust runtime:

  • an option dict, equivalent to &self in rust
  • an arguments dict of PCollections
  • a privacy definition

The privacy definition can generally be ignored for now. The only relevant info inside the privacy definition are the user preferences to force constant time, constant memory, etc. We haven't matured to support that yet.

Let me know if you see issues with this proposed code structure!

@mikephelan
Copy link
Contributor

Using the example of the analysis notebook, I'm going to pose an example here.
In the sixth code block, with the first comment line:

attempt 4 - succeeds!

there is an example using dp_mean and dp_variance.
For this case, would we be sending a DAG to Apache Beam, containing two nodes, one for dp_mean and one for dp_variance?

@Shoeboxam
Copy link
Member Author

More likely, translate

materialize -> cast -> clamp -> impute -> resize -> (mean, variance)

Into a beam pipeline. I'm happy to add custom data loading components if beam doesn't make the same assumptions materialize does.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants