-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed demo dataset / output notebook / output CSV / output text #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up Chuck!
demo.ipynb
Outdated
" axes.bar(x_values_above, y_values_above, color=color, **shared)\n", | ||
" axes.bar(x_values_below, y_values_below, color=\"white\", **shared)\n", | ||
" axes.hlines([y_cutoff], 0, len(y_values), colors=[\"black\"], linestyles=[\"dotted\"])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've made bars below the cutoff more faded by making their color white. Nice. I like the shorter code below, though. Maybe we should add a helper for making the error-bar'ed plot to the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I can do now to simplify things is move this CSV generation code into a helper script. The bar plot was useful to make sure the distribution was plausible, but doesn't need to be kept.
} | ||
], | ||
"source": [ | ||
"plot_histogram(grade_histogram, error=grade_histogram_95_accuracy, cutoff=50) # TODO: Set cutoff correctly." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've specified public_info="keys" above, there is no thresholding/cutoff. You'd need to drop those descriptors and add a delta parameter. Then the threshold should appear in the summary table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not understanding exactly what needs to change. There's already delta in the context:
privacy_loss=dp.loss_of(epsilon=epsilon, delta=delta),
but if I just change keys
to lengths
here:
("grade_bin",): dp.polars.Margin(
max_partition_length=max_possible_rows,
- public_info="keys",
+ public_info="lengths",
),
("class_year_bin",): dp.polars.Margin(
max_partition_length=max_possible_rows,
- public_info="keys",
+ public_info="lengths",
),
It errors at
grade_histogram_summary = grade_histogram_query.summarize()
ValueError: unable to infer bounds
demo.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"TODO: Accuracy? See note above." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is easier to think about when you aren't grouping, but when grouping, accuracy will be different for every group. I think we'd need to add a different kind of utility that also takes in a counts dataframe, and returns a dataframe with per-group accuracies.
demo.ipynb
Outdated
"source": [ | ||
"class_year_histogram_scale = class_year_histogram_summary['scale'].item()\n", | ||
"# See the \"distribution\" in the summary above to confirm that discrete laplacian is correct.\n", | ||
"class_year_histogram_95_accuracy = dp.discrete_laplacian_scale_to_accuracy(class_year_histogram_scale, 0.05)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pass alpha to summarize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alpha is a required positional parameter on discrete_laplacian_scale_to_accuracy
: Is that not correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind: accuracy is on the summary.
demo.ipynb
Outdated
" 'histogram': {v['class_year_bin']: v['len'] for v in class_year_histogram.to_dicts()}\n", | ||
" },\n", | ||
" }\n", | ||
"}\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple suggestions here.
- I think it would be better to more closely mirror the library API.
- You could handle all the serialization/deserialization logic for domain descriptors via the opendp-logger crate
- Once you've more closely mirrored the library api, it leaves 'grade' and 'class_year' keys in inputs. I think these would be better moved into outputs and to rename outputs to releases, as they are set within the scope of building the query.
{
'data': {
'csv_path': csv_path,
},
'privacy_unit': {
'contributions': 1
},
'privacy_loss': {
'epsilon': 1.0,
'delta': 1e-7
},
'domain': {
[opendp_logger output here] (this is where max_possible_rows ends up)
},
'split_by_weights': weights,
'releases': {
'grade': {
'mean': grade_mean.item(),
'histogram': {v['grade_bin']: v['len'] for v in grade_histogram.to_dicts()},
'query': {
'min': grade_min,
'max': grade_max,
'bins_count': grade_bins_count,
}
},
'class_year': {
'mean': class_year_mean.item(),
'histogram': {v['class_year_bin']: v['len'] for v in class_year_histogram.to_dicts()},
'query': {
'min': class_year_min,
'max': class_year_max,
'bins_count': class_year_bins_count,
}
}
}
}
While I didn't write it here, the query itself could also be serialized via the OpenDP logger crate. The nice thing about using the logger crate is that a user (or us) could reconstruct domains and measurements themselves from the serialized output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the logger API is something we would steer people to, then should it be in doc.opendp.org? I can see that it would be easier, and in many ways better, for us to produce a data structure, but until there is some documentation around it, then I'd be very hesitant to point anyone to the logger format as the right way to use opendp.
Possible next steps:
- Add logger tabs in the existing docs?
- or at least documentation of the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, while logger output could be seen as another medium for specifying queries, I don't think that format is very human-editable, so I'd want to avoid logger tabs. If the lower editability is not consistent with your vision for this json output, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would be good to spruce the package up and add to docs.opendp.org. It always seems to come up when people want to integrate with OpenDP. I'm not sure how exactly it would be added yet.
demo.ipynb
Outdated
"source": [ | ||
"### CSV export?\n", | ||
"\n", | ||
"Flatten the data stucture to key value pairs and make a two-column CSV unless there are other requirements?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan. Let the user do it if they want it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CSV export format was requested during planning. If you think it's not useful, then best next step might be to put it on the agenda for the next meeting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe needs clarification- flattening out a nested data structure into a csv is not very user-friendly. What is the purpose of such an output? I could totally see having one csv per query. Maybe the json could be part of a zip that also holds csvs, and each release in the json points to a different csv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asked on Slack, and wasn't able to get an immediate answer, but Ceilyn did file a tracking issue. I'll check with Ellen tomorrow, but my sense is that no one here has a precise picture of what this should be. Maybe the end-users will, if we ask them, but it is not considered a blocker.
To move this forward, I'm going to make a new branch from here with just the outputs, so it's clear that they are separate from the rest of the notebook.
Co-authored-by: Michael Shoemate <[email protected]>
To answer your last question: In the grouping algorithm used here, delta corresponds to the probability of releasing a group key that is unique to an individual. Releasing any group key specific to an individual runs the risk of causing a privacy violation: say, for example, if the group keys were social security numbers. Bin labels aren't as apparently sensitive, but since DP always guards against the worst case, the math protects them both the same. An adversary could still construct bin edges for which they know a bin singles out an individual. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me in general, and I think the idea of a notebook is helpful for users. I think it would be also good to get feedback from someone who represents a typical user.
Thanks for feedback. I've addressed some points, and narrowed the scope. Two new PRs are filed for
My understanding from Ceilyn is that they'll try to recruit some real users to give feedback, but we shouldn't wait for that: For now, we just need to use our own judgement. Mike, I've addressed several of your comments, but sometimes, for instance the accuracy on means, we would need to do something more complicated now to get results, pending new features in opendp. If we were just creating a report for users, work arounds might make sense so we could give them the fullest results, but since this is intended to be a model for users, I would incline towards keeping it simple, with perhaps at most a link to an issue if we want to explain that it is coming. Please re-review! |
Some TODOs:
|
Notebook generation has been implemented. This PR was useful in setting direction, but is no longer needed. |
--demo
option #7utf8-lossy
. Towards Explainutf8-lossy
in templated code #42When this is approved, it will be broken into pieces that can be generated by the user, so this is a really good time to make sure we're all happy with the target output, while it's all in one piece.
@ekraffmiller:
@Shoeboxam