Proposed demo dataset / output notebook / output CSV / output text #48

mccalluc · 2024-10-07T21:36:51Z

A demonstration of the notebook that will be a final output of the UI flow. Fix Notebook to demo DP histograms with cut #46
A proposal for a demo dataset. Towards Implement --demo option #7
A proposal for CSV and Text output formats.
An explanation of utf8-lossy. Towards Explain utf8-lossy in templated code #42

When this is approved, it will be broken into pieces that can be generated by the user, so this is a really good time to make sure we're all happy with the target output, while it's all in one piece.

@ekraffmiller:

Does it feel like the generated notebook would be useful for the users we're imagining?
Are the Text and CSV formats appropriate?
Does anyone else need to be looped in here?

@Shoeboxam

Is the usage of OpenDP appropriate?
Should we add more links to the docs site, or any more explanation?
The right accuracy conversion function to use depends on distribution in the summary, but it feels like the connection is very manual: "Look above to confirm that these match." Should I do something else?
I believe there is some connection between delta and the cutoff level in the graph, but can you remind me what it is?

Shoeboxam

Thanks for writing this up Chuck!

demo.ipynb

Shoeboxam · 2024-10-09T13:08:07Z

demo.ipynb

+    "    axes.bar(x_values_above, y_values_above, color=color, **shared)\n",
+    "    axes.bar(x_values_below, y_values_below, color=\"white\", **shared)\n",
+    "    axes.hlines([y_cutoff], 0, len(y_values), colors=[\"black\"], linestyles=[\"dotted\"])\n",


I see that you've made bars below the cutoff more faded by making their color white. Nice. I like the shorter code below, though. Maybe we should add a helper for making the error-bar'ed plot to the library.

What I can do now to simplify things is move this CSV generation code into a helper script. The bar plot was useful to make sure the distribution was plausible, but doesn't need to be kept.

demo.ipynb

Shoeboxam · 2024-10-09T14:11:46Z

demo.ipynb

+    }
+   ],
+   "source": [
+    "plot_histogram(grade_histogram, error=grade_histogram_95_accuracy, cutoff=50) # TODO: Set cutoff correctly."


Since you've specified public_info="keys" above, there is no thresholding/cutoff. You'd need to drop those descriptors and add a delta parameter. Then the threshold should appear in the summary table.

Sorry, I'm not understanding exactly what needs to change. There's already delta in the context:

privacy_loss=dp.loss_of(epsilon=epsilon, delta=delta),

but if I just change keys to lengths here:

("grade_bin",): dp.polars.Margin( max_partition_length=max_possible_rows, - public_info="keys", + public_info="lengths", ), ("class_year_bin",): dp.polars.Margin( max_partition_length=max_possible_rows, - public_info="keys", + public_info="lengths", ),

It errors at

grade_histogram_summary = grade_histogram_query.summarize()

ValueError: unable to infer bounds

Shoeboxam · 2024-10-09T14:15:44Z

demo.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "TODO: Accuracy? See note above."


This is easier to think about when you aren't grouping, but when grouping, accuracy will be different for every group. I think we'd need to add a different kind of utility that also takes in a counts dataframe, and returns a dataframe with per-group accuracies.

Shoeboxam · 2024-10-09T14:17:27Z

demo.ipynb

+   "source": [
+    "class_year_histogram_scale = class_year_histogram_summary['scale'].item()\n",
+    "# See the \"distribution\" in the summary above to confirm that discrete laplacian is correct.\n",
+    "class_year_histogram_95_accuracy = dp.discrete_laplacian_scale_to_accuracy(class_year_histogram_scale, 0.05)\n",


Just pass alpha to summarize?

alpha is a required positional parameter on discrete_laplacian_scale_to_accuracy: Is that not correct?

never mind: accuracy is on the summary.

Shoeboxam · 2024-10-09T14:35:03Z

demo.ipynb

+    "            'histogram': {v['class_year_bin']: v['len'] for v in class_year_histogram.to_dicts()}\n",
+    "        },\n",
+    "    }\n",
+    "}\n",


A couple suggestions here.

I think it would be better to more closely mirror the library API.

You could handle all the serialization/deserialization logic for domain descriptors via the opendp-logger crate

Once you've more closely mirrored the library api, it leaves 'grade' and 'class_year' keys in inputs. I think these would be better moved into outputs and to rename outputs to releases, as they are set within the scope of building the query.

{ 'data': { 'csv_path': csv_path, }, 'privacy_unit': { 'contributions': 1 }, 'privacy_loss': { 'epsilon': 1.0, 'delta': 1e-7 }, 'domain': { [opendp_logger output here] (this is where max_possible_rows ends up) }, 'split_by_weights': weights, 'releases': { 'grade': { 'mean': grade_mean.item(), 'histogram': {v['grade_bin']: v['len'] for v in grade_histogram.to_dicts()}, 'query': { 'min': grade_min, 'max': grade_max, 'bins_count': grade_bins_count, } }, 'class_year': { 'mean': class_year_mean.item(), 'histogram': {v['class_year_bin']: v['len'] for v in class_year_histogram.to_dicts()}, 'query': { 'min': class_year_min, 'max': class_year_max, 'bins_count': class_year_bins_count, } } } }

While I didn't write it here, the query itself could also be serialized via the OpenDP logger crate. The nice thing about using the logger crate is that a user (or us) could reconstruct domains and measurements themselves from the serialized output.

If the logger API is something we would steer people to, then should it be in doc.opendp.org? I can see that it would be easier, and in many ways better, for us to produce a data structure, but until there is some documentation around it, then I'd be very hesitant to point anyone to the logger format as the right way to use opendp.

Possible next steps:

Add logger tabs in the existing docs?

or at least documentation of the schema?

Hmm, while logger output could be seen as another medium for specifying queries, I don't think that format is very human-editable, so I'd want to avoid logger tabs. If the lower editability is not consistent with your vision for this json output, that's fine.

Yeah, it would be good to spruce the package up and add to docs.opendp.org. It always seems to come up when people want to integrate with OpenDP. I'm not sure how exactly it would be added yet.

Shoeboxam · 2024-10-09T14:35:27Z

demo.ipynb

+   "source": [
+    "### CSV export?\n",
+    "\n",
+    "Flatten the data stucture to key value pairs and make a two-column CSV unless there are other requirements?"


Not a big fan. Let the user do it if they want it?

A CSV export format was requested during planning. If you think it's not useful, then best next step might be to put it on the agenda for the next meeting?

Maybe needs clarification- flattening out a nested data structure into a csv is not very user-friendly. What is the purpose of such an output? I could totally see having one csv per query. Maybe the json could be part of a zip that also holds csvs, and each release in the json points to a different csv?

Asked on Slack, and wasn't able to get an immediate answer, but Ceilyn did file a tracking issue. I'll check with Ellen tomorrow, but my sense is that no one here has a precise picture of what this should be. Maybe the end-users will, if we ask them, but it is not considered a blocker.

To move this forward, I'm going to make a new branch from here with just the outputs, so it's clear that they are separate from the rest of the notebook.

demo.ipynb

Co-authored-by: Michael Shoemate <[email protected]>

Shoeboxam · 2024-10-09T20:16:45Z

To answer your last question: In the grouping algorithm used here, delta corresponds to the probability of releasing a group key that is unique to an individual. Releasing any group key specific to an individual runs the risk of causing a privacy violation: say, for example, if the group keys were social security numbers.

Bin labels aren't as apparently sensitive, but since DP always guards against the worst case, the math protects them both the same. An adversary could still construct bin edges for which they know a bin singles out an individual.

ekraffmiller

This looks good to me in general, and I think the idea of a notebook is helpful for users. I think it would be also good to get feedback from someone who represents a typical user.

…lculate right now

mccalluc · 2024-10-17T20:16:16Z

Thanks for feedback. I've addressed some points, and narrowed the scope. Two new PRs are filed for

input: Add --demo CLI option #61
and output: Text and CSV output proposals #69

My understanding from Ceilyn is that they'll try to recruit some real users to give feedback, but we shouldn't wait for that: For now, we just need to use our own judgement.

Mike, I've addressed several of your comments, but sometimes, for instance the accuracy on means, we would need to do something more complicated now to get results, pending new features in opendp. If we were just creating a report for users, work arounds might make sense so we could give them the fullest results, but since this is intended to be a model for users, I would incline towards keeping it simple, with perhaps at most a link to an issue if we want to explain that it is coming.

Please re-review!

mccalluc · 2024-10-22T19:03:20Z

Some TODOs:

The backticks around "Context" make it a little harder to see that it's a link, so take them off.
Pull out some of the parts of Context (like the privacy loss) and explain them in their own cells.
On the charts, make sure the y-axis doesn't go negative. (If there a negative values, we'll still see the whiskers.

mccalluc · 2024-12-05T19:25:06Z

Notebook generation has been implemented. This PR was useful in setting direction, but is no longer needed.

mccalluc added 8 commits October 7, 2024 10:54

start demo notebook

b98305e

demo, up through context

32a7f21

DP mean

70e1543

histogram!

c534591

human readable histogram categories

dcb69d2

class years

a3dfe67

demo output formats

a5ce194

ready for review

acd0c67

mccalluc requested review from ekraffmiller and Shoeboxam October 7, 2024 21:36

mccalluc added 3 commits October 7, 2024 18:07

simplify function

e465cfe

add cutoff to chart

86a4c01

add more students, for narrower error bars

d87f15c

mccalluc mentioned this pull request Oct 8, 2024

Create mock data and plot #35

Merged

Shoeboxam reviewed Oct 9, 2024

View reviewed changes

mccalluc commented Oct 9, 2024

View reviewed changes

demo.ipynb Outdated Show resolved Hide resolved

This was referenced Oct 9, 2024

Use thresholding instead of public keys #58

Open

Accuracy on means #59

Open

Apply suggestions from code review

28e5ca1

Co-authored-by: Michael Shoemate <[email protected]>

Merge branch 'main' into 46-dp-demo-notebook

e283a99

ekraffmiller reviewed Oct 11, 2024

View reviewed changes

fix json (missing comma in suggested change)

5acae47

cmbz mentioned this pull request Oct 16, 2024

DP Wizard: Collect questions for prospective proof-of-concept testers IQSS/dataverse-pm#336

Open

mccalluc mentioned this pull request Oct 16, 2024

Add --demo CLI option #61

Merged

information -> public information

0921df2

mccalluc mentioned this pull request Oct 17, 2024

Read the max_possible_rows from the () margin opendp/opendp#2102

Closed

mccalluc added 2 commits October 17, 2024 15:14

Remove todo on mean accuracy, since it would be a little tricky to ca…

7eec570

…lculate right now

provide alpha to get accuracy more easily

bad1db6

mccalluc mentioned this pull request Oct 17, 2024

Text and CSV output proposals #69

Closed

Remove output cells

3526c1a

mccalluc requested review from ekraffmiller and Shoeboxam October 17, 2024 20:05

mccalluc added 2 commits October 21, 2024 08:25

Merge branch 'main' into 46-dp-demo-notebook

3d97978

Reuse arg_parse code in demo notebook

4fdf1a5

This was referenced Oct 23, 2024

DP Wizard: Develop Shiny proof-of-concept IQSS/dataverse-pm#337

Open

Download Results Tab #95

Closed

Define notebook output #97

Closed

mccalluc marked this pull request as draft October 24, 2024 16:52

mccalluc closed this Dec 5, 2024

mccalluc deleted the 46-dp-demo-notebook branch December 5, 2024 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed demo dataset / output notebook / output CSV / output text #48

Proposed demo dataset / output notebook / output CSV / output text #48

mccalluc commented Oct 7, 2024 •

edited

Loading

Shoeboxam left a comment

Shoeboxam Oct 9, 2024

mccalluc Oct 9, 2024

Shoeboxam Oct 9, 2024

mccalluc Oct 17, 2024

Shoeboxam Oct 9, 2024

Shoeboxam Oct 9, 2024

mccalluc Oct 17, 2024

mccalluc Oct 17, 2024

Shoeboxam Oct 9, 2024 •

edited

Loading

mccalluc Oct 9, 2024

Shoeboxam Oct 9, 2024

Shoeboxam Oct 9, 2024 •

edited

Loading

Shoeboxam Oct 9, 2024

mccalluc Oct 9, 2024

Shoeboxam Oct 9, 2024 •

edited

Loading

mccalluc Oct 17, 2024

Shoeboxam commented Oct 9, 2024 •

edited

Loading

ekraffmiller left a comment

mccalluc commented Oct 17, 2024

mccalluc commented Oct 22, 2024

mccalluc commented Dec 5, 2024

Proposed demo dataset / output notebook / output CSV / output text #48

Proposed demo dataset / output notebook / output CSV / output text #48

Conversation

mccalluc commented Oct 7, 2024 • edited Loading

Shoeboxam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shoeboxam Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shoeboxam Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shoeboxam Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shoeboxam commented Oct 9, 2024 • edited Loading

ekraffmiller left a comment

Choose a reason for hiding this comment

mccalluc commented Oct 17, 2024

mccalluc commented Oct 22, 2024

mccalluc commented Dec 5, 2024

mccalluc commented Oct 7, 2024 •

edited

Loading

Shoeboxam Oct 9, 2024 •

edited

Loading

Shoeboxam Oct 9, 2024 •

edited

Loading

Shoeboxam Oct 9, 2024 •

edited

Loading

Shoeboxam commented Oct 9, 2024 •

edited

Loading