Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed demo dataset / output notebook / output CSV / output text #48

Closed
wants to merge 20 commits into from

Conversation

mccalluc
Copy link
Contributor

@mccalluc mccalluc commented Oct 7, 2024

When this is approved, it will be broken into pieces that can be generated by the user, so this is a really good time to make sure we're all happy with the target output, while it's all in one piece.

@ekraffmiller:

  • Does it feel like the generated notebook would be useful for the users we're imagining?
  • Are the Text and CSV formats appropriate?
  • Does anyone else need to be looped in here?

@Shoeboxam

  • Is the usage of OpenDP appropriate?
  • Should we add more links to the docs site, or any more explanation?
  • The right accuracy conversion function to use depends on distribution in the summary, but it feels like the connection is very manual: "Look above to confirm that these match." Should I do something else?
  • I believe there is some connection between delta and the cutoff level in the graph, but can you remind me what it is?

@mccalluc mccalluc mentioned this pull request Oct 8, 2024
Copy link
Member

@Shoeboxam Shoeboxam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up Chuck!

demo.ipynb Outdated Show resolved Hide resolved
demo.ipynb Outdated
Comment on lines 143 to 145
" axes.bar(x_values_above, y_values_above, color=color, **shared)\n",
" axes.bar(x_values_below, y_values_below, color=\"white\", **shared)\n",
" axes.hlines([y_cutoff], 0, len(y_values), colors=[\"black\"], linestyles=[\"dotted\"])\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you've made bars below the cutoff more faded by making their color white. Nice. I like the shorter code below, though. Maybe we should add a helper for making the error-bar'ed plot to the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I can do now to simplify things is move this CSV generation code into a helper script. The bar plot was useful to make sure the distribution was plausible, but doesn't need to be kept.

demo.ipynb Outdated Show resolved Hide resolved
demo.ipynb Outdated Show resolved Hide resolved
demo.ipynb Show resolved Hide resolved
}
],
"source": [
"plot_histogram(grade_histogram, error=grade_histogram_95_accuracy, cutoff=50) # TODO: Set cutoff correctly."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've specified public_info="keys" above, there is no thresholding/cutoff. You'd need to drop those descriptors and add a delta parameter. Then the threshold should appear in the summary table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not understanding exactly what needs to change. There's already delta in the context:

privacy_loss=dp.loss_of(epsilon=epsilon, delta=delta),

but if I just change keys to lengths here:

        ("grade_bin",): dp.polars.Margin(
            max_partition_length=max_possible_rows,
-            public_info="keys",
+            public_info="lengths",
        ),
        ("class_year_bin",): dp.polars.Margin(
            max_partition_length=max_possible_rows,
-            public_info="keys",
+            public_info="lengths",
        ),

It errors at

grade_histogram_summary = grade_histogram_query.summarize()
ValueError: unable to infer bounds

demo.ipynb Outdated
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO: Accuracy? See note above."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is easier to think about when you aren't grouping, but when grouping, accuracy will be different for every group. I think we'd need to add a different kind of utility that also takes in a counts dataframe, and returns a dataframe with per-group accuracies.

demo.ipynb Outdated
"source": [
"class_year_histogram_scale = class_year_histogram_summary['scale'].item()\n",
"# See the \"distribution\" in the summary above to confirm that discrete laplacian is correct.\n",
"class_year_histogram_95_accuracy = dp.discrete_laplacian_scale_to_accuracy(class_year_histogram_scale, 0.05)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pass alpha to summarize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alpha is a required positional parameter on discrete_laplacian_scale_to_accuracy: Is that not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind: accuracy is on the summary.

demo.ipynb Outdated
" 'histogram': {v['class_year_bin']: v['len'] for v in class_year_histogram.to_dicts()}\n",
" },\n",
" }\n",
"}\n",
Copy link
Member

@Shoeboxam Shoeboxam Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggestions here.

  • I think it would be better to more closely mirror the library API.
  • You could handle all the serialization/deserialization logic for domain descriptors via the opendp-logger crate
  • Once you've more closely mirrored the library api, it leaves 'grade' and 'class_year' keys in inputs. I think these would be better moved into outputs and to rename outputs to releases, as they are set within the scope of building the query.
{
    'data': {
        'csv_path': csv_path,
    },
    'privacy_unit': {
        'contributions': 1
    },
    'privacy_loss': {
        'epsilon': 1.0,
        'delta': 1e-7
    },
    'domain': {
        [opendp_logger output here] (this is where max_possible_rows ends up)
    },
    'split_by_weights': weights,
    'releases': {
        'grade': {
            'mean': grade_mean.item(),
            'histogram': {v['grade_bin']: v['len'] for v in grade_histogram.to_dicts()},
            'query': {
                'min': grade_min,
                'max': grade_max,
                'bins_count': grade_bins_count,
            }
        },
        'class_year': {
            'mean': class_year_mean.item(),
            'histogram': {v['class_year_bin']: v['len'] for v in class_year_histogram.to_dicts()},
            'query': {
                'min': class_year_min,
                'max': class_year_max,
                'bins_count': class_year_bins_count,
            }
        }
    }
}

While I didn't write it here, the query itself could also be serialized via the OpenDP logger crate. The nice thing about using the logger crate is that a user (or us) could reconstruct domains and measurements themselves from the serialized output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logger API is something we would steer people to, then should it be in doc.opendp.org? I can see that it would be easier, and in many ways better, for us to produce a data structure, but until there is some documentation around it, then I'd be very hesitant to point anyone to the logger format as the right way to use opendp.

Possible next steps:

  • Add logger tabs in the existing docs?
  • or at least documentation of the schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, while logger output could be seen as another medium for specifying queries, I don't think that format is very human-editable, so I'd want to avoid logger tabs. If the lower editability is not consistent with your vision for this json output, that's fine.

Copy link
Member

@Shoeboxam Shoeboxam Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be good to spruce the package up and add to docs.opendp.org. It always seems to come up when people want to integrate with OpenDP. I'm not sure how exactly it would be added yet.

demo.ipynb Outdated
"source": [
"### CSV export?\n",
"\n",
"Flatten the data stucture to key value pairs and make a two-column CSV unless there are other requirements?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan. Let the user do it if they want it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CSV export format was requested during planning. If you think it's not useful, then best next step might be to put it on the agenda for the next meeting?

Copy link
Member

@Shoeboxam Shoeboxam Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe needs clarification- flattening out a nested data structure into a csv is not very user-friendly. What is the purpose of such an output? I could totally see having one csv per query. Maybe the json could be part of a zip that also holds csvs, and each release in the json points to a different csv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked on Slack, and wasn't able to get an immediate answer, but Ceilyn did file a tracking issue. I'll check with Ellen tomorrow, but my sense is that no one here has a precise picture of what this should be. Maybe the end-users will, if we ask them, but it is not considered a blocker.

To move this forward, I'm going to make a new branch from here with just the outputs, so it's clear that they are separate from the rest of the notebook.

demo.ipynb Outdated Show resolved Hide resolved
Co-authored-by: Michael Shoemate <[email protected]>
@Shoeboxam
Copy link
Member

Shoeboxam commented Oct 9, 2024

To answer your last question: In the grouping algorithm used here, delta corresponds to the probability of releasing a group key that is unique to an individual. Releasing any group key specific to an individual runs the risk of causing a privacy violation: say, for example, if the group keys were social security numbers.

Bin labels aren't as apparently sensitive, but since DP always guards against the worst case, the math protects them both the same. An adversary could still construct bin edges for which they know a bin singles out an individual.

Copy link
Member

@ekraffmiller ekraffmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me in general, and I think the idea of a notebook is helpful for users. I think it would be also good to get feedback from someone who represents a typical user.

@mccalluc
Copy link
Contributor Author

Thanks for feedback. I've addressed some points, and narrowed the scope. Two new PRs are filed for

My understanding from Ceilyn is that they'll try to recruit some real users to give feedback, but we shouldn't wait for that: For now, we just need to use our own judgement.

Mike, I've addressed several of your comments, but sometimes, for instance the accuracy on means, we would need to do something more complicated now to get results, pending new features in opendp. If we were just creating a report for users, work arounds might make sense so we could give them the fullest results, but since this is intended to be a model for users, I would incline towards keeping it simple, with perhaps at most a link to an issue if we want to explain that it is coming.

Please re-review!

@mccalluc
Copy link
Contributor Author

Some TODOs:

  • The backticks around "Context" make it a little harder to see that it's a link, so take them off.
  • Pull out some of the parts of Context (like the privacy loss) and explain them in their own cells.
  • On the charts, make sure the y-axis doesn't go negative. (If there a negative values, we'll still see the whiskers.

@mccalluc
Copy link
Contributor Author

mccalluc commented Dec 5, 2024

Notebook generation has been implemented. This PR was useful in setting direction, but is no longer needed.

@mccalluc mccalluc closed this Dec 5, 2024
@mccalluc mccalluc deleted the 46-dp-demo-notebook branch December 5, 2024 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Complete
Development

Successfully merging this pull request may close these issues.

Notebook to demo DP histograms with cut
3 participants