Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemini LLM redacting implementation and docs #6

Merged
merged 1 commit into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 136 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 6 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ repository = "https://github.com/abdolence/redacter-rs"
documentation = "https://docs.rs/redacter"
readme = "README.md"
include = ["Cargo.toml", "src/**/*.rs", "README.md", "LICENSE"]
rust-version = "1.77.0"
rust-version = "1.80.0"
keywords = ["redact", "pii", "dlp"]
categories = ["command-line-utilities"]
description = "Copy & Redact files cli tool utilizing Data Loss Prevention (DLP) capabilities"
Expand All @@ -19,7 +19,8 @@ default = []
ci-gcp = [] # For testing on CI/GCP
ci-aws = [] # For testing on CI/AWS
ci-ms-presidio = [] # For testing on CI/MS Presidiom
ci = ["ci-gcp", "ci-aws", "ci-ms-presidio"]
ci-gcp-llm = [] # For testing on CI/GCP with LLM models
ci = ["ci-gcp", "ci-aws", "ci-ms-presidio", "ci-gcp-llm"]


[dependencies]
Expand All @@ -32,7 +33,7 @@ indicatif = { version = "0.17" }
clap = { version = "4.1", features = ["derive"] }
tokio = { version = "1.14", features = ["fs", "rt-multi-thread", "sync", "rt", "macros"] }
tokio-util = { version = "0.7", features = ["compat"] }
gcloud-sdk = { version = "0.25.4", features = ["google-privacy-dlp-v2", "google-rest-storage-v1"] }
gcloud-sdk = { version = "0.25.5", features = ["google-privacy-dlp-v2", "google-rest-storage-v1", "google-ai-generativelanguage-v1beta"] }
futures = "0.3"
sha2 = "0.10"
async-trait = "0.1"
Expand All @@ -51,6 +52,8 @@ aws-sdk-s3 = { version = "1" }
aws-sdk-comprehend = { version = "1" }
url = "2"
reqwest = { version = "0.12", default-features = false, features = ["multipart", "rustls-tls"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }


[dev-dependencies]
Expand Down
34 changes: 27 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

# Redacter

Copy & Redact cli tool to securely copy and redact files across various sources and destinations,
utilizing Data Loss Prevention (DLP) capabilities.
Copy & Redact cli tool to securely copy and redact files removing Personal Identifiable Information (PII)
across various sources and destinations and utilizing Data Loss Prevention (DLP) capabilities.

The tool doesn't implement DLP itself, but rather relies on external models such as
Google Cloud Platform's DLP API.

Expand All @@ -25,11 +26,14 @@ Google Cloud Platform's DLP API.
* text, html, json files
* structured data table files (csv)
* images (jpeg, png, bpm, gif)
* [AWS Comprehend](https://aws.amazon.com/comprehend/) PII redaction for text files.
* [AWS Comprehend](https://aws.amazon.com/comprehend/) PII redaction:
* text, html, csv, json files
* [Microsoft Presidio](https://microsoft.github.io/presidio/) for PII redaction (open source project that you can
install on-prem).
* text, html, json files
* text, html, csv, json files
* images
* [Gemini LLM](https://ai.google.dev/gemini-api/docs) based redaction
* text, html, csv, json files
* ... more DLP providers can be added in the future.
* **CLI:** Easy-to-use command-line interface for streamlined workflows.
* Built with Rust to ensure speed, safety, and reliability.
Expand Down Expand Up @@ -63,7 +67,7 @@ Options:
-f, --filename-filter <FILENAME_FILTER>
Filter by name using glob patterns such as *.txt
-d, --redact <REDACT>
Redacter type [possible values: gcp-dlp, aws-comprehend, ms-presidio]
Redacter type [possible values: gcp-dlp, aws-comprehend, ms-presidio, gemini-llm]
--gcp-project-id <GCP_PROJECT_ID>
GCP project id that will be used to redact and bill API calls
--allow-unsupported-copies
Expand All @@ -78,6 +82,8 @@ Options:
URL for text analyze endpoint for MsPresidio redacter
--ms-presidio-image-redact-url <MS_PRESIDIO_IMAGE_REDACT_URL>
URL for image redact endpoint for MsPresidio redacter
--gemini-model <GEMINI_MODEL>
Gemini model name for Gemini LLM redacter. Default is 'models/gemini-1.5-flash'
-h, --help
Print help
```
Expand All @@ -97,8 +103,11 @@ Source/destination can be a local file or directory, or a file in GCS, S3, or a

### Google Cloud Platform DLP

To be able to use GCP DLP you need to authenticate using `gcloud auth application-default login` or provide a service
account key using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
To be able to use GCP DLP you need to:

- authenticate using `gcloud auth application-default login` or provide a service account key
using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
- provide a GCP project id using `--gcp-project-id` option.

### AWS Comprehend

Expand All @@ -113,6 +122,17 @@ You can use Docker to run it locally or deploy it to your infrastructure.
You need to provide the URLs for text analysis and image redaction endpoints using `--ms-presidio-text-analyze-url` and
`--ms-presidio-image-redact-url` options.

### Gemini LLM

To be able to use GCP DLP you need to:

- authenticate using `gcloud auth application-default login --client-id-file=<client_secret-file>.json` or provide a
service account key
using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
Please note that you need to also configure OAuth setup following the
official [instructions](https://ai.google.dev/gemini-api/docs/oauth#set-cloud).
- provide a GCP project id using `--gcp-project-id` option.

## Examples:

```sh
Expand Down
Loading