Skip to content

Commit

Permalink
Gemini LLM redacting implementation and docs (#6)
Browse files Browse the repository at this point in the history
  • Loading branch information
abdolence authored Aug 7, 2024
1 parent ddddc42 commit 1c62f5f
Show file tree
Hide file tree
Showing 8 changed files with 450 additions and 20 deletions.
141 changes: 136 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 6 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ repository = "https://github.com/abdolence/redacter-rs"
documentation = "https://docs.rs/redacter"
readme = "README.md"
include = ["Cargo.toml", "src/**/*.rs", "README.md", "LICENSE"]
rust-version = "1.77.0"
rust-version = "1.80.0"
keywords = ["redact", "pii", "dlp"]
categories = ["command-line-utilities"]
description = "Copy & Redact files cli tool utilizing Data Loss Prevention (DLP) capabilities"
Expand All @@ -19,7 +19,8 @@ default = []
ci-gcp = [] # For testing on CI/GCP
ci-aws = [] # For testing on CI/AWS
ci-ms-presidio = [] # For testing on CI/MS Presidiom
ci = ["ci-gcp", "ci-aws", "ci-ms-presidio"]
ci-gcp-llm = [] # For testing on CI/GCP with LLM models
ci = ["ci-gcp", "ci-aws", "ci-ms-presidio", "ci-gcp-llm"]


[dependencies]
Expand All @@ -32,7 +33,7 @@ indicatif = { version = "0.17" }
clap = { version = "4.1", features = ["derive"] }
tokio = { version = "1.14", features = ["fs", "rt-multi-thread", "sync", "rt", "macros"] }
tokio-util = { version = "0.7", features = ["compat"] }
gcloud-sdk = { version = "0.25.4", features = ["google-privacy-dlp-v2", "google-rest-storage-v1"] }
gcloud-sdk = { version = "0.25.5", features = ["google-privacy-dlp-v2", "google-rest-storage-v1", "google-ai-generativelanguage-v1beta"] }
futures = "0.3"
sha2 = "0.10"
async-trait = "0.1"
Expand All @@ -51,6 +52,8 @@ aws-sdk-s3 = { version = "1" }
aws-sdk-comprehend = { version = "1" }
url = "2"
reqwest = { version = "0.12", default-features = false, features = ["multipart", "rustls-tls"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }


[dev-dependencies]
Expand Down
34 changes: 27 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

# Redacter

Copy & Redact cli tool to securely copy and redact files across various sources and destinations,
utilizing Data Loss Prevention (DLP) capabilities.
Copy & Redact cli tool to securely copy and redact files removing Personal Identifiable Information (PII)
across various sources and destinations and utilizing Data Loss Prevention (DLP) capabilities.

The tool doesn't implement DLP itself, but rather relies on external models such as
Google Cloud Platform's DLP API.

Expand All @@ -25,11 +26,14 @@ Google Cloud Platform's DLP API.
* text, html, json files
* structured data table files (csv)
* images (jpeg, png, bpm, gif)
* [AWS Comprehend](https://aws.amazon.com/comprehend/) PII redaction for text files.
* [AWS Comprehend](https://aws.amazon.com/comprehend/) PII redaction:
* text, html, csv, json files
* [Microsoft Presidio](https://microsoft.github.io/presidio/) for PII redaction (open source project that you can
install on-prem).
* text, html, json files
* text, html, csv, json files
* images
* [Gemini LLM](https://ai.google.dev/gemini-api/docs) based redaction
* text, html, csv, json files
* ... more DLP providers can be added in the future.
* **CLI:** Easy-to-use command-line interface for streamlined workflows.
* Built with Rust to ensure speed, safety, and reliability.
Expand Down Expand Up @@ -63,7 +67,7 @@ Options:
-f, --filename-filter <FILENAME_FILTER>
Filter by name using glob patterns such as *.txt
-d, --redact <REDACT>
Redacter type [possible values: gcp-dlp, aws-comprehend, ms-presidio]
Redacter type [possible values: gcp-dlp, aws-comprehend, ms-presidio, gemini-llm]
--gcp-project-id <GCP_PROJECT_ID>
GCP project id that will be used to redact and bill API calls
--allow-unsupported-copies
Expand All @@ -78,6 +82,8 @@ Options:
URL for text analyze endpoint for MsPresidio redacter
--ms-presidio-image-redact-url <MS_PRESIDIO_IMAGE_REDACT_URL>
URL for image redact endpoint for MsPresidio redacter
--gemini-model <GEMINI_MODEL>
Gemini model name for Gemini LLM redacter. Default is 'models/gemini-1.5-flash'
-h, --help
Print help
```
Expand All @@ -97,8 +103,11 @@ Source/destination can be a local file or directory, or a file in GCS, S3, or a

### Google Cloud Platform DLP

To be able to use GCP DLP you need to authenticate using `gcloud auth application-default login` or provide a service
account key using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
To be able to use GCP DLP you need to:

- authenticate using `gcloud auth application-default login` or provide a service account key
using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
- provide a GCP project id using `--gcp-project-id` option.

### AWS Comprehend

Expand All @@ -113,6 +122,17 @@ You can use Docker to run it locally or deploy it to your infrastructure.
You need to provide the URLs for text analysis and image redaction endpoints using `--ms-presidio-text-analyze-url` and
`--ms-presidio-image-redact-url` options.

### Gemini LLM

To be able to use GCP DLP you need to:

- authenticate using `gcloud auth application-default login --client-id-file=<client_secret-file>.json` or provide a
service account key
using `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
Please note that you need to also configure OAuth setup following the
official [instructions](https://ai.google.dev/gemini-api/docs/oauth#set-cloud).
- provide a GCP project id using `--gcp-project-id` option.

## Examples:

```sh
Expand Down
Loading

0 comments on commit 1c62f5f

Please sign in to comment.