Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
MarkChenYutian committed Oct 19, 2023
0 parents commit ee17c9d
Show file tree
Hide file tree
Showing 90 changed files with 6,607 additions and 0 deletions.
28 changes: 28 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# IDE Settings
/.idea/
/.vscode/

# Cached Files
*.DS_Store
**/__pycache__/
**/wandb/
**/storage/
result/cache/*.json

# Dataset
*.jsonl

# Weight Files
*.pt
*.pickle
*.pkl

# Sensitive Files
**/secret.json
**/client_state.json
**/cookies.pkl
**/secrets.yaml

# Log Files
*.log
*.lock
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 Yutian Chen

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# T5-Sentinel-public

Release repo for our work "Token Prediction as Implicit Classification to Identify LLM-Generated Text"

## Requirement

As shown in `requirements.txt` in the root directory.

## Evaluate

1. Download the checkpoints `0622.hidden.a.pt`, `t5-small.0613.a.pt` and `solaiman-detector-base.pt` and place in the `./data/checkpoint` directory. The models can be found in Release page of this repository.
2. Download the OpenLLMText dataset in the `./data/split` directory

3. Run the following files
1. `./evaluator/calc/calc_accuracy.py` to calculate the accuracy under different settings for each module
2. `./evaluator/interpret/integrated_gradient.ipynb` to calculate the integrated gradient for samples
3. `./evaluator/interpret/sample_pca.py` to calculate the PCA analysis for hidden layers of the test subset
4. `./evaluator/plot/*.py` to generate plots of related metrics (confusion matrix, roc, det, etc.)

## Train

1. Use the `./detector/t5/arbitrary/__main__.py` to train the T5-Sentinel Model

(The detailed hyperparameter setup we used for training the T5-Sentinel model in paper is presented in `settings_0613_full.yaml`)

2. Use the `./detector/t5/arbitrary_hidden/__main__.py` to train the T5-Hidden Model

3 changes: 3 additions & 0 deletions cache/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Cache Directory

This files contains intermediate calculation results from other files / function calls s.t. they can be memoized and accelerate the calculation.
11 changes: 11 additions & 0 deletions data/baselines/openai_classifier_output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# OpenAI Classifier Output

This folder collects the classification result of OpenAI text classifier (GPT detector)

https://platform.openai.com/ai-text-classifier

The results are collected automatically by async web client in `./src/baseline/openai_client.py`

* The file `gpt2-output-gpt-openai.jsonl` is the classification result of dataset `xl-1542M.test.jsonl` in `GPT2-output` dataset.

* The file `gpt2-output-web-openai.jsonl` is the classification result of dataset `webtext.test.jsonl` in `GPT2-output` dataset.
7 changes: 7 additions & 0 deletions data/baselines/zerogpt_classifier_output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ZeroGPT Classifier Output

This folder collects the classification result of ZeroGPT text classifier (GPT detector)

https://www.zerogpt.com/

The results are collected automatically by async web client in `./src/baseline/zerogpt_client.py`
1 change: 1 addition & 0 deletions data/checkpoint/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Where the checkpoints are stored.
53 changes: 53 additions & 0 deletions data/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import io
import zipfile
import requests

from pathlib import Path

from pipeline.lib.sanitize_dataset import sanitize
from pipeline.lib.build_abalation import build_clean_variants
from pipeline.lib.report_entry_count import report

sources = ["gpt2-output", "open-gpt-text", "open-llama-text", "open-palm-text", "open-web-text"]
from_subsets = ["test-dirty.jsonl", "train-dirty.jsonl", "valid-dirty.jsonl"]
to_subsets = ["test.jsonl", "train.jsonl", "valid.jsonl"]

def downloadAndExtractTo(url: str, to: Path):
print(f"Downloading: {url} => {to}")
file = zipfile.ZipFile(io.BytesIO(requests.get(url, stream=True).content))
file.extractall(to)



if __name__ == "__main__":
from_files = [Path(source, from_subset)
for from_subset in from_subsets
for source in sources]
to_files = [Path(source, to_subset)
for to_subset in to_subsets
for source in sources]

downloadAndExtractTo("https://zenodo.org/records/8285326/files/GPT2.zip?download=1", Path("data", "split", "gpt2-output"))
downloadAndExtractTo("https://zenodo.org/records/8285326/files/ChatGPT.zip?download=1", Path("data", "split", "open-gpt-text"))
downloadAndExtractTo("https://zenodo.org/records/8285326/files/LLaMA.zip?download=1", Path("data", "split", "open-llama-text"))
downloadAndExtractTo("https://zenodo.org/records/8285326/files/PaLM.zip?download=1", Path("data", "split", "open-palm-text"))
downloadAndExtractTo("https://zenodo.org/records/8285326/files/Human.zip?download=1", Path("data", "split", "open-web-text"))
downloadAndExtractTo("https://zenodo.org/records/8285326/files/ZeroGPT-baseline-response.zip?download=1", Path("data", "baselines", "zerogpt_classifier_output"))
downloadAndExtractTo("https://zenodo.org/records/8285326/files/OpenAI-baseline-response.zip?download=1", Path("data", "baselines", "openai_classifier_output"))

# Report
print("Download Finished!\n\nDataset Statistics:\n")
for source in sources:
for subset in from_subsets:
report(source, subset)
print("\n")

# Build cleaned up dataset version
sanitize(from_files, to_files)

# Build clean variants for the large ablation table
build_clean_variants(Path("data", "split", "open-palm-text"))
build_clean_variants(Path("data", "split", "open-web-text"))
build_clean_variants(Path("data", "split", "open-gpt-text"))
build_clean_variants(Path("data", "split", "gpt2-output"))
build_clean_variants(Path("data", "split", "open-llama-text"))
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
124 changes: 124 additions & 0 deletions detector/openai_classifier/openai_classifier_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""
@brief: An async generator used to collect OpenAI's classifier's response on test dataset
@author: Yutian Chen <[email protected]>
@date: May 16, 2023
"""
import asyncio
import aiohttp
import yaml
import json
import time

from typing import TypedDict, List, Tuple
from pathlib import Path
from generator.client_base import AsyncRequestClient, TaskResult
from pipeline.component.text_component import TextEntry
import pipeline.component.text_component as P

# Typing

class OpenAIState(TypedDict):
processed: set


class OpenAIConfig(TypedDict):
InputDirectory: List[str]
OutputDirectory: List[str]
WaitTime: float
Header: dict
URL: str

OpenAIArgs = Tuple[TextEntry, Path]
OpenAI_Type = AsyncRequestClient[OpenAIState, OpenAIArgs, OpenAIConfig]
###

load_data_fn = P.FromJsonStr() >> P.WriteExtra({"pred_by": "openai", "variant": "original"})

async def openai_request_fn(self: OpenAI_Type, state: OpenAIState, *args: OpenAIArgs) -> TaskResult:
entry: TextEntry
destination: Path
entry, destination = args

submission = {
"model": "model-detect-v2",
"max_tokens": 1, "temperature": 1, "top_p": 1, "n": 1, "logprobs": 5,
"stop": "\n", "stream": False,
"prompt": entry["text"] + "<|disc_score|>"
}

async with self.worker_lock:
start_time = time.time()
try:
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10)) as session:
async with session.post(self.config["URL"], headers=self.config["Header"], json=submission) as response:
status_code = response.status
result = await response.json()

duration = time.time() - start_time
if status_code != 200:
await asyncio.sleep(self.config["WaitTime"] - duration)
return TaskResult.RETRY

async with self.writer_lock:
serializable = {
"uid": entry["uid"],
"extra": entry["extra"],
"res": result
}
with open(destination, "a", encoding="utf-8") as f: f.write(json.dumps(serializable) + "\n")

duration = time.time() - start_time
await asyncio.sleep(self.config["WaitTime"] - duration)

except (aiohttp.ClientError, aiohttp.ServerTimeoutError, aiohttp.ServerDisconnectedError):
await asyncio.sleep(self.config["WaitTime"])
return TaskResult.RETRY

except Exception as e:
print("[x]\tUnexpected exception: ", e)
return TaskResult.CANCEL

return TaskResult.FINISH


def openai_pred_fn(client: OpenAI_Type, state: OpenAIState, *args: OpenAIArgs) -> bool:
entry: TextEntry
entry, dest = args
return entry["uid"] not in state["processed"]


def openai_task_generator(client: OpenAI_Type, state: OpenAIState) -> List[OpenAIArgs]:
Tasks = []
for input_file, output_file in zip(client.config["InputDirectory"], client.config["OutputDirectory"]):
counter = 0
print(f"{input_file} --> {output_file}", end="\tCount:")
assert Path(input_file).exists()
with open(input_file, "r") as f:
for line in f.read().strip().split("\n"):
Tasks.append((load_data_fn(line), Path(output_file)))
counter += 1
print(counter)
return Tasks


def openai_state_initializer(client: OpenAI_Type) -> OpenAIState:
return {"processed": set()}


if __name__ == "__main__":
with open("./detector/openai_classifier/openai_classifier_client.yaml", "r") as f:
openai_config = yaml.safe_load(f)

with open("./detector/openai_classifier/secret.json", "r") as f:
openai_secret = json.load(f)
openai_config["Config"]["Header"].update(openai_secret)

OpenAIClient = OpenAI_Type(
openai_config,
openai_request_fn,
openai_pred_fn,
openai_task_generator,
openai_state_initializer,
display_args=lambda args: args[0]["uid"]
)
asyncio.run(OpenAIClient.execute())
35 changes: 35 additions & 0 deletions detector/openai_classifier/openai_classifier_client.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
ClientName: "openai_classifier"
ClientRoot: "./detector/openai_classifier/"

MaxAsyncWorkerCnt: 120
MaxRetryCnt: 3

Config:
InputDirectory:
# - "./data/split/open-gpt-text/test-dirty.jsonl"
# - "./data/split/open-web-text/test-dirty.jsonl"
# - "./data/split/open-palm-text/test-dirty.jsonl"
# - "./data/split/open-llama-text/test-dirty.jsonl"
# - "./data/split/gpt2-output/test-dirty.jsonl"
- "./data/split/hc3-test/hc3-human.jsonl"
- "./data/split/hc3-test/hc3-chatgpt.jsonl"

OutputDirectory:
# - "./data/baselines/openai_classifier_output/open-gpt-text.jsonl"
# - "./data/baselines/openai_classifier_output/open-web-text.jsonl"
# - "./data/baselines/openai_classifier_output/open-palm-text.jsonl"
# - "./data/baselines/openai_classifier_output/open-llama-text.jsonl"
# - "./data/baselines/openai_classifier_output/gpt2-output.jsonl"
- "./data/baselines/openai_classifier_output/hc3-human.jsonl"
- "./data/baselines/openai_classifier_output/hc3-chatgpt.jsonl"

WaitTime: 60
URL: https://api.openai.com/v1/completions

Header:
Content-Type: application/json
Referer: https://platform.openai.com/
Origin: https://platform.openai.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
OpenAI-Organization: [in secret.json]
Authorization: [in secret.json]
Loading

0 comments on commit ee17c9d

Please sign in to comment.