From 57ea904be9a8ae541a3c41db7f9857e300cf36ea Mon Sep 17 00:00:00 2001
From: "David B. Kinder" <david.b.kinder@intel.com>
Date: Thu, 5 Sep 2024 15:40:05 -0700
Subject: [PATCH] doc: fix headings and indents

* fix heading levels
* remove $ on command examples
* fix markdown coding errors: indenting and spaces in emphasis

Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
---
 README.md                           | 31 +++++++++++++++--------------
 evals/benchmark/stresscli/README.md | 14 ++++++-------
 evals/metrics/bleu/README.md        | 28 ++++----------------------
 3 files changed, 27 insertions(+), 46 deletions(-)

diff --git a/README.md b/README.md
index f7892fca..9216ba75 100644
--- a/README.md
+++ b/README.md
@@ -69,27 +69,27 @@ results = evaluate(args)
 
 1. setup a separate server with [GenAIComps](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/lm-eval)
 
-```
-# build cpu docker
-docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
+   ```
+   # build cpu docker
+   docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
 
-# start the server
-docker run -p 9006:9006 --ipc=host  -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
-```
+   # start the server
+   docker run -p 9006:9006 --ipc=host  -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
+   ```
 
 2. evaluate the model
 
-- set `base_url`, `tokenizer` and `--model genai-hf`
+   - set `base_url`, `tokenizer` and `--model genai-hf`
 
-```
-cd evals/evaluation/lm_evaluation_harness/examples
+     ```
+     cd evals/evaluation/lm_evaluation_harness/examples
 
-python main.py \
-    --model genai-hf \
-    --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
-    --tasks  "lambada_openai" \
-    --batch_size 2
-```
+     python main.py \
+         --model genai-hf \
+         --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
+         --tasks  "lambada_openai" \
+         --batch_size 2
+     ```
 
 ### bigcode-evaluation-harness
 For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
@@ -104,6 +104,7 @@ python main.py \
     --batch_size 10 \
     --allow_code_execution
 ```
+
 #### function call usage
 ```python
 from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
diff --git a/evals/benchmark/stresscli/README.md b/evals/benchmark/stresscli/README.md
index a2f1bfa1..2cc7fc98 100644
--- a/evals/benchmark/stresscli/README.md
+++ b/evals/benchmark/stresscli/README.md
@@ -35,7 +35,7 @@ pip install -r requirements.txt
 ### Usage
 
 ```
-$ ./stresscli.py --help
+./stresscli.py --help
 Usage: stresscli.py [OPTIONS] COMMAND [ARGS]...
 
   StressCLI - A command line tool for stress testing OPEA workloads.
@@ -60,7 +60,7 @@ Commands:
 
 More detail options:
 ```
-$ ./stresscli.py load-test --help
+./stresscli.py load-test --help
 Usage: stresscli.py load-test [OPTIONS]
 
   Do load test
@@ -74,12 +74,12 @@ Options:
 
 You can generate the report for test cases by:
 ```
-$ ./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
+./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
 ```
 
 More detail options:
 ```
-$ ./stresscli.py report --help
+./stresscli.py report --help
 Usage: stresscli.py report [OPTIONS]
 
   Print the test report
@@ -101,7 +101,7 @@ You can dump the current testing profile by
 ```
 More detail options:
 ```
-$ ./stresscli.py dump --help
+./stresscli.py dump --help
 Usage: stresscli.py dump [OPTIONS]
 
   Dump the test spec
@@ -115,12 +115,12 @@ Options:
 
 You can validate if the current K8s and workloads deployment comply with the test spec by:
 ```
-$ ./stresscli.py validate --file testspec.yaml
+./stresscli.py validate --file testspec.yaml
 ```
 
 More detail options:
 ```
-$ ./stresscli.py validate --help
+./stresscli.py validate --help
 Usage: stresscli.py validate [OPTIONS]
 
   Validate against the test spec
diff --git a/evals/metrics/bleu/README.md b/evals/metrics/bleu/README.md
index d92598f6..cd6985f0 100644
--- a/evals/metrics/bleu/README.md
+++ b/evals/metrics/bleu/README.md
@@ -1,28 +1,5 @@
----
-title: BLEU
-emoji: 🤗 
-colorFrom: blue
-colorTo: red
-sdk: gradio
-sdk_version: 3.19.1
-app_file: app.py
-pinned: false
-tags:
-- evaluate
-- metric
-description: >-
-  BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
-  Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"
-  – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
-
-  Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.
-  Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.
-  Neither intelligibility nor grammatical correctness are not taken into account.
----
-
 # Metric Card for BLEU
 
-
 ## Metric Description
 BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
 
@@ -48,17 +25,20 @@ This metric takes as input a list of predicted sentences and a list of lists of
 ```
 
 ### Inputs
+
 - **predictions** (`list` of `str`s): Translations to score.
 - **references** (`list` of `list`s of `str`s): references for each translation.
-- ** tokenizer** : approach used for standardizing `predictions` and `references`.
+- **tokenizer** : approach used for standardizing `predictions` and `references`.
     The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
     This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).
 
 The default tokenizer is based on whitespace and regexes. It can be replaced by any function that takes a string as input and returns a list of tokens as output. E.g. `word_tokenize()` from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) or pretrained tokenizers from the [Tokenizers library](https://huggingface.co/docs/tokenizers/index).
+
 - **max_order** (`int`): Maximum n-gram order to use when computing BLEU score. Defaults to `4`.
 - **smooth** (`boolean`): Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`.
 
 ### Output Values
+
 - **bleu** (`float`): bleu score
 - **precisions** (`list` of `float`s): geometric mean of n-gram precisions,
 - **brevity_penalty** (`float`): brevity penalty,