Skip to content

Commit

Permalink
Fix: Broken links in README.rst banner text [skip ci] (#33)
Browse files Browse the repository at this point in the history
* Fix: Broken links in README.rst banner text [skip ci]

Signed-off-by: Matthew Watkins <[email protected]>

* Chore: pre-commit autoupdate

---------

Signed-off-by: Matthew Watkins <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
1 parent 45d61a6 commit 7d4d882
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 40 deletions.
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

[IMPORTANT]
💬 Important

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)
On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation (`FINOS <https://finos.org>`_), with OS-Climate, an open source community dedicated to building data technologies, modelling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the `FINOS governance framework <https://community.finos.org/docs/governance>`_; read more on `finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg <https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg>`_


.. image:: https://img.shields.io/badge/OS-Climate-blue
Expand Down
93 changes: 55 additions & 38 deletions src/osc_transformer_based_extractor/README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,55 @@
---

# Relevance Detector

This folder contains a set of scripts and notebooks designed to process data, train a sentence transformer model, and perform inferences to detect the relevance of folder contents. Below is a detailed description of each file and folder included in this repository.

## How to Use This Repository

1. **Prepare Training Data**:

- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:

### Example Snippet
### Example Snippet

| question | context | company | source_file | source_page | kpi_id | year | answer | data_type | relevant_paragraphs | annotator | Index | label |
|-------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------|-----------------------------------|-------------|--------|------|--------------|-----------|------------------------------------------------|---------------------|-------|-------|
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022 | 0 |
| question | context | company | source_file | source_page | kpi_id | year | answer | data_type | relevant_paragraphs | annotator | Index | label |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | ---------------------------------- | --------------------- | ----- | ----- |
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022 | 0 |

- If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
- Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.


2. **Train the Model**:

- Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.

- To train the model using function calling
```python

```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
```

**Parameters**:
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

- To train the model from the command line, run `fine_tune.py` with the required arguments:

```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
Expand All @@ -62,38 +65,44 @@ This folder contains a set of scripts and notebooks designed to process data, tr
3. **Perform Inference**:
- Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
- For programmatic inference, you can use the function provided in `inference.py`:

```python
from inference import get_inference
result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```


## Repository Contents

### Python Scripts

1. **`inference.py`**

- This script contains the function to make inferences using the trained model.
- **Usage**: Import this script and use the provided function to predict the relevance of new data.
- **Example**:

```python
from inference import get_inference
result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```

**Parameters**:
- `question (str)`: The question for inference.
- `paragraph (str)`: The paragraph to be analyzed.
- `model_path (str)`: Path to the pre-trained model.
- `tokenizer_path (str)`: Path to the tokenizer of the pre-trained model.
- `question (str)`: The question for inference.
- `paragraph (str)`: The paragraph to be analyzed.
- `model_path (str)`: Path to the pre-trained model.
- `tokenizer_path (str)`: Path to the tokenizer of the pre-trained model.

2. **`make_training_data_from_curator.py`**

- This script processes CSV data obtained from a module named `curator` to make it suitable for training the model.
- **Usage**: Run this script to generate training data from the curator's output and save it in the `Data` folder.

3. **`train_sentence_transformer.py`**

- This script defines a function to train a sentence transformer model, which can be called from other scripts or notebooks.
- **Usage**: Import and call the `fine_tune_model` function to train your model.
- **Example**:

```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
Expand All @@ -107,20 +116,22 @@ This folder contains a set of scripts and notebooks designed to process data, tr
save_steps=500
)
```

**Parameters**:
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

4. **`fine_tune.py`**
- This script allows you to train a sentence transformer model from the command line.
- **Usage**: Run this script from the command line with the necessary arguments.
- **Example**:

```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
Expand All @@ -136,11 +147,13 @@ This folder contains a set of scripts and notebooks designed to process data, tr
### Jupyter Notebooks

1. **`inference_demo.ipynb`**

- A notebook to demonstrate how to perform inferences using a custom model and tokenizer.
- **Features**: Allows specifying model and tokenizer paths, which can be local paths or HuggingFace paths.
- **Usage**: Open this notebook and follow the instructions to test inference with your own models.

2. **`make_sample_training_data.ipynb`**

- This notebook was used to create sample training data from a sample CSV file.
- **Usage**: Open and run this notebook to understand the process of creating sample data for training.

Expand All @@ -153,34 +166,38 @@ This folder contains a set of scripts and notebooks designed to process data, tr
- **`Data/`**
- This folder contains the processed training data obtained from the `curator` module. It serves as the input for training the sentence transformer model.


## Setting Up the Environment

To set up the working environment for this repository, follow these steps:

1. **Clone the repository**:

```bash
git clone https://github.com/yourusername/folder-relevance-detector.git
cd folder-relevance-detector
```

2. **Create a new virtual environment and activate it**:

```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```

3. **Install PDM**:

```bash
pip install pdm
```

4. **Sync the environment using PDM**:

```bash
pdm sync
```

5. **Add any new library**:

```bash
pdm add <library-name>
```
Expand Down

0 comments on commit 7d4d882

Please sign in to comment.