Skip to content

Commit

Permalink
update readme on package structure
Browse files Browse the repository at this point in the history
  • Loading branch information
bw4sz committed Oct 30, 2024
1 parent 6cacad0 commit 3ca5454
Show file tree
Hide file tree
Showing 15 changed files with 201 additions and 122 deletions.
164 changes: 138 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,169 @@
# ML Workflow Manager
# ML Pipeline Project

ML Workflow Manager is a high-level Python package for managing machine learning workflows. It provides a modular structure for data ingestion, processing, model training, evaluation, deployment, and monitoring. It also includes an annotation module based on the AirborneFieldGuide project.
A modular machine learning pipeline for data processing, model training, evaluation, and deployment with pre-annotation prediction capabilities for Bureau of Ocean Energy Management (BOEM) data.

## Project Structure

```
project_root/
├── src/ # Source code for the ML pipeline
│ ├── __init__.py
│ ├── data_ingestion.py # Data loading and preparation
│ ├── data_processing.py # Data preprocessing and transformations
│ ├── model_training.py # Model training functionality
│ ├── pipeline_evaluation.py # Pipeline and model evaluation metrics
│ ├── model_deployment.py # Model deployment utilities
│ ├── monitoring.py # Monitoring and logging functionality
│ ├── reporting.py # Report generation for pipeline results
│ ├── pre_annotation_prediction.py # Pre-annotation model predictions
│ └── annotation/ # Annotation-related functionality
│ ├── __init__.py
│ └── pipeline.py # Annotation pipeline implementation
├── tests/ # Test files for each component
│ ├── test_data_ingestion.py
│ ├── test_data_processing.py
│ ├── test_model_training.py
│ ├── test_pipeline_evaluation.py
│ ├── test_model_deployment.py
│ ├── test_monitoring.py
│ ├── test_reporting.py
│ └── test_pre_annotation_prediction.py
├── conf/ # Configuration files
│ └── config.yaml # Main configuration file
├── main.py # Main entry point for the pipeline
├── run_ml_workflow.sh # Script to run pipeline in Serenity container
├── requirements.txt # Project dependencies
├── .gitignore # Git ignore file
├── CONTRIBUTING.md # Contributing guidelines
├── LICENSE # Project license
└── README.md # This file
```

## Components

### Source Code (`src/`)

- **data_ingestion.py**: Handles data loading and initial preparation
- **data_processing.py**: Implements data preprocessing and transformations
- **model_training.py**: Contains model training logic
- **pipeline_evaluation.py**: Evaluates pipeline performance and model metrics
- **model_deployment.py**: Manages model deployment
- **monitoring.py**: Provides monitoring and logging capabilities
- **reporting.py**: Generates reports for pipeline results
- **pre_annotation_prediction.py**: Handles pre-annotation model predictions
- **annotation/**: Contains annotation-related functionality
- **pipeline.py**: Implements the annotation pipeline

### Tests (`tests/`)

Contains test files corresponding to each component in `src/`. Uses pytest for testing.

### Configuration (`conf/`)

Contains YAML configuration files managed by Hydra:
- **config.yaml**: Main configuration file defining pipeline parameters

## Installation

You can install the ML Workflow Manager using pip:
1. Clone the repository:
```bash
git clone https://github.com/your-username/project-name.git
cd project-name
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

To run the main workflow and the annotation pipeline:
### Running the Pipeline

Using the Serenity container:
```bash
python main.py
./run_ml_workflow.sh your-branch-name
```

## Running Tests
Or directly with Python:
```bash
python main.py
```

To run the tests, make sure you have pytest installed and then run:
### Running Tests

```bash
pytest
pytest tests/
```

This will run all the tests in the `tests/` directory and display the results.
## Configuration

## Annotation Module
The pipeline uses Hydra for configuration management. Main configuration options are defined in `conf/config.yaml`.

The `annotation` module is based on the AirborneFieldGuide project. It provides additional functionality for annotating airborne data. To use this module, you can import it in your Python scripts:
Example configuration:
```yaml
data:
input_dir: "path/to/input"
output_dir: "path/to/output"

```python
from annotation.pipeline import config_pipeline
model:
type: "classification"
parameters:
learning_rate: 0.001
batch_size: 32

# Use the config_pipeline function to run the annotation workflow
config_pipeline(your_config)
pipeline:
steps:
- data_ingestion
- data_processing
- model_training
- evaluation
```
For more details on how to use the annotation module, please refer to the AirborneFieldGuide documentation.

## Contributing
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Pipeline Components
## Dependencies
Key dependencies include:
- Hydra
- PyTorch
- NumPy
- Pandas
- Pytest
See `requirements.txt` for a complete list.

## Development

### Code Organization

- Each component is a separate module in the `src/` directory
- Tests mirror the source code structure in the `tests/` directory
- Configuration is managed through Hydra
- Monitoring and logging are integrated throughout the pipeline using comet

### Testing

- Tests are written using pytest
- Each component has its own test file
- Run tests with `pytest tests/`

### Adding New Components

1. Create a new module in `src/`
2. Add corresponding test file in `tests/`
3. Update configuration in `conf/config.yaml`
4. Update `main.py` to integrate the new component
5. Create a branch and push your changes to the remote repository
6. Create a pull request to merge your changes into the main branch

- Data Ingestion
- Data Processing
- Model Training
- Pipeline Evaluation
- Model Deployment
- Monitoring
- Reporting
34 changes: 0 additions & 34 deletions initiate.py

This file was deleted.

15 changes: 7 additions & 8 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,20 @@
from src.label_studio import check_for_new_annotations, upload_to_label_studio
from src.model import Model

@hydra.main(version_base=None, config_path="conf", config_name="config")
@hydra.main(version_base=None, config_path="conf", config_name="config", check_annotations=True)
def main(cfg: DictConfig):

# Check for new annotations
new_annotations = check_for_new_annotations(**cfg.label_studio)
if new_annotations is None:
print("No new annotations, exiting")
return None
# Check for new annotations if the check_annotations flag is set
if cfg.check_annotations:
new_annotations = check_for_new_annotations(**cfg.label_studio)
if new_annotations is None:
print("No new annotations, exiting")
return None

model_training = Model()
trained_model = model_training.train_model(annotations)

# Update the model path
cfg.model.path = trained_model

existing_model = cfg.model.path

pipeline_monitor = PipelineEvaluation(trained_model)
Expand Down
4 changes: 0 additions & 4 deletions src/data_processing.py
Original file line number Diff line number Diff line change
@@ -1,4 +0,0 @@
from src.monitoring import Monitoring

class DataProcessing:
# ... existing code ...
22 changes: 21 additions & 1 deletion src/model_deployment.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,30 @@
import os
from src.monitoring import Monitoring
from src.pipeline_evaluation import PipelineEvaluation
from huggingface_hub import HfApi, HfFolder

class ModelDeployment:
def __init__(self):
self.monitoring = Monitoring()
self.pipeline_evaluation = PipelineEvaluation()
self.hf_api = HfApi()
self.hf_token = HfFolder.get_token()

def upload_to_huggingface(self, model_path, repo_id):
"""
Upload the successful checkpoint to Hugging Face.
Args:
model_path (str): The path to the model checkpoint.
repo_id (str): The repository ID on Hugging Face.
Returns:
None
"""
self.hf_api.upload_file(
path_or_fileobj=model_path,
path_in_repo=os.path.basename(model_path),
repo_id=repo_id,
token=self.hf_token
)

# ... existing code ...
4 changes: 0 additions & 4 deletions src/monitoring.py

This file was deleted.

5 changes: 0 additions & 5 deletions tests/test_annotation_pipeline.py

This file was deleted.

4 changes: 2 additions & 2 deletions tests/test_data_ingestion.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
import pytest
from src.data_ingestion import DataIngestion
from src.monitoring import Monitoring

@pytest.fixture
def data_ingestion():
return DataIngestion()

def test_ingest_data(data_ingestion):
# Example test for data ingestion
data = data_ingestion.ingest_data()
assert data is not None
# Add more specific assertions based on your expected data structure
# Add more assertions based on expected data structure

if __name__ == '__main__':
pytest.main()
9 changes: 4 additions & 5 deletions tests/test_data_processing.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
import pytest
from src.data_processing import DataProcessing
from src.monitoring import Monitoring

@pytest.fixture
def data_processing():
return DataProcessing()

def test_process_data(data_processing):
processing = data_processing
raw_data = "Sample raw data" # Replace with appropriate test data
processed_data = processing.process_data(raw_data)
# Example test for data processing
raw_data = "raw data"
processed_data = data_processing.process_data(raw_data)
assert processed_data is not None
# Add more specific assertions based on your expected processed data structure
# Add more assertions based on expected processed data
10 changes: 5 additions & 5 deletions tests/test_model_deployment.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ def model_deployment():
return ModelDeployment()

def test_deploy_model(model_deployment):
deployment = model_deployment
model = "Sample model" # Replace with appropriate test model
deployed_model = deployment.deploy_model(model)
assert deployed_model is not None
# Add more specific assertions based on your expected deployed model structure
# Example test for model deployment
model = "model"
deployment_result = model_deployment.deploy_model(model)
assert deployment_result is not None
# Add more assertions based on expected deployment results

def test_model_deployment():
# ... (other test setup code)
Expand Down
11 changes: 5 additions & 6 deletions tests/test_model_training.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
import pytest
from src.model_training import ModelTraining
from src.monitoring import Monitoring

@pytest.fixture
def model_training():
return ModelTraining()

def test_train_model(model_training):
training = model_training
processed_data = "Sample processed data" # Replace with appropriate test data
trained_model = training.train_model(processed_data)
assert trained_model is not None
# Add more specific assertions based on your expected model structure
# Example test for model training
training_data = "training data"
model = model_training.train_model(training_data)
assert model is not None
# Add more assertions based on expected model properties
16 changes: 0 additions & 16 deletions tests/test_monitoring.py

This file was deleted.

Loading

0 comments on commit 3ca5454

Please sign in to comment.