-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
15 changed files
with
201 additions
and
122 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,57 +1,169 @@ | ||
# ML Workflow Manager | ||
# ML Pipeline Project | ||
|
||
ML Workflow Manager is a high-level Python package for managing machine learning workflows. It provides a modular structure for data ingestion, processing, model training, evaluation, deployment, and monitoring. It also includes an annotation module based on the AirborneFieldGuide project. | ||
A modular machine learning pipeline for data processing, model training, evaluation, and deployment with pre-annotation prediction capabilities for Bureau of Ocean Energy Management (BOEM) data. | ||
|
||
## Project Structure | ||
|
||
``` | ||
project_root/ | ||
│ | ||
├── src/ # Source code for the ML pipeline | ||
│ ├── __init__.py | ||
│ ├── data_ingestion.py # Data loading and preparation | ||
│ ├── data_processing.py # Data preprocessing and transformations | ||
│ ├── model_training.py # Model training functionality | ||
│ ├── pipeline_evaluation.py # Pipeline and model evaluation metrics | ||
│ ├── model_deployment.py # Model deployment utilities | ||
│ ├── monitoring.py # Monitoring and logging functionality | ||
│ ├── reporting.py # Report generation for pipeline results | ||
│ ├── pre_annotation_prediction.py # Pre-annotation model predictions | ||
│ └── annotation/ # Annotation-related functionality | ||
│ ├── __init__.py | ||
│ └── pipeline.py # Annotation pipeline implementation | ||
│ | ||
├── tests/ # Test files for each component | ||
│ ├── test_data_ingestion.py | ||
│ ├── test_data_processing.py | ||
│ ├── test_model_training.py | ||
│ ├── test_pipeline_evaluation.py | ||
│ ├── test_model_deployment.py | ||
│ ├── test_monitoring.py | ||
│ ├── test_reporting.py | ||
│ └── test_pre_annotation_prediction.py | ||
│ | ||
├── conf/ # Configuration files | ||
│ └── config.yaml # Main configuration file | ||
│ | ||
├── main.py # Main entry point for the pipeline | ||
├── run_ml_workflow.sh # Script to run pipeline in Serenity container | ||
├── requirements.txt # Project dependencies | ||
├── .gitignore # Git ignore file | ||
├── CONTRIBUTING.md # Contributing guidelines | ||
├── LICENSE # Project license | ||
└── README.md # This file | ||
``` | ||
|
||
## Components | ||
|
||
### Source Code (`src/`) | ||
|
||
- **data_ingestion.py**: Handles data loading and initial preparation | ||
- **data_processing.py**: Implements data preprocessing and transformations | ||
- **model_training.py**: Contains model training logic | ||
- **pipeline_evaluation.py**: Evaluates pipeline performance and model metrics | ||
- **model_deployment.py**: Manages model deployment | ||
- **monitoring.py**: Provides monitoring and logging capabilities | ||
- **reporting.py**: Generates reports for pipeline results | ||
- **pre_annotation_prediction.py**: Handles pre-annotation model predictions | ||
- **annotation/**: Contains annotation-related functionality | ||
- **pipeline.py**: Implements the annotation pipeline | ||
|
||
### Tests (`tests/`) | ||
|
||
Contains test files corresponding to each component in `src/`. Uses pytest for testing. | ||
|
||
### Configuration (`conf/`) | ||
|
||
Contains YAML configuration files managed by Hydra: | ||
- **config.yaml**: Main configuration file defining pipeline parameters | ||
|
||
## Installation | ||
|
||
You can install the ML Workflow Manager using pip: | ||
1. Clone the repository: | ||
```bash | ||
git clone https://github.com/your-username/project-name.git | ||
cd project-name | ||
``` | ||
|
||
2. Install dependencies: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Usage | ||
|
||
To run the main workflow and the annotation pipeline: | ||
### Running the Pipeline | ||
|
||
Using the Serenity container: | ||
```bash | ||
python main.py | ||
./run_ml_workflow.sh your-branch-name | ||
``` | ||
|
||
## Running Tests | ||
Or directly with Python: | ||
```bash | ||
python main.py | ||
``` | ||
|
||
To run the tests, make sure you have pytest installed and then run: | ||
### Running Tests | ||
|
||
```bash | ||
pytest | ||
pytest tests/ | ||
``` | ||
|
||
This will run all the tests in the `tests/` directory and display the results. | ||
## Configuration | ||
|
||
## Annotation Module | ||
The pipeline uses Hydra for configuration management. Main configuration options are defined in `conf/config.yaml`. | ||
|
||
The `annotation` module is based on the AirborneFieldGuide project. It provides additional functionality for annotating airborne data. To use this module, you can import it in your Python scripts: | ||
Example configuration: | ||
```yaml | ||
data: | ||
input_dir: "path/to/input" | ||
output_dir: "path/to/output" | ||
|
||
```python | ||
from annotation.pipeline import config_pipeline | ||
model: | ||
type: "classification" | ||
parameters: | ||
learning_rate: 0.001 | ||
batch_size: 32 | ||
|
||
# Use the config_pipeline function to run the annotation workflow | ||
config_pipeline(your_config) | ||
pipeline: | ||
steps: | ||
- data_ingestion | ||
- data_processing | ||
- model_training | ||
- evaluation | ||
``` | ||
For more details on how to use the annotation module, please refer to the AirborneFieldGuide documentation. | ||
|
||
## Contributing | ||
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us. | ||
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests. | ||
## License | ||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | ||
## Pipeline Components | ||
## Dependencies | ||
Key dependencies include: | ||
- Hydra | ||
- PyTorch | ||
- NumPy | ||
- Pandas | ||
- Pytest | ||
See `requirements.txt` for a complete list. | ||
|
||
## Development | ||
|
||
### Code Organization | ||
|
||
- Each component is a separate module in the `src/` directory | ||
- Tests mirror the source code structure in the `tests/` directory | ||
- Configuration is managed through Hydra | ||
- Monitoring and logging are integrated throughout the pipeline using comet | ||
|
||
### Testing | ||
|
||
- Tests are written using pytest | ||
- Each component has its own test file | ||
- Run tests with `pytest tests/` | ||
|
||
### Adding New Components | ||
|
||
1. Create a new module in `src/` | ||
2. Add corresponding test file in `tests/` | ||
3. Update configuration in `conf/config.yaml` | ||
4. Update `main.py` to integrate the new component | ||
5. Create a branch and push your changes to the remote repository | ||
6. Create a pull request to merge your changes into the main branch | ||
|
||
- Data Ingestion | ||
- Data Processing | ||
- Model Training | ||
- Pipeline Evaluation | ||
- Model Deployment | ||
- Monitoring | ||
- Reporting |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +0,0 @@ | ||
from src.monitoring import Monitoring | ||
|
||
class DataProcessing: | ||
# ... existing code ... | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,30 @@ | ||
import os | ||
from src.monitoring import Monitoring | ||
from src.pipeline_evaluation import PipelineEvaluation | ||
from huggingface_hub import HfApi, HfFolder | ||
|
||
class ModelDeployment: | ||
def __init__(self): | ||
self.monitoring = Monitoring() | ||
self.pipeline_evaluation = PipelineEvaluation() | ||
self.hf_api = HfApi() | ||
self.hf_token = HfFolder.get_token() | ||
|
||
def upload_to_huggingface(self, model_path, repo_id): | ||
""" | ||
Upload the successful checkpoint to Hugging Face. | ||
Args: | ||
model_path (str): The path to the model checkpoint. | ||
repo_id (str): The repository ID on Hugging Face. | ||
Returns: | ||
None | ||
""" | ||
self.hf_api.upload_file( | ||
path_or_fileobj=model_path, | ||
path_in_repo=os.path.basename(model_path), | ||
repo_id=repo_id, | ||
token=self.hf_token | ||
) | ||
|
||
# ... existing code ... |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,15 @@ | ||
import pytest | ||
from src.data_ingestion import DataIngestion | ||
from src.monitoring import Monitoring | ||
|
||
@pytest.fixture | ||
def data_ingestion(): | ||
return DataIngestion() | ||
|
||
def test_ingest_data(data_ingestion): | ||
# Example test for data ingestion | ||
data = data_ingestion.ingest_data() | ||
assert data is not None | ||
# Add more specific assertions based on your expected data structure | ||
# Add more assertions based on expected data structure | ||
|
||
if __name__ == '__main__': | ||
pytest.main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,13 @@ | ||
import pytest | ||
from src.data_processing import DataProcessing | ||
from src.monitoring import Monitoring | ||
|
||
@pytest.fixture | ||
def data_processing(): | ||
return DataProcessing() | ||
|
||
def test_process_data(data_processing): | ||
processing = data_processing | ||
raw_data = "Sample raw data" # Replace with appropriate test data | ||
processed_data = processing.process_data(raw_data) | ||
# Example test for data processing | ||
raw_data = "raw data" | ||
processed_data = data_processing.process_data(raw_data) | ||
assert processed_data is not None | ||
# Add more specific assertions based on your expected processed data structure | ||
# Add more assertions based on expected processed data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,13 @@ | ||
import pytest | ||
from src.model_training import ModelTraining | ||
from src.monitoring import Monitoring | ||
|
||
@pytest.fixture | ||
def model_training(): | ||
return ModelTraining() | ||
|
||
def test_train_model(model_training): | ||
training = model_training | ||
processed_data = "Sample processed data" # Replace with appropriate test data | ||
trained_model = training.train_model(processed_data) | ||
assert trained_model is not None | ||
# Add more specific assertions based on your expected model structure | ||
# Example test for model training | ||
training_data = "training data" | ||
model = model_training.train_model(training_data) | ||
assert model is not None | ||
# Add more assertions based on expected model properties |
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.