update readme on package structure

weecology · Oct 30, 2024 · 3ca5454 · 3ca5454
1 parent 6cacad0
commit 3ca5454
Show file tree

Hide file tree

Showing 15 changed files with 201 additions and 122 deletions.
diff --git a/README.md b/README.md
@@ -1,57 +1,169 @@
-# ML Workflow Manager
+# ML Pipeline Project
 
-ML Workflow Manager is a high-level Python package for managing machine learning workflows. It provides a modular structure for data ingestion, processing, model training, evaluation, deployment, and monitoring. It also includes an annotation module based on the AirborneFieldGuide project.
+A modular machine learning pipeline for data processing, model training, evaluation, and deployment with pre-annotation prediction capabilities for Bureau of Ocean Energy Management (BOEM) data.
+
+## Project Structure
+
+```
+project_root/
+│
+├── src/                      # Source code for the ML pipeline
+│   ├── __init__.py
+│   ├── data_ingestion.py    # Data loading and preparation
+│   ├── data_processing.py   # Data preprocessing and transformations
+│   ├── model_training.py    # Model training functionality
+│   ├── pipeline_evaluation.py # Pipeline and model evaluation metrics
+│   ├── model_deployment.py  # Model deployment utilities
+│   ├── monitoring.py        # Monitoring and logging functionality
+│   ├── reporting.py         # Report generation for pipeline results
+│   ├── pre_annotation_prediction.py  # Pre-annotation model predictions
+│   └── annotation/          # Annotation-related functionality
+│       ├── __init__.py
+│       └── pipeline.py      # Annotation pipeline implementation
+│
+├── tests/                   # Test files for each component
+│   ├── test_data_ingestion.py
+│   ├── test_data_processing.py
+│   ├── test_model_training.py
+│   ├── test_pipeline_evaluation.py
+│   ├── test_model_deployment.py
+│   ├── test_monitoring.py
+│   ├── test_reporting.py
+│   └── test_pre_annotation_prediction.py
+│
+├── conf/                    # Configuration files
+│   └── config.yaml         # Main configuration file
+│
+├── main.py                 # Main entry point for the pipeline
+├── run_ml_workflow.sh      # Script to run pipeline in Serenity container
+├── requirements.txt        # Project dependencies
+├── .gitignore             # Git ignore file
+├── CONTRIBUTING.md        # Contributing guidelines
+├── LICENSE                # Project license
+└── README.md             # This file
+```
+
+## Components
+
+### Source Code (`src/`)
+
+- **data_ingestion.py**: Handles data loading and initial preparation
+- **data_processing.py**: Implements data preprocessing and transformations
+- **model_training.py**: Contains model training logic
+- **pipeline_evaluation.py**: Evaluates pipeline performance and model metrics
+- **model_deployment.py**: Manages model deployment
+- **monitoring.py**: Provides monitoring and logging capabilities
+- **reporting.py**: Generates reports for pipeline results
+- **pre_annotation_prediction.py**: Handles pre-annotation model predictions
+- **annotation/**: Contains annotation-related functionality
+  - **pipeline.py**: Implements the annotation pipeline
+
+### Tests (`tests/`)
+
+Contains test files corresponding to each component in `src/`. Uses pytest for testing.
+
+### Configuration (`conf/`)
+
+Contains YAML configuration files managed by Hydra:
+- **config.yaml**: Main configuration file defining pipeline parameters
 
 ## Installation
 
-You can install the ML Workflow Manager using pip:
+1. Clone the repository:
+```bash
+git clone https://github.com/your-username/project-name.git
+cd project-name
+```
 
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
 
 ## Usage
 
-To run the main workflow and the annotation pipeline:
+### Running the Pipeline
 
+Using the Serenity container:
 ```bash
-python main.py
+./run_ml_workflow.sh your-branch-name
 ```
 
-## Running Tests
+Or directly with Python:
+```bash
+python main.py
+```
 
-To run the tests, make sure you have pytest installed and then run:
+### Running Tests
 
 ```bash
-pytest
+pytest tests/
 ```
 
-This will run all the tests in the `tests/` directory and display the results.
+## Configuration
 
-## Annotation Module
+The pipeline uses Hydra for configuration management. Main configuration options are defined in `conf/config.yaml`.
 
-The `annotation` module is based on the AirborneFieldGuide project. It provides additional functionality for annotating airborne data. To use this module, you can import it in your Python scripts:
+Example configuration:
+```yaml
+data:
+  input_dir: "path/to/input"
+  output_dir: "path/to/output"
 
-```python
-from annotation.pipeline import config_pipeline
+model:
+  type: "classification"
+  parameters:
+    learning_rate: 0.001
+    batch_size: 32
 
-# Use the config_pipeline function to run the annotation workflow
-config_pipeline(your_config)
+pipeline:
+  steps:
+    - data_ingestion
+    - data_processing
+    - model_training
+    - evaluation
 ```
 
-For more details on how to use the annotation module, please refer to the AirborneFieldGuide documentation.
-
 ## Contributing
 
-Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
+Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
 
 ## License
 
 This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 
-## Pipeline Components
+## Dependencies
+
+Key dependencies include:
+- Hydra
+- PyTorch
+- NumPy
+- Pandas
+- Pytest
+
+See `requirements.txt` for a complete list.
+
+## Development
+
+### Code Organization
+
+- Each component is a separate module in the `src/` directory
+- Tests mirror the source code structure in the `tests/` directory
+- Configuration is managed through Hydra
+- Monitoring and logging are integrated throughout the pipeline using comet
+
+### Testing
+
+- Tests are written using pytest
+- Each component has its own test file
+- Run tests with `pytest tests/`
+
+### Adding New Components
+
+1. Create a new module in `src/`
+2. Add corresponding test file in `tests/`
+3. Update configuration in `conf/config.yaml`
+4. Update `main.py` to integrate the new component
+5. Create a branch and push your changes to the remote repository
+6. Create a pull request to merge your changes into the main branch
 
-- Data Ingestion
-- Data Processing
-- Model Training
-- Pipeline Evaluation
-- Model Deployment
-- Monitoring
-- Reporting
diff --git a/initiate.py b/initiate.py
diff --git a/main.py b/main.py
@@ -9,21 +9,20 @@
 from src.label_studio import check_for_new_annotations, upload_to_label_studio
 from src.model import Model
 
-@hydra.main(version_base=None, config_path="conf", config_name="config")
+@hydra.main(version_base=None, config_path="conf", config_name="config", check_annotations=True)
 def main(cfg: DictConfig):
-
-    # Check for new annotations
-    new_annotations = check_for_new_annotations(**cfg.label_studio)
-    if new_annotations is None:
-        print("No new annotations, exiting")
-        return None
+    # Check for new annotations if the check_annotations flag is set
+    if cfg.check_annotations:
+        new_annotations = check_for_new_annotations(**cfg.label_studio)
+        if new_annotations is None:
+            print("No new annotations, exiting")
+            return None
 
     model_training = Model()
     trained_model = model_training.train_model(annotations)
 
     # Update the model path
     cfg.model.path = trained_model
-
     existing_model = cfg.model.path
 
     pipeline_monitor = PipelineEvaluation(trained_model)

diff --git a/src/data_processing.py b/src/data_processing.py
@@ -1,4 +0,0 @@
-from src.monitoring import Monitoring
-
-class DataProcessing:
-    # ... existing code ...

diff --git a/src/model_deployment.py b/src/model_deployment.py
@@ -1,10 +1,30 @@
 import os
 from src.monitoring import Monitoring
 from src.pipeline_evaluation import PipelineEvaluation
+from huggingface_hub import HfApi, HfFolder
 
 class ModelDeployment:
     def __init__(self):
         self.monitoring = Monitoring()
         self.pipeline_evaluation = PipelineEvaluation()
+        self.hf_api = HfApi()
+        self.hf_token = HfFolder.get_token()
+
+    def upload_to_huggingface(self, model_path, repo_id):
+        """
+        Upload the successful checkpoint to Hugging Face.
+
+        Args:
+            model_path (str): The path to the model checkpoint.
+            repo_id (str): The repository ID on Hugging Face.
+
+        Returns:
+            None
+        """
+        self.hf_api.upload_file(
+            path_or_fileobj=model_path,
+            path_in_repo=os.path.basename(model_path),
+            repo_id=repo_id,
+            token=self.hf_token
+        )
 
-    # ... existing code ...
diff --git a/src/monitoring.py b/src/monitoring.py
diff --git a/tests/test_annotation_pipeline.py b/tests/test_annotation_pipeline.py
diff --git a/tests/test_data_ingestion.py b/tests/test_data_ingestion.py
@@ -1,15 +1,15 @@
 import pytest
 from src.data_ingestion import DataIngestion
-from src.monitoring import Monitoring
 
 @pytest.fixture
 def data_ingestion():
     return DataIngestion()
 
 def test_ingest_data(data_ingestion):
+    # Example test for data ingestion
     data = data_ingestion.ingest_data()
     assert data is not None
-    # Add more specific assertions based on your expected data structure
+    # Add more assertions based on expected data structure
 
 if __name__ == '__main__':
     pytest.main()
diff --git a/tests/test_data_processing.py b/tests/test_data_processing.py
@@ -1,14 +1,13 @@
 import pytest
 from src.data_processing import DataProcessing
-from src.monitoring import Monitoring
 
 @pytest.fixture
 def data_processing():
     return DataProcessing()
 
 def test_process_data(data_processing):
-    processing = data_processing
-    raw_data = "Sample raw data"  # Replace with appropriate test data
-    processed_data = processing.process_data(raw_data)
+    # Example test for data processing
+    raw_data = "raw data"
+    processed_data = data_processing.process_data(raw_data)
     assert processed_data is not None
-    # Add more specific assertions based on your expected processed data structure
+    # Add more assertions based on expected processed data
diff --git a/tests/test_model_deployment.py b/tests/test_model_deployment.py
@@ -8,11 +8,11 @@ def model_deployment():
     return ModelDeployment()
 
 def test_deploy_model(model_deployment):
-    deployment = model_deployment
-    model = "Sample model"  # Replace with appropriate test model
-    deployed_model = deployment.deploy_model(model)
-    assert deployed_model is not None
-    # Add more specific assertions based on your expected deployed model structure
+    # Example test for model deployment
+    model = "model"
+    deployment_result = model_deployment.deploy_model(model)
+    assert deployment_result is not None
+    # Add more assertions based on expected deployment results
 
 def test_model_deployment():
     # ... (other test setup code)

diff --git a/tests/test_model_training.py b/tests/test_model_training.py
@@ -1,14 +1,13 @@
 import pytest
 from src.model_training import ModelTraining
-from src.monitoring import Monitoring
 
 @pytest.fixture
 def model_training():
     return ModelTraining()
 
 def test_train_model(model_training):
-    training = model_training
-    processed_data = "Sample processed data"  # Replace with appropriate test data
-    trained_model = training.train_model(processed_data)
-    assert trained_model is not None
-    # Add more specific assertions based on your expected model structure
+    # Example test for model training
+    training_data = "training data"
+    model = model_training.train_model(training_data)
+    assert model is not None
+    # Add more assertions based on expected model properties
diff --git a/tests/test_monitoring.py b/tests/test_monitoring.py