- Philip Gehde
- Moiyad Alfawwar
Welcome to our Data Science project repository for CSCI-347 Data Mining. This project focuses on applying various data science techniques and algorithms to medical diagnostic work, with a specific emphasis on heart disease prediction. Our work is inspired by the potential of machine learning to revolutionize medical diagnostics, making early detection and prevention of diseases like heart disease more accurate and accessible.
We utilize the Statlog Heart Disease Dataset from the UCI machine learning repository. This dataset is widely recognized in the data science field for its applications in diverse research studies. It contains 270 instances with 13 attributes and no missing values, making it an ideal candidate for our analysis.
The dataset features the following attributes, crucial for heart disease diagnosis:
- Age
- Sex (binary: male or female)
- Chest Pain Type (4 values, label-encoded)
- Resting Blood Pressure
- Serum Cholesterol in mg/dl
- Fasting Blood Sugar > 120 mg/dl (binary)
- Resting Electrocardiographic Results (0,1,2)
- Maximum Heart Rate Achieved
- Exercise Induced Angina (binary)
- Oldpeak (ST depression induced by exercise relative to rest)
- The Slope of the Peak Exercise ST Segment
- Number of Major Vessels (0-3) Colored by Flourosopy
- Thal (3 = normal; 6 = fixed defect; 7 = reversible defect)
This repository contains Jupyter Notebooks covering the following key topics:
- Graph Analysis: Exploration of data relationships and patterns.
- Linear Transformation: Application of linear algebra techniques to optimize data representation.
- K-Means Clustering: Unsupervised learning method to identify data clusters.
- Additional notebooks will explore various data preprocessing, analysis, and machine learning techniques relevant to our project's goal.
Our project aims to:
- Evaluate and Clean the Dataset: Assess the quality, potential biases, and applicability of the dataset for heart disease diagnostics.
- Apply Data Mining Techniques: Utilize various algorithms to uncover patterns and insights that could inform medical diagnostics.
- Enhance Medical Diagnostic Work: Explore how machine learning can improve diagnostic accuracy, focusing on heart disease.
Our personal experiences and observations in the medical field highlight the urgent need for improved diagnostics. This project is not just an academic exercise; it's a step toward leveraging data science for real-world medical advancements.
To get started with our notebooks:
- Clone this repository to your local machine.
- Ensure you have Jupyter Notebook installed, or use Google Colab for an online alternative.
- Open the notebooks and follow the instructions within to replicate our analyses.
- Python 3.x
- Jupyter Notebook
- Libraries: NumPy, pandas, matplotlib, scikit-learn, etc. (A full list of dependencies is available in the
requirements.txt
file.)
We welcome contributions from the data science community. Whether it's improving the code, suggesting new analysis techniques, or discussing the implications of our findings, your input is valuable.
This project is licensed under the MIT License - see the LICENSE file for details.
Our heartfelt gratitude goes to the researchers and contributors of the Statlog Heart Disease Dataset at the UCI Machine Learning Repository. Their work provides the foundation for our project and many others in the field of medical diagnostics.
Join us in this exploratory journey through data science to make a tangible impact on medical diagnostics. Together, we can push the boundaries of what's possible in healthcare through the power of data mining.