Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CSV copies, read from original CSV in notebook #23

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

colevandersWands
Copy link
Contributor

Maintaining copies of data files in your repository is a risk. It's easy to update the original and forget to copy it everywhere else.

Better to read directly from the original generated CSV file.

Maintaining copies of data files in your repository is a risk. It's easy to update the original and forget to copy it everywhere else.

Better to read directly from the original generated CSV file.
@colevandersWands colevandersWands added the enhancement New feature or request label Feb 15, 2024
@colevandersWands colevandersWands added this to the Data Cleaning milestone Feb 15, 2024
@joshuaSamuel06
Copy link
Collaborator

joshuaSamuel06 commented Feb 16, 2024

The file named original is not sorted and is the original dataset that we downloaded. I think I should only keep the latest and cleaned dataset which is not the file named original. I will delete the unwanted dataset. I cant merge this pull request as you have deleted the file we are using for analysis. But I will delete the unwanted data.

Question: I kept the other files because as you said, if someone tries to run your code it should run without errors. But I think I got it wrong. Its the main(Analysis) code that should run without errors and not the other data cleaning related code. Right?

@colevandersWands
Copy link
Contributor Author

The file named original is not sorted and is the original dataset that we downloaded. I think I should only keep the latest and cleaned dataset which is not the file named original.

It's a good idea to keep all copies, especially the original. Without the original copy it's hard for anyone to know if you made a mistake cleaning and sorting it. Anyone should be able to run your script on the original data to re-generate the cleaned/sorted set

@colevandersWands
Copy link
Contributor Author

I cant merge this pull request as you have deleted the file we are using for analysis.

I haven't! Check the changes in your analysis notebook, I updated the path to read from the generated file in Data Analysis. This way you never need to copy-paste or manually update any data - everything you expect someone to do manually is something they can either forget to do, or do incorrectly. Even if it's carefully documented.

I know this may seem unnecessary because the data is already cleaned, but you need to think about reproducibility. What if someone had an update to your original data? They should just need to replace the original file and re-run the scripts.

@colevandersWands
Copy link
Contributor Author

Its the main(Analysis) code that should run without errors and not the other data cleaning related code. Right?

All of your scripts should run without errors. Someone checking your project shouldn't need to debug it, or even need to read all the code if they don't want to understand the details.

@joshuaSamuel06
Copy link
Collaborator

I tried giving the relative path. But I think the relative path is not supported in VS code

@colevandersWands
Copy link
Contributor Author

Can you share a screen shot of your notebook and error? I was able to run these notebooks from my VSCode

@joshuaSamuel06
Copy link
Collaborator

image

@colevandersWands
Copy link
Contributor Author

@joshuaSamuel06 , thanks for the screenshot! I replaced the string path with os.path.join, now it should work on windows and mac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: READY FOR REVIEW
Development

Successfully merging this pull request may close these issues.

2 participants