-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove CSV copies, read from original CSV in notebook #23
base: main
Are you sure you want to change the base?
Remove CSV copies, read from original CSV in notebook #23
Conversation
Maintaining copies of data files in your repository is a risk. It's easy to update the original and forget to copy it everywhere else. Better to read directly from the original generated CSV file.
cleaning analysis folder: use generated CSV, remove unused CSV
The file named original is not sorted and is the original dataset that we downloaded. I think I should only keep the latest and cleaned dataset which is not the file named original. I will delete the unwanted dataset. I cant merge this pull request as you have deleted the file we are using for analysis. But I will delete the unwanted data. Question: I kept the other files because as you said, if someone tries to run your code it should run without errors. But I think I got it wrong. Its the main(Analysis) code that should run without errors and not the other data cleaning related code. Right? |
It's a good idea to keep all copies, especially the original. Without the original copy it's hard for anyone to know if you made a mistake cleaning and sorting it. Anyone should be able to run your script on the original data to re-generate the cleaned/sorted set |
I haven't! Check the changes in your analysis notebook, I updated the path to read from the generated file in Data Analysis. This way you never need to copy-paste or manually update any data - everything you expect someone to do manually is something they can either forget to do, or do incorrectly. Even if it's carefully documented. I know this may seem unnecessary because the data is already cleaned, but you need to think about reproducibility. What if someone had an update to your original data? They should just need to replace the original file and re-run the scripts. |
All of your scripts should run without errors. Someone checking your project shouldn't need to debug it, or even need to read all the code if they don't want to understand the details. |
I tried giving the relative path. But I think the relative path is not supported in VS code |
Can you share a screen shot of your notebook and error? I was able to run these notebooks from my VSCode |
@joshuaSamuel06 , thanks for the screenshot! I replaced the string path with |
Maintaining copies of data files in your repository is a risk. It's easy to update the original and forget to copy it everywhere else.
Better to read directly from the original generated CSV file.