Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make process improvements #59

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gouenji-shuuya
Copy link

  • Move to bash as that was implicitly expected (ref. #vidyut on Discord)
  • Some refactoring.
  • Sub-make is correctly called when using make create_all_data.
  • Use -jnproc in make.
  • Ignore venv in git.

@gouenji-shuuya gouenji-shuuya force-pushed the make-fix branch 2 times, most recently from f38da63 to df9c149 Compare March 12, 2023 12:31
@akprasad
Copy link
Contributor

Thank you for this! Forgive the late look.

Copy link
Contributor

@akprasad akprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

@@ -78,7 +78,7 @@ tests:
```shell
$ git clone https://github.com/ambuda-org/vidyut.git
$ cd vidyut
$ make test
$ make -j`nproc` test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that -j controls the number of make recipes that run in parallel. If so, what is the benefit of using -j here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cargo can respect the jobserver settings. Refer rust-lang/rust#42682.

Even if the workflow is currently serial, there is a possibility of decoupling steps in future for faster builds.

For instance, create_kosha executes successfully before create_sandhi_rules, so there is no dependency (though the latter is quick). I think the cloning can also be parallelized, by use of recipes like this:

mkdir build

get_corpus_data:
	@if [[ -e "data/raw/dcs" ]]; then 				\
		echo "Training data already exists -- skipping fetch.";	\
	else 								\
		echo "Training data does not exist -- fetching.";	\
		mkdir -p "data/raw/dcs";				\
		git clone https://github.com/OliverHellwig/sanskrit.git	\
			--depth=1 build/dcs-data;			\
		mv build/dcs-data/dcs/data/conllu data/raw/dcs/conllu;	\
	fi


get_linguistic_data:
	@if [[ -e "data/raw/lex" ]]; then 				\
		echo "Lexical data already exists -- skipping fetch.";	\
	else 								\
		echo "Lexical data does not exist -- fetching.";	\
		mkdir -p "data/raw/lex";				\
		git clone https://github.com/sanskrit/data.git		\
			--depth=1 build/data-git;			\
		python3 build/data-git/bin/make_data.py 		\
			--make_prefixed_verbals;			\
		mv build/data-git/all-data/* data/raw/lex;		\
	fi

So the create_all_data.sh file really is a bottleneck. There are several make commands there and they execute one by one. Maybe a bit of parallelization will help, like in test/train?

vidyut-cheda/scripts/fetch_training_data.py Outdated Show resolved Hide resolved
- Move to bash as that was implicitly expected (ref. #vidyut on Discord)
- Some refactoring.
- Sub-make is correctly called when using make create_all_data.
- Use -j`nproc` in make.
- Ignore venv in git.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants