Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pattern checks for linting #1374

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nimbinatus
Copy link
Member

Fixes #1369

Adds in three pattern checks to address issues discovered during triage. Warns as these will not break code, but rather are specific to legal requirements (best actual documentation I can find is https://github.com/instructlab/taxonomy/blob/main/README.md?plain=1#L284-L287, but this information was discussed in a triage meeting with @jjasghar and @juliadenham).

Also, hides uv-specific files from git.

@nimbinatus nimbinatus self-assigned this Dec 19, 2024
@github-actions github-actions bot added the ci label Dec 19, 2024
@nimbinatus
Copy link
Member Author

(Also, does squash and merge not work here in this repo, so I need to squash and force push to clean up history before merging on my own?)

if taxonomy.errors > 0:
exit_code = 1
if taxonomy.warnings > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong since N errors and 1 warning means exit code 0. These 2 lines are not needed since exit_code is initialized to 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh bugger, you're right; I'm resetting the exit code. Sorry; fixing that.

# maintainers to address rather than block on them. We will
# revisit when other content is allowed.
qna_file_path = taxonomy.rel_path.with_name("qna.yaml")
if "knowledge" in qna_file_path.parts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be too promiscuous since the user could put "knowledge" some where in their sub path: compositional_skills/philosophy/knowledge/qna.yaml

I think you need to look at only part 0.

# revisit when other content is allowed.
qna_file_path = taxonomy.rel_path.with_name("qna.yaml")
if "knowledge" in qna_file_path.parts:
qna_file_contents = parser.parse(qna_file_path).contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The qna.yaml is already parsed in the taxonomy object: taxonomy.contents. Why parse it again?

if "knowledge" in qna_file_path.parts:
qna_file_contents = parser.parse(qna_file_path).contents
for element in qna_file_contents["document"]["patterns"]:
if not re.match('.*.md', element):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought pdf support was added? Anyway the regex here would be

re.search("\.md$", element)

But what about the pattern folder_of_md_files/* which is a legitimate value which should not be rejected.

If you want to do more checking here, I don't think you can do it by pattern matching the yaml contents. You would need to clone the repo, find all files in the repo which match the patterns, and then check that all those files match the desired file types.

qna_file_contents["document"]["repo"]):
taxonomy.warning(
"The document repo \"%s\" needs to be a "
"GitHub-based repository.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't have to be GitHub. Any valid git repo could be used. We just expect that any such git repo can be accessed because any necessary authorization is configured.

"GitHub-based repository.",
qna_file_contents["document"]["repo"]
)
if not re.match(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potential check is discussed here: instructlab/schema#30

If we do want to require SHA values, we should probably do that in the schema. But I am not convinced that is a great idea. Since we could allow non-SHA values that have special meanings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linter needs to check pattern and link
2 participants