Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For solving error for cases when github repo is found however subdir is None #13

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Parth59
Copy link

@Parth59 Parth59 commented Feb 23, 2022

get_repository_url_and_subdir(PYPI, "kiwisolver")

url = "https://pypi.org/pypi/{}/json".format("cffi")
data = json.loads(requests.get(url).content)
#get_base_repo_url(data["info"]["project_urls"]["Source Code"])
search_github_url_in_json_data(PYPI, "cffi",data )
homepage = {"body": requests.get(data["info"]["home_page"]).text}
repo_url, subdir = search_github_url_in_json_data(PYPI, "cffi", homepage)
urls = search_for_github_repo(homepage)
urls
@@ -12,6 +12,8 @@ def search_github_url_in_json_data(ecosystem, package, json_data):
subdir = locate_subdir(ecosystem, package, repo_url)
return repo_url, subdir
except Exception as e:
if repo_url!= None:
return repo_url, None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In search_github_url_in_json_data, we only return the repo_url if we could validate the repo by locating the package directory, otherwise we don't return anything.

The rationale here is that we can pass in any data to this function, e.g., package homepage, which may contain various GitHub URLs in the data for whatever reason. Therefore, a validation step here is a must.

Let me know if I'm missing anything.

Copy link
Owner

@nasifimtiazohi nasifimtiazohi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I approve 1b3187ce317c15338a9e77f9b415e258c280af76. Can you add a test case for this and preferably a bit more elaborate commit message? If you don't have time, that's fine, I'll fix it myself.

See comments for the other two commits.

Also, you said you could identify repos for 200-300 more projects. Could you identify the subdir as well? If not, is it because your change in the search_github_url_in_json_data. In that case, I don't accept that change as mentioned in my comment. Anyway, can you send me these additional projects for whom you identified the GitHub project?

Update:

  1. No need to share your data. For kiwisolver see my comment below. We need some fix here, but not the way this PR has proposed.
  2. I have fixed 1b3187ce317c15338a9e77f9b415e258c280af76 by myself in a slightly different way.
  3. For backslash in the regex, please add a test case on why it is necessary. If you keep only this fix (or, open another PR just with that one), I'll be happy to merge that in.

@@ -69,7 +69,7 @@ def get_base_repo_url(repo_url):

def search_for_github_repo(data):
urls = set()
url_pattern = re.compile(r"""https?:\/\/(?:www\.)?github\.com[^\s)|<|>]+""")
url_pattern = re.compile(r"""https?:\/\/(?:www\.)?github\.com[^\s)|<|>\"]+""")
Copy link
Owner

@nasifimtiazohi nasifimtiazohi Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me a test case where a backslash would help?
Also, are we sure that a backslash cannot legitimately appear within a URL?

return get_base_repo_url(repo_url), postprocess_subdir(subdir)
if subdir != None:
return get_base_repo_url(repo_url), postprocess_subdir(subdir)
return get_base_repo_url(repo_url), None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is indeed a bug. I introduced it in the last commit on this file, it seems, but I think I processed my data set before introducing this bug. But we definitely need to fix this, and add a test case for this.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, the change needs to happen within postprocess_subdir. That function requires an input validation check. I'll do that myself. But thanks a lot for catching the bug.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

@nasifimtiazohi
Copy link
Owner

nasifimtiazohi commented Mar 7, 2022

kiwisolver is an interesting case.
This and some other packages, I see the that we fail to validate because the package contains only binaries.

In package-locator, we may tweak the design by considering the stars and other fields as a reliable source of info, and we will return the repository from these fields even if we can't validate the subdirectory. I have opened an issue for this, and will fix it in due time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants