-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For solving error for cases when github repo is found however subdir is None #13
base: main
Are you sure you want to change the base?
Conversation
url = "https://pypi.org/pypi/{}/json".format("cffi") data = json.loads(requests.get(url).content) #get_base_repo_url(data["info"]["project_urls"]["Source Code"]) search_github_url_in_json_data(PYPI, "cffi",data ) homepage = {"body": requests.get(data["info"]["home_page"]).text} repo_url, subdir = search_github_url_in_json_data(PYPI, "cffi", homepage) urls = search_for_github_repo(homepage) urls
@@ -12,6 +12,8 @@ def search_github_url_in_json_data(ecosystem, package, json_data): | |||
subdir = locate_subdir(ecosystem, package, repo_url) | |||
return repo_url, subdir | |||
except Exception as e: | |||
if repo_url!= None: | |||
return repo_url, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In search_github_url_in_json_data
, we only return the repo_url if we could validate the repo by locating the package directory, otherwise we don't return anything.
The rationale here is that we can pass in any data to this function, e.g., package homepage, which may contain various GitHub URLs in the data for whatever reason. Therefore, a validation step here is a must.
Let me know if I'm missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I approve 1b3187ce317c15338a9e77f9b415e258c280af76
. Can you add a test case for this and preferably a bit more elaborate commit message? If you don't have time, that's fine, I'll fix it myself.
See comments for the other two commits.
Also, you said you could identify repos for 200-300 more projects. Could you identify the subdir
as well? If not, is it because your change in the search_github_url_in_json_data
. In that case, I don't accept that change as mentioned in my comment. Anyway, can you send me these additional projects for whom you identified the GitHub project?
Update:
- No need to share your data. For
kiwisolver
see my comment below. We need some fix here, but not the way this PR has proposed. - I have fixed
1b3187ce317c15338a9e77f9b415e258c280af76
by myself in a slightly different way. - For backslash in the regex, please add a test case on why it is necessary. If you keep only this fix (or, open another PR just with that one), I'll be happy to merge that in.
@@ -69,7 +69,7 @@ def get_base_repo_url(repo_url): | |||
|
|||
def search_for_github_repo(data): | |||
urls = set() | |||
url_pattern = re.compile(r"""https?:\/\/(?:www\.)?github\.com[^\s)|<|>]+""") | |||
url_pattern = re.compile(r"""https?:\/\/(?:www\.)?github\.com[^\s)|<|>\"]+""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show me a test case where a backslash would help?
Also, are we sure that a backslash cannot legitimately appear within a URL?
return get_base_repo_url(repo_url), postprocess_subdir(subdir) | ||
if subdir != None: | ||
return get_base_repo_url(repo_url), postprocess_subdir(subdir) | ||
return get_base_repo_url(repo_url), None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This is indeed a bug. I introduced it in the last commit on this file, it seems, but I think I processed my data set before introducing this bug. But we definitely need to fix this, and add a test case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, the change needs to happen within postprocess_subdir
. That function requires an input validation check. I'll do that myself. But thanks a lot for catching the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
In package-locator, we may tweak the design by considering the stars and other fields as a reliable source of info, and we will return the repository from these fields even if we can't validate the subdirectory. I have opened an issue for this, and will fix it in due time. |
get_repository_url_and_subdir(PYPI, "kiwisolver")