Design the Cachito archive pruning feature #938

ben-alkov · 2023-11-13T17:41:19Z

ben-alkov
Nov 13, 2023
Maintainer

Background

Where do cachito-archives come from?

Cachito generates a source archive as part of processing requests that involve
git repositories in two different cases

Cloning the request repository: When a git repository specified in a request
has not been archived before, cachito will clone it and create an
archive
Cloning a git dependency: For dependencies specified in pip, npm, yarn, or
rubygems that are not available in their respective registries and instead
must be retrieved from a git repository, cachito will clone and package these
repositories

After archiving, it uploads them to a Nexus repository.

Structure of cachito-archives

The archive directory is hosted on an OpenShift dynamic NFS Persistent Volume
Claim (PVC) with ReadWriteMany (RWX) access mode. This directory is mounted
across all worker pods. When cachito processes a request and generates a source
archive, it creates a tarball with the following directory structure:

Standard repository archives are named 'namespace/repo_name/<git_ref>.tar.gz'.

Archives including git submodules are named
'namespace/repo_name/<git_ref>-with-submodules.tar.gz'.

For repositories cloned via SSH, the namespace includes the clone method
'[email protected]:namespace/repo_name/<git_ref>.tar.gz'.

From the archive structure, we have the repository namespace/name, but not the
full repo URL that is stored for the request in the cachito DB.

See

[root@83e3332f36ed sources]# tree .                                                              
.
├── cachito-testing
│   ├── cachito-gomod-with-deps
│   │   └── fb299a69ff17a1673c2d9446cc2df67891738e68.tar.gz
│   ├── cachito-npm-with-deps
│   │   └── 6349948efe8dd2399ce87af9cc906ecfe4a141a3.tar.gz
│   └── cachito-npm-without-deps
│       └── 2f0ce1d7b1f8b35572d919428b965285a69583f6-with-submodules.tar.gz
├── [email protected]:cachito-testing
│   └── cachito-gomod-with-deps
│       └── fb299a69ff17a1673c2d9446cc2df67891738e68.tar.gz
└── kevva
    └── is-positive
        └── 97edff6f525f192a3f83cea1944765f769ae2678-with-submodules.tar.gz

Proposal

Archive “Pruner” Script

Develop a script designed to clean up old source archives by deleting those which exceed a specified age

This script should

Accept a configurable time threshold for how old an archive must be to qualify
for deletion
Run periodically and preferably during times of low activity, such as
weekends. This scheduling should be implemented as a cron job within
OpenShift, mirroring the approach taken with our current script that
identifies stale requests

Determine whether an archive is stale by performing the following process

Scanning the directory containing the source archives
Extracting the repo_name and ref from each archive's path
Making an API call to retrieve the most recent request associated with the
extracted repo_name and ref. Querying the cachito DB seems preferable to
relying on file system timestamps. We have copied the cachito-archives volume
before and the integrity of that metadata could be questionable
Assessing whether the archive has been utilized since the configurable
threshold based on the "created" timestamp within the request
Proceed to remove any source archive that has not been requested within the
set time frame and log the details

Additional Topics

Location
Place the new script in cachito/workers alongside the existing
cleanup_job.py script
No special handling needed for “with-submodules” archives
Extract the repo_name and ref from the path, but whether it has submodules
is not relevant to the decision of whether or not to prune the archive.
No special handling for ssh archives
As mentioned above, the repositories cloned using SSH are stored in a
different namespace within the archives directory than those cloned via
HTTPS, even if they refer to the same repository. The naming convention for
archives cloned via SSH looks like '[email protected]:namespace/repo_name/<git_ref>.tar.gz'.
I don’t think we need to address this distinction at this time, since the
impact is negligible.
Specifically, if we receive both SSH and non-SSH clone requests for the same
repository and reference, the determination of the "age" for the archive
that was cloned without SSH will be based on whichever request, SSH or
non-SSH, is the most recent. In contrast, the age of the SSH-cloned archive
will be determined solely by when the SSH clone request was made.
Archive deletion will be performed by the script and not by the cachito
application itself
It seems like this should be an ancillary, deployment-specific feature and
should be implemented outside the main application. Upstream cachito will
continue to retain source archives indefinitely
Don’t prune archives that are not associated with a cachito request
This refers to archives created for git dependencies that are subsequently
uploaded to nexus. They should be safe to delete, but we would need to
fall-back to file system timestamps or some similar method because there is no
associated request in the cachito DB for them. The number of these should be
much smaller and we can revisit if it ever becomes an issue with a backlog
story

New API Endpoint - `GET api/v1/requests/latest`

Purpose

Return the most recent request for the specified repository name and git reference.

Operation

The endpoint queries the Request table, leveraging an index on the ref column to
efficiently filter results. It matches the repo column based on a substring
search, which requires a wildcard comparison due to the nature of the repo_name
input.

To optimize performance, the 'ref' parameter is filtered first to reduce the
dataset before applying the wildcard search on the 'repo_name'. The latest request
is identified by the highest id value.

SQLAlchemy Query (Draft)

request = (
       Request.query.filter(Request.ref == ref)
       .filter(Request.repo.contains(repo_name))
       .order_by(desc(Request.id))
       .first()
   )

Parameters

repo_name: The name of the repository, consistent with the cachito-archives
directory structure. This parameter is a substring of the full repository
URL stored in the 'repo' column. Example: "release-engineering/retrodep"
ref: The Git reference to match. It is used for an exact match and benefits
from the database index. Example: "bc9767a71ede6e0084ae4a9e01dcd8b81c30b741"

Response

200 OK: Returns the latest request object in JSON format if a match is found.
404 Not Found: Returns an error message if no matching request is present.

OpenAPI Specification (Draft)

 "/requests/latest":
   get:
     operationId: cachito.web.api_v1.get_latest_request
     summary: Get the latest request
     description: Return the latest request for a given repo_name and ref
     parameters:
     - name: repo_name
       in: query
       description: A repository name to filter by
       schema:
         type: string
         maxLength: 200
         example: release-engineering/retrodep
     - name: ref
       in: query
       description: A git ref to filter request by
       schema:
         type: string
         minLength: 40
         maxLength: 40
         pattern: '^[a-f0-9]{40}$'
         example: bc9767a71ede6e0084ae4a9e01dcd8b81c30b741
     responses:
       "200":
         description: The requested Cachito request
         content:
           application/json:
             schema:
               $ref: "#/components/schemas/Request"
       "404":
         description: The request wasn't found
         content:
           application/json:
             schema:
               type: object
               properties:
                 error:
                   type: string
                   example: The requested resource was not found

ben-alkov · 2023-11-13T17:48:27Z

ben-alkov
Nov 13, 2023
Maintainer Author

@brunoapimentel @lkolacek @ejegrova @eskultety @taylormadore:

This is what a Design doc could look like in Git Hub Discussions.

Unfortunately, we don't have the ability to comment directly in the text of the doc, but something like the following could work (sorry, Bruno, Taylor)

0 replies

ben-alkov · 2023-11-13T17:48:31Z

ben-alkov
Nov 13, 2023
Maintainer Author

"""

Making an API call to retrieve the most recent request

What if the API request fails (I know it rarely does, right? ^ ^)

Retry a few times before failing the job? How is the "cleanup_job" reporting failures? Should we use the same approach?
"""

0 replies

ben-alkov · 2023-11-13T17:49:03Z

ben-alkov
Nov 13, 2023
Maintainer Author

"""

What if the API request fails (I know it rarely does, right? ^ ^)

As long as we use the worker requests session, requests should retry: https://github.com/containerbuildsystem/cachito/blob/ccadeef3f9f86d08fa4b8434a3f7f15c9edb9cee/cachito/workers/requests.py#L28
"""

0 replies

ben-alkov · 2023-11-13T17:51:53Z

ben-alkov
Nov 13, 2023
Maintainer Author

Obviously it's very different to GDocs.

Another thing we lose is the document versioning and history which GDocs provides, but I don't have a sense of how valuable that is.

0 replies

ben-alkov · 2023-11-13T17:53:31Z

ben-alkov
Nov 13, 2023
Maintainer Author

Note that in addition to the usual emoji response button, there is also an upvote, which we could use as our "+1" indicator for approvals

0 replies

ben-alkov · 2023-11-13T17:59:26Z

ben-alkov
Nov 13, 2023
Maintainer Author

The GH markdown editor is quite nice, with some cool new "slash" features in beta, including

/Code - for inserting a Markdown code block
/Table - for inserting a Markdown table
/Details - for creating a "hidden" section, like so

A really long text block (doesn't have to be code)

 "/requests/latest":
   get:
     operationId: cachito.web.api_v1.get_latest_request
     summary: Get the latest request
     description: Return the latest request for a given repo_name and ref
     parameters:
     - name: repo_name
       in: query
       description: A repository name to filter by
       schema:
         type: string
         maxLength: 200
         example: release-engineering/retrodep
     - name: ref
       in: query
       description: A git ref to filter request by
       schema:
         type: string
         minLength: 40
         maxLength: 40
         pattern: '^[a-f0-9]{40}$'
         example: bc9767a71ede6e0084ae4a9e01dcd8b81c30b741
     responses:
       "200":
         description: The requested Cachito request
         content:
           application/json:
             schema:
               $ref: "#/components/schemas/Request"
       "404":
         description: The request wasn't found
         content:
           application/json:
             schema:
               type: object
               properties:
                 error:
                   type: string
                   example: The requested resource was not found

0 replies

eskultety · 2023-11-14T09:44:40Z

eskultety
Nov 14, 2023
Maintainer

Thank you @ben-alkov for taking a look at this! Yes, versioning becomes a problem once you update the original post with additional findings as it will immediately render most comments irrelevant - do we want to start deleting irrelevant comments? Probably not. So keeping the discussion history relevant to the latest findings and navigation through it is going to be a challenge. That's where IMO mailing lists shine because the thread can be infinite and you always know what email and then (after opening one) which bit of the body a given message responds to which compared to GH comments (any comments for that matter) would be unacceptable for large projects with many stakeholders as that would get extremely messy IMO. That said, are we in that kind of situation? Not at the moment, so unless we want to create a mailing list, I think GH discussions will do. It's not the most refined interface out there, but then again, it's all hosted in a single place so contributors always know where to look for the source of truth and the experience is quite consistent (whether that's a good or a bad thing). Back to GDocs, it's true it has "versioning", how useful given its interface is it? Even if it were nice and refined, the fact one cannot use simple quick markdown formatting for technical discussions automatically disqualify GDocs IMO.

As for commenting directly on a line of text or code, well, at least there's the usual quoting people are already used to (from reviews), so one can always refer to a specific paragraph, it might become a PITA for code blocks though once the level of quoting will nest significantly during the discussion, but hey, it is what it is and we can always look for something better in the meantime and add pointers to it here in the repo, but the one thing to bear in mind is the motivation we're even thinking about this - to host all relevant pieces of information and discussions here in the main repository AND to streamline developer's day-to-day workflows by moving many of the processes to GitHub, hence making the overall experience consistent.

What might turn out as a nice feature is that we can create an issue automatically from the discussion (not sure how messy that will be). Once the main issue is tracked, it can be decomposed into smaller pieces linked to the main issue, just food for thought.

1 reply

ben-alkov Nov 14, 2023
Maintainer Author

one cannot use simple quick markdown formatting for technical discussions automatically disqualify GDocs IMO

+1

What might turn out as a nice feature is that we can create an issue automatically from the discussion (not sure how messy that will be)

Similar other GitHub features I've used work quite well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design the Cachito archive pruning feature #938

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Design the Cachito archive pruning feature #938

ben-alkov Nov 13, 2023 Maintainer

Background

Structure of cachito-archives

Proposal

Archive “Pruner” Script

Develop a script designed to clean up old source archives by deleting those which exceed a specified age

Determine whether an archive is stale by performing the following process

Additional Topics

New API Endpoint - GET api/v1/requests/latest

Purpose

Operation

SQLAlchemy Query (Draft)

Parameters

Response

OpenAPI Specification (Draft)

Replies: 7 comments · 1 reply

ben-alkov Nov 13, 2023 Maintainer Author

ben-alkov Nov 13, 2023 Maintainer Author

ben-alkov Nov 13, 2023 Maintainer Author

ben-alkov Nov 13, 2023 Maintainer Author

ben-alkov Nov 13, 2023 Maintainer Author

ben-alkov Nov 13, 2023 Maintainer Author

eskultety Nov 14, 2023 Maintainer

ben-alkov Nov 14, 2023 Maintainer Author

ben-alkov
Nov 13, 2023
Maintainer

New API Endpoint - `GET api/v1/requests/latest`

Replies: 7 comments 1 reply

ben-alkov
Nov 13, 2023
Maintainer Author

ben-alkov
Nov 13, 2023
Maintainer Author

ben-alkov
Nov 13, 2023
Maintainer Author

ben-alkov
Nov 13, 2023
Maintainer Author

ben-alkov
Nov 13, 2023
Maintainer Author

ben-alkov
Nov 13, 2023
Maintainer Author

eskultety
Nov 14, 2023
Maintainer

ben-alkov Nov 14, 2023
Maintainer Author