You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was originally created at: 2008-02-12 08:41:52.
This issue was reported by: belley.
belley said at 2008-02-12 08:41:52
Hi everyone,
I have implemented the following command-line option:
--cache-verify
When using CacheDir(), verify that already-existing, up-to-date
derived files and files built by this invocation match the ones
present in the cache. This is done by performing a binary com-
parison between each derived file and its corresponding cache
file. A warning message is printed if they do not match. This is
useful to identify files that are corrupted in the cache.
One has to be careful when interpreting the --cache-verify warn-
ings. Derived files can hash to same cache entry if their build
signature is incomplete.
For example, a scanner that forgets to list all of its implicit
dependencies could lead to an action producing a different
derived file on different machines and yet they would all hash
to the same cache file. In that case, --cache-verify would com-
plain about files not matching, but the problem is not a cor-
rupted cache.
I have also modified slightly the behavior of --cache-force and --cache-populate. As discussed on the mailing list, the change in behavior should not affect the current usage of these options.
--cache-force
When using CacheDir(), populate a cache by copying any already-
existing, up-to-date derived files to the cache, in addition to
files built by this invocation. This is useful to populate a
new cache with all the current derived files, or to add to the
cache any derived files recently built with caching disabled via
the --cache-disable option.
When using the --cache-disable and --cache-force options
together, scons will not retrieve files from the cache, but will
still copy derived files to the cache.
When using the --cache-force option, derived files will always
be copied to the cache, overwriting the corresponding file if it
is already present. This behavior is useful for getting rid of
corrupted files that might exist in the cache.
--cache-populate
When using CacheDir(), populate a cache by copying any already-
existing, up-to-date derived files to the cache, in addition to
files built by this invocation. This is useful to populate a
new cache with all the current derived files, or to add to the
cache any derived files recently built with caching disabled via
the --cache-disable option.
When using the --cache-disable and --cache-populate options
together, scons will not retrieve files from the cache, but will
still copy derived files to the cache.
When using the --cache-populate option, derived files will not
be copied to the cache if the corresponding file is already
present.
(Submitted against the changeset 2659 of branches/core.)
Benoit
belley said at 2008-02-12 08:42:23
Created an attachment (id=303)
Implementation + regression tests
bdbaddog said at 2008-04-08 23:12:22
Benoit - What's the use model for this? Central cache, flags used by any and/or specified user to update cache from other build tree?
gregnoel said at 2008-04-14 13:23:17
Bug party triage: Jim to research use case and see if there's a base issue (e.g., asynchronous cache updates) that's the root cause of this fix, and whether the root cause might be fixable.
belley said at 2008-04-14 13:25:32
(Attached previous e-mail to this bug description to help tracking.)
Hi William,
A user might introduce a bug in one of its SConscript in such a way that it causes cache corruption. These options are useful to identify these bugs in your SConscripts and fix them.
Let me expand a bit on this.
SCons uses build signatures to fetch element out of the cache. But, a build signature can sometime be erroneous. I have found a few instances of these at my very own work place. For example:
A user might forget to properly specify the varlist of an action. The result is that different targets get stored under the same cache key (i.e. build signature).
A builder might produce a different target on different OSes even if its build signature is the same. The different handling of CR-LF under Windows and UNIX makes it very easy to have automatically generated headers files to have the same build signature but still be different.
Please note that caching a file under the wrong build signature is EXTREMELY dangerous. It will be re-fetched from the cache over and over again. Your build will fail mysteriously. You try cleaning (i.e. scons --clean) and the next build fails again. The worst part is that the build does not crash at the point where the erroneous file is fetched from the cache but at the point where the file is used later in the build. That's make it hard to debug.
The option --cache-verify is very handy to tracked down these issues and fix the faulty SConscripts. It is very important that the bugs in the SConscript's be fixed because otherwise your cache will get corrupted quickly. You'll be erasing your cache so often that you might as well not have a cache... ;-(
The --cache-force option is used to fix a broken cache once you have identify the bug in your SConscript. You could also just erase your cache, but selectively using --cache-force to purge the stale file out of the cache is faster. This aspect is probably more important for central caches where erasing the cache means hours and hours of recompilation.
Benoit
belley said at 2008-04-14 13:29:44
Greg wrote:
Bug party triage: Jim to research use case and see if there's a base issue? (e.g., asynchronous cache updates) that's the root cause of this fix, and whether the root cause might be fixable.
Hi Greg,
As explained in my previous e-mail (I have just attached it to this bug report), an error in a user-written SConscript my cause a corruption of the cache. The --cache-verify is therefore useful to debug SConscript and therefore IMO should be available to any user.
Does it make sense ?
BTW, I am currently rebasing the patch to the changeset 2773 and verifying that it works with Python 2.5.1.
Benoit
gregnoel said at 2008-04-14 19:48:41
Benoit, there's no doubt that this fix helps in the short term; the question was really if it's more productive to go after the root causes so that a short-term fix is not required. SCons already has too many confusing command-line options; if we can avoid creating any more, we're ahead. If there's a way to be sure the problem doesn't arise in the first place or if the cache becomes self-correcting, these options may not be needed. If they will only be needed in the short term, they could be documented as for debugging only and to be withdrawn when the root causes are fixed.
mightyllamas said at 2008-04-14 20:22:56
Another potential use case to add to the ones mentioned by Benoit would be to help track down problems caused by buggy network hardware and/or software.
I wouldn't be surprised if the problems I'm having is due to some network flakiness, but I'm not really 100% sure.
belley said at 2008-04-15 07:55:48
Greg wrote:
Benoit, there's no doubt that this fix helps in the short term; the question was really if it's more productive to go after the root causes so that a short-term fix is not required. SCons already has too many confusing command-line options; if we can avoid creating any more, we're ahead. If there's a way to be sure the problem doesn't arise in the first place or if the cache becomes self-correcting, these options may not be needed. If they will only be needed in the short term, they could be documented as for debugging only and to be withdrawn when the root causes are fixed.
Hi Greg,
I agree that SCons already has too many command-line options. Unfortunately, I do not see how could entirely eliminate the need for the --cache-verify option. Let's examine the various cases/suggestions:
Bugs in SCons
I just submitted a patch for a bug in SCons that could cause cache corruption (due the erronous re-use of a corrupted file). See issue 2014. I feel that this is not an isolated case. The signature system is complex. Bugs will be introduced. The --cache-verify option gives an effective way for tracking down these types of bugs (in addition to --tree and sconsign).
Bugs in SConscript
This is an error that a user makes while writting its SConscript that leads to an erronous signature.
I have already encountered a few examples of this. An action that generate different targets on different OS, machines, versions of Python, etc. A signature might be incomplete because a user-written scanner or emitter fails to list all dependencies. A FunctionAction does not list every accessed construction variables in its varlist.
I could see that the varlist one could be fixed by dynamically listing every accessed construction variables using an proxy of some sort. But for the other cases, this is not obvious at all.
Corrupted disk and or network
Not much we can do about this. At least the --cache-verify and --cache-force options let you fix your cache without having to recompile the entire world to rebuild your caches.
Self-correcting cache
I fail to see how this could work. For example, a build fetches a file from the cache because the build signature matches, but the file just fetched, for some reason, it actually different than the one that you have built. How could you determine that the file that you just fetched is actually different that the one that you would have built without building it ?
Benoit
belley said at 2008-06-02 12:02:21
The first patch was broken on Windows because os.rename() fails on Windows if the target already exists. This was not detected before because the cache-write-error warning is turned off by default.
I have also realized that built targets could be copied multiple times back to the cache because:
Both push() and push_if_cached() are called for targets that are actually built.
For up-to-date targets, the visited() method was called twice, once in make_ready() and a second time in executed() !!!
This was all fixed.
(Is it too late to check this in before 1.0 ? It does not seem too risky to me.)
Note that the newer patch was submitted against the changeset 3019.
Benoit
mightyllamas said at 2009-03-04 07:46:55
There may yet be an asynchronous cache issue, but if so, it's rare enough that it's hard to know if there's an actual problem or whether it's people killing builds at inopportune times. Certainly the code looks ok, for whatever that's worth.
Our error rate for cache is pretty darn low now (or people aren't telling me any more, but I presume it's the former :)
azverkan said at 2009-03-17 23:12:33
What type of network filesystem is this?
The description sounds like a CIFS share with Windows clients. If that is the case it would be interesting to known what the OS version and service pack level for the server and clients are.
mightyllamas said at 2009-03-18 17:10:54
You are correct, it is a samba server. The clients are mostly XP and a few Vista, mostly up-to-date. The server is FreeBSD 6.
gregnoel said at 2009-03-19 00:50:32
Bug party triage. If --cache-verify is strictly for debugging, the name should be changed to something with "debug" in it.
Benoit, if you want to do it, assign the issue to yourself and go for it.
This issue was originally created at: 2008-02-12 08:41:52.
This issue was reported by:
belley
.Hi everyone,
I have implemented the following command-line option:
I have also modified slightly the behavior of
--cache-force
and--cache-populate
. As discussed on the mailing list, the change in behavior should not affect the current usage of these options.(Submitted against the changeset 2659 of branches/core.)
Benoit
Created an attachment (id=303)
Implementation + regression tests
Benoit - What's the use model for this? Central cache, flags used by any and/or specified user to update cache from other build tree?
Bug party triage: Jim to research use case and see if there's a base issue (e.g., asynchronous cache updates) that's the root cause of this fix, and whether the root cause might be fixable.
(Attached previous e-mail to this bug description to help tracking.)
Hi William,
A user might introduce a bug in one of its SConscript in such a way that it causes cache corruption. These options are useful to identify these bugs in your SConscripts and fix them.
Let me expand a bit on this.
SCons uses build signatures to fetch element out of the cache. But, a build signature can sometime be erroneous. I have found a few instances of these at my very own work place. For example:
Please note that caching a file under the wrong build signature is EXTREMELY dangerous. It will be re-fetched from the cache over and over again. Your build will fail mysteriously. You try cleaning (i.e.
scons --clean
) and the next build fails again. The worst part is that the build does not crash at the point where the erroneous file is fetched from the cache but at the point where the file is used later in the build. That's make it hard to debug.The option
--cache-verify
is very handy to tracked down these issues and fix the faulty SConscripts. It is very important that the bugs in the SConscript's be fixed because otherwise your cache will get corrupted quickly. You'll be erasing your cache so often that you might as well not have a cache... ;-(The
--cache-force
option is used to fix a broken cache once you have identify the bug in your SConscript. You could also just erase your cache, but selectively using--cache-force
to purge the stale file out of the cache is faster. This aspect is probably more important for central caches where erasing the cache means hours and hours of recompilation.Benoit
Greg wrote:
Hi Greg,
As explained in my previous e-mail (I have just attached it to this bug report), an error in a user-written SConscript my cause a corruption of the cache. The
--cache-verify
is therefore useful to debug SConscript and therefore IMO should be available to any user.Does it make sense ?
BTW, I am currently rebasing the patch to the changeset 2773 and verifying that it works with Python 2.5.1.
Benoit
Benoit, there's no doubt that this fix helps in the short term; the question was really if it's more productive to go after the root causes so that a short-term fix is not required. SCons already has too many confusing command-line options; if we can avoid creating any more, we're ahead. If there's a way to be sure the problem doesn't arise in the first place or if the cache becomes self-correcting, these options may not be needed. If they will only be needed in the short term, they could be documented as for debugging only and to be withdrawn when the root causes are fixed.
Another potential use case to add to the ones mentioned by Benoit would be to help track down problems caused by buggy network hardware and/or software.
I wouldn't be surprised if the problems I'm having is due to some network flakiness, but I'm not really 100% sure.
Greg wrote:
Hi Greg,
I agree that SCons already has too many command-line options. Unfortunately, I do not see how could entirely eliminate the need for the
--cache-verify
option. Let's examine the various cases/suggestions:Bugs in SCons
I just submitted a patch for a bug in SCons that could cause cache corruption (due the erronous re-use of a corrupted file). See issue 2014. I feel that this is not an isolated case. The signature system is complex. Bugs will be introduced. The
--cache-verify
option gives an effective way for tracking down these types of bugs (in addition to--tree
and sconsign).Bugs in SConscript
This is an error that a user makes while writting its SConscript that leads to an erronous signature.
I have already encountered a few examples of this. An action that generate different targets on different OS, machines, versions of Python, etc. A signature might be incomplete because a user-written scanner or emitter fails to list all dependencies. A FunctionAction does not list every accessed construction variables in its varlist.
I could see that the varlist one could be fixed by dynamically listing every accessed construction variables using an proxy of some sort. But for the other cases, this is not obvious at all.
Corrupted disk and or network
Not much we can do about this. At least the
--cache-verify
and--cache-force
options let you fix your cache without having to recompile the entire world to rebuild your caches.Self-correcting cache
I fail to see how this could work. For example, a build fetches a file from the cache because the build signature matches, but the file just fetched, for some reason, it actually different than the one that you have built. How could you determine that the file that you just fetched is actually different that the one that you would have built without building it ?
Benoit
The first patch was broken on Windows because
os.rename()
fails on Windows if the target already exists. This was not detected before because the cache-write-error warning is turned off by default.I have also realized that built targets could be copied multiple times back to the cache because:
push()
andpush_if_cached()
are called for targets that are actually built.visited()
method was called twice, once inmake_ready()
and a second time inexecuted()
!!!This was all fixed.
(Is it too late to check this in before 1.0 ? It does not seem too risky to me.)
Note that the newer patch was submitted against the changeset 3019.
Benoit
There may yet be an asynchronous cache issue, but if so, it's rare enough that it's hard to know if there's an actual problem or whether it's people killing builds at inopportune times. Certainly the code looks ok, for whatever that's worth.
Our error rate for cache is pretty darn low now (or people aren't telling me any more, but I presume it's the former :)
What type of network filesystem is this?
The description sounds like a CIFS share with Windows clients. If that is the case it would be interesting to known what the OS version and service pack level for the server and clients are.
You are correct, it is a samba server. The clients are mostly XP and a few Vista, mostly up-to-date. The server is FreeBSD 6.
Bug party triage. If
--cache-verify
is strictly for debugging, the name should be changed to something with "debug" in it.Benoit, if you want to do it, assign the issue to yourself and go for it.
Implementation + regression tests
The text was updated successfully, but these errors were encountered: