Message boards : Number crunching : Computation Error
Previous · 1 · 2 · 3 · Next
| Author | Message | 
|---|---|
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 I`m using 4.8 and am losing at least 50% of WUs also from "computation error". It doesn`t happen on other BOINC projects, and I`m sure my computer is OK. 1) Please realize that this was two bad batches of WUs that came out just at the end of the day as they were leaving for the holidays. Yes, it was a bad time for it to happen, but "bad batches" happen to every project. 2) SETI had two of these "bad batches" in the last couple of weeks, very similar to the two Rosetta has; one that ran extremely long before failing, wasting MANY hours of CPU time, and one that failed immediately because it was "0 length". SETI staff was still there and were able to kill most of these after only three or four people had crunched them, but those who spent 10 or 11 hours on one got zero credit. Rosetta staff hasn't been there to kill these, so more people are getting them as they get reissued, but then Rosetta has said that they will figure out a way to make sure everyone gets credit for at least the "long" ones. 3) All of the "DEFAULT_xxxxx_205" long-running WUs should have already cleared by now. What remains are the last traces of the "short WUs" (which could have any name). So with every hour that passes, the percentage of computation errors you get decreases. It's not a good situation, and nobody is happy about it, but the Rosetta staff is putting safeguards in to try to prevent it from happening again, and is doing everything they can to make sure that nobody loses any more credit than they have to.   | 
| STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 0 | 
 1) Please realize that this was two bad batches of WUs that came out just at the end of the day as they were leaving for the holidays. ======= I don't believe all the Computation Errors are due to the bad WU's that the Project Released. I think some of it has to do with the way the BOINC Manager Adjusts the Time to Completion. I just had 2 WU's get the Computation Error & I feel it had to do with the Completion Time being to low. If you get a run of shorter WU's, say in the 1 - 2 hour range the Manager will adjust the Time down to that amount of time. Then say all of a sudden you get a few 5-7 hour WU's it's more than likely or a good chance that you'll get the Computation Error once you get about 4 hour into the run because of a Time Overrun for the WU. I had this happen to me a lot a while back at the PrimeGrid Project & I had to actually manually edit the Benchmark scores downward in the .xml files so the Manager would show more Time to Completion before the Errors stopped. Rytis has extended the maximum execution length for the WU's several times to try & help out also, maybe they need to do that with the Rosetta WU's too. | 
| DeHackedDragon Send message Joined: 24 Dec 05 Posts: 1 Credit: 112 RAC: 0 | 
 I`m using 4.8 and am losing at least 50% of WUs also from "computation error". It doesn`t happen on other BOINC projects, and I`m sure my computer is OK. Yeah, I have that problem too, and I'm really unhappy about it. I lost almost 50 credits to that problem of that bad batch of 205 something... :( | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 I don't believe all the Computation Errors are due to the bad WU's that the Project Released. I think some of it has to do with the way the BOINC Manager Adjusts the Time to Completion. PoorBoy, is the "maximum CPU time" affected by the BOINC Manager settings? I understood that it was just a "cpu seconds" value passed in the WU itself. This would make slower machines more likely to hit it, but would be easier to code. If the DCF etc., _is_ taken into account, I think that would normally be a "good thing", but could be an issue here, at least on _when_ they blow up. Have you had any CPU-time-exceeded errors on anything other than the DEFAULT_xxxxx_205's? Those monsters are going to hit _any_ reasonable limit, adjusted or not. I've only _seen_ that error on those. All the other computation errors that _I've_ seen have been on the "short WUs", the ones where the random seed is miscalculated/misread.   | 
|  River~~  Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 | 
 
 one of my slow boxes (733MHz) reached 26hrs before I noticed a DEFAULT_xxxx_205 - wheras I think I have seen other people have cut off around 11 hours if I remember right. If so then the max run must depend in some way on the benchmarks and/or historic run lengths. River~~ | 
|  stilespj Send message Joined: 17 Dec 05 Posts: 1 Credit: 749,870 RAC: 0 | 
 Out of the 5 last work units, all five had unrecoverable errors.  In fact since I have joint Rosetta@at home via boinc, this has been typical. I see no point on wasting my my computer power on this.  I do not get these errors on other projects, such as seti at home.  Maybe when Rosetta@home (via boinc) is ready for prime time, I'll be back! Bye. Paul | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 Yeah, I have that problem too, and I'm really unhappy about it. I lost almost 50 credits to that problem of that bad batch of 205 something... :( Please note that the message you quoted (and others) have explained that the credits "lost" to the DEFAULT_xxxxx_205 problems will be replaced/granted after the staff returns from the holidays and the bad WUs have flushed through the system.   | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 I have moved several "off-topic" messages out of this thread. over to thread 750, "Moderated Messages moved here". Please limit comments in this thread to "computation error" issues. Thanks! EDIT:: Scribe, I deleted your one-word response, because I could not tell which posting it was directed at, and when moved to the other thread, it made no sense at all. If you object, I'll restore it.   | 
| STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 0 | 
 I don't believe all the Computation Errors are due to the bad WU's that the Project Released. I think some of it has to do with the way the BOINC Manager Adjusts the Time to Completion. hehe ... good thing I scrolled down the page a little further because I was about to go on a Rant. I though my Post had been completely deleted because I couldn't find it. I wondered why it would have been because I thought I gave a rational response to some of the Computation Errors. As far as I know from watching the BOINC Manager your setting's or preferences have nothing to do with the maximum CPU time. When a Benchmark is run it sets the Time to Completion at that time for what your Benchmark is saying the Computer is capable of completing the WU's in. As you run the WU's & finish them the Manager will slowly adjust the time upwards or downwards depending on the actual amount of time your taking to complete them in. If you get a bunch of WU's in a row that only take 2-3 hours the Manager will eventually adjust the Time to Completion to that amount. Then you all of a sudden get a few 6-8 hour WU's and I think the Manager will Error out the WU's once they reach 4 or 5 hours of run time. That's just my feeling's on the matter, but like I said it's what was happening to me over @ the PrimeGrid Site until I jacked the Time to Completion amount back up over what it was actually taking me to run the WU's. I haven't run any of the DEFAULT_xxxxxx_205's yet because I stopped running the Project for about a week. The 2 WU's I mentioned in my Post were from older WU's I had left yet from when I stopped running. They both Erred out at the same time around the 3 1/2 hour mark showing 50% done. The Time to Completion for the WU's that were left to run was @ 2 1/2 so right away I suspected a overrun as being the cause of the Error's & why I made my Post. I Manually Re-Benchmarked the Computer and the Time to Completion jumped back up to around 5 1/2 hours & so far I haven't had another Error on that Computer again. I'm keeping an eye on it because the Time to Completion has slowly dropped back down to under 4 hours again. If it goes much lower I'm going to Manually Re-Benchmark it again to kick the Time back up again ... | 
|  ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 | 
 I have been getting alot of computation errors lately. JUst look at my results. It seems not only to be with the 205 batch. Ecaf. | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 If you get a bunch of WU's in a row that only take 2-3 hours the Manager will eventually adjust the Time to Completion to that amount. Then you all of a sudden get a few 6-8 hour WU's and I think the Manager will Error out the WU's once they reach 4 or 5 hours of run time. That's just my feeling's on the matter, but like I said it's what was happening to me over @ the PrimeGrid Site until I jacked the Time to Completion amount back up over what it was actually taking me to run the WU's. This is something that someone with a lot more knowledge of the code than I have will have to answer, but if the DCF etc. DO affect the "max time", then the project definitely needs to keep that in mind, and possibly raise it quite a bit. That would be make the "extreme" cases like these "DEFAULT_xxxxx_205"s _worse_, but if the other alternative is causing _good_ WUs to error out... Can you copy your posting or write up something on the topic and create a new thread? I'm afraid it may be "missed" by the staff if it's buried in here, and it seems to be a separate issue that they need to be aware of.   | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 I have been getting alot of computation errors lately. JUst look at my results. It seems not only to be with the 205 batch. "Computation errors" are _NOT_ the issue with the "DEFAULT_xxxxx_205" WUs. Those "run forever" and eventually give a "maximum cpu time exceeded" error. The computation errors are from the bad handling of the random seed, and seem to be across multiple batches, almost randomly. At this point, there's nothing to do about them other than let them fail - we can't identify them to abort them, or anything else. The project staff has already turned off the creation of _more_ of these, but doing anything else with the existing ones will have to wait for them to return after the holidays.   | 
|  ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 | 
 Thanks Bill for your explanation. I'll just keep crunching for now. | 
|  Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 | 
 Just one more point, if you look closely at most of the computation error work units they die within a few seconds to a minute or so. As BIll said they seem to be unstable in one way or another. In LHC@Home we see this a lot where the work unit stops, though usually not with a computation error. :) I am not sure that granting me 0.06 CS per failed work unit is going to substantially change my standing ... but I won't complain ... :) | 
| truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0 | 
 Well just had 5 more jobs (out of 12) just give me computation errors and none of them was The Default 205. I did just get a Default 205(12/28 yesterady)) download and it ran ok, was under the impression that the Defaults 205's where all gone. The jobs where topological( i think) that computation defaulted.  Visit us at Christianboards.org | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 I did just get a Default 205(12/28 yesterady)) download and it ran ok, was under the impression that the Defaults 205's where all gone. Again, "computation error" is on the _non_ "DEFAULT_xxxx_205"s... I looked through your results trying to find the 205 in question and couldn't locate it - can you give a link or a WU number? There are still some being "recycled", that had been delayed in large queues; I'm trying to see how many more hosts are likely to be bit by these. Also, if it ran okay - how long did it take? Those are supposed to be 100x as large as the "normal" ones...   | 
| truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0 | 
 I did just get a Default 205(12/28 yesterady)) download and it ran ok, was under the impression that the Defaults 205's where all gone. The Default_1hz6_205_87_1 i just uploaded it, but in the transferes tab (sections ) i didnot see it being uploaded. The cpu time shown was as follows 10:41:53, Iam mentioning this just to inform you that there is still some of these floating around. Thanks again. Visit us at Christianboards.org | 
| truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0 | 
 Just went to upload another computer and 3 out of the 4 jobs had computation errors also and the jobs are as follows. 1ogw_topology_sample_207_1688_10 1ogw_topology_sample_207_12547_8 1ogw_topology_sample_207_9521_4 As i had mentioned before that i had several of these topology jobs on another machine fail, iam uploaded the above jobs now. When i went to upload jobs they did not show up in the transferes section as being uploaded. Visit us at Christianboards.org | 
|  Tern  Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,566 RAC: 2 | 
 can you give a link or a WU number? The name doesn't help; it's not shown on the "list of results" page. And I don't even know which of your computers this is... However, the "10 hours+" was enough, I just looked at the latest results from each of your computers until I saw one that was 38,000+ seconds, then dug down to find the name. It did _NOT_ complete successfully; it shows client error, as expected. The good news is that the WU errors status is "cancelled", meaning it will not be sent to a third person. (This one was in the first person's cache for eight days before they aborted it, as requested.) So the "205"s are coming along well. I also found your "short" computation-error results from today on another of your computers. The one I looked at in detail had been processed by five people now, but it _also_ shows to be "cancelled". So... someone of the project staff apparently has managed to figure out how to identify these, and has cancelled the resending of them; all that remains is for those who have already downloaded these and have them in their cache to return the final failure. Progress! :-)   | 
| Scribe  Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 | 
 Bill I have just checked a couple of the 'short' failures visible on my results , one with 10 sendings the other with 4, both have now been set to 'cancelled' so you are probably correct about the staff finding a way...     | 
            Message boards : 
            Number crunching : 
        Computation Error
    
 
         ©2025 University of Washington 
https://www.bakerlab.org