Issues with 4.82

Message boards : Number crunching : Issues with 4.82

To post messages, you must log in.

AuthorMessage
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11232 - Posted: 23 Feb 2006, 6:27:28 UTC

The increased frequency of problems with version 4.82 we think is probably due to the increased average work unit run time. If a significant fraction of your work units are having problems, please reduce the target run time to two hours (the default is currently 8 hours)--this should reduce the chance of an error during the run by a factor of four. we will also reduce the default target time to four hours. on RALPH we didn't see these problems probably because the default time was set to one hour so we could get test results back quickly.

David Kim is working hard to get stack tracing implemented so we can eliminate the sources of the errors as soon as possible. this is our number one priority.

On the few windows machines we have locally, we have seen almost no errors with 4.82. of course it would be very useful to know what machine configurations are most correlated with errors. for example, perhaps optimized clients could be more likely to have problems?

it would be very useful if people who read this could briefly describe their machines and the fraction of work units that are having problems--hopefully patterns will emerge which will help isolate the problems.
ID: 11232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@H] Ray
Avatar

Send message
Joined: 20 Sep 05
Posts: 118
Credit: 100,251
RAC: 0
Message 11235 - Posted: 23 Feb 2006, 10:19:09 UTC

Dave
No problems yet over here. Two finished in about 8 hours time each.


Pizza@Home Rays Place Rays place Forums
ID: 11235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 11245 - Posted: 23 Feb 2006, 13:11:50 UTC - in response to Message 11232.  

...we will also reduce the default target time to four hours.

I haven't encountered any errors -- yet -- and would like to leave the run time at 8 hours but don't see any option to select 8 hours, per se.

On the few windows machines we have locally, we have seen almost no errors with 4.82.
For what it's worth: My machine is 3GHz/HT windows xp with no WU errors so far (fingers crossed). In fact, I rarely encounter any error on any WU on any project (except for ghost WUs). Makes me wonder if my machine setup is super good (yeah, right!) or the problems are more related to individual computer setups like networking or over-clocking, or using machines at near minimum requirements. My setup is straight out of the box from Dell and I don't tamper with it (don't know how anyway), though I did increase the memory considerably because I could afford it and I like big safety margins even if it's wasted.
ID: 11245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 11246 - Posted: 23 Feb 2006, 13:26:02 UTC

From here you can change your "target CPU time". Default (I.E not selected) is 8 hours.
ID: 11246 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 11248 - Posted: 23 Feb 2006, 13:56:22 UTC

The workunits I have had error have all crashed within the first few seconds of startup. Most of these have occurred on a P3 Coppermine 667MHz, but I've seen the occasional one on a Dual AMD setup too.
ID: 11248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 11251 - Posted: 23 Feb 2006, 15:13:10 UTC - in response to Message 11248.  

The workunits I have had error have all crashed within the first few seconds of startup. Most of these have occurred on a P3 Coppermine 667MHz, but I've seen the occasional one on a Dual AMD setup too.

See this post by David Kim.
ID: 11251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11258 - Posted: 23 Feb 2006, 16:20:45 UTC - in response to Message 11245.  

...we will also reduce the default target time to four hours.

I haven't encountered any errors -- yet -- and would like to leave the run time at 8 hours but don't see any option to select 8 hours, per se.

On the few windows machines we have locally, we have seen almost no errors with 4.82.
For what it's worth: My machine is 3GHz/HT windows xp with no WU errors so far (fingers crossed). In fact, I rarely encounter any error on any WU on any project (except for ghost WUs). Makes me wonder if my machine setup is super good (yeah, right!) or the problems are more related to individual computer setups like networking or over-clocking, or using machines at near minimum requirements. My setup is straight out of the box from Dell and I don't tamper with it (don't know how anyway), though I did increase the memory considerably because I could afford it and I like big safety margins even if it's wasted.



thanks! collectively we should be able to figure out what machine configurations are having the most errors and then track down the problems. please post any ideas and success rates on your machines (good and bad).
ID: 11258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 11259 - Posted: 23 Feb 2006, 16:26:10 UTC

This one has bombed out twice....
ID: 11259 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 11261 - Posted: 23 Feb 2006, 16:34:49 UTC - in response to Message 11259.  

This one has bombed out twice....

Yes, but it won't do so again because the batch (*fullatom*318*) has been cancelled by David Kim (see my link to his post further down in this thread).
ID: 11261 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 11264 - Posted: 23 Feb 2006, 18:18:51 UTC

I have an Athlon XP 2400 with 512MB and Win 2000 sp2 with no errors (has crunched about a dozen 8-hour WUs).
https://boinc.bakerlab.org/rosetta/results.php?hostid=109566

It does have an occational ghost WU. When this happened the message log said:
Wed Feb 22 00:11:57 2006|rosetta@home|Started upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1fkb__317_256_0_0
Wed Feb 22 00:11:57 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
Wed Feb 22 00:11:57 2006|rosetta@home|Reason: To fetch work
Wed Feb 22 00:11:57 2006|rosetta@home|Requesting 17537 seconds of new work
Wed Feb 22 00:12:04 2006|rosetta@home|Finished upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1fkb__317_256_0_0
Wed Feb 22 00:12:04 2006|rosetta@home|Throughput 10382 bytes/sec
Wed Feb 22 00:14:03 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi failed with a return value of 500
Wed Feb 22 00:14:03 2006|rosetta@home|No schedulers responded
Wed Feb 22 00:15:03 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
Wed Feb 22 00:15:03 2006|rosetta@home|Reason: To fetch work
Wed Feb 22 00:15:03 2006|rosetta@home|Requesting 17352 seconds of new work, and reporting 1 results
Wed Feb 22 00:15:08 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
Wed Feb 22 00:15:08 2006|rosetta@home|Message from server: Not sending work - last RPC too recent: 185 sec
Wed Feb 22 00:15:08 2006|rosetta@home|No work from project

The above was for WU https://boinc.bakerlab.org/rosetta/result.php?resultid=11872367

With an earlier ghost the message log said:
Sat Feb 18 04:18:26 2006|rosetta@home|Started upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1cei__312_341_0_0
Sat Feb 18 04:18:26 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
Sat Feb 18 04:18:26 2006|rosetta@home|Reason: To fetch work
Sat Feb 18 04:18:26 2006|rosetta@home|Requesting 2286 seconds of new work
Sat Feb 18 04:18:31 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
Sat Feb 18 04:18:31 2006|rosetta@home|Message from server: Server can't open database
Sat Feb 18 04:18:31 2006|rosetta@home|Project is down
Sat Feb 18 04:18:32 2006|rosetta@home|Finished upload of PRODUCTION_ABINITIO_INCREASECYCLES50_1cei__312_341_0_0
Sat Feb 18 04:18:32 2006|rosetta@home|Throughput 10921 bytes/sec
Sat Feb 18 05:18:32 2006|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
Sat Feb 18 05:18:32 2006|rosetta@home|Reason: To fetch work
Sat Feb 18 05:18:32 2006|rosetta@home|Requesting 6523 seconds of new work, and reporting 1 results
Sat Feb 18 05:22:02 2006|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi failed with a return value of 500
Sat Feb 18 05:22:02 2006|rosetta@home|No schedulers responded

that was for WU https://boinc.bakerlab.org/rosetta/result.php?resultid=11663100

There is no sign that the files for these ghost WUs were ever downloaded. Perhaps there can be a problem if work is requested when an upload is in progress?

I have a bunch of Linux machines that haven't had any problems, but they seem to be using 4.81.
ID: 11264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
uioped1
Avatar

Send message
Joined: 9 Feb 06
Posts: 15
Credit: 1,058,481
RAC: 0
Message 11273 - Posted: 23 Feb 2006, 22:08:43 UTC - in response to Message 11232.  

The increased frequency of problems with version 4.82 we think is probably due to the increased average work unit run time. If a significant fraction of your work units are having problems, please reduce the target run time to two hours (the default is currently 8 hours)--this should reduce the chance of an error during the run by a factor of four. we will also reduce the default target time to four hours. on RALPH we didn't see these problems probably because the default time was set to one hour so we could get test results back quickly.


Not an error exactly, but a complaint: I just got my first batch of the new 4.82 workunits, which will take vastly more time than Boinc requested. I'm not clear on how the scheduler decides on how many workunits will fulfill a request for x seconds of work, but apparently this was not adjusted with the new search mode. (requested 48 hours of work, received possibly 120)
ID: 11273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 847,605
RAC: 0
Message 11287 - Posted: 24 Feb 2006, 3:23:56 UTC
Last modified: 24 Feb 2006, 4:11:35 UTC

Just had another WU fail on this machine:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=13228

but this time it was with a Ralph WU. On Ralph the link to the machine is this:

http://ralph.bakerlab.org/show_host_detail.php?hostid=953

The failure was exactly the same as with the three 4.82 WU's that failed earlier. BTW the last one I got completed successfully.

A little bit of log around the error:

2/23/2006 8:09:41 PM|ralph@home|Resuming result BARCODE_30_1a68__215_17_0 using rosetta_beta version 486
2/23/2006 8:09:41 PM|rosetta@home|Pausing result PRODUCTION_ABINITIO_DBFLAGS_1tit__307_607_1 (left in memory)
2/23/2006 8:40:14 PM|ralph@home|Unrecoverable error for result BARCODE_30_1a68__215_17_0 ( - exit code -1073741811 (0xc000000d))
2/23/2006 8:40:17 PM||request_reschedule_cpus: process exited
2/23/2006 8:40:17 PM|ralph@home|Computation for result BARCODE_30_1a68__215_17_0 finished


The machine is a dual P3 1GHz with 1GB of ram, running WinXP SP2. Running with "Leave in Memory" = YES, several other BOINC projects, ... did I forget anything? Oh yes, BOINC CC 5.2.15.

Failed Rosetta WU's:
https://boinc.bakerlab.org/rosetta/result.php?resultid=11823719
https://boinc.bakerlab.org/rosetta/result.php?resultid=11805479
https://boinc.bakerlab.org/rosetta/result.php?resultid=11796212
Failed Ralph WU:
http://ralph.bakerlab.org/result.php?resultid=6153


ID: 11287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Beezlebub
Avatar

Send message
Joined: 18 Oct 05
Posts: 40
Credit: 260,375
RAC: 0
Message 11288 - Posted: 24 Feb 2006, 4:40:35 UTC - in response to Message 11232.  
Last modified: 24 Feb 2006, 4:46:32 UTC


David Kim is working hard to get stack tracing implemented so we can eliminate the sources of the errors as soon as possible. this is our number one priority.

On the few windows machines we have locally, we have seen almost no errors with 4.82. of course it would be very useful to know what machine configurations are most correlated with errors. for example, perhaps optimized clients could be more likely to have problems?

it would be very useful if people who read this could briefly describe their machines and the fraction of work units that are having problems--hopefully patterns will emerge which will help isolate the problems.


I am running 3 mach (1 P4 2ghz, 1 AMD 2800 2.1ghz 1 P D820 2.8ghz) on Rosetta, Seti, Einstein all running 24/7. With 91 total Rosetta results only 2 were unsuccessful due to download problems (server I think). I believe the people with multiple failures need to check out their hardware and stability (overclocking, heat, antivirus, etc.) BEFORE complaining about the program itself. I'm not attacking anyone, I'm just pointing out the fact that my mix of machines and others who posted "no problems" point to something OTHER than the 4.82 being the problem.

My machine specs:

"Black" P4 2ghz 1024mb PC2700 DDR Foxcon mb.

"Boinc" AMD 2800+ 2.1ghz 1024mb Kingmax PC3200 DDR.

"TxEagle" Pentium D820 dual core 2.8ghz. 1gig OCZ Gold PC2-5400 DDR2 dual channel, ASUS P5LD2-VM MB.

All are running XP Pro Sp2

e6600 quad @ 2.5ghz
2418 floating point
5227 integer

e6750 dual @ 3.71ghz
3598 floating point
7918 integer


ID: 11288 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
uioped1
Avatar

Send message
Joined: 9 Feb 06
Posts: 15
Credit: 1,058,481
RAC: 0
Message 11297 - Posted: 24 Feb 2006, 6:32:34 UTC - in response to Message 11282.  


Not an error exactly, but a complaint: I just got my first batch of the new 4.82 workunits, which will take vastly more time than Boinc requested. I'm not clear on how the scheduler decides on how many workunits will fulfill a request for x seconds of work, but apparently this was not adjusted with the new search mode. (requested 48 hours of work, received possibly 120)



You can adjust the run length of the WUs in your preferences. See this post in the Rosetta FAQs for details.


Ah, I hadn't realized that you could do that for units already downloaded. Still, that wouldn't have fixed the problem I experienced. At best, I can set my requested time back to two, and try to set it back before I start my last result so that I can correct the time correction factor for next time.

Thanks for your help.


ID: 11297 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
uioped1
Avatar

Send message
Joined: 9 Feb 06
Posts: 15
Credit: 1,058,481
RAC: 0
Message 11392 - Posted: 25 Feb 2006, 18:32:32 UTC - in response to Message 11359.  
Last modified: 25 Feb 2006, 18:34:25 UTC




See if this post helps you at all.



That post was definitely helpful. I think a new FAQ entry or stickied thread specifically about the effects you will see before the boinc clients adjust to the new units. What I posted here was an attempt, but I haven't worked through my queue yet, so I don't even know that it's correct. The part I'm specifically referring to is that:

This is a temporary problem,
It can be mitigated temporarily by doing X
That it will happen every time you download workunits after increasing proc time
Aborting all your units won't fix the problem, until you have completed at least a few of the new units.


Please correct me if I've got something wrong, I'd hate to be disseminating wrong info.
ID: 11392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1834
Credit: 124,260,318
RAC: 8
Message 11396 - Posted: 25 Feb 2006, 19:40:56 UTC
Last modified: 25 Feb 2006, 19:42:22 UTC

I had one PC running an 'optimised' client and it started getting lots of errors. I swapped boinc.exe for the standard one from one of my other PCs and the errors seem to have stopped. Anything reported after the 24th Feb has been run using the standard updated client: https://boinc.bakerlab.org/rosetta/results.php?hostid=53007

Might be conincidence, or another unrelated problem, but it might be the problem with a number of the clients out there...
ID: 11396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Issues with 4.82



©2025 University of Washington
https://www.bakerlab.org