Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 293 · 294 · 295 · 296 · 297 · 298 · 299 . . . 300 · Next

AuthorMessage
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 187
Credit: 6,375,683
RAC: 5,738
Message 109853 - Posted: 13 Oct 2024, 0:11:58 UTC - in response to Message 109844.  

There is a new batch of Beta work out.
Just be warned- you need roughly 2.2GB of RAM per Task (although it looks like it will drop down after a while to 1-2GB)
.

My latest of these are taking 2.3G to 2.5G each on my Linux machine. I allow 4 Rosetta tasks to run at a time. IIRC, they take 8 to 9 hours of wall clock time to run.

Computer 5910575
Computer information

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Coprocessors 	---
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.22.1.el8_10.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	128085.97 MB
Cache 	16896 KB
Swap space 	15992 MB
Total disk space 	488.04 GB
Free Disk Space 	479.37 GB

ID: 109853 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,155,895
RAC: 16,061
Message 109855 - Posted: 13 Oct 2024, 3:25:51 UTC - in response to Message 109852.  

You made me look.
All less than 400Mb here for some reason
After about half an hour mine end up down around 700-800MB. But when they first start after 5min or so they're still up around 2GB+ before dropping down again.
I'd check what application is actually running those other Tasks - probably your other projects, or they're resends and not one of the latest Beta batch.

When I wrote that I was on my 6-core machine.
More odd is now I've returned to my 8C/16T machine I found it was running only 8 tasks plus 2 waiting for memory and only 1 WCG running - 7 cores idle,
Checking task manager all Rosetta tasks were again only using 3-400Mb each and I had a lot of RAM spare too - what happened to the other cores, I don't know.
I suspended all Rosetta tasks, letting SiDock get its turn - 16 tasks very low RAM - then when they finished, priority went back to Rosetta and all 16 threads started running tasks again.
Some funny business going on somewhere...

At least I created a little space to download more of the remaining few Rosetta tasks available. Not many left now.
ID: 109855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,155,895
RAC: 16,061
Message 109857 - Posted: 13 Oct 2024, 8:35:03 UTC - in response to Message 109855.  
Last modified: 13 Oct 2024, 8:52:22 UTC

You made me look.
All less than 400Mb here for some reason
After about half an hour mine end up down around 700-800MB. But when they first start after 5min or so they're still up around 2GB+ before dropping down again.
I'd check what application is actually running those other Tasks - probably your other projects, or they're resends and not one of the latest Beta batch.

When I wrote that I was on my 6-core machine.
More odd is now I've returned to my 8C/16T machine I found it was running only 8 tasks plus 2 waiting for memory and only 1 WCG running - 7 cores idle,
Checking task manager all Rosetta tasks were again only using 3-400Mb each and I had a lot of RAM spare too - what happened to the other cores, I don't know.
I suspended all Rosetta tasks, letting SiDock get its turn - 16 tasks very low RAM - then when they finished, priority went back to Rosetta and all 16 threads started running tasks again.
Some funny business going on somewhere...

At least I created a little space to download more of the remaining few Rosetta tasks available. Not many left now.

Got up to find only 6 Rosetta tasks running, plus 4 waiting for memory and 6 threads idle, while RAM is at 65% used and 5.5Gb free
5 of the tasks are using 310-440Mb, only one using 2.122Gb
This is very odd

Edit: This is getting even weirder.
Having set NNT for all projects, I suspended all the <non-running> Rosetta tasks first and immediately the 6 idle threads started running 6 SiDock tasks. Those Rosetta tasks waiting for memory were still waiting for memory.
Then I suspended all the running & waiting for memory Rosetta tasks and all 16 threads are now running SiDock tasks.
My intention, as before, was to free up my non-Rosetta offline cache to allow more Rosetta into the cache, then return to Rosetta tasks, restarting them one at a time which seemed to allow all 16 threads to run Rosetta the last time.
Why suspending unstarted tasks allowed the idle threads to be utilised, I have no idea. Never seen that before,
ID: 109857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,616,240
RAC: 22,198
Message 109858 - Posted: 13 Oct 2024, 9:07:18 UTC - in response to Message 109857.  

Got up to find only 6 Rosetta tasks running, plus 4 waiting for memory and 6 cores idle, while RAM is at 65% used and 5.5Gb free
5 of the tasks are using 310-440Mb, only one using 2.122Gb
This is very odd
Very, very odd.
Most of my Tasks are now using around 2GB of RAM, even after running for a few hours.

I'd suggest checking your "When and how BOINC uses your computer" preferences.

These are mine- the most likely to be causing issues- the Memory preferences. Is "Leave non-GPU tasks in memory while suspended" selected? And low "Use at most preferences" would also cause issues.

Computing
Usage limits	
                                  Use at most 100 % of the CPUs
                                  Use at most 100 % of CPU time

When to suspend	
          Suspend when computer is on battery No
              Suspend when computer is in use No
Suspend GPU computing when computer is in use No
  'In use' means mouse/keyboard input in last 3 minutes
 Suspend when no mouse/keyboard input in last --- minutes
    Suspend when non-BOINC CPU usage is above --- %
                         Compute only between ---

Other	
                               Store at least 0.35 days of work
                    Store up to an additional 0.01 days of work
                   Switch between tasks every 60 minutes
    Request tasks to checkpoint at most every 60 seconds

Disk
                             Use no more than 30 GB
                               Leave at least 2 GB free
                             Use no more than 60 % of total

Memory
         When computer is in use, use at most 95 %
     When computer is not in use, use at most 98 %
Leave non-GPU tasks in memory while suspended No
                  Page/swap file: use at most 75 %

Grant
Darwin NT
ID: 109858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Klimax

Send message
Joined: 27 Apr 07
Posts: 44
Credit: 2,800,788
RAC: 2,415
Message 109859 - Posted: 13 Oct 2024, 9:53:29 UTC - in response to Message 109851.  
Last modified: 13 Oct 2024, 9:55:38 UTC

Also only three days of deadline?
Which has been the case for years now, and which is why you don't return a large amount of the work you download- you miss the deadline almost 50% of the time.

Your initial Estimated completion times are set by the project at 8 hours, but your actual Target CPU time appears to be 12 hours. So even with a small cache, you would miss deadlines. You need no cache, or at the very most a very small cache to avoid missing deadlines (1 day or less).
Running more than one project there is no need for a cache at all. If one project doesn't have work, the other will make up that gap, till the first project has work again. No missed deadlines, no not getting work from one project because the other has filled the cache with work.

Ideally
           Store at least 0.02 days of work
Store up to an additional 0.01 days of work

But if you really feel the need for a cache
           Store at least 0.35 days of work
Store up to an additional 0.01 days of work
would be plenty.


If all of your projects have server issues/lots of periods with no work, then "Store at least xx days of work could be set to 1 day (or 1.5 days if you go with the default Target CPU time of 8 hours).

While on surface or first pass you'd be correct, things are bit more complex (Side note: My question about deadlines shows how long ago I fully paid attention to project).

First, variable memory consumption. Often several tasks are waiting for others to finish or to drop currently allocated memory (Seems that nlike others, I got tasks that keep all 2GB allocated even way later).

Second, BOINC or Rosetta has very weird accounting of remaining time to completion. (Lots of task are now around 50% mark yet estimated time to completion is still 12 hours, while waiting tasks have 8 hours. (And yes, my target runtime is 12 hours)

All those canceled were in any case from beginning of month were computer was configured for all 20 cores to be used and ran out of memory very fast making lost of tasks waiting for memory and thus lots of cancellations. I have since then changed configuration to use only 10 cores, so same situation shouldn't occur.

BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.

ETA: Looks like some tasks finally dropped in memory usage to 1GB and one to 700MB.
ID: 109859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,616,240
RAC: 22,198
Message 109871 - Posted: 14 Oct 2024, 7:31:42 UTC - in response to Message 109859.  
Last modified: 14 Oct 2024, 7:32:16 UTC

BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.
Setting it that way may not give you what you might expect it to.
If you want 2 days worth, then set it to 2+ 0.01.

Those additional days are just that- additional days. They will only be added on when the cache gets low enough to reach the "Store at least value" and it needs to be topped up. Then it will also top up the additional day, which will then run down again until the "Store at least value" is reached again.


With it set to 1+1 you will get one day's worth, plus another day's worth, but then the cache will run down to just under 1 day's worth, then it will refill the 1 day & then re-fill the second additional day.
With it set to 2+ 0.01 as it returns a Task, it will download another to keep the cache at the 2 days level.
Grant
Darwin NT
ID: 109871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 10
Credit: 100,868,563
RAC: 74,698
Message 109875 - Posted: 15 Oct 2024, 1:26:27 UTC

Looks like Application "Rosetta Beta 6.06" tasks are using 2.5GB of RAM each! That becomes a bit inefficient when have 128 cores in a computer and 128GB RAM (only 46/128 cores used). Ones before that and "Rosetta 4.20" are consuming less than 0.5GB (and all 128 cores used).
Is it possible to limit the RAM usage per task, so can consume all cores again?
ID: 109875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,616,240
RAC: 22,198
Message 109876 - Posted: 15 Oct 2024, 5:46:03 UTC - in response to Message 109875.  

Is it possible to limit the RAM usage per task, so can consume all cores again?
No.
As mentioned in the previous posts, the high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less (although there can be some Tasks where it goes up to 2GB per Task for a while later on, before dropping down again- my current Tasks after 4 hours are using around 800MB each).
Grant
Darwin NT
ID: 109876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher

Send message
Joined: 10 Jun 13
Posts: 32
Credit: 32,808,832
RAC: 52,579
Message 109877 - Posted: 15 Oct 2024, 15:01:46 UTC - in response to Message 109876.  
Last modified: 15 Oct 2024, 15:02:31 UTC

[No. As mentioned in the previous posts, the high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less (although there can be some Tasks where it goes up to 2GB per Task for a while later on, before dropping down again- my current Tasks after 4 hours are using around 800MB each).


Then I have no choice but to NOT run Rosetta beta and I don't see an option to turn the beta work off. If I could limit the number (say 4) of them running it would be possible. So I guess I'll leave Rosetta turned off. As I said, 16c/32t and 32GB of memory (I pretty much build all the computers now days with 32GB of memory) and I'm physically away from the computers for 5 months out of the year. Well it was fun while it lasted.
ID: 109877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PMH_UK

Send message
Joined: 9 Aug 08
Posts: 16
Credit: 1,243,749
RAC: 0
Message 109878 - Posted: 15 Oct 2024, 15:23:59 UTC - in response to Message 109877.  

You can limit the number running using app_config.xml file in Rosetta's project directory.
Create/amend then from menu select Read config files.
Paul.
ID: 109878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,616,240
RAC: 22,198
Message 109881 - Posted: 16 Oct 2024, 11:14:57 UTC

Well, at least it's been a while since the last time.

boinc-process host is down again, so no Validation until it lives again.
Grant
Darwin NT
ID: 109881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,616,240
RAC: 22,198
Message 109884 - Posted: 17 Oct 2024, 5:46:19 UTC - in response to Message 109881.  

Well, at least it's been a while since the last time.

boinc-process host is down again, so no Validation until it lives again.
And now the download server has died as well.
Grant
Darwin NT
ID: 109884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,155,895
RAC: 16,061
Message 109886 - Posted: 17 Oct 2024, 8:04:12 UTC - in response to Message 109858.  

Got up to find only 6 Rosetta tasks running, plus 4 waiting for memory and 6 cores idle, while RAM is at 65% used and 5.5Gb free
5 of the tasks are using 310-440Mb, only one using 2.122Gb
This is very odd
Very, very odd.
Most of my Tasks are now using around 2GB of RAM, even after running for a few hours.

I'd suggest checking your "When and how BOINC uses your computer" preferences.

These are mine- the most likely to be causing issues- the Memory preferences. Is "Leave non-GPU tasks in memory while suspended" selected? And low "Use at most preferences" would also cause issues.

Disk
                             Use no more than 60 % of total

Memory
         When computer is in use, use at most 95 %
     When computer is not in use, use at most 98 %

My only settings that are more restrictive are
Disk 50%
Memory in use 85%
Memory not in use 95%

More likely it's that I had a faulty RAM stick the other month so I'm only running with 16Gb RAM rather than 32GB
ID: 109886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Klimax

Send message
Joined: 27 Apr 07
Posts: 44
Credit: 2,800,788
RAC: 2,415
Message 109887 - Posted: 17 Oct 2024, 9:48:37 UTC - in response to Message 109871.  

BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.
Setting it that way may not give you what you might expect it to.
If you want 2 days worth, then set it to 2+ 0.01.

Those additional days are just that- additional days. They will only be added on when the cache gets low enough to reach the "Store at least value" and it needs to be topped up. Then it will also top up the additional day, which will then run down again until the "Store at least value" is reached again.


With it set to 1+1 you will get one day's worth, plus another day's worth, but then the cache will run down to just under 1 day's worth, then it will refill the 1 day & then re-fill the second additional day.
With it set to 2+ 0.01 as it returns a Task, it will download another to keep the cache at the 2 days level.

After quick verification on another project... damn you are correct. I was misunderstanding that option for past 17 years or so.
ID: 109887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,155,895
RAC: 16,061
Message 109889 - Posted: 17 Oct 2024, 11:11:05 UTC - in response to Message 109887.  

BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.
With it set to 1+1 you will get one day's worth, plus another day's worth, but then the cache will run down to just under 1 day's worth, then it will refill the 1 day & then re-fill the second additional day.
With it set to 2+ 0.01 as it returns a Task, it will download another to keep the cache at the 2 days level.

After quick verification on another project... damn you are correct. I was misunderstanding that option for past 17 years or so.

I also misunderstood it for a decade or more, but in the end I decided I <did> actually want somewhere between the minimum and maximum amount of days and didn't really care where I was as long as I was in that general area.
My target actually hovered between 1 and 1.5 days total back then, but more recently I've found it more appropriate for me to halve that so no one project runs away with itself too far when Rosetta tasks sometimes become available.

As they have in the last hour or so.
The trouble is, as well as the boinc-process server being down, so is the download server, boinc-files.bakerlab.org so the necessary files are failing atm.
Fingers crossed someone notices
ID: 109889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 10
Credit: 100,868,563
RAC: 74,698
Message 109891 - Posted: 17 Oct 2024, 18:59:38 UTC - in response to Message 109876.  
Last modified: 17 Oct 2024, 19:00:31 UTC

high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less


This is not my experience. Have beta 6.06 tasks that are currently near 50% complete and RAM usage is between 2.26GB and 2.50GB each (1.7GB to 2.2GB compressed).
Sounds like limiting the Rosetta count is only recourse because RAM to CPU ratio is so far off, can't prioritize the more RAM efficient tasks, and swapping causes tasks to take 10x longer.
ID: 109891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 58
Credit: 23,333,405
RAC: 33,629
Message 109893 - Posted: 17 Oct 2024, 22:49:37 UTC - in response to Message 109891.  

high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less


This is not my experience. Have beta 6.06 tasks that are currently near 50% complete and RAM usage is between 2.26GB and 2.50GB each (1.7GB to 2.2GB compressed).
Sounds like limiting the Rosetta count is only recourse because RAM to CPU ratio is so far off, can't prioritize the more RAM efficient tasks, and swapping causes tasks to take 10x longer.


I agree, I have high RAM usage the entire time in Linux. A Win10 system had lower RAM usage then Linux and I could run 100% R@H with 2GB ram per thread and be my primary desktop.
ID: 109893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,155,895
RAC: 16,061
Message 109895 - Posted: 18 Oct 2024, 2:10:09 UTC - in response to Message 109889.  

The trouble is, as well as the boinc-process server being down, so is the download server, boinc-files.bakerlab.org so the necessary files are failing atm.
Fingers crossed someone notices

Looks like it was fixed 3 or 4 hours later. By the time I got back from work to do something about it all the few remaining tasks had been snapped up again <sigh>
ID: 109895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jonathan

Send message
Joined: 31 Jul 24
Posts: 2
Credit: 94,344
RAC: 1,476
Message 109896 - Posted: 19 Oct 2024, 6:02:50 UTC
Last modified: 19 Oct 2024, 6:51:03 UTC

Hi, I had to abort a couple of Rosette beta workunits from my arm64 linux (RPi5) machine as they made the machines unresponsive. Possibly they were memory constrained with 4 cores and 4Gb of memory, but whatever the reason the machine became unresponsive to ssh
1584608775	1409829153	6297726	13 Oct 2024, 9:04:25 UTC	17 Oct 2024, 9:03:56 UTC	Aborted	44.46	36.32	---	Rosetta Beta v6.06
aarch64-unknown-linux-gnu
1584607687	1409833590	6297726	13 Oct 2024, 9:04:25 UTC	17 Oct 2024, 9:03:56 UTC	Aborted	4.78	0.00	---	Rosetta Beta v6.06
aarch64-unknown-linux-gnu

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_aarch64-unknown-linux-gnu @SETDB1_8UWP_boinc_fulldb_6hkEP2_0_3936.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_f5ae1de8e1/database
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>
[/code]
ID: 109896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jonathan

Send message
Joined: 31 Jul 24
Posts: 2
Credit: 94,344
RAC: 1,476
Message 109897 - Posted: 19 Oct 2024, 6:02:56 UTC
Last modified: 19 Oct 2024, 6:05:21 UTC

apologies duplicate
ID: 109897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 293 · 294 · 295 · 296 · 297 · 298 · 299 . . . 300 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org