Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next
Author | Message |
---|---|
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta. Now that CASP is almost over, we can get back to this. ...... |
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 |
rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta. Now that CASP is almost over, we can get back to this. I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". 8-) I haven't done much more on Rosetta and my opinion has not changed. I am going to Seattle 1/7 for several days but currently have no plans on stopping by the lab. David was supportive and interested but the "developers" were "skeptical" (even somewhat "hostile") about my performance expectations. Without their interest, there will not even be the simple changes ... and IMO, those are the only ones that make sense. 1. The Rosetta Project server/network infrastructure is "creaky" and it is probably already the project bottleneck. a. the big problem is likely just disk IO and reliability but could also be network too. b. they could improve performance by supporting multiple MACHINE CLASSES (SSE2, AVX2, ...) and get rid of the x87 floating point. 2. The Rosetta source code is kludgy, cumbersome and will be VERY difficult to make major changes. a. changing the compiler to ICC gives a bump. b. adding the FOURTH DIMENSION to the vector coordinates will enable SSE/AVX to use VECTOR instructions instead of the SCALAR instructions they did the last time I looked. c. I did not see how the current code could be modified beyond the 4th coordinate ... GPU that has hundreds of compute elements would not have any parallel work. 3. "Performance" increase that a person will see on their machine will widely vary. Some will see big bumps. Other will see little change. a. I saw wide variations on small machines and little variations on big machines ... looked like cache size and memory latency was a big factor. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". Its significance will become apparent when there is a snow and ice storm on the way and you realize that you don't have to go out in it. Then it is a big deal. |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". To be clear, i'm NOT speaking about you, rjs5, you're GREAT!!! I'm referring to Rosetta's admins and at their silence about this. |
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 |
I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". Yep! I got that. 8-) My point was that the admins were good to great as volunteers. They, however, do not speak for the "developers" nor do they have much (if any) influence over the direction or tasks the developers work on. Developers get sensitive when you point out these things. Humorously, if you doubled the compute performance of Rosetta, they would get twice the work completed for about the same network bandwidth. A 6-hour job would do 12-hours of the old work with the same network traffic. Rosetta uses the BOINC timer to kill the Rosetta job at the end of the next Rosetta compute loop. The developers I interacted were justifiably skeptical about my claims. I frequently got the compiler developers giving me a new compiler with a fancy new switch ... telling me it would improve performance by "5%". They were ALWAYS slower since they did not understand the application I was working on. I worked for the last 16 years as a Software Performance Engineer on an very large enterprise sized application that was structured and behaved similar to Rosetta. I used my knowledge/experience with CPU/cache, memory and IO architectures to drive source code changes and compiler improvements. The Rosetta developers know what I have recommended (and are familiar with the technique) AND THEY control their implementation. 8-) If they don't care, nothing will happen. They were not "pleased" with my candid recommendations. The admins are OK. |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
My point was that the admins were good to great as volunteers. They, however, do not speak for the "developers" nor do they have much (if any) influence over the direction or tasks the developers work on. Developers get sensitive when you point out these things. I thinked admins and devs working TOGETHER to make the project better :-P They were not "pleased" with my candid recommendations. If you are interested, a volunteer on Tn-Grid (italian genetic map project) are working on optimization with interesting preliminary results optimize It turned out that on my SandyBridge CPUs SSE version was the fastest one (I suspect that unaligned loads kills performance of AVX version). it needs about 1 hour per WUs (original version needed about 2.5 hours) |
![]() ![]() Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Vote with your feet... Core a7 looks very promising and they have a forum that is actually alive. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Vote with your feet... I have done Folding for many years, but almost exclusively on the GPUs. Their official party line is that anything they can do on the CPUs can also be done on the GPUs. So why bother with a7 on a CPU when you can do Core 21 on a GPU and be an order of magnitude more efficent? Of course, if your GPUs are committed to other projects, then a7 is perfectly reasonable, though you will be getting a lot more a4s for some time, and you can't select. So I like to reserve my CPU power for the projects that have no alternative. Maybe Rosetta should have a more efficient alternative; it is annoying to think that they are overlooking the easy improvements, but they may have reasons. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
GPUs these days has thousands of vector stream processors, if all those vector stream processors can be activated to do supercomputing style vector compute as a boinc group, r@h alone could easily top 10s or 100s of *petaflops* easily out gunning the fastest supercomputers in the world but that's provided on the notion that everyone is running those top end GPUs like the recent Nvidia GTX 1070 |
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 |
I can tie my shoes one at a time (Rosetta today). I can have one person help me and we can tie both in parallel (Rosetta with the extra 4th vector dimension). If I have the help of "thousands" of people to tie my shoes, they can still only tie 2 in parallel with "thousands" idling. A GPU is worthless for Rosetta work UNTIL the developers invest a TON of time to REDESIGN the entire software design (if even possible). With the number of machines currently crunching their work, they have near zero incentive to burn the man-years of effort. |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
A GPU is worthless for Rosetta work UNTIL the developers invest a TON of time to REDESIGN the entire software design (if even possible). With the number of machines currently crunching their work, they have near zero incentive to burn the man-years of effort. Indeed, this is the thread of CPU optimizations :-P And your words "skeptical", "hostile" and not "pleased" with my candid recommendations (referring to devs) are not so encouraging :-( |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
+1 agreed, the truth to be told, even CERN the *physics* people in which those high energy physics computations are associated with *vector super computers* declared that *vector parallel computations* is but only very few of the real world scenarios, much of all the rest of problem with all those extremely parallel vector supercomputing horsepower let that be 100s of peta flops - is useless, they can only be solved sequentially where the next iteration depends on the prior only 1 out of the millions of vector core is probably used https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf On modern architecture, extrapolation based on synthetic yup only a teeny tiny weeny few problems out of the whole universe of problems can be simply expressed as a large set of linear equations. all the rest never fit that pattern |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
Happy new year to all of you!!! |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
+1 agreed, the truth to be told, even CERN the *physics* people in which those high energy physics computations are associated with *vector super computers* declared that *vector parallel computations* is but only very few of the real world scenarios, much of all the rest of problem with all those extremely parallel vector supercomputing horsepower let that be 100s of peta flops - is useless, they can only be solved sequentially where the next iteration depends on the prior only 1 out of the millions of vector core is probably used Yeah, but at CERN, *physics* people are not closed to new possibility, like opencl https://www.hpcwire.com/2017/04/14/xeon-fpga-processor-tested-at-cern/ |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
TJ, a rosetta dev, on other thread wrote: I am a Rosetta developer who looked at the issue Rjs5 pointed out which was using ICC rather than gcc. I also found a large speed improvement. As such we started transitioning over to using icc in compiling. Just considering new AOCC The AOCC compiler system is a high performance, production quality code generation tool. The AOCC environment provides the developer the essential choices when building and optimizing C, C++, and Fortran applications targeting 32-bit and 64-bit Linux® platforms. The AOCC compiler system offers a high level of advanced optimizations, multi-threading, and processor support that includes global optimization, vectorization, interprocedural analyses, loop transformations, and code generation. Also highly optimized libraries, which extracts the optimal performance from each x86 processor core, are used. The AOCC Compiler Suite simplifies and accelerates development and tuning for x86, AMD64 (AMD® x86-64 Architecture), and Intel64 (Intel® x86-64 Architecture) applications |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
A lot has changed including increased use of C++11 from the commons developers, some of which is not yet supported by Visual C++ and has to be ported, also new dependencies, and a migration away from Boost (related to increased C++11 use). Also, the "rosetta scripts" protocols will not be backwards compatible due to new XML format rules so the next app version will be a new app named "rosetta" which is appropriate :) These changes will help the SSEx/Avx development, i hope.... |
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 |
A lot has changed including increased use of C++11 from the commons developers, some of which is not yet supported by Visual C++ and has to be ported, also new dependencies, and a migration away from Boost (related to increased C++11 use). Also, the "rosetta scripts" protocols will not be backwards compatible due to new XML format rules so the next app version will be a new app named "rosetta" which is appropriate :) These changes will make no difference on vector computing. The only possibility is .... since they have the chest open on the source code for major surgery, ... they could possibly make the simple but widespread changes to add vector capability. They need to make changes to their primary TYPEDEF statements to pad to a 2^n size or no compiler will do anything other than sequential, SCALAR crunching. |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
These changes will make no difference on vector computing. In Italy we say "campa cavallo" (something like "don't hold your breath"/"That'll be the day!") |
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 |
These changes will make no difference on vector computing. When the Project managers realize that "doubling the performance of the software" is the same thing as "reducing the project operating costs by ~50%", they will begin pressuring for the changes. The changes are simple, but will require a careful changes to be sprinkled across the source code. There is likely very few developers with the code knowledge capable of safely making the changes across the code. I suspect that those capable developers are not interested in that kind of work. The developers currently appear (to me) to put very low priority on performance changes. |
![]() Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,426,657 RAC: 2,579 ![]() |
Hi everybody, Avx 512 seems no so good Avx512 |
Message boards :
Number crunching :
Rosetta@home using AVX / AVX2 ?
©2025 University of Washington
https://www.bakerlab.org