Page 2 of 2

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Tue Jun 30, 2015 8:52 pm
by Captain
Pext wrote:Can you use the computational power to turn Radiant into a REPL like experience?
Can you use it to map the circumference of Gwamps' head or would that cause the entire universe to implode?

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 12:43 am
by VolumetricSteve
syp0s wrote:Steve, how come the Cray isn't powerful enough to just smash the times regardless of how unoptimised the code is for its architecture (if that's the right word)?

I'd have thought you just have enough power to crush anything made for commercial systems? - is that not the case?
There's a lot to this discussion, but the easiest answer is no. I had also hoped for the same but while the numbers are being crushed with very little work on my end, there are many reasons as to why there isn't more...crushing by default.

In a way, there is some pretty solid ass kicking going on, even without using the fancy-pants Intel compiler or the Cray compiler. Totally unoptimized, untested, foreign code running better (even if it's just slightly) on a system this weird is kind of a miracle. I haven't had to edit a single line of source code yet - ALL I've been doing up to this point has been murdering compile options and messing with my build environment. In a way, this kinda reflects the raw power/capability of the system.

Unfortunately, there are so many layers to this, there's a lot of room for things to go untailored and unoptimized. A 'for instance' would be the use of a dynamically linked executable vs a statically linked one. Let's say you're building a program that uses a lot of libraries...and let's say it calls those libraries at random and almost all the time (unlikely, but just for the sake of argument). On some systems, each call to that dynamically linked library needs to jump back to a completely different system to grab the library, pull it to local memory, and then actually use it. By this time you've killed tons of cpu cycles and this adds up the more this is done. Alternatively, you have statically linked executables where all of those libraries live inside the executable itself, so you have a big fat executable, but it won't have to go far to find the functions it's looking for. Weird things like this can drastically impact performance and they need to be tailored at the code-level of your application for the system you're running on.

A better example would be the differences between AMD and Intel cpus. While both are X86, they do things very differently, particularly when it comes to cache. The Cray I'm working on uses Intel Haswell cpus, which use what's called Inclusive cache. Here it means the L1 cache is actually inside and part of the L2 cache. This makes a Cache-Miss much less costly, generally resulting in better performance even though you still have less cache. AMD cpus use Exclusive cache, where L1 is a discrete amount, and L2 is a discrete amount. You have more of both, but if you need something in L1 and it's only in L2, you lose some cycles to that lookup.

If your code is built to make amazing use of exclusive caches, you'll likely see a weird performance curve on an Intel machine, likewise if you've optimized your code for an Intel cpu...there was a time when it flat out wouldn't work on an AMD chip, though I think those days are long since past...but you get the point.

Disclaimer, this is coming from memory and I may be full of crap:
For a good while there, from about 1990 (historians, correct me if I'm wrong) to whenever MMX came out, the key performance factor we saw on commodity cpus was the frequency. I believe Socket 7 could range from 90MHz to 233Mhz - more if you played your cards right. As I recall, there wasn't a ton going on with processor extensions at that time, so your performance was determined by core architecture and frequency. That's a 2.5X difference, so for us today, that'd be like having a 4GHz cpu upgrade to a 10GHz cpu. Assuming your cores could handle that frequency well, you'd see an insane performance jump like we saw back in the day going from 90 to 233. I don't recall a lot of code changes being made to keep up with new systems...old things just got faster and faster. Today, there's a lot...A LOT more going on with multi-core, AVX, Floating Point and tons of other crap CPUs get saddled with now and if your code isn't talking to those functions, you won't see the performance you expect.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 2:35 am
by VolumetricSteve
ToxicBug wrote:[spoiler]Image[/spoiler]

:paranoid:
I only just saw this now for some reason, this is happening because *drum roll* you have a faster cpu! You have a 3GHz Haswell and ours are 2.3GHz (though we happen to have over 6000 of them). Ours are also Xeons so there a few other differences going on, but primarily your test results continue to show what I've been suspecting for some time...Q3map2 loves cache more than any other computational resource....and you've got 20MB of it. I'm thinking we have more cores and less cache, I'll have to check tomorrow. Interesting results!

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 5:29 am
by Eraser
An easy example is AMD vs Intel's approach to cores. AMD has just slapped more cores on their CPU's while Intel aimed to make individual cores faster.

So if your software is purely single threaded, you can't leverage the power of multiple cores and you're at a disadvantage with a CPU that has more but slower cores.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 5:58 am
by ToxicBug
VolumetricSteve wrote:
ToxicBug wrote:[spoiler]Image[/spoiler]

:paranoid:
I only just saw this now for some reason, this is happening because *drum roll* you have a faster cpu! You have a 3GHz Haswell and ours are 2.3GHz (though we happen to have over 6000 of them). Ours are also Xeons so there a few other differences going on, but primarily your test results continue to show what I've been suspecting for some time...Q3map2 loves cache more than any other computational resource....and you've got 20MB of it. I'm thinking we have more cores and less cache, I'll have to check tomorrow. Interesting results!
It's overclocked to 4.4GHz BTW.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 12:25 pm
by VolumetricSteve
@Eraser
I strongly agree.

@ToxicBug
I wonder how much power it's pulling at 4.4GHz. Would you be able to do another benchmark at stock clocks?

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 9:50 pm
by losCHUNK
Eraser wrote:An easy example is AMD vs Intel's approach to cores. AMD has just slapped more cores on their CPU's while Intel aimed to make indivisible cores faster.

So if your software is purely single threaded, you can't leverage the power of multiple cores and you're at a disadvantage with a CPU that has more but slower cores.
Yah, but, does this explain why the i7 is mopping the floor when compared to an i5 ?. It seems that the i7s extra cores with HT or the beastly amount of cache is giving it a major advantage

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 01, 2015 10:44 pm
by ToxicBug
VolumetricSteve wrote:@ToxicBug
I wonder how much power it's pulling at 4.4GHz. Would you be able to do another benchmark at stock clocks?
3000MHz core (3500MHz turbo)
3000MHz uncore
2133MHz RAM

Image

4400MHz core
4100MHz uncore
2133MHz RAM

Image

4400MHz core
4100MHz uncore
2666MHz RAM

Image

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Thu Jul 02, 2015 12:54 am
by VolumetricSteve
Well that's interesting...

It continues to amaze me how little impact insane clock speeds seem to have on this code.

It's a shame you're on windows or I could send you a newly compiled intel binary.

At work, as far as I know, I have the Intel C++ compiler (v14 and v15) some version of GCC and the official cray compiler
It's also a crap-shoot at the office as to what kinds of performance I'll get since other folks might be running stuff on my compute nodes. I'm seriously considering asking for remote access so I can continue my work at home but I'm weighing that against how completely insane I want to appear to my employers. It's probably worth it. :)

Today, I recompiled everything with the full litany of optimization flags Intel had to offer only to find my compile time shot up to 25 seconds...then I discovered some users were running jobs on that system which messed up my results. I'll have to try again tomorrow.

Alternatively, if anyone wants to front me the 10 grand I'll need for a dual haswell xeon system at home, i'd be fine with that.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Thu Jul 02, 2015 8:17 am
by Eraser
losCHUNK wrote:
Eraser wrote:An easy example is AMD vs Intel's approach to cores. AMD has just slapped more cores on their CPU's while Intel aimed to make indivisible cores faster.

So if your software is purely single threaded, you can't leverage the power of multiple cores and you're at a disadvantage with a CPU that has more but slower cores.
Yah, but, does this explain why the i7 is mopping the floor when compared to an i5 ?. It seems that the i7s extra cores with HT or the beastly amount of cache is giving it a major advantage
Mopping the floor when doing what? You might not see the expected improvements when running single threaded software tests.
Also, like you said, the difference between an i5 and i7 is more than just added cores.

It's similar to how GPU's are much more than simply faster versions of their predecessors. Added clockspeed alone is not what makes our modern day games run at sixty frames per second and still look absolutely amazing. The introduction of hardware processing for specific tasks is what makes things go. Similar to how hardware T&L back in the GeForce 256 days made a huge difference in games that supported it, just as well as vertex and pixel shader pipelines gave a huge boost when it was introduced in DirectX 8. Those are all major improvements that didn't depend on significantly higher clock speeds.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Thu Jul 02, 2015 8:37 am
by losCHUNK
Mopping the floor when compared to an i5, there's like a 6/7 second bonus and what's in the i7 that increases the instructions per cycle ?.

Single threaded performance between the i5 and i7 is practically the same. Even when the clock speed was dropped well below the equivalent i5 it maintained the lead.

So - if the program was multi threaded I would expect the AMD to do better and if not I was expecting the i7 to be pretty close to the i5. Is there something else other than cache that could explain this ?, could the i7 be utilising more than 1 thread thanks to windows or Intels processing management ?, cos its a pretty big gain.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Sat Jul 04, 2015 6:37 am
by Captain
QUAKE3 MAP BENCHMARK 1.3 - RESULTS
==================================
OS = Win 6.1
CPU = Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
RAM = 4095 MByte
==================================
Map Compile = 00:02
Vis = 00:02
Bspc = 00:07
Lightning = 00:11
Total = 00:24

Actual clock speed is 4.5GHz, also 16GB system memory.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Sat Jul 04, 2015 8:21 am
by Pext
VolumetricSteve wrote:Well that's interesting...

It continues to amaze me how little impact insane clock speeds seem to have on this code.
Maybe there's a part of the code that is more or less independent from cpu speed like loading files generated by a previous compilation stage. For example if there are 20 iterations of the lighting algorithm and loading the previous stage takes 300ms this would allready amount to 6 seconds.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Mon Jul 06, 2015 1:20 pm
by VolumetricSteve
@Pext

That's exactly what happens, but there's something else weird going on where Captain Mazda's i7-4790 at 4GHz gets nearly identical performance in q3map2 to my 2.3GHz chip, they're both haswells, the only real difference is cache size. I'm not sure what he's got, but I'm pretty sure it's not 40MB of L3.

Maybe this shows some weird trade off between clock speed and cache. I'm going to establish a stronger baseline with some reference maps.



I've built a small map to test reliability of the compiler to make sure it's not just arbitrarily leaking on random maps and to make sure VIS is working correctly.
It compiles here in 1 second....so that'll need to be made more complex.
I think I've found a bug in q3map2 2.5.17n but I'm not sure if it impacts the gtkradiant branch and netradiant or just the netradiant one. I'm doing more testing before I send it up to the devs....hopefully today

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Tue Jul 07, 2015 10:16 am
by Ryoki
You're doing good work Steve, applying highpowered science stuff to quake like that. I have no idea what you're on about but applaud you all the same.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Tue Jul 07, 2015 12:35 pm
by VolumetricSteve
Ryoki wrote:You're doing good work Steve, applying highpowered science stuff to quake like that. I have no idea what you're on about but applaud you all the same.
Thanks, I try. I think it's practically tradition. When my office first switched to IPv6, they used some updated version of quake 1 to make sure connectivity was solid.


At the moment, I'm trying to build something for q3map2 that would be something like the acid-tests are for web browsers.

I got into Radiant last night and built a huge map meant to stress the VIS algorithm. The map data comes to about 6MB and once vis is done, the resultant bsp is nearly 12MB. It's gotta be pushing some maximums because the light process won't even attempt to run.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 08, 2015 5:27 pm
by duffman91
I'm going to put my computer science hat on for a second here and go over some high level concepts. I haven’t dealt with a Cray in particular, but I’ve worked with other super computers. You can’t compare a commercial home user application’s performance on a PC to a super computer.

Let's stick to gigaflops and IOPS as performance baselines, and let's take a look at how this applies to applications.

gigaflops - measure of billion floating point operations - measure CPU compute power
iops - input/output operations per second - measure storage read/write

gigaflops:
Home PC - Single processor with multiple cores. 1 ALU/FPU per core (average x86 chip).
Supercomputer - Multiple processors, multiple cores. 1 ALU/FPU per core (average x86 chip)

iops:
Home PC - single drive, maybe a RAID 0 with two drives. - low IOPS
Super Computer - some form of RAIN is usually implemented - Redundant Array of Independent Nodes. Each Node is some obnoxious RAID configuration with a storage controller. - high IOPS

From the high level definitions above, a supercomputer is always the more powerful hardware. Why would Q3MAP2 run the same on my hella sweet PC? LOL?!

Gigaflops:
To start, the software running on your home PC was not written for parallel computing (multiple processors crunching at the same time). It was written for a single CPU, though it may work well with multiple cores.

IOPS:
As a compiler, Q3Map2 shouldn’t be doing a lot of IOPS at all. Instead it’s likely using virtual memory and CPU cache. Reading 1 file and writing 1 file to disk takes the same amount of time regardless of storage infrastructure if we assume the heads are in the right spot already. Let’s keep net seek times out of this talk.

So what are you finding? It’s not the case that your overclocked PC at home is as powerful as a Cray. Instead, the software you’re running is not designed to leverage the majority of compute resources available in a Cray. Therefore, the performance is degraded to that of a home PC.

Does this make sense?

If you really want to compare compute power, take a look at analytics software. It's built to handle tons of IOPS to and from disk across multiple processors.

Edit:
Since you're only using 1 node of the super computer, and you have 32 cores with an Intel chip, the question is really how many threads is q3map2 using. If the threads aren't consuming all the cores, then results comparable to a PC are expected.

Directing this to people asking why the Cray isn't smashing a PC's time.

Re: Happen to have a Cray XC-40? Q3Map2 will work on it.

Posted: Wed Jul 08, 2015 6:10 pm
by VolumetricSteve
duffman91 wrote: IOPS:
As a compiler, Q3Map2 shouldn’t be doing a lot of IOPS at all. Instead it’s likely using virtual memory and CPU cache. Reading 1 file and writing 1 file to disk takes the same amount of time regardless of storage infrastructure if we assume the heads are in the right spot already. Let’s keep net seek times out of this talk.

So what are you finding? It’s not the case that your overclocked PC at home is as powerful as a Cray. Instead, the software you’re running is not designed to leverage the majority of compute resources available in a Cray. Therefore, the performance is degraded to that of a home PC.

Does this make sense?

If you really want to compare compute power, take a look at analytics software. It's built to handle tons of IOPS to and from disk across multiple processors.

Edit:
Since you're only using 1 node of the super computer, and you have 32 cores with an Intel chip, the question is really how many threads is q3map2 using. If the threads aren't consuming all the cores, then results comparable to a PC are expected.

Directing this to people asking why the Cray isn't smashing a PC's time.

Agreed on all fronts. However, there's no goal to asses compute performance really. My only concern here is q3map2 and this is more for a laugh than any practical application beyond experimenting with code. Each node of the XC-40 is two Intel E5-2698 cpus and 128GB of ram tied into an Aries chip. I think they have hyperthreading disabled so I execute with -threads 32 but I can just as easily try 64, there are little variances in performance from day to day because of the userbase, but I have no reason to believe that using 64 would make a huge difference..or at least it hasn't so far.

You're also correct that IOPS don't really factor in here...except in the small hiccup that our filesystem *sucks* and performance is an issue for almost all of our researchers. To circumvent that, I want to try to run everything out of a ramdisk but that'll come later, and it's pretty much a non-issue anyway for q3map2.

So based on this, the crux of what interests me is that it looks like a 4.5GHz Haswell with 10 or so MB of cache does roughly as well as two xeons with 40MB each. Is there a software optimization issue? Absolutely. It's just interesting that it manifests itself in this way. Getting down to brass tacks, this is pitting x86 cores against x86 cores, the secret-hardware-sauce from cray is all network related, which I'm not touching. You might be surprised to see how little engineering Cray actually does these days when they release a system. In the case of the XC-40, they built the Aries network interface, and assemble the boards and hardware..but the layer at which the cpu functions and talks to its neighboring cpu via QPI is 100% Intel's doing, and virtually identical technologies exist in the consumer market outside of this system.