Problem :
I’ve been doing some computations at home and at work and have noticed some unexpected performance issues. The work machine is quite a bit more “serious” than the home machine, but sometimes the home machine outperforms the work machine. I’m curious what the dynamic is — why this should be, and if I can tweak it at all.
Ultimately on both machines the computations are just a lot of very large arbitrary-precision integer linear algebra computations (based on the GNU multi-precision library). Reducing a lot of sparse but “large” integer matrices, finding the vertices on the boundary of high-dimensional polyhedra, and so on.
On my home computer (which has two cores), if I run a standard computation on only one core (the 2nd core near-idle), it takes 282s.
On the home computer running two identical parallel computations (the same computation as above), it takes approx 320s on each core.
On my office computer, with all cores essentially idle except for one core running this computation, it takes 196s.
On my office computer, if I have all 8 cores running full-out, and one of the cores is doing the computation above, it takes 356s on the one core that’s running the computation.
Here are the details on my home computer:
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
stepping : 10
cpu MHz : 800.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 4787.65
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
stepping : 10
cpu MHz : 800.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 4787.98
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
rybu@rybu-laptop:~/prog/regina/exercise/4M-census/rank1/t$
and my office computer:
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 6147.45
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
cpu MHz : 1600.000
ca
Solution :
Gridengine doesn’t do what you want. That’s for processors that don’t share a memory bus, like your SMP machines do.
You have the right idea about trying to keep your task’s memory in cache. You do that by having all the memory accesses as close to each other as possible, for both instructions and data. Don’t jump around from one area to another.
SMP systems are really optimized for throughput rather than latency. If latency is really that important to you, you are best off dividing it into a task for each processor.