Original Link: https://www.anandtech.com/show/1032
Xeon 2.8GHz DP & 2.0GHz MP - Part I: Taking over the Enterprise
by Anand Lal Shimpi on November 18, 2002 9:04 AM EST- Posted in
- IT Computing
When Intel first released their Xeon branded processor in May of 2001 we were impressed with its performance, but the crown was quickly taken away by AMD with their Athlon MP that followed one month later.
The original Xeon offered higher performance than its predecessor, the Pentium III Xeon, mostly because of its high bandwidth FSB and memory bus, unfortunately the Xeon suffered from the same drawbacks as the original Pentium 4. Remember back to our review of the first Pentium 4, the performance of the NetBurst based CPU was nothing to cheer about - in many cases it was no faster than the Pentium III. The Xeon launched at 1.7GHz, faster than the first Pentium 4 but the architecture was identical its desktop counterpart. Fortunately, because of the nature of enterprise applications, the Xeon's performance was made respectable thanks to the aforementioned bandwidth advantages over the Pentium III.
The Athlon MP, on the other hand, came out of the gates strong; initially offering a huge advantage, at considerably lower cost, than Intel's Xeon. The benefits of higher IPC execution, larger on-die caches, and a point-to-point bus protocol gave AMD a clear advantage over Intel in entry-level dual processor systems. However, just as was the case in the desktop world, over time Intel began to gain momentum and made the Xeon infinitely more competitive.
Things have changed considerably since the release of the original Xeon last year; for starters, the Xeon DP (Dual Processor version) now ships with twice as much L2 cache, courtesy of Intel's 0.13-micron manufacturing process. Other than a 512KB L2 cache, the 0.13-micron manufacturing process also gave Intel the ability to ramp clock speeds as high as 2.80GHz. The Xeon still trails the desktop Pentium 4 in clock speed, but that's mainly because of the more rigorous validation that all enterprise class products must go through. The higher clock speeds and larger L2 cache worked together with Hyper-Threading to close the gap in performance between the Xeon and the Athlon MP.
Alongside the Xeon DP, Intel also introduced the Xeon MP - capable of 4-way operation. The Xeon MP was introduced on a more mature 0.18-micron process but featured up to a 2MB on-die L3 cache in addition to the 256KB on-die L2 cache. The massive on-die caches approximately doubled the transistor count of the Xeon and thus reduced yields on the CPUs. In order to compensate, Intel drastically reduced the clock speeds of these CPUs - introduced at 1.6GHz while the Xeon DP approached 2.8GHz.
Earlier this month Intel transitioned their Xeon MP processors to the now mature 0.13-micron process, outfitting them with a 512KB L2 cache and a 2MB on-die L3 cache. The 108 million transistor processors are now available at speeds up to 2.0GHz, still lower than the Xeon DP parts but as you're definitely familiar with - clock speed isn't everything.
Today Intel is continuing to turn up the heat with new chipsets and new CPUs for their enterprise and workstation markets. The Xeon DP finally gets a 533MHz FSB and while E7205 (Granite Bay) offers dual channel DDR for Pentium 4s, Intel brings the E7505 (Placer) to the Xeon for some dual processor dual channel DDR action.
Because of the sheer number of products Intel is introducing and the multiple target markets we're splitting our coverage of the Xeon processors into three parts. This first part will focus on performance in the enterprise market, more specifically on database server applications. The second part will be posted after Comdex and will focus on web server performance. Finally, part three will look at the workstation performance of these processors and platforms.
So without further ado let's dive right into the new CPUs...
Xeon DP - Finally with a 533MHz FSB
The Xeon DP family is extended today with the introduction of the first 533MHz FSB parts. The architecure of the Xeon DP is identical to the current 0.13-micron Northwood Pentium 4 CPUs, except you can use them in dual processor systems. This means that these CPUs have the same 512KB on-die L2 cache as the Pentium 4, the same trace cache and now the same 4.2GB/s FSB interface.
As their name implies you can only use these CPUs in single or dual processor configurations; and to make it very certain that you won't be able to use these in quad processor motherboards Intel outfits the Xeon DP processors with an extra pin, making the chips have a 604-pin interface instead of the 603-pin interface of the Xeon MP. The sole purpose of this extra pin is to make certain that you only use Xeon MP processors in 4-way configurations, which is a shame since the Xeon DP processors ship at much higher clock speeds but we'll get to that later.
Socket-604 (Xeon DP)
Xeon MP - 108 Million Transistors @ 2.0GHz
Just a couple of weeks ago Intel introduced the first 0.13-micron Xeon MP parts. As we mentioned in the introduction, the Xeon MP isn't special for its ability to run in quad processor configurations but rather its massive on-die caches.
The new Xeon MP comes equipped with the same 512KB on-die L2 cache as its DP counterpart, but is also outfitted with a massive 2MB on-die L3 cache. The on-die L3 cache and its associated controller logic increases the Xeon's transistor count by no less than 53 million transistors. The end result is that the Xeon MP has a total of 108 million transistors, that's about twice the size of the Xeon DP.
Socket-603 (Xeon MP)
Getting a 55 million transistor processor to run at 2.80GHz isn't a big deal for Intel, but getting a 108 million transistor CPU to run that fast is considerably harder. As you all know, the more transistors you have on a CPU the larger your die size becomes; and the larger your die, the more defects will be present, thus driving the overall yield (ratio of working chips to total chips produced) down.
Luckily the majority of the transistors on the Xeon MP are used for cache, which is much less prone to showstopping defects thanks to redundancy that's built into the cache (there's actually slightly more cache than necessary on the CPUs, just in case defects do occur in some parts of it). However, in order to make sure that Intel doesn't throw away more Xeon MPs than they sell, they limit the clock speeds on the processors. Remember that it is difficult to get good yields on higher clocked CPUs, so one way of balancing out the large die area of the Xeon MP is to keep clock speeds much lower than the Xeon DP.
The result is that the fastest 0.13-micron Xeon MP ships at 2.0GHz, approximately 30% slower than the fastest Xeon DP. This raises an interesting question that many IT managers will face - what's faster: 2.8GHz processors with no L3 cache or 2.0GHz processors with 2MB L3 caches? We'll be answering that shortly, but now let's take a look at the new chipsets.
New Platforms - DDR Across the Board
A year ago, Intel's Xeon could only be paired with RDRAM on the 860 chipset; today, there isn't a single chipset being introduced with RDRAM support. This mirrors Intel's efforts on the desktop side, with Springdale on its way to replace the 850E as the high-end desktop platform of choice. As we've mentioned in the past, this doesn't mean Intel's abandoning RDRAM completely - but it does mean that the two are on a pretty big break. For now, RDRAM isn't necessary in the desktop, workstation or server segments; and the success or failure of DDR2 will determine whether it is in the near future.
From top to bottom Intel has finally regained their lost title as the king of chipsets. Today Intel has chipsets for virtually every market segment, with the exception of the 4-way+ configurations which are handled by ServerWorks.
On the entry-level workstation side Intel announced the official name for their Granite Bay chipset today - the E7205. We're covering the E7205 in a separate article today but it deserves a mention here because of its role as a workstation chipset. Despite Intel's attempts to make the 7205 a workstation board, most motherboard manufacturers will be producing high-end desktop solutions based on the chipset - designed to replace the 850E in their product lines.
The features of the E7205 are as follows:
- Single Processor Pentium 4 support
- 2 x 64-bit DDR266 memory channels (just like NVIDIA's nForce2)
- 533MHz FSB support
- AGP 8X support
- USB 2.0 support
What's worth noting is there is no DDR333 support provided by the chipset. Just as we saw with the nForce2, keeping the FSB and memory bus in sync with one another result in faster overall performance. Also, dual channel DDR333 won't be necessary until Intel increases the FSB once again to 667MHz, which will require a higher bandwidth memory bus. Chances are that we will see dual channel DDR333 support with the forthcoming Springdale chipset, or potentially a second revision of Springdale.
Next we have the E7505 chipset, otherwise known as Placer. The role of the 7505 is to basically take the functionality provided by the E7205 and bring it to dual processor Xeon workstations. The specifications of the chipset are the same, the only difference is that these chipsets are validated for dual processor operation and support the 604-pin Xeon DP interface.
Moving onto the server side, Intel is updating their current chipset line with the Plumas-533, now known as the E7501 chipset. The original Plumas chipset (E7500) was the first dual channel DDR solution from Intel and as the name implies, the E7501 simply adds support for the 533MHz FSB and dual channel DDR266.
Just like the E7500, the 7501 does not have an AGP interface (why would you need AGP on a server?) and is strictly a Xeon DP chipset, Pentium 4 owners will have to stick to the E7205 (Granite Bay).
For 4-way configurations Intel once again turns to ServerWorks to provide the chipsets for the Xeon MP. The upfront investment cost of developing a 4-way (or more) chipset is incredible, requiring tens of millions of dollars of development and validation. It should be noted that these costs are significantly reduced with a more simplified MP architecture such as AMD's Opteron, but with the conventional x86 MP setup they remain just as high. Because of the extremely high barrier to entry, ServerWorks finds themselves without competition in this arena and thus they service the entire 4-way Xeon server market.
ServerWorks' Xeon MP chipsets have been around for quite some time and thus there's nothing new to report on here. The Grand Champion HE is their top of the line solution offering support for up to 4 processors and up to 4 independent, load-balanced 64-bit memory controllers.
Since the Xeon MP processors still all connect to the North Bridge through a 3.2GB/s FSB link, having a 6.4GB/s memory bus might seem a bit useless. However, if you take into account that a great deal of these 4-way machines require more than 1GB/s of I/O alone, it is clear where the additional memory bandwidth becomes useful.
Xeon MP Servers from Appro & Intel
We received 4U Xeon MP servers from Intel and a 5U from Appro for this review.
The biggest advantages of the Intel server are ease of use and accessability:
There are only 5 removable drive bays on the Intel machine as you can see above.
The six cooling fans are easily removed:
If a fan fails or is removed the remaining fans will spin up to compensate for the fan loss until it is repaired or replaced.
Memory is installed in a riser card:
You must install DIMMs in pairs of four to match up with the 4 x 64-bit memory buses. There are a total of 12 DIMM slots on this riser card.
A big blue duct directs airflow over the 4 Xeon MP CPUs:
Removing the duct reveals the CPU module:
Removing the module is as simple as lifting two levers:
With the CPU module out this is what you see:
Note that the motherboard is covered with a piece of plastic to prevent loose screws or anything else from coming in contact with the board itself.
Here we have the CPU module removed with two new CPUs ready for installation:
Appro's 4-way Xeon MP server was a bit different than Intel's solution although they performed the same and used Tyan's Thunder GC-HE S4520 motherboard:
The biggest advantage Appro has is the presence of 10 removable drive bays as opposed to Intel's 5:
The Appro server features three hot swappable power supplies, much like the Intel solution:
With the top off the Appro is very similar to the Intel server:
There is no protective cover on the motherboard and you must unscrew the memory riser card to get it out:
The Appro heatsinks are thankfully much easier to remove than the Intel heatsinks:
Four screws hold the heatsinks in place:
Thumbscrews hold the Appro case fans in place:
Performance: Asking the Right Questions
Now that you know all of the specs of the CPUs and the platforms, let's talk about performance. With so many different CPUs and platforms it is easy to confuse what's the best and as always, you're left with the feeling that the manufacturer is going to recommend the most expensive as being your heavenly option.
Before we start asking performance questions, let's look at the situation we have with the Xeon DP and MP systems from the perspective of a database server application.
On the one hand we have the Xeon DP processor that runs at very high clock speeds and has a fairly good sized on-die L2 cache (512KB). But, does the 40% clock speed advantage outweigh the presence of a large L3 cache on the Xeon MP?
As large as a 2MB L3 cache is, you're not going to be able to fit any reasonable database into that small of a cache; we'd need a 2GB cache for that. Instead, the presence of a large L3 cache helps reduce memory latency significantly. Remember that caching improves overall performance not by attempting to store everything in cache, but by taking what's used most frequently and putting it in a location much closer to the CPU. Along with putting more frequently used data in cache, a larger cache will allow the CPU to have more data that's adjacent to frequently used data in cache as well. Based on the principal of locality, we know that frequently used data tends to be surrounded in memory by data that will also be used in the near future; a large on-die L3 cache let's us store more of this data on the CPU as well. Since database servers are very memory and I/O intensive you would expect any reduction in overall data access latency to help performance tremendously, but again the question is whether that reduction in latency is enough to hide the clock speed penalty.
Then there's the question of going beyond two processors; with the Xeon DP you're stuck to a maximum of two CPUs, but with Xeon MP you can very easily expand to four. How much of a performance gain will you get from moving to four CPUs and is it worth the added cost and additional licensing fees?
On top of all of these concerns we have yet another variable to consider, Hyper-Threading. Does the presence of Hyper-Threading diminish the need for more than two CPUs? And how is Hyper-Threading performance impacted by the decrease in clock speed and increase in cache size offered by the Xeon MP? Remember that Hyper-Threading does increasingly better at higher clock speeds, and it is also most beneficial during times of idle execution. However the Xeon MP runs at lower clock speeds than the Xeon DP and its on-die L3 cache reduces the periods of idle execution, both working against Hyper-Threading.
These are the questions we hoped to answer when we started working on this review weeks ago, but as you're about to see, answering them was much more difficult than you'd expect.
Performance: Answering the Questions
When the Xeon and Athlon MP processors were first released we turned to our own server environment to provide the test tools we need to evaluate the platforms. We recorded a trace of 30 minutes of activity on the AnandTech Forums DB server and then used that as the best way of figuring out the real world performance of these CPUs.
Since we originally ran that test we've added AnandTech Web and Ad DB tests to our suite of database server benchmarks, and we've updated the AnandTech Forums DB test. Our last trace recording of the Forums was when the Forums DB was around 3GB in size, today the database is 10GB large and growing.
When we received the Xeon platfoms for testing we immediately fired up our in house tests and went at it. The first thing we realized was that even our largest, most stressful DB test was not enough to even remotely stress a 4-way Xeon MP server. During our first run of benchmarks we couldn't get the 4-way Xeon MP with Hyper-Threading enabled to ever peak above 20 - 25% CPU utilization; we needed something more stressful than our server environment.
Around the same time that we were working on this article, we were also working on the server setup for AnandTech. Until very recently, we still had two older Pentium III and Pentium III Xeon based database servers running the AnandTech Web DB and the AnandTech Ad DB. Because of their relatively low CPU utilization we were able to consolodate both of those servers into a single modern day server platform, in this case a dual Athlon MP 2200+. While running both DBs on a single server would increase our memory and I/O requirements, it does save on licensing costs and cuts down on rack space as well. To give you an idea of the amount to be saved, a single processor license of SQL Enterprise edition retails for $19,995. The fewer CPUs you have and the fewer systems you have to license, the lower your overall server management costs will be.
This idea of running multiple DBs off of a single box gave us inspiration to try the same with our DB server tests. In essence, instead of having one big database as our testplatform we'd run multiple DBs in order to simulate larger loads. This approach ended up being exactly what we needed as we could finally stress not only the faster 2-way servers, but 4-way servers as well. And at the higher load multiples we were able to simulate tests of databases that weren't 3GB in size, but 24GB in size.
Setting up the Tests
Setting up the tests and getting everything perfect wasn't an easy task at all, in fact the first issue we encountered wasn't one we thought about at all until working on our Hyper-Threading Pentium 4 article. In that article we mentioned that Windows 2000 will not properly recognize Hyper-Threading and thus will treat every CPU, whether logical or physical, as a physical CPU. The problem with this is that if you have a 4-way Xeon MP server and enable Hyper-Threading, the OS will think you have 8 total CPUs and unless you're running Windows 2000 Advanced Server the OS will only recognize 4 of them. Because of this we had to switch our test OS from Windows 2000 Server to Windows 2000 Advanced Server. Keep this in mind if you plan on deploying a 4-way Xeon MP setup on a Windows 2000 Server platform.
Microsoft's SQL Server doesn't suffer from this same fate and it properly recognizes Hyper-Threading on processors. Because of this you can use SQL Server 2000 Standard Edition with a 4-way Xeon MP with Hyper-Threading enabled, and you'll only have to be licensed for the four physical processors you have in that machine. With any more than 4 physical CPUs however you'll have to switch over to SQL Server 2000 Enterprise Edition, but the switch will grant you the ability to use more than 2GB of memory which is quite useful.
The other issue that creeps up is that without a knowledge of Hyper-Threading, Windows 2000 may not necessarily assign threads to physical CPUs first before assigning them to logical CPUs. Remember that a physical CPU will always be faster than a logical processor that exists within another physical CPU; so if you only have four concurrent threads being dispatched to the CPU you'd want them sent to all four physical CPUs and not two physical and two logical processors.
The OS problems will be solved once .NET Server is released, but until then you're just going to have to keep that in the back of your mind before deploying hardware. Now let's move onto the tests themselves.
As we've mentioned in previous articles, recording a trace on a database server is much like recording a timedemo in Quake III. Every single request that is sent to the database server is recorded into a file that is nothing more than a list of those requests. Then, you can run a copy of the database that the trace was recorded from on any machine and play the trace back in order to simulate the exact same load on the DB on any machine. For our load multiplied tests we simply ran multiple copies of the DB and multiple traces concurrently.
For the AnandTech Forums DB tests we ran single and 3x load test runs, while the AnandTech Web and AnandTech Ad DBs were run under 1x, 4x, 6x and 12x load.
Each test was run 5 times, with the first scores being thrown out in order to let the DB server begin to cache queries; the rest of the runs were averaged to provide the final score in transactions per second. The final score was a sum of all of the concurrent transactions divided by the total running time of the traces. The DB server was rebooted and defragmented before switching tests or load settings.
The DB servers were configured with 2GB of memory and 6 x 18GB 10K RPM Seagate Cheetah drives in RAID 0 running off of an Intel 64-bit PCI SCSI RAID controller. Normally you wouldn't find a DB server configured with drives in RAID 0 but in order to prevent us from having to use twice as many drives we stuck with RAID 0. We weren't concerned with data loss in the event of a drive failure since this was only a test server, but in a real world situation we'd have the drives configured in a RAID 10 array to provide performance and data redundancy.
AnandTech Web DB Performance
The first DB test we have is of the AnandTech Web DB; this is the database that houses all of the articles (including this one) and makes sure that they are available for you to read every day. As you can guess, the vast majority of requests to this database are selects (basically reads). The data request comes over the Internet, down through the NIC, to the CPU which then sends it to memory or I/O to fulfill the read request; this test simulates everything after the request hits the NIC.
The original copy of the database is 249MB but note that in the multiplied load test the overall working database size is multiplied as well (4x = 996, 12x= 2988MB etc...).
The purple bars indicate that Hyper-Threading was enabled, blue bars indicate Hyper-Threading disabled and the green bar is for AMD.
|
First of all, the standings here won't be changing much but that is to be expected, especially as we increase the load. It would take a pair of very powerful CPUs to beat out a 4-way configuration like we have here.
Even at such low loads you can see that the 4-way Xeon MP offers a 30% performance boost over a 2-way Xeon MP system. But once you look at the 2-way 2.8GHz Xeon DP setup, the performance advantage of the 4-way Xeon MP system is much less impressive. Here the 4-way setup is only able to provide a 10% performance advantage over the 2-way 2.8GHz setup. The on-die L3 cache is the major reason behind this advantage since the load isn't great enough to stress all four CPUs.
Moving down to the 2-way configurations you can see that the 2-way Xeon MP, despite its large on-die L3 cache, cannot outperform the 2-way Xeon DP because of the DP's 40% clock speed advantage. Will this trend continue as we increase the load? There's only one way to find out, but we'll get to that in a bit.
The impact of Hyper-Threading has clearly improved since we first looked at the technology; on the 4-way Xeon MP setup there is a 5% performance advantage provided by HT, the 2-way Xeon DP gets a 13% boost and the 2-way Xeon MP receives a 9% kick in the pants. In theory, the performance boost provided by Hyper-Threading should increase as load increases simply because the increase in I/O utilization would trigger more periods of idle execution within the CPUs.
Once we factor AMD into the equation, it's clear that the clock speed and cache advantages of the Xeons are growing the performance gap. We are due for an Athlon MP 2400+ relatively soon which should help keep AMD competitive.
|
The first thing you'll notice is that the spread increases noticeably as we increase the load level.
Hyper-Threading gets a chance to make much more of an impact here, where the 2-way Xeon MP with Hyper-Threading enabled can outperform the 2-way Xeon DP with HT disabled. Enabling Hyper-Threading increased performance by 28% on the 2-way Xeon MP and by 26% on the 2-way Xeon DP. Neither of those performance gains was enough to beat out the raw power of having four CPUs, but even on the 4-way setup HT was able to increase performance by 20%.
The move from 2 processors to 4 extends to 67% and the performance advantage of the 4-way Xeon MP over the 2-way Xeon DP extends to 39%. Once again we see that the 2-way Xeon MP's 2MB L3 cache isn't able to overcome the 40% clock speed advantage over the 2-way Xeon DP.
|
Things change slightly as we increase the load to 6X; for starters, the 2-way Athlon MP and 2-way Xeon MP setups swap positions but still offer very similar performance.
The 4-way Xeon MP continues to exert its dominance, extending its lead over the 2-way Xeon DP to 48%. The performance gain by going to 4 CPUs in this case ends up being an incredible 62%. It is clear that server applications scale much better with more processors than anything we could ever find in the desktop or workstation arenas.
Hyper-Threading continues to offer impressive performance gains; once again, the higher the load the more the CPU has to go to I/O and the more periods of idle execution are present. The end result is that Hyper-Threading yields a 23% gain on the 4-way Xeon MP, a 25% gain on the Xeon DP and a 36% gain on the Xeon MP. It is very interesting to note that the largest boost from Hyper-Threading comes with the 2-way Xeon MP setup and not the 2-way Xeon DP configuration as we originally theorized. One possible explanation is that the load isn't great enough to completely saturate the 2-way Xeon DP platform whereas the slower 2-way Xeon MPs are more CPU bound in this test; the outcome being that the CPU bound 2-way Xeon MP configuration benefits more from being able to use all of its execution power.
|
For our final Web DB test we crank up the load to 12 times that of the original Web DB test, for a total working database size of almost 3GB; for a content-only database, 3GB is quite a few reviews. It is here that we separate the boys from the men and the benefits of 4-way SMP and Hyper-Threading really become clear.
First off, the 4-way Xeon MP maintains a 64% performance lead over the 2-way Xeon DP courtesy of the number of CPUs. It's interesting to note that even under such heavy load, the 2-way Xeon MP is not able to overcome the clock speed deficit and outperform the 2-way Xeon DP. This could be an indication of why Intel didn't want 4-way configurations of their 2.8GHz Xeon DP processors as they could theoretically outperform the more expensive 4-way Xeon MP solutions.
In this test we see that the benefit from going to 4 processors is a healthy 88%. However you must realize that it took us multiplying the load on the server by a factor of 12 before we got this sort of scaling with CPUs as well as this sort of performance advantage over the 2.8GHz Xeon DPs.
Hyper-Threading proves to be most useful here, providing a 35% performance gain for the 4-way Xeon MP solution and a 33% gain for the 2-way Xeon MP. The Xeon DP setup received a smaller 27% performance boost. This is the second time we saw a more impressive performance gain out of the Xeon MP parts than the Xeon DP and it could be related to the fact that the Xeon MP CPUs have more cache to share amongst the multiple threads contending for resources. We would assume that there's a reasonable deal of locality in these concurrent memory accesses but this data may indicate otherwise.
In either case, Hyper-Threading definitely proves its worth here; although not as effective as moving to a 4-way configuration, Hyper-Threading offers more bang for your buck than adding more cache or even increasing clock frequency - the combination of all three however, results in a much more powerful server microprocessor.
AnandTech Ad DB Performance
Next we have the AnandTech Ad DB, which prides itself on being very similar to the Web database in terms of usage pattnerns but much larger in sheer database size. The Ad DB itself is 2.1GB large and obviously the size is multiplied in the load multiplied tests.
|
Immediately it becomes clear that this isn't the same sort of load we saw in the last set of tests. The most obvious characteristic of the Ad DB tests is that Hyper-Threading isn't initially very helpful. Granted there is an overall performance increase of around 7% on the 2-way Xeon DP, but for the two Xeon MP platforms the technology does much more harm than good. This could be the work of Hyper-Threading scaling well with clock speed at relatively low loads, which aren't I/O intensive enough to introduce more periods of idle execution.
Both Xeon MPs perform wonderfully in this test, clearly using their large on-die L3 cache to their advantage. It's clear that we're not very CPU bound initially as the 4-way Xeon MP only offers a 10% performance advantage over the 2-way Xeon MP.
|
Quadrupling the load sure does change things, doesn't it?
The biggest change is that Hyper-Threading now yields a performance gain across all platforms instead of selectively doing better on the Xeon DP setup. If we look at the performance gains: 27% on the 4-way Xeon MP, 21% on the 2-way Xeon MP and 12% on the 2-way Xeon DP it is clear that the Xeon MP setups benefit much more from HT than the Xeon DP. This is the exact opposite of what we saw going on in the original test but let's take a look at some more heavily loaded situations before determining why...
|
We saw this in the last test, but the trend continues here as well - the 40% clock speed advantage of the Xeon DP 2.8 cannot make up for the presence of a 2MB on-die L3 cache on the Xeon MP and thus the two perform virtually identically to one another. This positions our Ad DB test as a very cache-friendly performance scenario, which also helps explain some of the interesting Hyper-Threading results we've seen.
Remember that when you're executing multiple threads on a single Xeon core, the threads effectively get half of the cache that they would have received had they been executing exclusively on the CPU. Assuming there is a great deal of locality between concurrent threads, then this reduction in useable cache isn't too big of an issue but it is clearly important here; the Xeon MP shows HT scaling equal to or greater than that of the higher clocked Xeon DP.
AMD unfortunately can't stand a chance in this test, falling almost 30% behind the slowest contender here with HT enabled.
|
As we crank the load up to our highest level the 4-way Xeon MP separates itself from the rest of the pack with a 68 - 74% advantage over both 2-way setups.
Hyper-Threading also proves to be a worthy ally, giving the 4-way setup a 28% boost and both 2-way setups a 30% performance increase.
The performance trends continue to be what we saw earlier which provided us with more than enough information to draw some conclusions.
Final Words
Just recently we showed the benefits of Hyper-Threading on the desktop but it is clear that the true potential of Intel's new technology is fully realized in the server market.
While unable to replace the power of additional CPUs, Hyper-Threading provides no less than a 30% boost in I/O loaded situations, which occurs quite often in reality. In fact, once you factor in the latency introduced by traffic flowing over the network before getting to the server there are quite a few opportunities for Hyper-Threading to fill those moments of idle execution.
However, for lightly loaded servers that don't experience much I/O utilization there may actually be a small performance drop if Hyper-Threading is enabled. In any situation where CPU performance is a key issue, enabling Hyper-Threading does seem to provide a performance gain. We can attribute this to an updated revision of the Hyper-Threading architecture in the new Xeons as well as the faster clock speeds that these processors are running at.
Needless to say, that if developers can come even remotely close to implementing this sort of thread level parallelism in desktop applications then Hyper-Threading will bring the Pentium 4 much more success than any other single architectural improvement.
Intel's new 533MHz FSB Xeon DPs perform quite well at their new 2.80GHz clock speed. As our tests have shown, the increased clock speed provides performance that's equal to if not greater than a Xeon MP system with an identical number of CPUs; the performance advantage only tilts in favor of the Xeon MP once you move to 4 processors.
We can also conclude that Intel's decision to make the Xeon DPs limited to 2-way configurations is solely to prevent cannabalizing 4-way Xeon MP sales. There are indeed many situations that would benefit from having a cheaper 4-way configuration of 2.8GHz CPUs but without any on-die L3 cache. This isn't to discount the power of the 108 million transistor Xeon MPs, but they are definitely not the most economical solution for enterprise customers and in many cases a 4-way Xeon DP configuration (if one existed) would suffice.
What is very interesting to note is the impressive scalability we saw when going from 2 to 4 processors, despite the fact that Intel's implements a shared bus architecture with the Xeon. Even though all 4 CPUs are sharing the same 3.2GB/s of FSB bandwidth, they are able to outperform a similarly configured 2-way setup by up to 88% in heavily loaded situations. Keeping this in mind, we are curious to see if AMD's Opteron architecture improves 4-way scalability even more or if it is unlikely that even this much load would saturate the Xeon's FSB.
That concludes part I of our Xeon DP & MP coverage, next we will be taking a look at the web server performance of these platforms. Be sure to check back after Comdex for part II...