Maximum Performance Computing Solutions, Servers, Workstations, Rackmount, Dual Core, AMD, X2, Opteron, Intel, P4, Xeon, Linux, Windows, Samba, email, NAS, Beowulf, Cluster, Scyld, Disk, RAID, Controller, Storage, Redhat


Systems Designed To Your Specifications
Based On Single or Dual Core
AMD Athlon 64 / Opteron or Intel P4 / Xeon Processors
Site Map    Search    Testimonials    Benchmarks    Contact Us    Downloads

The Definition of High Performance Computing:

Performance can be measured many ways. As the saying goes: there are lies, damn lies and then there are benchmarks. Benchmarks can be useful and meaningful, if and only if we understand what they are measuring and how a computer system works.

In a workstation, performance means fast response to human requests. This can be defined as fast start times for a program once the user has clicked an icon or it can be shorter compile times when a user is trying to build a program or an operating system (like a linux kernel).

In a server, performance is measured as the time it takes the server to read or write some external data to or from its data storage medium or the time it takes to find a given piece of data in its internal data stores. Performance can also be measured on a server in transactions (usually different for each application) per second.

For special cases, like firewalls or routers, performance is measured by the maximum amount of network traffic it can move from one input/output port to another input/output port all the while filtering or rerouting the data stream.

To better understand the options that effect performance, we first need to understand what is happening inside a computer.

How a Computer System Works:

Most computer systems are based on the same principals and in a general sense on the same type of architecture. A processor or multiple processors execute a program. A program is just a sequence of instructions that tells a processor how to do what you want the machine to do.

At a given clock speed, internal processor architecture, internal cache size and architecture, processor to memory bandwidth (memory speed and type and Memory Controller Architecture) and software optimization all affect system performance.

How fast a processor executes a program is controlled by many hardware and software decisions made when that particular computing system is designed. In today's machines, data locality is the key to system performance. What this means is: the closer the program and data is to the processor, the faster the processor can execute that program. The function of a processor's cache is to hold program elements or data "close" to the processor for faster access.

All processors being sold today have at least one level of memory cache inside, most have two levels of cache and some have three levels of cache. This processor cache in most cases will run at the same speed as the processor. The fastest speed cache is the level one cache. This may be organized as a unified cache (aka Intel) or as two caches, one for instructions and one for data (aka AMD). The level two cache is almost always unified and is usually labeled a having some number of "associated sets". A 2 way set associative cache can hold two sets of associated information. A 4 way set associative cache can hold four sets of associated information.

Below is a much simplified graphic that shows the relationship among the standard computing elements in a computer system.

 

 

If a given piece of information is not in the level 1 cache, the processor is forced to look for it in the level 2 cache. If the information is not in the level 2 cache, the system is forced to look for it in main memory and so on. If the information needed is stored close to the CPU ALU/FPU block (Arithmetic Logic Unit / Floating Point Unit), the system has very fast access to the information. If it is not close to the CPU ALU/FPU block the system is forced to look for it further away in slower storage areas (further down in the graphic) and a significant time delay in incurred thus decreasing overall performance.

The larger the storage capacity of the upper blocks in the above illustration, the faster most programs run. Today, most CPU's have all the cache built-in (older socket 7 CPU's had between 16K and 256 K bytes of internal level 1 cache; associated socket 7 motherboards usually had 512K bytes of external level 2 cache). Today's completely internal cache structure affords faster cache accesses. It also has a down side. As the CPU gets more complex and the caches get larger and faster, the CPU chip puts out more heat and therefore requires a more efficient Heat Sink Fan (HSF) assemblies.

Today's processors have between 512K and 2 MB of internal cache. Cache size is not the only influencing factor in system performance.

If processor "A" can execute more instructions in one clock cycle than processor "B", then processor "A" is said to provide better performance at the same clock speed.

Below is a comparison of the most popular processors:

Processor Architecture/Technology Comparison:

Single Processor System
Competitive Comparison (Usually used in workstations and low end servers)

Features
 
AMD Athlon™ 64
 
Pentium® 4
Architecture Introduction
 
2003
2004
 
2000
Infrastructure
 
Socket 754
Socket 939
 
Socket 478,
Socket LGA775
Process Technology
 
90 nanometer or 130 nanometer, SOI
 
90 nanometer
Number of Transistors
 
105.9 Million
 
125 Million
64-bit Instruction Set Support
 
Yes,
AMD64 technology
 
No
32-bit Instruction Set Support
 
Yes
 
Yes
Enhanced Virus Protection for Windows® XP SP2  
Yes
 
No
System Bus Technology
 
HyperTransport™ technology up to 2000MHz
Full duplex
 
Front Side Bus @ 800 MHz, Half duplex
Integrated DDR Memory Controller (MCT)
 
64-bit + 8-bit ECC
PC3200, PC 2700, PC 2100
128-bit + 16-bit ECC unbuffered PC3200, PC 2700, or PC 2100
 
No,
Discrete logic device
on motherboard
Processor-to-System Bandwidth
 
HyperTransport bandwidth:
up to 6.4 GB/s
Memory bandwidth:
up to 3.2 GB/s
Total: up to 9.6 GB/s
HyperTransport bandwidth:
up to 8.0 GB/s
Memory bandwidth:
up to 6.4 GB/s
Total: up to 14.4 GB/s
 
Total: up to 6.4 GB/s
Integrated Northbridge
 
Yes,
128-bit data path @ CPU core frequency
 
No,
Discrete logic device on motherboard,
64-bit data path @ 200MHz
High-Performance, On-chip Cache
 
L1: 128KB
L2: 512KB or 1024KB (exclusive)
Total Effective Cache: 640K or 1152KB
 
L1: 20K or 28K
L2: 1MB (inclusive)
Total Effective Cache: 1MB
3D and Multimedia Instructions
 
3DNow!™ Professional technology, SSE2
 
SSE, SSE2, SSE3


Dual
Processor Server Competitive Comparison:
Server Feature Comparison
AMD Opteron™
Intel Xeon
Intel Xeon
Intel Itanium 2
Modular, glueless scalability
Yes
Requires Northbridge Requires Northbridge Requires Northbridge
SMP Capabilities Up to 8-way Up to 2-way up to 2-way Up to 4-way
Direct Connect Architecture Yes No No No
High Performance 32-bit and 64-bit computing AMD64 No EM64T No
HyperTransport™ technology Yes No No No
Integrated DDR memory controller Yes No No No
Front Side Bus frequency
1.4 - 2.6 GHz
533 MHz
800 MHz
400 MHz
Front Side Bus bandwidth
11.2 - 20.8 GB/s
4.3 GB/s
6.4 GB/s
6.4 GB/s
Maximum Inter-processor bandwidth
8.0 GB/s
4.3 GB/s
6.4 GB/s
6.4 GB/s
Memory support
DDR266/333/400
DDR266
DDR333 or DDR2-400
DDR200
Memory Bandwidth 2P System
12.8 GB/s
4.3 GB/s
6.4 GB/s
6.4 GB/s
Memory Bandwidth 4P System
25.6 GB/s
N/A
N/A
6.4 GB/s
L1 cache size (max) 128 KB 8KB + 12k mop 16KB + 12k mop 32 KB
L2 cache size (max) 1 MB 512 KB 1 or 2 MB 256 KB
L3 cache size (max) N/A 2 MB N/A 9 MB
Maximum I/O bandwidth 2P System
16.0 GB/s
3.2 GB/s
12.3 GB/s
6.4 GB/s
Maximum I/O bandwidth 4P System
32.0 GB/s
N/A
N/A
6.4 GB/s
SIMD Instruction Set Support
SSE, SSE2, SSE3
SSE, SSE2
SSE, SSE2, SSE3
N/A

The best overall performance of a processor is achieved by:

1) Maximizing the speed and size of level 1 on-chip cache.
2) Maximizing the number of instructions executed in one clock cycle.
3) Maximizing the processor clock speed.
4) Maximizing the processor's access to system memory.

If the internal architecture and cache sizes were the same on all processors (which they are not) then the next factor that effects performance is the processor's connection to the memory.

All of today's Intel based systems use a chip called the North Bridge chip to connect the processor to memory, graphics (usually via an AGP or PCIe port ) and to the other parts of the system. The connection from the processor to the North Bridge chip is called the Front Side Bus (FSB).

In contrast to this, all of the AMD Athlon 64 and Opteron based systems have the memory controller built into the processor and they use multiple HyperTransport busses to communicate to bridge chips for graphics, PCI-X, PCIe, the other parts of the system and other processors in a multiprocessor system. This provides certain distinct advantages:

  • Faster CPU to memory controller bus - It runs at the same clock speed as the processor!

  • Wider FSB and memory bus - 128 bits

  • An Exclusive NON-shared FSB on each Processor

  • An Exclusive NON-shared memory bank on each processor

We will look at overall motherboard and system architecture after we understand memory and how it is accessed.
 

Memory Architecture and Performance:

The North Bridge chip controls the memory. When the processor requests the data at a given memory location, the North Bridge chip does the actual memory access. Dynamic Random Access Memory (DRAM) is constructed like a spreadsheet with a row and column address defining the location of each bit of information. Each location has a row address and a column address but the DRAM chip only has one set of address input pins. This requires splitting the address input to the chip. DRAM  requires three stages to access a given piece of information:

1) Row Address Strobe - Time to strobe the row address on the DRAM address pins

2) Row to Column Address Delay - Wait Period for changing the address information on the address pins from the row address to the column address.

3) Column Address Strobe Time - Time to strobe the column address on the DRAM address pins

Different memory types have different performance capabilities. The standard memory types in use today have the following theoretical maximum performance capabilities:

PC100
SDRAM

PC133
SDRAM

PC2100
DDR

PC2700
DDR

PC3200
DDR

PC3200
Dual Bank

PC800
1 CH

PC800
2 CH

800 MB/s

1.1 GB/s

2.1 GB/s

2.7 GB/s

3.2 GB/s

6.4 GB/s

1.6 GB/s

3.2 GB/s

Memory of the same type can have differing performance depending upon how fast the DRAM chips can be accessed on a DIMM (Dual Inline Memory Module). Common speeds for PC-133 SDRAM (Synchronous DRAM) are CL3 or CL2. CL3, a 3clk - 3clk - 3clk access or CL2 a 2clk - 2clk - 2clk access. Clearly CL2 memory provides better performance.

Double Data Rate (DDR) memory can transfer data on both the rising edge and the falling edge of the memory clock signal. DDR memory is rated at CL3, CL2.5 and CL2 speeds. Again, the CL2 rated memory provides better performance.

The above explanation is over-simplified. SDRAM only requires the first access to contain all three parts of the memory cycle. After the first access, any other accesses in the same column only require the row address to be strobed. A typical CL2 access of four consecutive locations on the same memory page would really be a 6-1-1-1 clock access or 9 clocks for 4 locations. Using slower memory would cause these 4 accesses to take place as 9-1-1-1 or about 30% slower.

As an example, we tested a Dual Processor AMD MP1800+ based system (Tyan S2466N-4M MB) running memory set to 2.5-3-3 and then set it to 2-2-2 (the memory was rated at 2-2-2). The difference in performance as benchmarked by Sisoftware Sandra 2002 SP1 was 243 Megabytes / second! The results were 1500 MB/sec for the first setting and 1743 MB/sec for the faster setting. This comes out to a real increase in performance of 16.2%. This is a typical performance increase for a system that has been properly tuned to use fast memory.

Take note though, that as processor caches get bigger, the performance of the memory on programs that are small in size becomes much less apparent. On the other hand if the programs deal with large data sets or large amounts of disk based data, the memory speed becomes more critical.

Now that we have an overview of the items that most effect system performance, lets take a look at a dual processor system.
 

Typical Dual Processor System Architectures:

Over the last year differing motherboard architectures have come about due to the AMD Opteron having the memory controller and three HyperTransports added to the processor chip. This allowed a significantly higher level of integration on the CPU itself and therefore brought the main system memory closer to the CPU and in turn increased performance significantly.

If you look carefully at the next 2 diagrams you will see that by integrating more of the motherboard functions (mainly the memory controller from the North Bridge Chip or Memory Controller Hub - MCH) onto the processor the architecture of the motherboard can eliminate the largest bottlenecks that used to be present in older systems.

First a typical Intel Dual Xeon based system:

The red circles in the diagram shown above indicate severe bottle necks in system performance and can cripple an otherwise very fast processor depending on the application. The shortcomings of this architecture are:

  • Processor to MCH is a fixed frequency and requires a separate chip
         Newer processors will run at the same fixed FSB speed
         Memory access is delayed by passing through a separate chip
     

  • Both Processors share the same Front Side Bus
         Effectively halving each processors bandwidth to memory
         Stalling one processor while the other is accessing memory or I/O
         All processor to system I/O and control must use this one path
     

  • One interleaved memory bank for both processors
         Again, Effectively halving each processors bandwidth to memory
         Half the bandwidth of a 2 memory bank architecture
     

  • All program access to graphics, PCI, PCI-X or other I/O must be through this bottleneck

Clearly this older architecture needs to have these deficiencies addresses and corrected to provide better system concurrency and higher throughput.

Next a Dual Opteron Based System:

AMD Dual Opteron Processor Based Server

The above graphic shows a dual AMD Opteron based motherboard (this is the block diagram for the Tyan S2895 and the Iwill DK8ES). Each processor has its own memory controller and a dedicated bank of local memory. As you can see, this architecture provides high performance, local connections between the processors and memory plus bi-directional HyperTranport busses to interconnect 1) the processors to each other, 2) the processors to PCI-X and SCSI, 3) the processors to the disk I/O, 4) the processors to the gigabit LAN interfaces and various other I/O ports.

The total available bandwidth between the processors and the rest of the system is significantly higher in this architecture compared to the previous Intel Xeon + MCH architecture.
 

Performance considerations to take into account when defining a system are:

1) Processor(s) speed, cache size, interconnect type and I/O architecture
2) Motherboard architecture
3) Memory Size, Speed and Access Times (CLx)
4) On board I/O controllers (on-board versus add-in cards)
 

So far we have only looked at processors, memory and motherboards. We still need to store our programs and connect to the outside world. To do this we will need disk storage and either a Local Area Network connection or a modem/DSL connection.

Disk Performance:

Disks store data on magnetic media known as a platter. The platter is a circle of magnetic media that can be written to through the use of a coil that flies very close to the platter. This coil is known as a head. The platter turns underneath the head. As the platter turns the head can read or write data on the media that is directly underneath it. To write more data than can fit in a single track the arm holding the head must move it to the next track. The things that effect disk performance are:

1) Platter rotation speed (effects average rotational latency)
2) Track to Track head speed
3) Track Data Density
4) Disk Cache Size

Platter rotation speed effects average latency. This is the average time it takes for the platter to spin around until it get to the piece of data you are looking for on that track. The faster the disk spins the sooner you can get (on average) to the data you want. Most disks today have a rotation speed from 5400 Rotations Per Minute (RPM) to as high as 15,000 RPM.

Track to Track Access Time. If the data requested is not on the present track, the track to track access time becomes important. The faster the disk can move the head to the track where the requested data resides the sooner the system can get access to the needed data. Most disks today have a track to track access time from 10 milliseconds (ms) to as low as < 3 ms. This parameter is also sometimes listed by the drive manufacturers as head seek time.

Track Data Density defines how much information can be stored on a given track. The higher the track data density, the more information the disk can store on one track. If a disk can store more data on one track it does not have to move the head to the next track as often. This means that the higher the recording density the lower the chances are that the head will have to be moved to the next track to get the required data.

Disk Cache Size. Disks contain intelligent controllers, read cache and write cache. When you ask for a given piece of data, the disk locates the data and sends it back to the motherboard. It also reads the rest of the track and caches this data on the assumption that you will want the next piece of data on the disk. This data is stored locally in its read cache. If, some time later you request the next piece of data and it is in the read cache the disk can deliver it with almost no delay.

When the motherboard writes to the disk the data is initially stored in the disk's write cache. This allows the motherboard to finish the write and go onto other things while the disk is writing the data to the media.

Usually a disk with a larger cache will perform better for small to medium transfer sizes than a disk with a smaller cache. Most disks today have cache sizes from 2 Megabytes to 8 Megabytes. Maximizing disk cache size, all other parameters being equal, will usually provide higher overall disk performance.
 

Input/Output (I/O) Performance:

Most of the I/O that we add to a computer system today sits on the PCI bus. These connections are made using the (in most cases) four to six white parallel connectors on the motherboard. Most of the on-board peripherals are connected to an internal PCI bus. The PCI bus has speeds ranging from: 32 bit / 33 MHz to 64 bit / 133 MHz known as 32 Bit PCI and PCI-X respectively. The first one is the most popular today on single processor Athlon 64 and Pentium 4 PC motherboards. Most manufacturers put the faster, 64 bit PCI slots onto server or dual processor workstation motherboards. Most of the I/O card vendors now offer a choice of 32 bit and 64 bit I/O cards that are "universal" in that they work in all four types of PCI slots.

To optimize the performance of a computer system, it is always better to use 64 bit cards in 64 bit PCI slots. This type of arrangement allows the motherboard to write or read the I/O card at twice to four timesthe speed of a 32 bit I/O card. As the availability of higher performance I/O cards goes up the price will come down, making it easier and less expensive to build a higher performance system.

When comparing I/O cards look for their bus type and speed and whether they contain an intelligent I/O processor on board. This I/O processor offloads some of the work required to communicate with the peripherals connected to your system. This leaves most processing power available for the applications running on the processor. The down side to this is that intelligent I/O cards are more expensive than simpler cards that rely on the internal CPU for control.
 

Summary:

1) CPU Performance

A faster processor, with more cache, with an integrated memory controller or with a faster FSB, that can do more in a single clock will usually give better performance.

2) Memory Performance

Larger memory, more banks of memory, with a faster access time will usually give better performance.

3) Disk Performance

A disk with a higher rotation speed, with faster track to track access time, with more cache and a higher track density will usually give better performance.

4) I/O Performance

A Intelligent I/O Card with a faster (66-133 MHz), wider (64 bit) PCI interface will usually give better performance.

Note that all of the above statements read "usually". This is because there are limitations on how fast a given function can run. An example of this is a modem. A 56K modem will never run faster than the phone line connection. The limitation is the phone line. The same holds true for almost all of the above items that were discussed. No matter how fast a CPU you use, if the program you are running causes a lot of disk accesses the CPU will still have to wait for the disk access to complete before it can continue. This is called "waiting faster" (more unused CPU clock in the same wait period). The overall key to high performance is to balance the speed of all the components listed above so that it is possible to get the best overall system performance for a given dollar spent on a computer.

If you have questions about the above article or want to discuss the trade-offs of a system design please email or contact us. Our specialty is providing the best price / performance ratio for a given application.


Consulting Performance Musings/Tips Links

 
    

Like our web site?
Want us to build or modify your web site?
Click here


Integrated Solutions and Systems, L.L.C.
1510 Old North Gate Road
Colorado Springs, CO  80921

Phone - 719-495-5866,  info@integratedsolutions.org

__________________________________________________________________________________

This site was last modified:
Site Visit Counter 

Our Web Site is hosted by:    Shouldn't yours be?
 

This Web site requires JavaScript enabled
Our site runs on:

 

All product names and logos used in this web site are for identification
only and are the trademarks or logos of their respective owners