Performance can be measured many ways. As
the saying goes: there are lies, damn lies and then there are
benchmarks. Benchmarks can be useful and meaningful, if and only if we
understand what they are measuring and how a computer system works.
In a workstation, performance means fast
response to human requests. This can be defined as fast start times for
a program once the user has clicked an icon or it can be shorter compile
times when a user is trying to build a program or an operating system
(like a linux kernel).
In a server, performance is measured as the
time it takes the server to read or write some external data to or from
its data storage medium or the time it takes to find a given piece of
data in its internal data stores. Performance can also be measured on a
server in transactions (usually different for each application) per
second.
For special cases, like firewalls or
routers, performance is measured by the maximum amount of network
traffic it can move from one input/output port to another input/output
port all the while filtering or rerouting the data stream.
To better understand the options that
effect performance, we first need to understand what is happening inside
a computer.
How a Computer System Works:
Most computer systems are based on the same
principals and in a general sense on the same type of architecture. A
processor or multiple processors execute a program. A program is just a
sequence of instructions that tells a processor how to do what you want
the machine to do.
At a given clock speed, internal processor
architecture, internal cache size and architecture, processor to memory
bandwidth (memory speed and type and Memory Controller Architecture) and
software optimization all affect system performance.
How fast a processor executes a program is
controlled by many hardware and software decisions made when that
particular computing system is designed. In today's machines, data
locality is the key to system performance. What this means is: the
closer the program and data is to the processor, the faster the
processor can execute that program. The function of a processor's cache
is to hold program elements or data "close" to the processor for faster
access.
All processors being sold today have at
least one level of memory cache inside, most have two levels of cache
and some have three levels of cache. This processor cache in most cases
will run at the same speed as the processor. The fastest speed cache is
the level one cache. This may be organized as a unified cache (aka
Intel) or as two caches, one for instructions and one for data (aka
AMD). The level two cache is almost always unified and is usually
labeled a having some number of "associated sets". A 2 way set
associative cache can hold two sets of associated information. A 4 way
set associative cache can hold four sets of associated information.
Below is a much simplified graphic that
shows the relationship among the standard computing elements in a
computer system.
If a given piece of information is not in
the level 1 cache, the processor is forced to look for it in the level 2
cache. If the information is not in the level 2 cache, the system is
forced to look for it in main memory and so on. If the information
needed is stored close to the CPU ALU/FPU block (Arithmetic Logic Unit /
Floating Point Unit), the system has very fast access to the
information. If it is not close to the CPU ALU/FPU block the system is
forced to look for it further away in slower storage areas (further down
in the graphic) and a significant time delay in incurred thus decreasing
overall performance.
The larger the storage capacity of the
upper blocks in the above illustration, the faster most programs run.
Today, most CPU's have all the cache built-in (older socket 7 CPU's had
between 16K and 256 K bytes of internal level 1 cache; associated socket
7 motherboards usually had 512K bytes of external level 2 cache).
Today's completely internal cache structure affords faster cache
accesses. It also has a down side. As the CPU gets more complex and the
caches get larger and faster, the CPU chip puts out more heat and
therefore requires a more efficient Heat Sink Fan (HSF) assemblies.
Today's processors have between 512K and 2
MB of internal cache. Cache size is not the only influencing factor in
system performance.
If processor "A" can execute more
instructions in one clock cycle
than processor "B", then processor "A" is said to provide better
performance at the same clock speed.
Below is a comparison of the most popular
processors:
Processor Architecture/Technology
Comparison:
Single Processor System Competitive Comparison (Usually used in workstations and low
end servers)
Features
AMD Athlon™ 64
Pentium® 4
Architecture Introduction
2003
2004
2000
Infrastructure
Socket 754
Socket 939
Socket 478,
Socket LGA775
Process Technology
90 nanometer or 130 nanometer, SOI
90 nanometer
Number of Transistors
105.9 Million
125 Million
64-bit Instruction Set Support
Yes,
AMD64 technology
No
32-bit Instruction Set Support
Yes
Yes
Enhanced Virus Protection for Windows®
XP SP2
Yes
No
System Bus Technology
HyperTransport™ technology up to 2000MHz
Full duplex
Front Side Bus @ 800 MHz,
Half duplex
Integrated DDR Memory Controller (MCT)
64-bit + 8-bit ECC
PC3200, PC 2700, PC 2100
128-bit + 16-bit ECC unbuffered PC3200, PC 2700,
or PC 2100
No,
Discrete logic device
on motherboard
Processor-to-System Bandwidth
HyperTransport bandwidth:
up to 6.4 GB/s
Memory bandwidth:
up to 3.2 GB/s Total: up to 9.6 GB/s
HyperTransport
bandwidth:
up to 8.0 GB/s
Memory bandwidth:
up to 6.4 GB/s Total: up to 14.4 GB/s
Total: up to 6.4 GB/s
Integrated Northbridge
Yes,
128-bit data path @ CPU core frequency
No,
Discrete logic device on motherboard,
64-bit data path @ 200MHz
High-Performance, On-chip Cache
L1: 128KB
L2: 512KB or 1024KB (exclusive)
Total Effective Cache: 640K or 1152KB
L1: 20K or 28K
L2: 1MB (inclusive)
Total Effective Cache: 1MB
3D and Multimedia Instructions
3DNow!™ Professional technology, SSE2
SSE, SSE2, SSE3
Dual Processor Server
Competitive Comparison:
Server
Feature Comparison
AMD Opteron™
Intel Xeon
Intel Xeon
Intel Itanium
2
Modular,
glueless scalability
Yes
Requires Northbridge
Requires Northbridge
Requires Northbridge
SMP
Capabilities
Up to 8-way
Up to 2-way
up to 2-way
Up to 4-way
Direct Connect
Architecture
Yes
No
No
No
High
Performance 32-bit and 64-bit computing
AMD64
No
EM64T
No
HyperTransport™
technology
Yes
No
No
No
Integrated DDR
memory controller
Yes
No
No
No
Front Side Bus
frequency
1.4 - 2.6 GHz
533 MHz
800 MHz
400 MHz
Front Side Bus
bandwidth
11.2 - 20.8 GB/s
4.3 GB/s
6.4 GB/s
6.4 GB/s
Maximum
Inter-processor bandwidth
8.0 GB/s
4.3 GB/s
6.4 GB/s
6.4 GB/s
Memory support
DDR266/333/400
DDR266
DDR333 or DDR2-400
DDR200
Memory
Bandwidth 2P System
12.8 GB/s
4.3 GB/s
6.4 GB/s
6.4 GB/s
Memory
Bandwidth 4P System
25.6 GB/s
N/A
N/A
6.4 GB/s
L1 cache size
(max)
128 KB
8KB + 12k mop
16KB + 12k mop
32 KB
L2 cache size
(max)
1 MB
512 KB
1 or 2 MB
256 KB
L3 cache size
(max)
N/A
2 MB
N/A
9 MB
Maximum I/O
bandwidth 2P System
16.0 GB/s
3.2 GB/s
12.3 GB/s
6.4 GB/s
Maximum I/O
bandwidth 4P System
32.0 GB/s
N/A
N/A
6.4 GB/s
SIMD
Instruction Set Support
SSE, SSE2, SSE3
SSE, SSE2
SSE, SSE2, SSE3
N/A
The best
overall performance of a processor is achieved by:
1) Maximizing the speed and size of level 1 on-chip cache.
2) Maximizing the number of instructions executed in one clock cycle.
3) Maximizing the processor clock speed.
4) Maximizing the processor's access to system memory.
If the internal architecture and cache sizes were the same on
all processors (which they are not) then the next factor that effects performance is the processor's
connection to the memory.
All of today's Intel based
systems use a chip called the North Bridge chip to connect the processor
to memory, graphics (usually via an AGP or PCIe port ) and to the other parts of the
system. The connection from the processor to the North Bridge chip is called the Front Side Bus (FSB).
In contrast to this, all of the
AMD Athlon 64 and Opteron based systems have the memory controller built
into the processor and they use multiple HyperTransport busses to
communicate to bridge chips for graphics, PCI-X, PCIe, the other parts
of the system and other processors in a multiprocessor system. This
provides certain distinct advantages:
Faster CPU to memory
controller bus - It runs at the same clock speed as the processor!
Wider FSB and memory bus -
128 bits
An Exclusive NON-shared FSB
on each Processor
An Exclusive NON-shared
memory bank on each processor
We will look at overall
motherboard and system architecture after we understand memory and how
it is accessed.
Memory Architecture and Performance:
The North Bridge chip controls the memory. When the processor
requests the data at a given memory location, the North Bridge chip does the actual memory access.
Dynamic Random Access Memory (DRAM) is constructed like a spreadsheet with a row and column address
defining the location of each bit of information. Each location has a row address and a column address
but the DRAM chip only has one set of address input pins. This requires splitting the address input to
the chip. DRAM requires three stages to access a given piece of information:
1) Row Address Strobe - Time to strobe the row address on the
DRAM address pins
2) Row to Column Address Delay - Wait Period for changing the
address information on the address pins from the row address to the column address.
3) Column Address Strobe Time - Time to strobe the column
address on the DRAM address pins
Different memory types have different performance capabilities.
The standard memory types in use today have the following theoretical maximum performance capabilities:
PC100
SDRAM
PC133
SDRAM
PC2100
DDR
PC2700
DDR
PC3200
DDR
PC3200
Dual Bank
PC800
1 CH
PC800
2 CH
800 MB/s
1.1 GB/s
2.1 GB/s
2.7 GB/s
3.2 GB/s
6.4 GB/s
1.6 GB/s
3.2 GB/s
Memory of the same type can have differing performance
depending upon how fast the DRAM chips can be accessed on a DIMM (Dual Inline Memory Module). Common
speeds for PC-133 SDRAM (Synchronous DRAM) are CL3 or CL2. CL3, a 3clk - 3clk - 3clk access or CL2 a
2clk - 2clk - 2clk access. Clearly CL2 memory provides better performance.
Double Data Rate (DDR) memory
can transfer data on both the rising edge and the falling edge of the
memory clock signal. DDR memory is rated
at CL3, CL2.5 and CL2 speeds. Again, the CL2 rated memory provides better performance.
The above explanation is over-simplified.
SDRAM only requires the first access to contain all three parts of the memory cycle. After the first
access, any other accesses in the same column only require the row address to be strobed. A typical CL2
access of four consecutive locations on the same memory page would really be a 6-1-1-1 clock access or 9
clocks for 4 locations. Using slower memory would cause these 4 accesses to take place as 9-1-1-1 or
about 30% slower.
As an example, we tested a Dual Processor AMD MP1800+ based system (Tyan
S2466N-4M MB) running memory set to 2.5-3-3 and then set it to 2-2-2 (the
memory was rated at 2-2-2). The difference in performance as
benchmarked by Sisoftware Sandra 2002 SP1 was 243 Megabytes / second! The results were 1500 MB/sec for
the first setting and 1743 MB/sec for the faster setting. This comes out to a real increase in
performance of 16.2%. This is a typical performance increase for a system that has been properly tuned
to use fast memory.
Take note though, that as
processor caches get bigger, the performance of the memory on programs
that are small in size becomes much less apparent. On the other hand if
the programs deal with large data sets or large amounts of disk based
data, the memory speed becomes more critical.
Now that we have an overview of the items that most effect
system performance, lets take a look at a dual processor system.
Typical Dual Processor System Architectures:
Over the last year differing
motherboard architectures have come about due to the AMD Opteron having
the memory controller and three HyperTransports added to the processor
chip. This allowed a significantly higher level of integration on the
CPU itself and therefore brought the main system memory closer to the
CPU and in turn increased performance significantly.
If you look carefully at the
next 2 diagrams you will see that by integrating more of the motherboard
functions (mainly the memory controller from the North Bridge Chip or
Memory Controller Hub - MCH) onto the processor the architecture of the
motherboard can eliminate the largest bottlenecks that used to be
present in older systems.
First a typical Intel Dual Xeon based system:
The red circles in the diagram
shown above indicate severe bottle necks in system performance and can
cripple an otherwise very fast processor depending on the application.
The shortcomings of this architecture are:
Processor to MCH is a fixed
frequency and requires a separate chip
Newer processors will run at the same fixed FSB speed
Memory access is delayed by passing through a separate
chip
Both Processors share the
same Front Side Bus
Effectively halving each processors bandwidth to memory
Stalling one processor while the other is accessing
memory or I/O
All processor to system I/O and control must use this
one path
One interleaved memory bank
for both processors
Again, Effectively halving each processors bandwidth to
memory
Half the bandwidth of a 2 memory bank architecture
All program access to
graphics, PCI, PCI-X or other I/O must be through this bottleneck
Clearly this older architecture
needs to have these deficiencies addresses and corrected to provide
better system concurrency and higher throughput.
Next a Dual Opteron Based
System:
AMD Dual
Opteron Processor Based Server
The above graphic shows a dual
AMD Opteron based motherboard
(this is the block diagram for the Tyan S2895 and the Iwill DK8ES). Each
processor has its own memory controller and a dedicated bank of local
memory. As you can see, this architecture provides high performance,
local connections between the processors and memory plus bi-directional
HyperTranport busses to interconnect 1) the processors to each other, 2)
the processors to PCI-X and SCSI, 3) the processors to the disk I/O, 4)
the processors to the gigabit LAN interfaces and various other I/O
ports.
The total available bandwidth
between the processors and the rest of the system is significantly
higher in this architecture compared to the previous Intel Xeon + MCH
architecture.
Performance considerations to take into account
when defining a system are:
1) Processor(s) speed, cache
size, interconnect type and I/O architecture
2) Motherboard architecture
3) Memory Size, Speed and Access Times (CLx)
4) On board I/O controllers (on-board versus add-in cards)
So far we have only looked at processors, memory
and motherboards. We still
need to store our programs and connect to the outside world. To do this we will need disk storage and
either a Local Area Network connection or a modem/DSL connection.
Disk Performance:
Disks store data on magnetic media known as a platter. The
platter is a circle of magnetic media that can be written to through the use of a coil that flies very
close to the platter. This coil is known as a head. The platter turns underneath the head. As the
platter turns the head can read or write data on the media that is directly underneath it. To write more
data than can fit in a single track the arm holding the head must move it to the next track. The things
that effect disk performance are:
1) Platter rotation speed (effects average rotational
latency)
2) Track to Track head speed
3) Track Data Density
4) Disk Cache Size
Platter rotation speed
effects average latency. This is the average time it takes for the platter to spin around until it get
to the piece of data you are looking for on that track. The faster the disk spins the sooner you can get
(on average) to the data you want. Most disks today have a rotation speed from 5400 Rotations Per Minute
(RPM) to as high as 15,000 RPM.
Track to Track Access Time.
If the data requested is not on the present track, the track to track access time becomes important. The
faster the disk can move the head to the track where the requested data resides the sooner the system
can get access to the needed data. Most disks today have a track to track access time from 10
milliseconds (ms) to as low as < 3 ms. This parameter is also sometimes listed by the drive
manufacturers as head seek time.
Track Data Density
defines how much information can be stored on a given track. The higher the track data density, the more
information the disk can store on one track. If a disk can store more data on one track it does not have
to move the head to the next track as often. This means that the higher the recording density the lower
the chances are that the head will have to be moved to the next track to get the required data.
Disk Cache Size.
Disks contain intelligent controllers, read cache and write cache. When you ask for a given piece of
data, the disk locates the data and sends it back to the motherboard. It also reads the rest of the
track and caches this data on the assumption that you will want the next piece of data on the disk. This
data is stored locally in its read cache. If, some time later you request the next piece of data and it
is in the read cache the disk can deliver it with almost no delay.
When the motherboard writes to
the disk the data is initially stored in the disk's write cache. This allows the motherboard to finish the write and go onto other things
while the disk is writing the data to the media.
Usually a disk with a larger cache will perform better for
small to medium transfer sizes than a disk with a smaller cache. Most disks today have cache sizes from
2 Megabytes to 8 Megabytes. Maximizing disk cache size, all other parameters being equal, will usually
provide higher overall disk performance.
Input/Output (I/O) Performance:
Most of the I/O that we add to a computer system today sits on
the PCI bus. These connections are made using the (in most cases) four to six white parallel connectors
on the motherboard. Most of the on-board peripherals are connected to an internal PCI bus. The PCI bus
has speeds ranging from: 32 bit / 33 MHz to 64 bit / 133 MHz known as 32
Bit PCI and PCI-X respectively. The first one is
the most popular today on single processor Athlon 64 and Pentium 4 PC
motherboards. Most manufacturers put the faster, 64 bit PCI slots onto
server or dual processor workstation motherboards. Most of the I/O card vendors now offer a choice of 32 bit and 64 bit I/O cards that are
"universal" in that they work in all four types of PCI slots.
To optimize the performance of a computer system, it is always
better to use 64 bit cards in 64 bit PCI slots. This type of arrangement allows the motherboard to write
or read the I/O card at twice to four timesthe speed of a 32 bit I/O card. As the availability of higher performance
I/O cards goes up the price will come down, making it easier and less expensive to build a higher
performance system.
When comparing I/O cards look for their bus type and speed and
whether they contain an intelligent I/O processor on board. This I/O processor offloads some of the work
required to communicate with the peripherals connected to your system. This leaves most processing power
available for the applications running on the processor. The down side to this is that intelligent I/O
cards are more expensive than simpler cards that rely on the internal CPU for control.
Summary:
1) CPU Performance
A faster processor, with more cache,
with an integrated memory controller or with a faster FSB, that
can do more in a single clock will usually give better performance.
2) Memory Performance
Larger memory, more banks of
memory, with a faster access time will usually give
better performance.
3) Disk Performance
A disk with a higher rotation speed, with faster track to
track access time, with more cache and a higher track density will usually give better performance.
4) I/O Performance
A Intelligent I/O Card with a faster (66-133 MHz), wider (64 bit)
PCI interface will usually give better performance.
Note that all of the above statements read "usually". This is
because there are limitations on how fast a given function can run. An example of this is a modem. A 56K
modem will never run faster than the phone line connection. The limitation is the phone line. The same
holds true for almost all of the above items that were discussed. No matter how fast a CPU you use, if
the program you are running causes a lot of disk accesses the CPU will still have to wait for the disk
access to complete before it can continue. This is called "waiting faster" (more unused CPU clock in the
same wait period). The overall key to high performance is to balance the speed of all the components
listed above so that it is possible to get the best overall system performance for a given dollar spent
on a computer.
If you have questions about the above article or want to
discuss the trade-offs of a system design please email or contact us. Our specialty is providing the
best price / performance ratio for a given application.