Currently the need for 64-bit computing is driven by applications that require large amounts of virtual and physical memory. However, AMD and Intel are developing 64-bit processors for personal computers and servers. They are developing these technologies because of the high demand for increased performance amongst home users and developers. AMD64 is the specific term for the 64-bit architecture that AMD uses on their line of processors. This paper focuses mainly on the AMD64 architecture and not a specific processor. If a specific AMD64 processor is being referred to it will be the AMD Athlon 64 which is the processor for home and personal use. The IA-64 is the specific term used to refer to Intel’s new 64-bit architecture. Intel has developed two server processors; the Itanium and the Itanium 2. Both of Intel’s processors will be referred too throughout this document. The purpose of this paper is to show the features and architectures of both the AMD64 and Intel IA-64 processors.
Today, nearly all personal computers use a 32-bit processor that runs a 32-bit operating system. In 1965, Gordon Moore, a famous computer scientist, said that the number of transistors in a circuit would double every couple of years. This statement is known as Moore’s Law and it still holds true today. 32-bit processors have existed since the mid eighty’s, and they can no longer have an efficient substantial speed increase without a major architectural change. In order for Moore’s Law to be true, a new faster 64-bit processor needed to be created. In 1981 Bill Gates said, “640K ought to be enough for anybody”. Well 640K of memory is not enough for anybody today, and 32-bit processors are slowly realizing that same fate. The 64-bit processors came from a need for faster computing and for a desire of new technology. In 1991 Intel began to formulate plans and develop its first 64-bit processor. In 1992 DEC (Digital Equipment Corporation) released the Alpha-64 processor. Today, Intel has developed the IA-64 architecture and AMD has developed the AMD64 architecture.
An average person would think a 64-bit processor should run twice as fast as a 32-bit processor, but this is not necessarily true. The 64-bit processor can handle numbers that are 64 bits long while a 32-bit processor can only handle numbers that are 32 bit numbers long. With most modern applications there are few that require a number larger than 32 bits. A few exceptions would be digital video, graphics, engineering, and scientific programs because of their purpose and complex algorithms. If a 32-bit processor encounters a number larger than 32-bits it will have to split the number up and process it into 2 sections. The process of splitting up the number requires vital processing time. The 64-bit processor would be able to handle that number without splitting it up, thus halving the processing time. If the 64-bit processor handles numbers larger than 32 bits it will be nearly twice as fast as the 32-bit processor handling the same number. Assume there are two processors that run at the same speed, except one is 64 bit and the other is 32 bit. If they encounter a 30-bit number both processors will handle the number with the exact same speed. This is why it is not necessarily true that a 64-bit processor runs twice as fast as a 32-bit processor.
One of the main benefits of 64-bit processors is their ability to access more memory. 32-bit processors normally can only access 4 gigabytes of memory. There are a few tricks and workarounds to allow 32-bit processors to run more than 4 gigabytes but they are not necessarily stable, and they hold no future advancements. 64-bit processors can theoretically access 18 million terabytes or 18 billion gigabytes. While no one can currently imagine a need for that much memory, the potential is there; which shows that 64-bit processors have a long future.
Currently the only systems that really need 64-bit processors are servers, and supercomputers. A server needs 64-bit computing because its needs to support more memory. While a personal computer only handles one user, a server needs to handle thousands of users simultaneously. Since it needs to handle so many users it needs to have a lot of memory. Supercomputers need 64-bit technology because they use large floating-point numbers. Many supercomputers process huge engineering calculations. “Seattle-based Cray Inc. is building a massively parallel processing supercomputer, nicknamed Thor's Hammer, for weapons research by the National Nuclear Security Administration at the Sandia National Laboratory in Albuquerque” (Kay, 2004, pg. 1). That supercomputer needs a 64-bit processor to handle the complex calculations.
Even though there is currently no need for 64-bit processors on personal computers AMD and Intel have created and are currently developing these processors. One reason for the development of personal 64-bit processors is advancement of technology and increase of speed. If a 64-bit processor did not exist then no new software and equipment would be developed. This would cause computers to be stuck with 32-bits and they will eventually reach a point where there would be no speed increase. Andy Yaschenko an author of XBit labs said, “As for the common PCs, 64bit will find no real application” (Yaschenko, 2003, pg. 3). Unfortunately Andy Yaschenko bases the future technology on past and current technologies. Even though a need cannot be foreseen, as the past has shown, a need will eventually appear and help continue 64-bit advancements.
The AMD64 was released to the public on September 2003 as the first 64-bit processor designed specifically for personal computing. The processor was developed using the x86 architecture that was first used by Intel’s 8086. Since the AMD64 uses the x86 architecture it is backwards compatible with 32-bit applications and operating systems. In order for the processor to be backwards compatible it uses two main operating modes that are divided into five sub modes. Below is a table that shows the operating modes and their features(AMD, 2003).
The two main operating modes are long mode and legacy mode. Long mode consists of two sub modes, which are 64-bit mode and compatibility mode. In order for long mode to be active the computer must be running a 64-bit operating system. For the 64-bit mode to be active the computer must be running a 64-bit application. This mode is enabled by the operating system on an individual code-segment basis. 64-bit mode supports all the new features and register extensions of the AMD64 architecture. Compatibility mode allows 64-bit operating systems to run 16-bit and 32-bit applications without the need for recompilation. In compatibility mode, the application thinks that it is running on a 32-bit operating system. The operating system however treats the application as if it is a 64-bit application. One limitation of compatibility mode is that 16-bit and 32-bit applications are still limited to a maximum of 4GB of memory space (AMD, 2003).
The other main operating mode is legacy mode. Legacy mode is separated into three sub modes, which are: protected mode, virtual-8086 mode, and real mode. Legacy mode is active when a 32-bit or 16-bit operating system is running. Protected mode supports 16-bit and 32-bit applications, and the application can access up to 4GB worth of memory space. Virtual-8086 mode supports 16-bit programs running under a 32-bit operating system. The programs can only access 1MB worth of memory space. Real mode supports 16 bit programs running under a 16-bit operating system. Just like in Virtual-8086 mode, programs are limited to 1MB of memory space (AMD, 2003).
One new feature of the AMD64 is the register extensions. “64-bit mode implements register extensions through a new group of instruction prefixes, called REX prefixes” (AMD, 2003, pg. 8). The extensions add eight new GPRs (General Purpose Registers) and all GPRs are widened to 64-bits. Eight new XMM (Extended Memory Manager) registers of 128-bits are added as well. Another extended feature of the AMD64 is the 64-bit RIP (Instruction Pointer). This pointer will allow 64-bit data addressing. The opcodes for this processor are also extended to support 64-bit addressing and register extensions. Overall, these are the register extensions that have been added to the AMD64 (AMD, 2003).
The General Purpose Registers are based upon the current operating mode that the system is in. If it is running legacy mode or compatibility mode then a total of 24 different variations of the 8 GPRs can be used. The variations include eight 8-bit registers, eight 16-bit registers, and eight 32-bit registers. The system determines which register to use based on the type of instruction, opcode, address size, or stack size. In 64-bit mode there are a total of 68 different variations that can be used. They are: sixteen 8-bit low byte registers, four 8-bit high byte registers, sixteen 16-bit registers, sixteen 32-bit registers, and sixteen 64-bit registers. The AMD64 architecture gives 8 more GPRs only to the 64-bit operating mode, and it has a total of 44 more variations (AMD, 2003).
Two special registers that exist in the AMD64 architecture are the Flags Register and the Instruction Pointer. Just like the GPRs these two registers are treated differently based upon which operating mode they system is in. If the system is in legacy real mode or virtual-8086 mode it only has access to an 8-bit FLAGS register, but if it’s operating in legacy or compatibility mode it has access to a 32-bit EFLAGS register. Finally if it is in 64-bit mode it has access to a 64-bit RFLAGS register. The FLAGS, EFLAGS, or RFLAGS registers all function the same way except some have a fewer number of bits. The purpose of these registers is to contain control and access bits for the application to access and use. The other special register is the Instruction Pointer. In legacy or compatibility mode the Instruction pointer is either a 16-bit IP, or a 32-bit EIP. In 64-bit mode the instruction pointer is extended to 64-bits and it is called a RIP register. The contents of the instruction pointer are not directly readable by software, but it is pushed onto the system stack. The purpose of the instruction pointer is to contain the address of the next instruction that needs to be executed. Both of these special registers operate the same way that traditional registers operate; the only difference is AMD extended the size of the registers to support 64-bits(AMD, 2003).
Another set of special registers are the XMM 128-bit media registers. “These registers perform integer and floating point operations primarily on vector operands” (AMD, 2003, pg.127). The first main issue with the XMM registers is their compatibility with all operating modes. The registers are compatible with all operating modes, but in 64-bit mode older applications must be recompiled to make use of the extra XMM registers. When the computer is operating in 32-bit mode or 16-bit mode then the applications work normally but only have access to 8 XMM registers. In 64-bit mode the user must recompile their program in order to get all the benefits of the 128-bit XMM registers. Of course the user does not have to recompile if the application is already compiled to work in 64-bit mode. There are many benefits of recompiling because the system is given access to eight more XMM registers, which gives it a total of sixteen 128-bit XMM registers. The system also has access to all 16 GPRs, which are extended to 64-bits. Finally the system has access to 64-bit virtual addressing and the RIP instruction pointer. The need for more XMM registers is great because it allows the applications to run in parallel on different vectors. Some types of applications that will make use of the 16 XMM registers are speech recognition programs, 2D and 3D graphics programs, professional CAD programs, and HDTV(High Definition Television) streaming media programs. The XMM registers can handle all the data and process it in parallel because of SIMD (Single Instruction Multiple Data) instructions. These SIMD instructions can access, manipulate, and change the data all with one instruction. Since they don’t need multiple instructions is saves vital processing time. Overall the addition of eight XMM registers allows 64-bit applications to run smoother and faster(AMD, 2003).
The AMD64 architecture handles four different data types. The four different types are signed integers, unsigned integers, BCD (Binary Coded Decimal) digits, and Packed BCD digits.
The signed and unsigned integers can handle 5 different types of integers. They can handle a byte, word, doubleword, quadword, and double quadword. The diagram above shows the capacity of each type of integer. The sign byte for signed integers is stored in the most significant bit. The architecture addresses memory using little endian byte order. This means that the least significant byte is stored in the lowest byte address. BCD digits have binary values ranging from 0 to 9. The lowest BCD digit can be 0000, while the highest can only be 1001. However, because a byte can contain 8 bits the BCD digit is not necessarily efficient enough. Packed BCD digits are used to hold two BCD digits. In binary it could hold on digit from 0000-1001 and another from 0000-1001. The maximum digit a packed BCD digit can hold is 99 or 10011001 (AMD, 2003).
The entire address space that a program can use is called virtual memory. Virtual memory is converted by a hardware and operating system software to smaller physical memory spaces, which are located on either the main memory or the hard disk. Virtual memory is treated different depending on what AMD64 operating mode the system is currently in. In the legacy modes, virtual memory is treated the same as if it was a 32 bit processor with a 32 bit operating system. In 64 bit mode, it uses a flat segmentation model of virtual memory. The 64 bit virtual memory space is treated as a single, flat address space. Programs address access locations that can be anywhere in that 64 bit address space. In compatibility mode, it uses a protected, multi-segment model of virtual memory. Legacy protected mode uses the same virtual memory model as compatibility mode (AMD, 2003).
The AMD64 architecture dispenses with most of the legacy segmentation functions in 64 bit mode. AMD believes that most modern operating systems do not use the segmentation features that are available in the x86 architecture, in favor of, handling all segmentation functions in software. Using software causes lost efficiency. AMD64 approach allows new 64 bit operating systems to be coded more simply, and it supports more efficient management of multi-programming environments than is possible in the legacy x86 architecture(AMD, 2003).
AMD64 has designed knew technology to help reduce bottlenecks within the computer system. In most processors data that needs to go to the video or main memory has to pass through a motherboard chip. Data that passes through USB, PCI, or hard-drives usually has to pass through 2 motherboard chips. Since the data has to squeeze through all these chips, the processor ends up waiting for the data. This waiting for data results in bottlenecks within the system. The first advancement AMD made to reduce bottlenecks is they built DDR Memory Controllers directly onto the processor. Typically the DDR Memory Controller is built into the motherboard. What this does is it allows the data to transfer directly from the processor to the main memory. This significantly reduces the data access time because it gets rid of the “middleman” that was slowing the process down. The other advancement that AMD built into their processor is HT (Hypertranport) technology. “HT is a high-speed data carrying method designed to replace or supplement many of the traditional input/output methods that can cause bottlenecks on modern motherboards” (Dowler, 2003, pg. 4). HT provides point-to-point links between components, and it travels at varying speeds based on the component it is traveling too. HT uses these high bandwidth links to send data to the memory and to the 2 chips on the motherboard. The two chips on the motherboard have a built in HT bridge so they can support the high speeds of transfer. Hypertransport links are different from traditional links that carry data. The HT links carry packets of data, similar to today’s Ethernet technology. This works by sending the address of the data and the actual data all on one line. Traditional methods have an address line and a data line to transfer data. This traditional method adds complexity and takes up space on the motherboard and the processor. The HT data can transfer at speeds up to 800Mhz DDR, which allows it to transfer as much or more data on less data lines than the traditional buses. The overall speed up is when slower buses transfer their data to the high speed HT links. Instead of moving through a series of slow buses the data is transferred to the HT “Highway” so that it can move. These two new design advancements that AMD has built into the AMD64 architecture, allow the processor and motherboard to transfer data much faster, and reduce bottlenecks within the system (Dowler, 2003).
The Itanium Processor is the first of its kind to implement the IA-64 instruction set. One of the major improvements of the IA-64 architecture over the IA-32 architecture is that the IA-64 architecture employs predication. Predication uses a predication bit that is used to determine the execution of an instruction without hampering program flow. For example, if a value is equal to the predicated bit then the instruction would be executed and if the value is not equal, then the instruction would be ignored and no break would occur in the program flow.
Here is a C-source code example courtesy of ChipGeek:
if (x == 4)
z = 9;
else z = 0;
Using the IA-34 bit architecture the instructions follow this scheme:
Compare x to 4
If not equal goto line 5
z = 9
goto line 6
z = 0
// Program continues from here
No matter what the value of X, there is going to be at least one break in the instruction flow (either line 2 or line 4). The IA-64’s use of the predication bit allows the machine to overcome this hampering. Using the same C-source code above, but using the IA-64 architecture, the scheme would be the following:
Compare x to 4 and store result in a predicate bit (we'll call it P)
If P==1; z = 9
If P==0; z = 0
If the value of P matches the comparison condition the results are written to memory, otherwise they are ignored. All three lines of code would be performed sequentially without an interruption in program flow. Only the result from line 2 would be placed in memory because that is the only predicate condition that matched the result of the compare in 1(Hodgin, 2001). Using IA-64 architecture over IA-32 architecture removes one of the biggest bottlenecks of the IA-32 architecture.
The Itanium processor was designed for high performance Internet servers and workstations. It supports 64-bit addressing, full IA-32 instruction set compatibility, and scalability across a wide range of operating systems.
The processor is currently offered in speeds of 733 MHz and 800 MHZ, 32KB of L1 cache, 96 KB of L2 cache, and either 2MB or 4MB of L3 cache that is four-way set associative on two or four 1MB chips. The 4MB L3 cache uses 294.8 million transistors and gives 12.8GBps of memory bandwidth at 800MHz. With this much cache, there is a good chance that the required data or set of instructions are being held in cache. With this in mind, bus traffic can be reduced and overall performance reaches new levels (Simon, 2000).
Since the Itanium was designed for 1 to 4000 processor workstations and servers, the different levels of cache and busses had to be optimized. The Level 3 bus offers fast communication between multiple CPUs. The large L2 cache reduces traffic in the CPU by keeping data close to the CPU that is using it. The Itanium also features page sizes from 4KB to 256MB. This gives the Itanium the flexibility to access small amounts of memory in small chunks and large amounts of memory in large chunks (Simon, 2000).
Intel also uses Data Speculation and cache hits in the Itanium processor. “Data speculation is caching and calling for data that may be needed or may be changed before it is needed, so that, in the case that the data is needed and it has not changed, the CPU does not have to take a latency impact from calling for the data. The processor, with the help of compiled instructions, looks ahead, anticipates what info it may need, and then brings it to cache or into the processor” (Simon, 2000, 7). Doing this helps hide memory latency. Cache hits help the CPU find data in cache by setting two-bit markers on memory loads. Doing this helps the processor quickly find the data that it needs in cache (Simon, 2000).
The Itanium processor uses the EPIC (Explicitly Parallel Instruction Computing) architecture. This architecture allows the processor to run instructions parallel with other instructions. The compiler groups the EPIC instructions into a structure named a “bundle”. There is no maximum size limit for the groups of “bundled” instructions. Also the “bundled” instructions do not affect each other, which allow multiple instructions to be handled concurrently without getting in each other’s way. The EPIC architecture relies heavily on compiler technology because it is at the compiler stage where instructions are “bundled”. Therefore, any modifications in compiler technology will have a direct effect on the performance of the Itanium processor (Simon, 2000).
The Itanium processor has four pipelined Arithmetic Logic Units. Each ALU can process one integer calculation per cycle. The Itanium also has 128 floating-point registers along with 128 integer registers. Not only does the Itanium have an abundance of registers, the registers will have the ability to rotate. This will enhance CPU performance by allowing the CPU to operate on multiple registers and processing large amounts of data (Intel, 2003c).
The Itanium’s system bus uses a 2.1 GBps multi-drop system bus, so the flow of instructions to the processor is plentiful. The first generation systems used dual-memory ported SDRAM, allocating 4.2 GBps of memory bandwidth. Later generations used DDR and SDRAM. With these speeds, combined with the large of amount of cache and cache bandwidth, the Itanium can process many Terabytes of information (Simon, 2000).
Intel says that the Itanium will have significant error checking capabilities. Itanium processors will employ Enhanced Machine Check Architecture (MCA) with extensive Error Correcting Code (ECC) and parity error checking on most processor caches and busses (Intel, 2004a). These error capabilities will give Itanium-based machines the ability to recognize errors, attempt to fix the error(s), or flag the error(s) as corrupted (Simon, 2000).
Intel Itanium 2
The Itanium 2 processor takes the Itanium to an even higher level. There are 3 different types of Itanium 2 processor. First, the Itanium 2 with 6MB L3 cache for MP and DP servers and workstations. It features speeds of 1.50 GHz, 1.40 GHz, and 1.30 GHz, 32 KB of Level 1 cache, 256 KB of level 2 cache, and integrated 6MB, 4MB, and 3MB Level 3 cache. The second is the Itanium 2 Processor 1.40 GHz with 3MB L3 cache optimized for DP servers and Workstations. Its’ features are much like the first Itanium 2, except it has integrated 3MB and 1.5 MB cache, and is dual processor optimized. The third is the low voltage Itanium 2 Processor Optimized for higher density DP servers and workstations. Its features are much like the other two processors as well featuring 1 GHz speeds, integrated 1.5 MB L3 cache, dual processor optimization, and only 62 watts maximum power consumption. All three processors are based on the EPIC architecture, have enhanced machine check architecture with extensive error correcting code, and support HP-UX, Linux, and Windows 2003 operating systems (Intel, 2004b).
Using the E8870 chipset, the Itanium 2 provides a peak memory bandwidth of 6.4 GB/sec. This chipset also allows for up to four DDR SDRAM DIMMs per channel for a total of up to 128 GB of memory, using 32 4 GB DDR SDRAM DIMMs. Two scalability ports provide 12.8 GB/sec maximum bandwidth for future expansion. Intel has also balanced out the system bus, memory and I/O, giving greater performance for the entire platform (Intel, 2003b). The pipeline inside the Itanium 2 is also shorter.
Overall, the Itanium 2 far exceeds the Itanium. Multiple cores with a larger cache take performance to a new level and power consumption is reduced. Multithreading has been enabled on the Itanium, which increases performance by up to 30% for multithreaded applications. An even larger cache further reduces the chance to have to go to memory and search for an instruction needed Architecture. The EPIC architecture has been further optimized so up to 6 simultaneous instructions can be performed instead of the 3 in the Itanium 1 processor (Intel, 2003a).
In conclusion, the Itanium and Itanium 2 processors meet the demands of a wide range of enterprise workloads. Through the use of EPIC technology, the processor shifts the balance of responsibilities between software and hardware. With its large amount of cache and high speed bandwidth, reaction time for acquiring data instructions is brought to an all time low. Terabytes of memory can be handled over the web with ease and quickness. The Itanium family also provides support for 64 bits of addressing, full IA-32 instruction set compatibility, and scalability across a wide range of operating systems and multiprocessor platforms. Currently Intel dominates the microprocessor market. However, most applications operate at the 32 bit level. Organizations are reluctant to experiment with new technology unless they feel the risks are lower than positive net results. Therefore, it will be a gradual evolution for widespread use of the 64 bit architecture within organizations. The Itanium family of processors must prove that they are reliable, efficient, and capable of improving performance.
Overall there are many similarities between the AMD64 architecture and the Intel IA-64 architecture. Both architectures support the x86 instruction set, except that the IA-64 needs a decoder which slows down processing the instructions. According to ExtremeTech the reason AMD and Intel have similar 64-bit processors is because Intel reverse engineered AMD’s 64-bit architecture (Hackman, 2004). While this may be true, it only helps simply future problems because Intel’s IA-64 instruction set is primarily compatible with the AMD64 instruction set. Both Intel and AMD can foresee the future of 64-bit computing and they are still rapidly developing advancements within their processors. This document has shown the specific architectures and features of the AMD64 and Intel Itanium processors.