NVIDIA IS GOING TO RELEASE DIRECTX 11 GRAPHIC CARD SERIES./!!! 
NVIDIA’s DirectX 11 Architecture: GF100 (Fermi) In Detail
Article written by Mark Poppin and BFG10K, AlienBabelTech Senior Editors.
Introduction
At their Graphics Technology Conference (GTC) last September 30th, NVIDIA announced their next-generation graphics architecture, codenamed Fermi. We reported on it for you here, here and here in a three-part series. At the GTC, graphics performance was not the focus of Tesla Fermi. Rather the conference was emphasizing NVIDIA’s new architecture as a revolutionary General Purpose Processor that takes much more advantage of their new Fermi GPU’s abilities of superfast parallel processing over their current architecture. NVIDIA’s goal is to dominate the professional market with their Tesla GPUs. Now that Fermi GF100 GPUs for NVIDIA’s new video cards are finally in mass production, we will be looking at how NVIDIA intends to dominate gaming.
Fermi Production Mock-up
Fermi GPU
To summarize the new architecture, Fermi boasts a brand new shader core whose compute clusters comprise a single shader multiprocessor (SM). Each stream processor has a fully-pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Each SM can dual-issue two independent instructions per clock to two different warps. Each instruction is run by a 16-way SIMD block that handles single-precision Floating Multiply-Add Instruction (FMAs). The Fermi memory hierarchy is also new, sporting a new unified L2 cache that serves all of the SMs without partitions. In addition, a new unified memory space allows each SM to not only communicate with its own local registers and shared memory, but now with L2 cache and beyond.
The GF100 features 768KB unified level-two cache as well as a rather complex cache hierarchy. In addition, many other GPU-compute areas of performance are improved over NVIDIA’s current Tesla architecture GPUs, GT200. The GF100 hardware can sustain peak Single Precision (SP) and Double Precision (DP) FMA instruction throughput. Atomic instruction throughput is maximized over the current generation and Fermi is backed by ECC which is absolutely necessary for GPU computing. This all comes together to support a new type of multi-threading technology which improves the efficiency of the 512 cores working together. The entire Fermi family is compatible with DirectX 11, OpenGL 3.x and OpenCL 1.x application programming interfaces (APIs). The new chips are finally in mass production using 40nm process technology at TSMC.
Let’s go ahead and see what is new and improved with GF100.
Lets look at the diagrams:
The first diagram from NVIDIA’s slides, shows the GF100 block diagram illustrating the Host Interface, the GigaThread Engine, four GPCs, six Memory Controllers, six ROP partitions, and a 768 KB L2 cache. Each GPC contains four PolyMorph engines. The ROP partitions are immediately adjacent to the L2 cache. The second image illustrates how GF100’s graphics architecture is built from a number of hardware blocks called Graphics Processing Clusters (GPCs). A GPC contains a Raster Engine and up to four SMs. The third image illustrates how it all works together.
Firstly, CPU commands are read by the GPU via the Host Interface. In turn the GigaThread Engine fetches data from the system memory and copies it to the framebuffer. GF100 implements six 64-bit GDDR5 memory controllers for 384-bit total which facilitates high bandwidth access to the framebuffer. The GigaThread Engine then creates and dispatches thread blocks to various SMs. Individual SMs in turn schedules warps (groups of 32 threads) to CUDA cores and to the other execution units. The GigaThread Engine also redistributes work to the SMs when work expansion occurs in the graphics pipeline.
In the first image, the rectangular structures are SMs, or as NVIDIA calls them, streaming multiprocessors of which Fermi has sixteen. NVIDIA calls the green squares inside of each SM, “CUDA cores”. These CUDA cores compromise the chip’s most fundamental execution resource which helps to determine the chip’s total processing power and ultimately its performance. The GT200 has 240 and Fermi has 512.
The memory interfaces are 64-bit. This means that Fermi has its total path to memory that is 384 bits wide. This is in contrast to the higher 512 bit pathway on the GT200. However, Fermi compensates by delivering almost twice the bandwidth per pin due to its support for GDDR5 memory; GT200 used GDDR3 memory.
To summarize, Fermi GF100 has:
So now NVIDIA has announced their Fermi GF100 next generation DX11 architecture which aims for even greater performance and also is more programmable and software friendly. There is no “GT300”. Until now, NVIDIA has chosen to primarily discuss Fermi Tesla GPU computing architecture and not to disclose microarchitecture or especially game-related performance details of GF100.
The biggest changes in GF100 architecture show us that the geometry pipeline has been significantly revamped with improved performance in geometry shading, stream out, and culling. Fillrate has also been improved which enables multiple displays to be driven simultaneously by GF100 SLI, much like AMD’s Eyefinity; but now additionally in 3D and at 120 Hz.
From studying the second image, we can see that the GPC is GF100’s dominant high-level hardware block. It features two key innovations—a scalable Raster Engine for triangle setup, rasterization, and z-cull, and a scalable PolyMorph Engine for vertex attribute fetch and tessellation. The Raster Engine resides in the GPC, whereas the PolyMorph Engine resides in the SM. On earlier NVIDIA GPUs, SMs and Texture Units were grouped together in hardware blocks called Texture Processing Clusters (TPCs). On GF100, each SM has four dedicated Texture Units.
As we look deeper, we can see that Fermi’s tessellation engine is impressive. It is not something just “tacked on” to GT200. NVIDIA saw early on that if they only made incremental changes to GT200, they would run into severe bottlenecks. Simply adding tessellation to GT200 would lead to intolerable geometry bottlenecks. They tell us that this is what took them so long – they had to design a better balanced new chip architecture that could also have better sequential rendering semantics built into its engine.

NVIDIA’s DirectX 11 Architecture: GF100 (Fermi) In Detail
Introduction
At their Graphics Technology Conference (GTC) last September 30th, NVIDIA announced their next-generation graphics architecture, codenamed Fermi. We reported on it for you here, here and here in a three-part series. At the GTC, graphics performance was not the focus of Tesla Fermi. Rather the conference was emphasizing NVIDIA’s new architecture as a revolutionary General Purpose Processor that takes much more advantage of their new Fermi GPU’s abilities of superfast parallel processing over their current architecture. NVIDIA’s goal is to dominate the professional market with their Tesla GPUs. Now that Fermi GF100 GPUs for NVIDIA’s new video cards are finally in mass production, we will be looking at how NVIDIA intends to dominate gaming.
To summarize the new architecture, Fermi boasts a brand new shader core whose compute clusters comprise a single shader multiprocessor (SM). Each stream processor has a fully-pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Each SM can dual-issue two independent instructions per clock to two different warps. Each instruction is run by a 16-way SIMD block that handles single-precision Floating Multiply-Add Instruction (FMAs). The Fermi memory hierarchy is also new, sporting a new unified L2 cache that serves all of the SMs without partitions. In addition, a new unified memory space allows each SM to not only communicate with its own local registers and shared memory, but now with L2 cache and beyond.
The GF100 features 768KB unified level-two cache as well as a rather complex cache hierarchy. In addition, many other GPU-compute areas of performance are improved over NVIDIA’s current Tesla architecture GPUs, GT200. The GF100 hardware can sustain peak Single Precision (SP) and Double Precision (DP) FMA instruction throughput. Atomic instruction throughput is maximized over the current generation and Fermi is backed by ECC which is absolutely necessary for GPU computing. This all comes together to support a new type of multi-threading technology which improves the efficiency of the 512 cores working together. The entire Fermi family is compatible with DirectX 11, OpenGL 3.x and OpenCL 1.x application programming interfaces (APIs). The new chips are finally in mass production using 40nm process technology at TSMC.
Let’s go ahead and see what is new and improved with GF100.
–~~~~~~~~~~~~–
GF100 ArchitectureLets look at the diagrams:
The first diagram from NVIDIA’s slides, shows the GF100 block diagram illustrating the Host Interface, the GigaThread Engine, four GPCs, six Memory Controllers, six ROP partitions, and a 768 KB L2 cache. Each GPC contains four PolyMorph engines. The ROP partitions are immediately adjacent to the L2 cache. The second image illustrates how GF100’s graphics architecture is built from a number of hardware blocks called Graphics Processing Clusters (GPCs). A GPC contains a Raster Engine and up to four SMs. The third image illustrates how it all works together.
Firstly, CPU commands are read by the GPU via the Host Interface. In turn the GigaThread Engine fetches data from the system memory and copies it to the framebuffer. GF100 implements six 64-bit GDDR5 memory controllers for 384-bit total which facilitates high bandwidth access to the framebuffer. The GigaThread Engine then creates and dispatches thread blocks to various SMs. Individual SMs in turn schedules warps (groups of 32 threads) to CUDA cores and to the other execution units. The GigaThread Engine also redistributes work to the SMs when work expansion occurs in the graphics pipeline.
In the first image, the rectangular structures are SMs, or as NVIDIA calls them, streaming multiprocessors of which Fermi has sixteen. NVIDIA calls the green squares inside of each SM, “CUDA cores”. These CUDA cores compromise the chip’s most fundamental execution resource which helps to determine the chip’s total processing power and ultimately its performance. The GT200 has 240 and Fermi has 512.
The memory interfaces are 64-bit. This means that Fermi has its total path to memory that is 384 bits wide. This is in contrast to the higher 512 bit pathway on the GT200. However, Fermi compensates by delivering almost twice the bandwidth per pin due to its support for GDDR5 memory; GT200 used GDDR3 memory.
To summarize, Fermi GF100 has:
- 512 CUDA cores
- 16 Geometry Units
- 4 raster units
- 64 texture units
- 48 ROP units
- 384-bit GDDR5
So now NVIDIA has announced their Fermi GF100 next generation DX11 architecture which aims for even greater performance and also is more programmable and software friendly. There is no “GT300”. Until now, NVIDIA has chosen to primarily discuss Fermi Tesla GPU computing architecture and not to disclose microarchitecture or especially game-related performance details of GF100.
The biggest changes in GF100 architecture show us that the geometry pipeline has been significantly revamped with improved performance in geometry shading, stream out, and culling. Fillrate has also been improved which enables multiple displays to be driven simultaneously by GF100 SLI, much like AMD’s Eyefinity; but now additionally in 3D and at 120 Hz.
From studying the second image, we can see that the GPC is GF100’s dominant high-level hardware block. It features two key innovations—a scalable Raster Engine for triangle setup, rasterization, and z-cull, and a scalable PolyMorph Engine for vertex attribute fetch and tessellation. The Raster Engine resides in the GPC, whereas the PolyMorph Engine resides in the SM. On earlier NVIDIA GPUs, SMs and Texture Units were grouped together in hardware blocks called Texture Processing Clusters (TPCs). On GF100, each SM has four dedicated Texture Units.
As we look deeper, we can see that Fermi’s tessellation engine is impressive. It is not something just “tacked on” to GT200. NVIDIA saw early on that if they only made incremental changes to GT200, they would run into severe bottlenecks. Simply adding tessellation to GT200 would lead to intolerable geometry bottlenecks. They tell us that this is what took them so long – they had to design a better balanced new chip architecture that could also have better sequential rendering semantics built into its engine.
–~~~~~~~~~~~~–
Last edited:

