A Good News .!!!

Core

Member
Jan 23, 2010
3,120
195
0
Milky Way/Local Cluster/Local Sol/Earth
NVIDIA IS GOING TO RELEASE DIRECTX 11 GRAPHIC CARD SERIES./!!! :)


NVIDIA’s DirectX 11 Architecture: GF100 (Fermi) In Detail





Article written by Mark Poppin and BFG10K, AlienBabelTech Senior Editors.
Introduction
At their Graphics Technology Conference (GTC) last September 30th, NVIDIA announced their next-generation graphics architecture, codenamed Fermi. We reported on it for you here, here and here in a three-part series. At the GTC, graphics performance was not the focus of Tesla Fermi. Rather the conference was emphasizing NVIDIA’s new architecture as a revolutionary General Purpose Processor that takes much more advantage of their new Fermi GPU’s abilities of superfast parallel processing over their current architecture. NVIDIA’s goal is to dominate the professional market with their Tesla GPUs. Now that Fermi GF100 GPUs for NVIDIA’s new video cards are finally in mass production, we will be looking at how NVIDIA intends to dominate gaming.
FermProdMockup_thumb.jpg
Fermi Production Mock-up

RawGPU_ob_thumb.jpg
Fermi GPU

To summarize the new architecture, Fermi boasts a brand new shader core whose compute clusters comprise a single shader multiprocessor (SM). Each stream processor has a fully-pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Each SM can dual-issue two independent instructions per clock to two different warps. Each instruction is run by a 16-way SIMD block that handles single-precision Floating Multiply-Add Instruction (FMAs). The Fermi memory hierarchy is also new, sporting a new unified L2 cache that serves all of the SMs without partitions. In addition, a new unified memory space allows each SM to not only communicate with its own local registers and shared memory, but now with L2 cache and beyond.
The GF100 features 768KB unified level-two cache as well as a rather complex cache hierarchy. In addition, many other GPU-compute areas of performance are improved over NVIDIA’s current Tesla architecture GPUs, GT200. The GF100 hardware can sustain peak Single Precision (SP) and Double Precision (DP) FMA instruction throughput. Atomic instruction throughput is maximized over the current generation and Fermi is backed by ECC which is absolutely necessary for GPU computing. This all comes together to support a new type of multi-threading technology which improves the efficiency of the 512 cores working together. The entire Fermi family is compatible with DirectX 11, OpenGL 3.x and OpenCL 1.x application programming interfaces (APIs). The new chips are finally in mass production using 40nm process technology at TSMC.
Let’s go ahead and see what is new and improved with GF100.

–~~~~~~~~~~~~–​
GF100 Architecture
Lets look at the diagrams:

The first diagram from NVIDIA’s slides, shows the GF100 block diagram illustrating the Host Interface, the GigaThread Engine, four GPCs, six Memory Controllers, six ROP partitions, and a 768 KB L2 cache. Each GPC contains four PolyMorph engines. The ROP partitions are immediately adjacent to the L2 cache. The second image illustrates how GF100’s graphics architecture is built from a number of hardware blocks called Graphics Processing Clusters (GPCs). A GPC contains a Raster Engine and up to four SMs. The third image illustrates how it all works together.
Firstly, CPU commands are read by the GPU via the Host Interface. In turn the GigaThread Engine fetches data from the system memory and copies it to the framebuffer. GF100 implements six 64-bit GDDR5 memory controllers for 384-bit total which facilitates high bandwidth access to the framebuffer. The GigaThread Engine then creates and dispatches thread blocks to various SMs. Individual SMs in turn schedules warps (groups of 32 threads) to CUDA cores and to the other execution units. The GigaThread Engine also redistributes work to the SMs when work expansion occurs in the graphics pipeline.
In the first image, the rectangular structures are SMs, or as NVIDIA calls them, streaming multiprocessors of which Fermi has sixteen. NVIDIA calls the green squares inside of each SM, “CUDA cores”. These CUDA cores compromise the chip’s most fundamental execution resource which helps to determine the chip’s total processing power and ultimately its performance. The GT200 has 240 and Fermi has 512.
The memory interfaces are 64-bit. This means that Fermi has its total path to memory that is 384 bits wide. This is in contrast to the higher 512 bit pathway on the GT200. However, Fermi compensates by delivering almost twice the bandwidth per pin due to its support for GDDR5 memory; GT200 used GDDR3 memory.
To summarize, Fermi GF100 has:

  • 512 CUDA cores
  • 16 Geometry Units
  • 4 raster units
  • 64 texture units
  • 48 ROP units
  • 384-bit GDDR5
NVIDIA’s current generation product, the GT200 – of which GTX 285 is the single GPU flagship – was able to improve on the original G80 design as represented by the 8800 GTX. By refining G80’s architecture, NVIDIA made it more programmable by adding double precision (DP) support and atomic operations. GT200 managed all of this while still holding on to the highest performance crown for a single GPU until nearly five months ago when AMD/ATI’s Radeon 5870 launched. Their competitor has the first DX11 chip that was built with incremental changes made over its last generation resulting in significant performance improvements over HD 4800 series.
So now NVIDIA has announced their Fermi GF100 next generation DX11 architecture which aims for even greater performance and also is more programmable and software friendly. There is no “GT300”. Until now, NVIDIA has chosen to primarily discuss Fermi Tesla GPU computing architecture and not to disclose microarchitecture or especially game-related performance details of GF100.
The biggest changes in GF100 architecture show us that the geometry pipeline has been significantly revamped with improved performance in geometry shading, stream out, and culling. Fillrate has also been improved which enables multiple displays to be driven simultaneously by GF100 SLI, much like AMD’s Eyefinity; but now additionally in 3D and at 120 Hz.
From studying the second image, we can see that the GPC is GF100’s dominant high-level hardware block. It features two key innovations—a scalable Raster Engine for triangle setup, rasterization, and z-cull, and a scalable PolyMorph Engine for vertex attribute fetch and tessellation. The Raster Engine resides in the GPC, whereas the PolyMorph Engine resides in the SM. On earlier NVIDIA GPUs, SMs and Texture Units were grouped together in hardware blocks called Texture Processing Clusters (TPCs). On GF100, each SM has four dedicated Texture Units.
As we look deeper, we can see that Fermi’s tessellation engine is impressive. It is not something just “tacked on” to GT200. NVIDIA saw early on that if they only made incremental changes to GT200, they would run into severe bottlenecks. Simply adding tessellation to GT200 would lead to intolerable geometry bottlenecks. They tell us that this is what took them so long – they had to design a better balanced new chip architecture that could also have better sequential rendering semantics built into its engine.

–~~~~~~~~~~~~–​
 
Last edited:
  • Like
Reactions: djmaxtor

Core

Member
Jan 23, 2010
3,120
195
0
Milky Way/Local Cluster/Local Sol/Earth
Geometry
NVIDIA’s goal is for GF100 to enable film-like geometric realism for game characters and objects. Geometric realism is central to the GF100 architectural enhancements for graphics. In addition, PhysX simulations are faster and developers can utilize GPU computing features in games more easily and effectively.
While programmable shading has allowed PC games to mimic the cinema in per-pixel effects, geometric realism is way behind. The most advanced modern PC games will use one to two million polygons per frame whereas a typical frame in a computer generated film uses hundreds of millions of polygons. While the number of pixel shaders has grown from one to many hundreds, the triangle setup engine has remained a singular unit. For example, the GeForce GTX 285 has more than 150 times the shading horsepower of the old GeForce FX, but less than 3 times the geometry processing rate. This means that pixels are shaded well but their geometric detail is weak.
Take a look at NVIDIA’s example from Far Cry 2. The holster has a heavily segmented strap. The corrugated roof is just a flat surface with a striped texture instead of curving properly. We also note that this character wears a hat to avoid the complexity of rendering hair.
On the other hand, the exquisitely detailed characters in CG films are made possible by tessellation and displacement mapping. Tessellation refines large triangles into collections of smaller triangles, while displacement mapping changes their relative position. To achieve these same goals, GF100’s entire graphics pipeline is designed to deliver higher performance in tessellation and geometry throughput.
GF100 replaces the traditional geometry processing architecture at the front end of the graphics pipeline with an entirely new distributed geometry processing architecture that is implemented using multiple “PolyMorph Engines”. Each of these engine includes a tessellation unit, an attribute setup unit, and other geometry processing units. Each SM has its own dedicated PolyMorph Engine as shown by the three grouped diagrams that we showed you earlier (above).
Newly generated primitives are converted to pixels by four Raster Engines that operate in parallel compared to a single Raster Engine in GT200 and in earlier GPUs. On-chip L1 and L2 caches now enable high bandwidth transfer of primitive attributes between the SM and the tessellation unit as well as between different SMs. Tessellation and all its supporting stages are performed in parallel on GF100 with improved geometry throughput. GF100‘s ability to perform parallel geometry processing is possibly the single most important GF100 architectural improvement. The ability to deliver setup rates exceeding one primitive per clock while maintaining correct rendering order is a significant technical achievement.
Major compute features improved on GF100 that will be useful in games include faster context switching between graphics and PhysX, concurrent compute kernel execution and an enhanced caching architecture which is good for irregular algorithms such as ray tracing, and AI. Simultaneously, improved atomic operations performance allows threads to safely cooperate through work queues, accelerating novel rendering algorithms. For example, fast atomic operations allow transparent objects to be rendered without presorting (order independent transparency) enabling developers to create levels with complex glass environments. GF100’s GigaThread engine reduces context switch time, making it possible to execute multiple compute and physics kernels for each frame.

–~~~~~~~~~~~~–​
Tessellation and Displacement Mapping
It takes DX11 to take advantage of geometry. DX9 and DX10 are unable to create generalized geometry on the GPU. Therefore we will see Tessellation and displacement mapping used together to create more realism in games. The ability to control the geometric level of detail (LOD) is very important. Because it is on-demand and the data is all kept on-chip, precious memory bandwidth is preserved. Also, because one model may produce many LODs, the same game assets may be used on a variety of platforms which makes the game developers very happy. Their characters can also be easily adjusted as to how it appears in the scene; if it is small then it gets little geometry, if it is close to the screen then it is rendered with greater detail.
As an additional benefit, developers may be able to use the same models on many generations of games and future GPUs where performance increases will allow for enabling even greater detail than was possible when the game was first released. Complexity can be adjusted dynamically to even target a given frame rate!
Here in NVIDIA’s slide from the Unigine engine demo, we see tessellation compared, on and off. There is no comparison; tessellation adds to realism.
Take a look at the third image that we presented earlier in this article. The use of tessellation fundamentally changes the GPU’s graphics workload balance. With tessellation, the triangle density of a given frame can increase by multiple orders of magnitude which strains serial resources such as the setup and rasterization units. To facilitate high triangle rates, NVIDIA designed a scalable geometry engine called the PolyMorph Engine. Each of GF100’s 16 PolyMorph engines has its own dedicated vertex fetch unit and a tessellator which expands geometry performance.
In conjunction with the PolyMorph Engine, NVIDIA designed four parallel Raster Engines which allows up to four triangles to be setup per clock. Results calculated in each of five stages which are then passed to an SM. The SM executes the game’s shader, returning the results to the next stage in the PolyMorph Engine. After all stages are complete, the results are forwarded to one of the four Raster Engines.
The Rasterizer takes the edge equations for each primitive and computes pixel coverage. If antialiasing is enabled, coverage is performed for each multisample and coverage sample. Each Rasterizer outputs eight pixels per clock for a total of 32 rasterized pixels per clock across the chip. Pixels produced by the rasterizer are sent to the Z-cull unit. By having a dedicated tessellator for each SM, and a Raster Engine for each GPC, GF100 delivers up to 8 times the geometry performance of GT200. NVIDIA also compares the geometry performance of GF100 to HD 5870 and finds Fermi is significantly faster.
Here is a performance comparison between GF100 and HD 5870 using a 60 second run with the Unigine engine:

–~~~~~~~~~~~~–​
Anti-Aliasing Image Quality
To improve anti-aliasing image quality, the GF100 introduces a new anti-aliasing mode: 32xCSAA. nVidia’s previous strongest edge AA mode was 16xQ, but this is now bested by 32xAA. Here’s the sample pattern for it, courtesy of nVidia:
32.png
32xAA = 8xMSAA + 24xCSAA.
Thus 32xCSAA is a natural extension of 16xQ, and offers even stronger edge (polygon) anti-aliasing, courtesy of providing a total of 32 unique samples
But that’s not all that has improved. The GF100 has a new ability to use coverage samples to affect the quality of alpha textures, as implemented through transparency anti-aliasing. With previous nVidia hardware such as the GT200, coverage samples had no effect on transparency anti-aliasing quality, as the result was derived solely from the base multi-sampling pattern in effect.
Also in the specific case of transparency multi-sampling, image quality has improved there too. Any titles using the older alpha test method to render transparent textures have their shader code automatically converted to use the alpha-to-cover technique, which should greatly improve image quality, especially in heavily aliased areas.
The upshot of this is higher quality edges, and higher quality alpha textures.

Anti-Aliasing Performance
In addition to improving image quality, anti-aliasing performance has also increased. When it comes to AA, the most obvious area to target is the ROPs, and that’s exactly what nVidia has done. The GF100 has 48 ROPs, up from 32 ROPs on the GTX285, which is especially helpful for portions of the scene that cannot be compressed.
Each ROP is also faster and more efficient than on previous generations, so it can do more work per cycle. This includes improvements made to the compression technology.
25a.png
Aside from better AA performance in general, nVidia’s old Achilles heel with 8xMSAA performance should also be addressed by the improvements. Historically, prior nVidia architectures have exhibited much higher relative performance hit when going from 4xMSAA to 8xMSAA, compared to competing ATi architectures.
Also by moving to 384 bit GDDR5, nVidia should have access to plenty of memory bandwidth to keep all of those ROPs fed with data.

–~~~~~~~~~~~~–​
Texture Filtering
As with anti-aliasing, there have been improvements made to texturing too. Interestingly the GF100 only has 64 TMUs, which is much less than the 80 TMUs on the GTX285, but nVidia claims overall performance should still be higher because of improvements to performance and efficiency.
Texture caching has been substantially improved, with the L1 cache being redesigned for greater efficiency. Also the presence of a unified L2 cache means the texture cache size is three times higher than on the GT200.
Layout changes and internal improvements to the texture units also combine with a higher TMU clock speed. On the GT200 the TMUs ran at the GPU’s core clock; on the GF100 they run at a higher clock, which allows them to perform more work in the same amount of time. nVidia’s numbers show 40% to 70% higher texturing performance than the GT200, despite having much fewer TMUs.
25b.png
The GF100’s texture units also offer hardware accelerated jittered sampling. This essentially means the hardware has the ability to offer a form of stochastic filtering by varying the texture sampling on a per-pixel basis. This is done by implementing DirectX 11’s Gather4 in hardware, and it provides the ability for up to four texels to be fetched from a 128×128 pixel grid with a single instruction.
This not only improves performance with things like ambient occlusion, but it can also improve image quality by removing banding through random sampling. It also allows game developers to implement customized texture filtering more efficiently. nVidia states that the GF100’s hardware implementation of this technique offers up to twice the performance of the GT200.

–~~~~~~~~~~~~–​
Compute Architecture
The compute engine is designed to handle the GPGPU side of things and encapsulates features such as CUDA, OpenCL, Direct Compute, and PhysX. Many of these have been around since the G80 days, but the GF100 delivers a number of improvements to make such general purpose computing run better.
The GF100 is designed to handle a wider range of algorithms better to encourage the use of the GPU more for parallel problems. One key area of improvement comes from its better cache system, which allows threads that access the same memory locations to run faster.
Another key improvement allows the GF100 to execute multiple task kernels at once, and the context switching between such tasks is much faster than on previous GPUs. This differs from the GT200 which could only run one task kernel at a time, and had very slow context switching.
And lastly, high level features such as debugging and a C++ programming environment to access GPGPU features are made possible with nVidia’s Nexus plug-in for Visual Studio. Such features simplify programming GPGPU tasks as they assist developers to work at a higher level than was previously possible.

Ray Tracing
The GF100 will not be able to do complex ray tracing (RT) in real time in PC games as in the above image. However, NVIDIA believes that RT is the future of graphics and they expect some implementation of it in conjunction with rasterization fairly soon as developers begin to take advantage of GF100’s new programming capabilities.

–~~~~~~~~~~~~–​
Conclusion
It’s clear that nVidia has invested a lot of resources and design effort into trying to make the GF100 the fastest single GPU to date. In addition to a very clear focus on improving GPGPU performance and usability, numerous enhancements to image quality and performance for gaming purposes have also been made.
It’ll be very interesting to see how the card performs in actual gaming situations, and more importantly, how it compares to ATi’s current single GPU flagship, the Radeon 5870.
We are looking forward to bringing our readers the latest news about the Fermi GF100 and we will be testing its performance and image quality in gaming. There is much more to be revealed about NVIDIA’s new GPU. Stay tuned. The graphics wars are heating up and it is getting very interesting again.

Article written by Mark Poppin and BFG10K, AlienBabelTech Senior Editors.




Reference

by apoppin on Jan.17, 2010, under ABTnews, Articles, Technology
 

Core

Member
Jan 23, 2010
3,120
195
0
Milky Way/Local Cluster/Local Sol/Earth
:nerd::nerd::nerd:

would i have to buy a new PC just to fit that thing....?:(:(:(

well it depend on which system you have currently got if you have a PCI-EXpress then you won't but AGP might be needed to kick off and buy a new one

all hardwares consist of PCI-E 16X until PCI-EXpress latest technology released it's named 3.0 and currently we have 2.0 and 1.1

so if you have a PCI-E system then don't worry ,your system is ready for next generation but it's better to buy a good CPU + RAM because as Nvidia mentions highly physics apply with DirectX 11 gaming it means ,need high capacity RAM and CPU to process particles ,read this also


The things you should consider before buying a Graphic Card.!
------------------------------ --------------

Ranges are considered like this
------------------------------ ------

100-400 - Entry Level
for watching movies ,video editing ,animations and play some small and old games but some times it might be possible to play most of games(even modern) with lowest settings.

500-600 - Mid Range
play games with medium settings or high settings ,you may go upto 1028*1024 beyond that it will mess up your game ,off VSync.

700-900 - High Range
you can play any game with max out settings with AF and AA also it's possible to on VSync because it will not reduce your frame rate.

examples for graphic card SERIES

ATI
X1000 Series
HD2000 Series
HD3000 Series
HD4000 Series
HD5000 Series

nVidia
FX Series (Nvidia 5 Series)
6 Series
7 Series
8 Series
9 Series
100 Series
GT/GTS/GTX200 series
300 Series (future expecting)

how I know is this graphic card belong to which range (just look at this example)
so get a series nvidia 9 series
and get a range 500-600 -Mid Range
now combine both 9+500=9500 -Mid Range

9500 - now read from the beginning.