u/IamRustyRust

Let's Understand Quaternions - Part 1

Quaternions: Let's Get Real (and Imaginary, and then Some!)

Quaternions usually considered hard and complex but they are king of rotations crucial for games and maths to understand quaternions, if we go the academic route it will sound like a waiter using long, expensive words to explain a simple carrot salad. 

We do not need any of that here. That is why we will not use academic language and we will start with imaginary numbers. 

Imaginary word sound like fiction, things that do not exist in our world right? 

>But what exactly does not belong here? If you multiply two negative numbers (two number with subtraction signs) together and the result is still a negative (subtraction sign) that just does not happen in our world. This is imaginary. 

Now, suppose we have a number like the square root of -1. When I say underroot -1 what am I actually trying to say? I am saying find a number that makes -1 when you multiply it by itself. But how is that possible because in this world the law of negative X negative positive follows? It is totally impossible and that is exactly why we call it an imaginary number. Now see the  following calculation.

https://preview.redd.it/eyy85gf8nkwg1.jpg?width=963&format=pjpg&auto=webp&s=09eb9564732a1e8305651175631ca25ed97ed726

>This kindergarten calculation has a significant role in quaternions. yes it is simple, but it is very powerful, and the entire quaternion is built on it. I have shared this calculation right now because we are currently discussing imaginary numbers. I will not talk about this calculation right now, but we will discuss it further ahead, where you will find out that this is the foundation of quaternions.

From Imaginary Numbers to Quaternions:

So, now we are going to step away from imaginary numbers and jump right into quaternions. We'll use the exact same formula you usually see written in textbooks to represent them. You know the one: xi, yj, and zk. And if you come from a computer background, you've probably seen a w attached to that. But if we just talk about the x, y, and z, those are simply the physical axes you see on a standard 3D graph. Quaternions add imaginary numbers directly to these. Let's look at how these imaginary numbers actually interact with our axes. If I take the term xi, the x is our physical axis, and the i is our imaginary number. And here, that imaginary number literally just means 90 degrees. So, what does xi actually mean? It means a strict 90-degree turn.

https://preview.redd.it/4w924hsankwg1.jpg?width=1280&format=pjpg&auto=webp&s=a43d9eaacf3198acf713826649b2b9c3e48710bb

But hold on. How did i suddenly become 90 degrees? Wasn't it supposed to be the square root of -1? How did it jump to being an angle? You won't find the answer to this in high-level physics. You actually find the answer right back in simple, basic kindergarten math we have done previously. Let's break that down right now. To do this, we just need to go back to our standard graph. We take our point xi and place it on the y-axis. Now, let's say we multiply one more i into it. That means we are taking xi and multiplying it by i. Physically, this means we are adding 90 degrees and another 90 degrees menas i^(2). The answer hits exactly 180 and i^(2) =180 degree then i =180/2 means 90 degree. So what does this mean?

https://preview.redd.it/1lzoxynenkwg1.jpg?width=1280&format=pjpg&auto=webp&s=60f4319cee700cf4bb7f962e84bdb4c261d0a8b8

>If we multiply imaginary numbers together, it directly picks us up and physically drops us on the exact opposite, negative axis (If we start on the positive x-axis, we flip straight to the negative x-axis) and the exact same mechanical thing happens with all the other axes if we start from them.

The Strange Co-Dependence of i and i^(2):

Now, let's assume for a second that this i^(2) just doesn't exist. Should that have any effect on standalone i? Normal human logic says that i should be independent. It should be completely whole on its own. That means i should totally exist even without being a thing. Because obviously, can never be formed without having an i first.

So, the real question pops up. If we assume there is no such thing as , can a single i still perform that 90-degree turn? The answer is hidden inside some very strange mathematical logic. Because actually, whatever value or identity i has, it comes entirely from . If there is no in the math, then a single i simply does not have any physical value of its own.

>This is a kind of math that literally moves backward instead of forward. It works from back to front. Think about it like this. In normal life, 1 and 1 together make 2. If your base number 1 isn't there, then the number 2 can never be formed. But in this specific game of imaginary numbers, the rule runs completely in reverse. Here, the math clearly tells us that if (which is supposed to be the result) doesn't exist, then i (which is supposed to be the base) will not exist either.

PART 2: 3D Quaternions

reddit.com
u/IamRustyRust — 18 hours ago
▲ 5 r/rust_p+2 crossposts

Let's Understand Quaternions- Part 1

Quaternions: Let's Get Real (and Imaginary, and then Some!)

Quaternions usually considered hard and complex but they are king of rotations crucial for games and maths to understand quaternions, if we go the academic route it will sound like a waiter using long, expensive words to explain a simple carrot salad. 

We do not need any of that here. That is why we will not use academic language and we will start with imaginary numbers. 

Imaginary word sound like fiction, things that do not exist in our world right? 

>But what exactly does not belong here? If you multiply two negative numbers (two number with subtraction signs) together and the result is still a negative (subtraction sign) that just does not happen in our world. This is imaginary. 

Now, suppose we have a number like the square root of -1. When I say underroot -1 what am I actually trying to say? I am saying find a number that makes -1 when you multiply it by itself. But how is that possible because in this world the law of negative X negative positive follows? It is totally impossible and that is exactly why we call it an imaginary number. Now see the  following calculation.

https://preview.redd.it/eyy85gf8nkwg1.jpg?width=963&format=pjpg&auto=webp&s=09eb9564732a1e8305651175631ca25ed97ed726

>This kindergarten calculation has a significant role in quaternions. yes it is simple, but it is very powerful, and the entire quaternion is built on it. I have shared this calculation right now because we are currently discussing imaginary numbers. I will not talk about this calculation right now, but we will discuss it further ahead, where you will find out that this is the foundation of quaternions.

From Imaginary Numbers to Quaternions:

So, now we are going to step away from imaginary numbers and jump right into quaternions. We'll use the exact same formula you usually see written in textbooks to represent them. You know the one: xi, yj, and zk. And if you come from a computer background, you've probably seen a w attached to that. But if we just talk about the x, y, and z, those are simply the physical axes you see on a standard 3D graph. Quaternions add imaginary numbers directly to these. Let's look at how these imaginary numbers actually interact with our axes. If I take the term xi, the x is our physical axis, and the i is our imaginary number. And here, that imaginary number literally just means 90 degrees. So, what does xi actually mean? It means a strict 90-degree turn.

https://preview.redd.it/4w924hsankwg1.jpg?width=1280&format=pjpg&auto=webp&s=a43d9eaacf3198acf713826649b2b9c3e48710bb

But hold on. How did i suddenly become 90 degrees? Wasn't it supposed to be the square root of -1? How did it jump to being an angle? You won't find the answer to this in high-level physics. You actually find the answer right back in simple, basic kindergarten math we have done previously. Let's break that down right now. To do this, we just need to go back to our standard graph. We take our point xi and place it on the y-axis. Now, let's say we multiply one more i into it. That means we are taking xi and multiplying it by i. Physically, this means we are adding 90 degrees and another 90 degrees menas i^(2). The answer hits exactly 180 and i^(2) =180 degree then i =180/2 means 90 degree. So what does this mean?

https://preview.redd.it/1lzoxynenkwg1.jpg?width=1280&format=pjpg&auto=webp&s=60f4319cee700cf4bb7f962e84bdb4c261d0a8b8

>If we multiply imaginary numbers together, it directly picks us up and physically drops us on the exact opposite, negative axis (If we start on the positive x-axis, we flip straight to the negative x-axis) and the exact same mechanical thing happens with all the other axes if we start from them.

The Strange Co-Dependence of i and i^(2):

Now, let's assume for a second that this i^(2) just doesn't exist. Should that have any effect on standalone i? Normal human logic says that i should be independent. It should be completely whole on its own. That means i should totally exist even without being a thing. Because obviously, can never be formed without having an i first.

So, the real question pops up. If we assume there is no such thing as , can a single i still perform that 90-degree turn? The answer is hidden inside some very strange mathematical logic. Because actually, whatever value or identity i has, it comes entirely from . If there is no in the math, then a single i simply does not have any physical value of its own.

>This is a kind of math that literally moves backward instead of forward. It works from back to front. Think about it like this. In normal life, 1 and 1 together make 2. If your base number 1 isn't there, then the number 2 can never be formed. But in this specific game of imaginary numbers, the rule runs completely in reverse. Here, the math clearly tells us that if (which is supposed to be the result) doesn't exist, then i (which is supposed to be the base) will not exist either.

PART 2: 3D Quaternions

reddit.com
u/IamRustyRust — 7 hours ago

Hardware Reality of Quaternions

Rotating 3D objects in game engines has always been a math-heavy process. In the Initially, using Euler angles (Pitch, Yaw, Roll) seems easy, but we must strictly reject them. Their biggest flaw is Gimbal Lock a condition where your rotation axis collapses and the entire math breaks. After rejecting Eulars cause of this failure, an engine architect is left with only two hardcore ways to handle rotations: Quaternions and Rotation Matrices. The Quaternion (which is king of Rotation fro me), are preferred because it's math formula is flexible, Gimbal-lock safe, and it can be made cache-friendly. But on the other hand, the standard 3D math and rendering world runs by default on Rotation Matrices. The problem is that when you put these matrices into real-time physics and high-performance computation, then a new engineering horror starts.

This engineering horror first comes forward in the form of Non-Orthogonal Drift. In a rotation matrix there should always be three orthogonal axes means all axis should be on 90 degrees. When floating-point math is repeatedly multiplied in the entire frame, then due to rounding errors those axes do not remain at strictly 90 degrees. The result is this that your perfectly square character starts looking squashed or distorted or like a skewed box. To fix this drift Re-orthogonalization is needed. The new object became skewed, now the CPU will have to stop the game and make that matrix straight again with math. This CPU Penalty makes the game slow, especially then when you have 1000 objects on the screen.

This overhead of math is only half the story. The real bottleneck is hit then when the CPU has to read the data of these 1000 objects from memory and after fixing write it back, because from the perspective of memory a matrix is very heavy. Think, a standard 3x3 (f32) rotation matrix takes 36 bytes (288 bits). But in reality for the entire mathematical rotation the matrix is only 3x3, whereas in Game Engines we always use a 4x4 Matrix, so that along with rotation in that matrix Translation (movement) and Scaling (size change) can also be saved. Its total size becomes 64 Bytes. This is that very number which fits in an L1 Cache line and blocks the CPU bandwidth. Hearing this it feels like okay, a 64-byte matrix will perfectly fit into the 64-byte cache line of the CPU, so what is the problem in this? The problem is this that in engineering when the size of any data becomes exactly equal to the memory container, then the margin of error becomes absolutely zero.

If the starting address of this matrix in memory is not precise (aligned), then this perfect fit suddenly becomes a hardware nightmare. Understand this with the example of a bare-metal memory address: suppose the first Cache Line of the CPU is from address 0 to 63, and the second Cache Line is from 64 to 127. If your entire 64-byte matrix is perfectly aligned (means it starts from address 0), then it will fit inside 0 to 63 in a single shot. But if the memory allocator shifts it even a little bit and starts it from address 16, then the data will cross the boundary. Result? The initial 48 bytes of the matrix will remain in the first cache line, and the remaining 16 bytes will spill and go into the second cache line. To process this unaligned data now the hardware has to pick up two separate cache lines in a single fetch and stitch them. If you are using SIMD instructions, then upon not having strict alignment either the CPU will straight give a Segmentation Fault (crash), or if you used an unaligned load instruction (movups), then the pipeline will stall and the load latency will double. And if by mistake this unaligned data crossed a 4KB Page Boundary, then a TLB miss will trigger and the CPU will have to do a page walk which can literally drop your speed up to 100x.

After this battle of cache lines, when the data comes inside the CPU core for final execution, then another limit of hardware is hit: Registers. We have XMM registers which are only 128-bit wide. This directly means that in a single register only 4 floating-point values can come. When you sit to process a 4x4 matrix with 16 values, then you will have to do messy loading between multiple registers, which makes the pipeline slow.

On the other hand, how clean and fast Quaternion is in memory, this in itself is a masterstroke. In a Quaternion the range is absolutely precise: [w, x, y, z] together make 4 floats, and its size is exactly 16 Bytes. This very compact size saves us memory fetch. With this we avoid Gimbal lock anyway, but also use the L1 Cache very efficiently. In reality, the entire [w, x, y, z] (all 16 bytes) is a Native Hardware Fit. Modern CPUs have 128-bit registers (like SSE registers XMM in Intel, or NEON registers Q in ARM). Because 4 floats multiplied by 4 bytes = 16 bytes, and 16 bytes are exactly 128 bits. This directly means that the CPU in a single instruction can load the entire quaternion into the register and multiply it. Therefore its math is much faster than the Matrix.

But here is a very big catch. The perfect loading of data in the register is only an advantage of storage and bandwidth, but when it comes to computation like doing quaternion multiplication (qvq^(-1) - The Sandwich Approach) to rotate a 3D vector then the game changes. For multiplication the hardware has to do cross and dot multiply of w, x, y, and z among themselves. And right here memory layout becomes our biggest obstacle. When you fetch XYZ values, then hopping has to be done in memory because the data is in Rows (which we call AoS layout). You will do branchless programming by using SIMD, but if you started Horizontal processing (data manipulation inside a single register), then its overhead will be so high that the purpose of using SIMD itself will be finished. To solve real-time physics we have a window of only 2ms. There is only one way to hit this frame rate: ending the overhead of shuffling and aligning the data through swizzling in such a way that it can stream straight into the registers.

Efficient data alignment and SIMD execution itself is that bar which separates an average engine from a high-performance bare-metal engine.

reddit.com
u/IamRustyRust — 6 days ago
▲ 2 r/rust_p+1 crossposts

The Hardware Reality of Quaternions

Rotating 3D objects in game engines has always been a math-heavy process. In the Initially, using Euler angles (Pitch, Yaw, Roll) seems easy, but we must strictly reject them. Their biggest flaw is Gimbal Lock a condition where your rotation axis collapses and the entire math breaks. After rejecting Eulars cause of this failure, an engine architect is left with only two hardcore ways to handle rotations: Quaternions and Rotation Matrices. The Quaternion (which is king of Rotation fro me), are preferred because it's math formula is flexible, Gimbal-lock safe, and it can be made cache-friendly. But on the other hand, the standard 3D math and rendering world runs by default on Rotation Matrices. The problem is that when you put these matrices into real-time physics and high-performance computation, then a new engineering horror starts.

This engineering horror first comes forward in the form of Non-Orthogonal Drift. In a rotation matrix there should always be three orthogonal axes means all axis should be on 90 degrees. When floating-point math is repeatedly multiplied in the entire frame, then due to rounding errors those axes do not remain at strictly 90 degrees. The result is this that your perfectly square character starts looking squashed or distorted or like a skewed box. To fix this drift Re-orthogonalization is needed. The new object became skewed, now the CPU will have to stop the game and make that matrix straight again with math. This CPU Penalty makes the game slow, especially then when you have 1000 objects on the screen.

This overhead of math is only half the story. The real bottleneck is hit then when the CPU has to read the data of these 1000 objects from memory and after fixing write it back, because from the perspective of memory a matrix is very heavy. Think, a standard 3x3 (f32) rotation matrix takes 36 bytes (288 bits). But in reality for the entire mathematical rotation the matrix is only 3x3, whereas in Game Engines we always use a 4x4 Matrix, so that along with rotation in that matrix Translation (movement) and Scaling (size change) can also be saved. Its total size becomes 64 Bytes. This is that very number which fits in an L1 Cache line and blocks the CPU bandwidth. Hearing this it feels like okay, a 64-byte matrix will perfectly fit into the 64-byte cache line of the CPU, so what is the problem in this? The problem is this that in engineering when the size of any data becomes exactly equal to the memory container, then the margin of error becomes absolutely zero.

If the starting address of this matrix in memory is not precise (aligned), then this perfect fit suddenly becomes a hardware nightmare. Understand this with the example of a bare-metal memory address: suppose the first Cache Line of the CPU is from address 0 to 63, and the second Cache Line is from 64 to 127. If your entire 64-byte matrix is perfectly aligned (means it starts from address 0), then it will fit inside 0 to 63 in a single shot. But if the memory allocator shifts it even a little bit and starts it from address 16, then the data will cross the boundary. Result? The initial 48 bytes of the matrix will remain in the first cache line, and the remaining 16 bytes will spill and go into the second cache line. To process this unaligned data now the hardware has to pick up two separate cache lines in a single fetch and stitch them. If you are using SIMD instructions, then upon not having strict alignment either the CPU will straight give a Segmentation Fault (crash), or if you used an unaligned load instruction (movups), then the pipeline will stall and the load latency will double. And if by mistake this unaligned data crossed a 4KB Page Boundary, then a TLB miss will trigger and the CPU will have to do a page walk which can literally drop your speed up to 100x.

After this battle of cache lines, when the data comes inside the CPU core for final execution, then another limit of hardware is hit: Registers. We have XMM registers which are only 128-bit wide. This directly means that in a single register only 4 floating-point values can come. When you sit to process a 4x4 matrix with 16 values, then you will have to do messy loading between multiple registers, which makes the pipeline slow.

On the other hand, how clean and fast Quaternion is in memory, this in itself is a masterstroke. In a Quaternion the range is absolutely precise: [w, x, y, z] together make 4 floats, and its size is exactly 16 Bytes. This very compact size saves us memory fetch. With this we avoid Gimbal lock anyway, but also use the L1 Cache very efficiently. In reality, the entire [w, x, y, z] (all 16 bytes) is a Native Hardware Fit. Modern CPUs have 128-bit registers (like SSE registers XMM in Intel, or NEON registers Q in ARM). Because 4 floats multiplied by 4 bytes = 16 bytes, and 16 bytes are exactly 128 bits. This directly means that the CPU in a single instruction can load the entire quaternion into the register and multiply it. Therefore its math is much faster than the Matrix.

But here is a very big catch. The perfect loading of data in the register is only an advantage of storage and bandwidth, but when it comes to computation like doing quaternion multiplication (qvq^(-1) - The Sandwich Approach) to rotate a 3D vector then the game changes. For multiplication the hardware has to do cross and dot multiply of w, x, y, and z among themselves. And right here memory layout becomes our biggest obstacle. When you fetch XYZ values, then hopping has to be done in memory because the data is in Rows (which we call AoS layout). You will do branchless programming by using SIMD, but if you started Horizontal processing (data manipulation inside a single register), then its overhead will be so high that the purpose of using SIMD itself will be finished. To solve real-time physics we have a window of only 2ms. There is only one way to hit this frame rate: ending the overhead of shuffling and aligning the data through swizzling in such a way that it can stream straight into the registers.

Efficient data alignment and SIMD execution itself is that bar which separates an average engine from a high-performance bare-metal engine.

reddit.com
u/IamRustyRust — 7 hours ago

Physics engine solver: TGS (Temporal Gauss-Seidel) Math.

TGS (Temporal Gauss-Seidel) Calculation For Physics Solver:

The formula of TGS

In academic language Temporal Gauss-Seidel (TGS) is a bare-metal matrix solver designed to bring a massive, interconnected web of physical constraints (like stacked boxes or character joints) into perfect balance without choking the CPU. It does this by sweeping through the chain of objects one by one: it calculates the strictly unbalanced force acting on a single object, shifts it into local equilibrium using its mobility, and immediately uses that newly updated position to calculate the forces on the next connected object. However, to prevent this heavy, sequential chain-reaction from exhausting our CPU cycles, it injects the dimension of "Time" (Temporal) by recycling the exact final physics state from the previous frame and using it as the "initial guess" for the new frame. Because objects barely move in 1/60th of a second, this brilliant shortcut saves the hardware dozens of calculation loops, instantly resolving complex physical webs into a stable state.

If we carefully analyze the TGS formula, the solver is essentially isolating a single object within an array. By extracting all the surrounding forces, it focuses exclusively on the inertia (mass) and pure external gravity acting on that specific object. To understand exactly how TGS works under the hood, let us visualize a physical scenario.

Imagine a person hanging off the edge of a cliff. But he isn't alone his legs are being grasped by a second person, whose legs are held by a third. This chain of people hanging helplessly below the cliff can be viewed as our first array: [B1, B2, B3], where B3 is at the very bottom and B1 is right at the top, holding onto the cliff edge. Now, suppose there are people safely standing on top of the cliff trying to pull them back up. One person is holding B1’s hand directly, a second person is pulling that person, and a third person is anchoring them all. This creates our upper array: [T3, T2, T1], where T1 is the one in direct physical contact with B1.

If we stitch this entire system together, we get a single appended array: [T3, T2, T1, B1, B2, B3]. In this massive interconnected chain, our Main Focus Point is strictly B1. Why? Because if B1's grip fails, everyone hanging below him will instantly fall into the abyss. Furthermore, immediately following T1, B1 is the very first element actually suspended in the air rather than standing solidly on the mountain. Until the solver properly stabilizes B1, it is physically impossible to accurately calculate the fate of B2 and B3.

Looking at the math, the very first term we see is x_i(^(k+1)). If we are recording this entire scene like a movie, 'k' represents the previous frame (the past), and 'k+1' represents the present frame. If we call this entire human chain the X array, then 'x_i' represents our main focus element: B1. Basically, the CPU is calculating the specific forces acting exclusively on this one guy, x_i.

First, we have to determine B1's own mass and inertia, which is represented in the formula by the term a_ii. We know that mass and inertia provide 'stubbornness' a natural resistance to movement. This is exactly why gravity affects a heavy iron ball and a light foam ball differently. But to resolve a physics constraint, we don't want stubbornness we desperately want movement. Therefore, the formula mathematically reverses this inertia by flipping it upside down: 1 / a_ii. And the exact inverse of inertia is Mobility.

Next, we have to account for gravity. But remember, B1 isn't floating in a vacuum. We need to find the pure gravitational force acting on him, but B2 and B3 are constantly dragging B1 further down (adding heavily to the downward force), while T1, T2, and T3 are pulling B1 up (fighting to cancel out that downward force). To find the true, raw state of B1, we must measure the forces of all these neighbors. This is exactly where the two Sigmas (the summation symbols) come into play.

  • The First Sigma (Sum of j=1 to i-1): This represents our T-array the people pulling up. Because the CPU has already solved these upper elements in the current execution loop, they carry the (k+1) "Present" state.
  • The Second Sigma (Sum of j=i+1 to n): This represents our B-array the people dragging him down. Because the CPU hasn't physically reached them yet in its sweep, they are still stuck in the (k) "Past" state.

In both of these Sigmas, the a_ij term represents the physical connection the stiffness of the spring or the tight grip of the hands that is transferring force between every single element.

The Unbalanced Force & The Slip:

Now we arrive at a highly critical concept. Inside the main bracket, the very first term you see is b_i. This is strictly the pure external force acting on Bi meaning B1's own isolated weight (his mass multiplied by gravity) pulling him straight down.

So, what does the CPU actually do? Inside that bracket, it takes b_i (B1's pure gravity) and strictly subtracts those two Sigmas (the pulls from the guys above and the guys below). When the CPU executes this subtraction, the final result it spits out is the 'UNBALANCED FORCE' (also known as the Residual).

Let's apply some hard numbers to our analogy: Imagine B1's own gravity (b_i) combined with the drag of the people below him (the Second Sigma) are pulling down with a massive total force of 100. Meanwhile, T1 hanging above (the First Sigma) is pulling up with a force of only 90. When we subtract these forces (100 minus 90), we are left with a net downward force of 10. This remaining force is our Unbalanced Force. It dictates that B1 is definitely not in equilibrium he is actively experiencing a continuous, physical pull downwards.

Because B1 will inevitably have to change his position due to this Unbalanced Force, the CPU takes B1's Mobility (that 1 / a_ii we calculated earlier) and multiplies it directly by this exact Unbalanced Force:

Mobility (1 / a_ii) * Unbalanced Force

When this multiplication occurs, this is the exact physical moment we call 'B1's hand slipping from T1'. This slip is absolutely not some random external error it is the direct physical displacement caused by that remaining force of 10. Driven entirely by the math, B1's hand will physically slide across the grid to a brand new position where all surrounding forces temporarily cancel each other out. This newly calculated, fully stabilized position becomes our final output on the left side of the equation: x_i(^(k+1)).

reddit.com
u/IamRustyRust — 7 days ago
▲ 1 r/rust_p+1 crossposts

The illusion of Dividation and Newton Rapson Method

The Illusion of Dividation:

Multiplication is basically a looping of addition, and division is a looping of subtraction. For this reason, multiplication is the exact opposite of division. Therefore, to calculate the result of a multiplication, we can strictly reverse the division meaning we can do 1/D. When we do this, we won't get the result of division, we will get the result of multiplication. Now, let's talk strictly about multiplication. When we multiply, we are actually just running a loop of addition. We can explain this very easily by dividing a multiplication problem into two parts: LHS X RHS (Left Hand Side multiplied by Right Hand Side). The LHS part is our base value, and the RHS decides exactly how many times we need to run the loop of adding the LHS to itself. If I say 2 X 5, its meaning is simply this: the number 2 has to be added 50 times.

To make this crystal clear, let me give you a non-numerical example. Imagine a man running laps around a stadium. The physical meaning of one lap is one loop, and exactly how many laps have to be made is decided entirely by the RHS. Now, suppose I say 1 X 0.5. Our LHS is 1, which means the value of one full lap is 1. But the RHS says the laps to be made are only 0.5. So, how much looping has to be done? Exactly half of our LHS. So what will our answer be? It will be 0.5, simply because the amount of the lap we actually completed was half.

In the same way, suppose our LHS is 2 and our RHS is 5, making it 2 X 5. If we run one lap, its value will be 2. How much looping do we have to do? 5, meaning 5 laps have to be made. As we keep running laps, their values keep getting added together. Since the value of one lap is 2 (our LHS), and we have to run 5 laps, the physical meaning of this is: 2 + 2 + 2 + 2 + 2 = 10. In 2 X 5, 2 is the base, and we loop it 5 times. Since one loop's value is 2, the result is 10.

Now, if we look at this from another angle, mathematically 10 X 2^(-1) = 10 / 2 = 5. What is this telling us in reality? It asks: if our total distance is 10, and the value of one lap is 2, how many laps is it made of? The answer comes out to 5. This means if we do the exact reverse, we will reach from 10 back down to 0. Earlier we did 2 + 2 + 2 + 2 + 2 = 10, so now we will reverse it by subtracting that same 2 exactly five times: 10 - 2 - 2 - 2 - 2 - 2 = 0. So, in reality, when we are dividing, we are actually just multiplying by n^(-1). When we divide any number by n, we are strictly just multiplying that number by its inverse (n^(-1)). Division in math actually does not exist as an independent operation, it is simply just the multiplication of the reverse number.

Dividation with bit Shifting:

Ancient Hardware used to sue this looping method for dividation and for multiplication bbut it was very costly used to eat mnay CPU Cycles what the Modern hardware use is Newton-Rapson Method before going towards that I want to talk about how we achieve Dividation result with with bit Shifting.

The Repetation of 0 t 9

So I will start with 0 to 9, or bit shifting division.

In our math all numbers are actually 0 to 9 only. If I let only its first number means the right-most digit (unit place) remain from the front of every number, then you will find out that the numbers 0 to 9 are continuously repeating. I give an example of this. I take the counting from 0 to 29 as an example. If I categorize this after every 10 numbers, in which starting from 0 the numbers go up to 9, then 3 categories of ours will be formed: first category 0 to 9, second category 10 to 19, third category 20 to 29.

Now if I keep only the right-most number in every category, then I will have 0 to 9 numbers repeat. In the context of this particular 0 to 29 number of ours, the left-most number (tens digit) that is there, the category will be decided by that. So I will make three categories which we will call 'series'. This way we will have three series come. Series of 0 (in which there will be numbers from 00 to 09), second Series of 1 (in which there will be numbers from 10 to 19). Because we have actually categorized by the left-most number, so we will write the number of the series on the top of the series, therefore we do not need to add the left-most digit every time. And in the same way there will be Series of 2.

Now, suppose I have a number 15, I have to do addition in this. Then the tendency of my number will be to go from 'Right to Left'. Like if I want to make 15 into 25, then means the number will go into the second category (Series 2) which will be exactly to its left. So it shifted from Right to Left. But if I look at this exact thing with subtraction, then my number will jump from Left to Right. Means, if I wanted to make 25 into 5, then I will have to go two categories Right (first in the series of 1, then in the series of 0).

Shifting Due to Adition and Substraction

This in a way we have changed the decimal numbers into positions (like Little Endian format works in memory, where the lower value is on the right side and higher value is on the left side). Now through this very memory shifting we can get the result of division. Because we have seen that division is nothing else but only looping of subtraction, so we can achieve the result of division by doing 'Right Shift' to the value.

Exactly this exact logic we will apply at the bit level, because at the bit level also the lower bit is on the right and higher bit is on the left. So suppose we have 32-bit hardware. If we do Right Shift to the bits (like >> 1), then we can directly get the result of division.

For Example:
The hardware understands only 0 and 1 at the bit level.
Assume we have a decimal number: 8
In hardware this will be written like this in binary: 1000 (it's binary). Now we apply the Right Shift operator on this number only once (8 >> 1). The physical meaning of this is to pick up the entire number and push it one step towards the Right, and in the empty space that is left on the left put a 0 there. When we shift 1000 one step Right, then the right-most 0 falls down the cliff and the number becomes: 0100. Now if we read 0100 back in decimal math, then its value is: 4

We did not apply any division formula in math, only at the hardware level shifted the number one category right, and the result of 8 automatically became 4.
Meaning the strict mathematical meaning of Number >> 1 in hardware is: Number divided by 2. If we shift two times (>> 2), then it will get divided by 4. Division is actually achieved directly through this kind of physical right-shifting in the ALU of the CPU.

But this hardware cheat code has one big limitation: it only works when dividing by 2, 4, 8, 16 (powers of 2)! Because the binary base-2 system works so that one right shift results in division by 2, and two right shifts result in division by 4. But if a game engine needs to divide 100 by 3, bit-shifting fails badly, because you cannot shift by half a bit.

When it comes to division by 3, 7, or any other odd number, the CPU once again needs a proper mathematical method. And this is where a technique is born in which the CPU reaches the answer without actually dividing the Newton-Raphson Method.

Dividation WIth Newton-Raphson Method:

Newton-Raphson is such an iterative machine which in its every iteration keeps reaching exactly close to the perfect result of division. How does this work? For this what we have learned behind, we will use that.

Dividation with Newton Rapson Method

Look, if I do N/D, then we have come seeing behind that in our math there are two families. One is 'Addition Family' whose most powerful member is the exponent, because the looping of multiplication itself makes the exponent. And on the other side our second family is 'Subtraction Family' whose most powerful member is the negative exponent, because the looping of division itself makes the negative exponent. So according to this we can conclude this that Addition and Subtraction families both are opposite to each other. Means, the work that we want to achieve through division, that exact work we can achieve through multiplication.

Therefore, if I make this 1/D or D^(-1), then I will convert it into multiplication. And for the hardware of a computer doing multiplication is very easy in comparison to division. Means, if I assume that M = 1/D, then we will straight bring N/D into the form of N X M, which is very easy to calculate.

Exactly this thing Newton-Raphson does. It finds the value of this 1/D (meaning m). What does it do for this? It makes an initial 'guess' (guess), and puts that guess into this mathematical machine of its. This machine iteratively refines it and gives exactly the perfect value of 1/D.

In this machine, x_n+1 is our new refined guess. Means the previous guess which was a little wrong, this machine refines it and takes it exactly close to the balance of 1. Here D is our original value (means original denominator). But here the real work this 2 is doing! This 2 that is here, this is actually that mechanism which is refining our result. It sees how far our guess is from the balance of perfect 1 means is it like the number is too small, or has become too big? Therefore we subtract this error from 2. This 2 is a mechanism which balances our guess every time. If the guess is too much bigger or smaller than 1, then according to that only our this 2 actually every time refines the result and tries to balance it on 1.

Proof that 2 is Working Like Two edge Spring

To visualize this, this 2 you can assume as two springs. One spring is attached at 0, one spring is attached at 2, and exactly in their middle area 1 is written. We have a ball (our guess). Our these two springs basically try this with every new guess that they balance the ball at 1. If the ball goes towards 0 (guess became too small), then the spring of 0 pushes the ball towards 2. If the ball goes towards 2 (guess became too big), then the spring of 2 pushes it back towards 0. In this way both springs push it away from themselves. This ball is pushed back and forth of these springs until means this iteration runs until the ball away from both springs, gets perfectly fixed exactly in the middle at 1!

Can Newton Rapson Method can be used in the GPU:

Until now, we have seen how easily an iterative machine (like Newton-Raphson) solves a problem like division. From a hardware point of view, to balance a single variable between two boundaries (like 0 and 2) is very easy. This optimization works so perfectly on the GPU because there is 'Data Independence' in it. To solve its math, one calculation does not need to know the state of another calculation. Therefore, millions of parallel cores of the GPU can flawlessly finish their respective work at the exact same time.

But in a real-world Physics Engine, things do not exist alone. Imagine that in a game, a pile of wooden boxes placed one on top of another falls to the ground. As soon as they hit the ground, a mutual interconnected web of forces is formed between them. The bottom-most box tries to stop the one above it, the second pushes the third, and the weight of the third box puts pressure back on the bottom-most box through the second. Millions of variables are getting entangled and pushing against each other in this way at the exact same time. Newton-Raphson alone fails terribly in solving this mutual web, because as soon as you fix the math of any one box, the calculation of all the rest gets ruined.

This mutual web becomes even more complex when we come to the physics of any 'Hero' character or advanced vehicle. Hero physics is actually a highly interdependent matrix of rigid bodies, joints, and hinges. As an example, if the hand of a Hero character hits a wall, the final state of his shoulder cannot be mathematically calculated until the exact force of the wrist and elbow is resolved.

To understand why the GPU fails here, we first have to look at the mechanical freedom of the CPU, which bare-metal engines harness through 'Fibers'. Fibers are essentially a very smart 'To-Do list' for the processor. The biggest advantage of Fibers is their extreme 'Individuality' (freedom). If one CPU thread starts a physics job and stalls (perhaps waiting for a joint to resolve), it is not necessary that the same thread must finish it another completely free thread can jump in and finish that exact job. Here, every thread can make its own independent decision.

But, you cannot implement a free, independent system like Fibers on a GPU. Why? Because GPU threads do not survive alone. They are physically chained together in groups of 32 (or 64), which NVIDIA calls 'Warps' and AMD calls 'Wavefronts'. All 32 threads in a Warp must follow the exact same command at the exact same nanosecond. They have absolutely zero 'Individuality'.

If we hand over this sequential, highly-dependent web of boxes and joints to the GPU, its parallel cores will get terribly stuck. Because they are locked in Warps, thousands of threads will stand blindly behind memory barriers, waiting for each other's work to finish. That is why bare-metal systems like TitanEngine take this heavy physics out from the GPU and route it strictly towards the CPU. Because the CPU, with its independent Fibers, is a master of solving branch-heavy and sequential logic in a straight line, without stopping.

Now, because the CPU is not solving any one isolated division, but is solving this entire web of constraints together, our Newton-Raphson machine becomes insufficient here. To handle this massive and interconnected system of equations together, we have to upgrade this mathematical machine of ours from a 'Scalar Solver' to a 'Matrix Solver'. A solver which sweeps together across all the joints and boxes until their forces balance among themselves and in this world of physics, its name is Gauss-Seidel.

Finally, the danger always remains that this entire heavy calculation might eat up all the CPU cycles of our hardware and bottleneck the system. To avoid this, we add the dimension of Time into this equation meaning we use the final result of the previous frame as the 'Initial Guess' for the new frame. With this one trick, we save dozens of calculation cycles. And by combining all these concepts, the industry standard of bare-metal rigid body physics is formed: Temporal Gauss-Seidel (TGS).

reddit.com
u/IamRustyRust — 7 hours ago
▲ 2 r/rust_p+1 crossposts

Physics Engine Solver: TGS (Temporal Gauss-Seidel) Math.

TGS (Temporal Gauss-Seidel) Calculation For Physics Solver:

The formula of TGS

In academic language Temporal Gauss-Seidel (TGS) is a bare-metal matrix solver designed to bring a massive, interconnected web of physical constraints (like stacked boxes or character joints) into perfect balance without choking the CPU. It does this by sweeping through the chain of objects one by one: it calculates the strictly unbalanced force acting on a single object, shifts it into local equilibrium using its mobility, and immediately uses that newly updated position to calculate the forces on the next connected object. However, to prevent this heavy, sequential chain-reaction from exhausting our CPU cycles, it injects the dimension of "Time" (Temporal) by recycling the exact final physics state from the previous frame and using it as the "initial guess" for the new frame. Because objects barely move in 1/60th of a second, this brilliant shortcut saves the hardware dozens of calculation loops, instantly resolving complex physical webs into a stable state.

If we carefully analyze the TGS formula, the solver is essentially isolating a single object within an array. By extracting all the surrounding forces, it focuses exclusively on the inertia (mass) and pure external gravity acting on that specific object. To understand exactly how TGS works under the hood, let us visualize a physical scenario.

Imagine a person hanging off the edge of a cliff. But he isn't alone his legs are being grasped by a second person, whose legs are held by a third. This chain of people hanging helplessly below the cliff can be viewed as our first array: [B1, B2, B3], where B3 is at the very bottom and B1 is right at the top, holding onto the cliff edge. Now, suppose there are people safely standing on top of the cliff trying to pull them back up. One person is holding B1’s hand directly, a second person is pulling that person, and a third person is anchoring them all. This creates our upper array: [T3, T2, T1], where T1 is the one in direct physical contact with B1.

If we stitch this entire system together, we get a single appended array: [T3, T2, T1, B1, B2, B3]. In this massive interconnected chain, our Main Focus Point is strictly B1. Why? Because if B1's grip fails, everyone hanging below him will instantly fall into the abyss. Furthermore, immediately following T1, B1 is the very first element actually suspended in the air rather than standing solidly on the mountain. Until the solver properly stabilizes B1, it is physically impossible to accurately calculate the fate of B2 and B3.

Looking at the math, the very first term we see is x_i(^(k+1)). If we are recording this entire scene like a movie, k represents the previous frame (the past), and k+1 represents the present frame. If we call this entire human chain the X array, then x_i represents our main focus element: B1. Basically, the CPU is calculating the specific forces acting exclusively on this one guy, x_i.

First, we have to determine B1's own mass and inertia, which is represented in the formula by the term a_ii. We know that mass and inertia provide stubbornness a natural resistance to movement. This is exactly why gravity affects a heavy iron ball and a light foam ball differently. But to resolve a physics constraint, we don't want stubbornness we desperately want movement. Therefore, the formula mathematically reverses this inertia by flipping it upside down: 1 / a_ii. And the exact inverse of inertia is Mobility.

Next, we have to account for gravity. But remember, B1 isn't floating in a vacuum. We need to find the pure gravitational force acting on him, but B2 and B3 are constantly dragging B1 further down (adding heavily to the downward force), while T1, T2, and T3 are pulling B1 up (fighting to cancel out that downward force). To find the true, raw state of B1, we must measure the forces of all these neighbors. This is exactly where the two Sigmas (the summation symbols, Where, Each Sigma represents an Array) come into play.

  • The First Sigma (Sum of j=1 to i-1): This represents our T-array the people pulling up. Because the CPU has already solved these upper elements in the current execution loop, they carry the (k+1) "Present" state.
  • The Second Sigma (Sum of j=i+1 to n): This represents our B-array the people dragging him down. Because the CPU hasn't physically reached them yet in its sweep, they are still stuck in the (k) "Past" state.

In both of these Sigmas, the a_ij term represents the physical connection the stiffness of the spring or the tight grip of the hands that is transferring force between every single element (person).

The Unbalanced Force & The Slip:

Now we arrive at a highly critical concept. Inside the main bracket, the very first term you see is b_i. This is strictly the pure external force acting on Bi meaning B1's own isolated weight (his mass multiplied by gravity) pulling him straight down.

So, what does the CPU actually do? Inside that bracket, it takes b_i (B1's pure gravity) and strictly subtracts those two Sigmas (the pulls from the guys above and the guys below). When the CPU executes this subtraction, the final result it spits out is the UNBALANCED FORC (also known as the Residual).

Let's apply some hard numbers to our analogy: Imagine B1's own gravity (b_i) combined with the drag of the people below him (the Second Sigma) are pulling down with a massive total force of 100. Meanwhile, T1 hanging above (the First Sigma) is pulling up with a force of only 90. When we subtract these forces (100 minus 90), we are left with a net downward force of 10. This remaining force is our Unbalanced Force. It dictates that B1 is definitely not in equilibrium he is actively experiencing a continuous, physical pull downwards.

Because B1 will inevitably have to change his position due to this Unbalanced Force, the CPU takes B1's Mobility (that 1 / a_ii we calculated earlier) and multiplies it directly by this exact Unbalanced Force:

Mobility (1 / a_ii) * Unbalanced Force

When this multiplication occurs, this is the exact physical moment we call B1s hand slipping from T1'. This slip is absolutely not some random external error it is the direct physical displacement caused by that remaining force of 10. Driven entirely by the math, B1's hand will physically slide across the grid to a brand new position where all surrounding forces temporarily cancel each other out. This newly calculated, fully stabilized position becomes our final output on the left side of the equation: x_i(^(k+1)).

reddit.com
u/IamRustyRust — 7 hours ago

The illusion of Dividation and Newton Rapson Method

The Illusion of Dividation:

Multiplication is basically a looping of addition, and division is a looping of subtraction. For this reason, multiplication is the exact opposite of division. Therefore, to calculate the result of a multiplication, we can strictly reverse the division meaning we can do 1/D. When we do this, we won't get the result of division, we will get the result of multiplication. Now, let's talk strictly about multiplication. When we multiply, we are actually just running a loop of addition. We can explain this very easily by dividing a multiplication problem into two parts: LHS X RHS (Left Hand Side multiplied by Right Hand Side). The LHS part is our base value, and the RHS decides exactly how many times we need to run the loop of adding the LHS to itself. If I say 2 X 5, its meaning is simply this: the number 2 has to be added 50 times.

To make this crystal clear, let me give you a non-numerical example. Imagine a man running laps around a stadium. The physical meaning of one lap is one loop, and exactly how many laps have to be made is decided entirely by the RHS. Now, suppose I say 1 X 0.5. Our LHS is 1, which means the value of one full lap is 1. But the RHS says the laps to be made are only 0.5. So, how much looping has to be done? Exactly half of our LHS. So what will our answer be? It will be 0.5, simply because the amount of the lap we actually completed was half.

In the same way, suppose our LHS is 2 and our RHS is 5, making it 2 X 5. If we run one lap, its value will be 2. How much looping do we have to do? 5, meaning 5 laps have to be made. As we keep running laps, their values keep getting added together. Since the value of one lap is 2 (our LHS), and we have to run 5 laps, the physical meaning of this is: 2 + 2 + 2 + 2 + 2 = 10. In 2 X 5, 2 is the base, and we loop it 5 times. Since one loop's value is 2, the result is 10.

Now, if we look at this from another angle, mathematically 10 X 2^(-1) = 10 / 2 = 5. What is this telling us in reality? It asks: if our total distance is 10, and the value of one lap is 2, how many laps is it made of? The answer comes out to 5. This means if we do the exact reverse, we will reach from 10 back down to 0. Earlier we did 2 + 2 + 2 + 2 + 2 = 10, so now we will reverse it by subtracting that same 2 exactly five times: 10 - 2 - 2 - 2 - 2 - 2 = 0. So, in reality, when we are dividing, we are actually just multiplying by n^(-1). When we divide any number by n, we are strictly just multiplying that number by its inverse (n^(-1)). Division in math actually does not exist as an independent operation, it is simply just the multiplication of the reverse number.

Dividation with bit Shifting:

Ancient Hardware used to sue this looping method for dividation and for multiplication bbut it was very costly used to eat mnay CPU Cycles what the Modern hardware use is Newton-Rapson Method before going towards that I want to talk about how we achieve Dividation result with with bit Shifting.

The Repetation of 0 t 9

So I will start with 0 to 9, or bit shifting division.

In our math all numbers are actually 0 to 9 only. If I let only its first number means the right-most digit (unit place) remain from the front of every number, then you will find out that the numbers 0 to 9 are continuously repeating. I give an example of this. I take the counting from 0 to 29 as an example. If I categorize this after every 10 numbers, in which starting from 0 the numbers go up to 9, then 3 categories of ours will be formed: first category 0 to 9, second category 10 to 19, third category 20 to 29.

Now if I keep only the right-most number in every category, then I will have 0 to 9 numbers repeat. In the context of this particular 0 to 29 number of ours, the left-most number (tens digit) that is there, the category will be decided by that. So I will make three categories which we will call 'series'. This way we will have three series come. Series of 0 (in which there will be numbers from 00 to 09), second Series of 1 (in which there will be numbers from 10 to 19). Because we have actually categorized by the left-most number, so we will write the number of the series on the top of the series, therefore we do not need to add the left-most digit every time. And in the same way there will be Series of 2.

Now, suppose I have a number 15, I have to do addition in this. Then the tendency of my number will be to go from 'Right to Left'. Like if I want to make 15 into 25, then means the number will go into the second category (Series 2) which will be exactly to its left. So it shifted from Right to Left. But if I look at this exact thing with subtraction, then my number will jump from Left to Right. Means, if I wanted to make 25 into 5, then I will have to go two categories Right (first in the series of 1, then in the series of 0).

Shifting Due to Adition and Substraction

This in a way we have changed the decimal numbers into positions (like Little Endian format works in memory, where the lower value is on the right side and higher value is on the left side). Now through this very memory shifting we can get the result of division. Because we have seen that division is nothing else but only looping of subtraction, so we can achieve the result of division by doing 'Right Shift' to the value.

Exactly this exact logic we will apply at the bit level, because at the bit level also the lower bit is on the right and higher bit is on the left. So suppose we have 32-bit hardware. If we do Right Shift to the bits (like >> 1), then we can directly get the result of division.

For Example:
The hardware understands only 0 and 1 at the bit level.
Assume we have a decimal number: 8
In hardware this will be written like this in binary: 1000 (it's binary). Now we apply the Right Shift operator on this number only once (8 >> 1). The physical meaning of this is to pick up the entire number and push it one step towards the Right, and in the empty space that is left on the left put a 0 there. When we shift 1000 one step Right, then the right-most 0 falls down the cliff and the number becomes: 0100. Now if we read 0100 back in decimal math, then its value is: 4

We did not apply any division formula in math, only at the hardware level shifted the number one category right, and the result of 8 automatically became 4.
Meaning the strict mathematical meaning of Number >> 1 in hardware is: Number divided by 2. If we shift two times (>> 2), then it will get divided by 4. Division is actually achieved directly through this kind of physical right-shifting in the ALU of the CPU.

But this hardware cheat code has one big limitation: it only works when dividing by 2, 4, 8, 16 (powers of 2)! Because the binary base-2 system works so that one right shift results in division by 2, and two right shifts result in division by 4. But if a game engine needs to divide 100 by 3, bit-shifting fails badly, because you cannot shift by half a bit.

When it comes to division by 3, 7, or any other odd number, the CPU once again needs a proper mathematical method. And this is where a technique is born in which the CPU reaches the answer without actually dividing the Newton-Raphson Method.

Dividation WIth Newton-Raphson Method:

Newton-Raphson is such an iterative machine which in its every iteration keeps reaching exactly close to the perfect result of division. How does this work? For this what we have learned behind, we will use that.

Dividation with Newton Rapson Method

Look, if I do N/D, then we have come seeing behind that in our math there are two families. One is 'Addition Family' whose most powerful member is the exponent, because the looping of multiplication itself makes the exponent. And on the other side our second family is 'Subtraction Family' whose most powerful member is the negative exponent, because the looping of division itself makes the negative exponent. So according to this we can conclude this that Addition and Subtraction families both are opposite to each other. Means, the work that we want to achieve through division, that exact work we can achieve through multiplication.

Therefore, if I make this 1/D or D^(-1), then I will convert it into multiplication. And for the hardware of a computer doing multiplication is very easy in comparison to division. Means, if I assume that M = 1/D, then we will straight bring N/D into the form of N X M, which is very easy to calculate.

Exactly this thing Newton-Raphson does. It finds the value of this 1/D (meaning m). What does it do for this? It makes an initial 'guess' (guess), and puts that guess into this mathematical machine of its. This machine iteratively refines it and gives exactly the perfect value of 1/D.

In this machine, x_n+1 is our new refined guess. Means the previous guess which was a little wrong, this machine refines it and takes it exactly close to the balance of 1. Here D is our original value (means original denominator). But here the real work this 2 is doing! This 2 that is here, this is actually that mechanism which is refining our result. It sees how far our guess is from the balance of perfect 1 means is it like the number is too small, or has become too big? Therefore we subtract this error from 2. This 2 is a mechanism which balances our guess every time. If the guess is too much bigger or smaller than 1, then according to that only our this 2 actually every time refines the result and tries to balance it on 1.

Proof that 2 is Working Like Two edge Springs

To visualize this, this 2 you can assume as two springs. One spring is attached at 0, one spring is attached at 2, and exactly in their middle area 1 is written. We have a ball (our guess). Our these two springs basically try this with every new guess that they balance the ball at 1. If the ball goes towards 0 (guess became too small), then the spring of 0 pushes the ball towards 2. If the ball goes towards 2 (guess became too big), then the spring of 2 pushes it back towards 0. In this way both springs push it away from themselves. This ball is pushed back and forth of these springs until means this iteration runs until the ball away from both springs, gets perfectly fixed exactly in the middle at 1!

Can Newton Rapson Method can be used in the GPU:

Until now, we have seen how easily an iterative machine (like Newton-Raphson) solves a problem like division. From a hardware point of view, to balance a single variable between two boundaries (like 0 and 2) is very easy. This optimization works so perfectly on the GPU because there is 'Data Independence' in it. To solve its math, one calculation does not need to know the state of another calculation. Therefore, millions of parallel cores of the GPU can flawlessly finish their respective work at the exact same time.

But in a real-world Physics Engine, things do not exist alone. Imagine that in a game, a pile of wooden boxes placed one on top of another falls to the ground. As soon as they hit the ground, a mutual interconnected web of forces is formed between them. The bottom-most box tries to stop the one above it, the second pushes the third, and the weight of the third box puts pressure back on the bottom-most box through the second. Millions of variables are getting entangled and pushing against each other in this way at the exact same time. Newton-Raphson alone fails terribly in solving this mutual web, because as soon as you fix the math of any one box, the calculation of all the rest gets ruined.

This mutual web becomes even more complex when we come to the physics of any 'Hero' character or advanced vehicle. Hero physics is actually a highly interdependent matrix of rigid bodies, joints, and hinges. As an example, if the hand of a Hero character hits a wall, the final state of his shoulder cannot be mathematically calculated until the exact force of the wrist and elbow is resolved.

To understand why the GPU fails here, we first have to look at the mechanical freedom of the CPU, which bare-metal engines harness through 'Fibers'. Fibers are essentially a very smart 'To-Do list' for the processor. The biggest advantage of Fibers is their extreme 'Individuality' (freedom). If one CPU thread starts a physics job and stalls (perhaps waiting for a joint to resolve), it is not necessary that the same thread must finish it another completely free thread can jump in and finish that exact job. Here, every thread can make its own independent decision.

But, you cannot implement a free, independent system like Fibers on a GPU. Why? Because GPU threads do not survive alone. They are physically chained together in groups of 32 (or 64), which NVIDIA calls 'Warps' and AMD calls 'Wavefronts'. All 32 threads in a Warp must follow the exact same command at the exact same nanosecond. They have absolutely zero 'Individuality'.

If we hand over this sequential, highly-dependent web of boxes and joints to the GPU, its parallel cores will get terribly stuck. Because they are locked in Warps, thousands of threads will stand blindly behind memory barriers, waiting for each other's work to finish. That is why bare-metal systems like TitanEngine take this heavy physics out from the GPU and route it strictly towards the CPU. Because the CPU, with its independent Fibers, is a master of solving branch-heavy and sequential logic in a straight line, without stopping.

Now, because the CPU is not solving any one isolated division, but is solving this entire web of constraints together, our Newton-Raphson machine becomes insufficient here. To handle this massive and interconnected system of equations together, we have to upgrade this mathematical machine of ours from a 'Scalar Solver' to a 'Matrix Solver'. A solver which sweeps together across all the joints and boxes until their forces balance among themselves and in this world of physics, its name is Gauss-Seidel.

Finally, the danger always remains that this entire heavy calculation might eat up all the CPU cycles of our hardware and bottleneck the system. To avoid this, we add the dimension of Time into this equation meaning we use the final result of the previous frame as the 'Initial Guess' for the new frame. With this one trick, we save dozens of calculation cycles. And by combining all these concepts, the industry standard of bare-metal rigid body physics is formed: Temporal Gauss-Seidel (TGS).

reddit.com
u/IamRustyRust — 8 days ago