r/computerarchitecture

Good material on how cpu's fetch ram values?

Hi,

Any good read or watch on how specifically the cpu retrieves data? Stack or heap and why buffer overflows *can* occur.

u/Yha_Boiii — 21 hours ago

Breaking the Binary Bottleneck: Native Base-8 Logic Architecture (NDR-Octabit-Core) with O(1) Performance. Looking for Hardware/Quantum Partners

Hello everyone, For decades, the computing industry has been locked into the binary paradigm. While silicon scaling is hitting its physical limits, most optimization efforts remain at the software level, leaving the underlying foundational logic untouched. I have developed and officially registered the NDR-Octabit-Core, a computational logic system designed to run on a native Base-8 architecture instead of traditional Base-2.

⚙️ The Core Innovation The NDR-Octabit-Core bypasses the standard binary tree-structures for data processing. By implementing a native 8-state logical mapping, the system achieves a predictable O(1) time complexity in execution benchmarks, eliminating the latency fluctuations (O(log n)) typical of traditional binary address and allocation mechanisms. Scientific Timestamp & Registry: The architecture, formal benchmarks, and Core implementation in C++ have been published and indexed via Zenodo with a public Digital Object Identifier (DOI): https://doi.org/10.5281/zenodo.20128879

🚀 The Next Frontier: Scaling into Quantum & Hardware The mathematical framework of the NDR-Octabit-Core naturally aligns with the next generation of computing: Hardware (FPGA/ASIC): Moving from software emulation to native multi-level logic gates (similar to advanced MLC/QLC concepts but at a logic-gate level). Quantum Computing (Qudits): Traditional quantum computing focuses on 2-level qubits. The NDR-Octabit logic is structurally ready to map natively into 8-level Qudits (Octits), potentially offering a more efficient control layer and real-time state tracking without classical binary translation overhead.

💼 What I am looking for: The foundational logic is proven and benchmarked. I am now looking to transition this project from a validated scientific model into a physical/emulated reality. I am seeking: Deep Tech Investors / Venture Capital: Interested in pre-seed infrastructure, semiconductor licensing, or paradigm-shifting hardware patents. Hardware & FPGA Engineers: To collaborate on building a hardware description layer (VHDL/Verilog) for physical prototyping. Quantum Computing Labs/Researchers: To co-develop the driver layer, mapping the Base-8 NDR logic into multi-level quantum simulators (like Qiskit) or physical qudit platforms. If you are tired of incremental software patches and want to discuss a foundational architecture shift, let's connect. Contact: jarav2001 [at] gmail.com

reddit.com

u/Wrong_Vacation3262 — 1 day ago

▲ 6 r/computerarchitecture

Force a cpu to run userspace stuff in ring 0 / EL3 ?

Hi,

I'm writing an app but kernel space restricts sys calls able to make but don't want stall time when flushing for a new security level, so is there a way to run a userspace app forced on the vlsi level to not switch. a usual 50-300 cycle penalty per switch is expensive when polling network and manipulating it in userspace, rinse and repeat?

reddit.com

u/Yha_Boiii — 2 days ago

▲ 10 r/computerarchitecture+1 crossposts

Testing whether machine memory can be built from deterministic primitives instead of only LLM context, vector search, or databases.

I’m building Crystal: a local deterministic memory substrate for machines by biological memory primitives.

Instead of starting with language generation, I’m starting with memory primitives:

consolidation, temporal association, simplicity selection, bounded curiosity, and embodied feedback.

I’m releasing the work layer by layer so each claim can be tested.

reddit.com

u/Salt_Diamond5703 — 3 days ago

▲ 60 r/computerarchitecture+6 crossposts

[BLOG] Building a SIMD Scan-Line Rasterizer from Scratch

Built a hardware scan-line triangle rasterizer from scratch, full writeup here if interested

https://mummanajagadeesh.github.io/blogs/rasterizer/

It’s simulation-based for now, asking for feedback/suggestions on improvements

u/Large-Raisin-5912 — 3 days ago

▲ 7 r/computerarchitecture

Is this decomposition-based area modeling approach reasonable for microarchitecture DSE?

I am exploring a lightweight area modeling flow for microarchitecture DSE (design-space exploration). The goal is not signoff-accurate area estimation, but fast and structurally meaningful area prediction across many gem5 / HDL parameter configurations.

The core idea is to avoid using a single black-box model. Instead, I decompose the design into several structure classes and model them separately:

SRAM-like storage structures (e.g., caches, BTBs, large regular arrays)
Register/state-array structures (e.g., register files, rename tables, scoreboards)
Queue/buffer-like structures (e.g., ROB, LSQ, FIFO, write buffers)
CAM / associative selection logic (e.g., wakeup-select, associative lookup, priority/age selection)
Remaining control and arithmetic datapath (modeled as residual area after subtracting the first four categories)

For SRAM-like structures, I plan to use OpenRAM / SRAM compiler results as ground truth. For logic-like structures, I plan to synthesize representative RTL with Yosys and train separate ML models. The final chip area would be the sum of all category predictions.

The motivation is that different microarchitectural structures scale very differently with parameters like ports, entries, width, associativity, and issue width, so a single global predictor may not capture these scaling behaviors well.

My questions are:

Does this decomposition make sense for early-stage microarchitecture DSE?
Are these categories architecturally meaningful from an area-modeling perspective?
Would you classify structures like ROB, issue queue, LSQ, rename table, and physical register file differently?
Is combining SRAM compiler/OpenRAM results with synthesized logic models a reasonable flow?
What are the biggest pitfalls of this approach?
Are there prior works or open-source projects that use a similar methodology?

I am mainly trying to understand whether this “decompose-by-structure-type” modeling strategy is fundamentally sound, even if absolute area accuracy is limited.

reddit.com

u/Low_Car_7590 — 4 days ago

▲ 12 r/computerarchitecture

Power modeling

How is power modeling done in industry and/or research? I feel like performance modeling is easy to understand with needing to model cycle behavior, but power seems much more difficult to estimate with abstract representations?

reddit.com

u/Visplay — 9 days ago

▲ 11 r/computerarchitecture

how big is execution time penalty for cpu mode switching?

Hi,

If a cpu runs a program in userspace contrary to kernel space how much of execution time is penalized on context switching and cpu modes? there are two forces: cpu mode itself bit vector being flipped (eg. el0 - el3) and then the kernel switching.

nothing specific, just wet finger in air

reddit.com

u/Yha_Boiii — 10 days ago

▲ 11 r/computerarchitecture+1 crossposts

CIM as a compute macro

I genuinely think CIM has more promise and future than the SIMT architecture that is dominating the market space right now. Yet, CIM narrative has gotten stuck on the narrative — eliminate data movement, co-locate compute with memory, show a power efficiency chart. Unfortunately, a lot of these claims do not scale as the performance required increases to enterprise grade. It’s not sufficient for a product, with 2 of the three capping out - memory size, bandwdith or TFLOPs.
I’ve spent significant time working through what it actually takes to make CIM a first-class compute macro in an enterprise datapath. Something that can handle mixed precision, scale with data bandwidth, tensor size etc — it sits alongside a CPU or GPU tile, exposes a clean interface to a compiler stack, and meets the reliability bar that production workloads demand.

Here are some problems that are living rent-free in my head and worth actually debating about:
The macro interface is still an open problem. Memory-mapped, tensor-core-like, or something purpose-built for dataflow — each choice has deep implications for how a workload scheduler sees the device and how much you’re asking a compiler team to build from scratch. Unfortunately, most CIM architectures punt on this and call it a software problem.
How do you architect a CPU or GPU that actually harnesses CIM at scale? The interesting question is how you redesign the memory hierarchy, execution units, and dataflow control so CIM becomes a native citizen of the compute fabric rather than an accelerator bolted on the side. What does the ISA surface look like? How does the scheduler reason about CIM availability without destroying pipeline efficiency?
Datacenter-level deployment is a network fabric problem as much as a silicon problem. A CIM macro that wins on a single chip means little if the inference serving architecture can’t distribute workloads across a rack or pod efficiently. How do you design the interconnect and topology so that CIM’s power efficiency advantage isn’t eaten by communication overhead? What does a CIM-native inference cluster actually look like?
These are the conversations I find most scarce — people who’ve thought past the device level into the full system stack.

Particularly interested in hearing from anyone who’s seriously engaged with the architecture above the macro.

reddit.com

u/AdmirableProject1575 — 10 days ago

▲ 43 r/computerarchitecture+1 crossposts

Hey everyone,

I’m a solo developer working nights with zero budget on a new CPU architecture called Project RJ8A. I wanted to share a major milestone and get some feedback from the veterans here.

The Milestone: I just successfully generated the Verilog RTL for my proof-gated execution path and ran it through my master test suite. As you can see in the first screenshot, it passed 55/55 algorithms flawlessly (including Quicksort, Ackermann, and Hanoi).

In our current benchmark suite, the architecture shows a ~6.1x wall-time efficiency advantage compared to heavily optimized x86 (-O3 -march=native with guarding), assuming a 2 GHz Fmax target for the final ASIC.

The "Good" Problem (Screenshot 2): Up until now, I used an ECP5-85F for my hardware proofs. However, the full Sidecar-Wrapper design just exploded past the chip's limits. Yosys native 64-bit synthesis completed successfully (peaking at 52GB RAM usage on my workstation), but the design requires ~142,000 LUTs (169% utilization on the ECP5).

The Pivot: I'm currently migrating the entire pipeline to the open-source F4PGA toolchain targeting a Xilinx Artix-7 (A200T) / Kintex-7 to accommodate the massive routing graph. VPR is running right now.

Since I'm bootstrapping this entirely alone and don't currently have the budget for a high-end physical Xilinx dev board to flash the final bitstream, I'm relying heavily on strict toolchain verification (Yosys -> VPR -> STA).

Are there any F4PGA/VPR veterans here who have pushed >140k LUT designs through the open-source Xilinx flow? Any pitfalls regarding memory limits or routing congestion I should watch out for during the 12-hour PnR runs?

Also open to any academic/seed partners who want to collaborate on the first physical boot. Thanks for reading!

https://preview.redd.it/0a301ij22mzg1.png?width=3840&format=png&auto=webp&s=ab94a877c3d1073431b22e9ed6d55654358e3dc5

https://preview.redd.it/us1wa52h2mzg1.png?width=2008&format=png&auto=webp&s=957749c492ae2b9b42035f4ffac4d25b48c1ef35

https://preview.redd.it/ypdkqx5k2mzg1.png?width=573&format=png&auto=webp&s=3bceaa19251b0ddee90f3df81981f7e0bf9a0771

reddit.com

u/Different-Breath-645 — 13 days ago

▲ 13 r/computerarchitecture

Is "execution model" a property of each abstraction level independently, or a top-level design principle?

Hi everyone, I'm an Italian student following the awesome Onur Mutlu's Digital Design and Computer Architecture course (ETH/CMU 447, publicly available on YouTube). I'm trying to understand what "execution model" actually means — apologies in advance, English is not my first language and I used AI assistance to help me formulate this question clearly, but the confusion is genuinely mine.

The way it's introduced in the course, "execution model" sounds like a top-level design principle — you choose an execution model (Von Neumann, dataflow) and then derive an ISA and a microarchitecture from it. But in practice, both the ISA and the microarchitecture seem to have their own execution model independently — and they can differ. OOO processors are the obvious example: sequential at the ISA level, dataflow-like at the microarchitecture level.

This makes me wonder: is "execution model" just a per-level descriptor — a way to characterize how instructions fire at each layer of the hierarchy — rather than a single overarching principle?

The reason I'm confused is that Von Neumann is presented as an execution model, but it's much more than a firing mechanism — it also includes stored program and a specific hardware organization. Dataflow, by contrast, is described almost purely as a firing mechanism. So either "execution model" means different things in the two cases, or Von Neumann is being used as a shorthand for something more specific.

Is there a clean definition of "execution model" in the literature, or is it consistently informal?

IMHO the "problem" Is that pedagogically speaking the von Neumann model Is presented as an indivisibile package, but, since its introduction, a lot of abstractions were introduced, complicating the picture.

Thanks in advance.

reddit.com

u/LoganHX — 11 days ago

▲ 5 r/computerarchitecture

C bound checking

Hi,

How does bound checking and such work on a lower level?

Why is snprintf needed when a normal say normal signed ints don't need bound checking?

Today i got a reality check on stack is also not bound checked, how does it actually work, heap or stack?

Any books, videos and other material specifically on the asm level of it all the compilers story on it?

reddit.com

u/Yha_Boiii — 12 days ago

▲ 1 r/computerarchitecture+2 crossposts

Hi everyone,

I am a 1st-year B.Tech student, and I recently published a theoretical architecture preprint on Zenodo exploring how to bypass the Thermal Wall and RC Delay limits using a quasi-delay-insensitive (QDI) paradigm.

Link to Paper: https://zenodo.org/records/20055657?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6Ijk5MGI1MzU2LTEyZGItNDA5Zi1iYzJjLTYwN2JlZDg4ZWRiYiIsImRhdGEiOnt9LCJyYW5kb20iOiIwNmZkZjA2ZmE5ZTRhMjE1MmNiMzNmNjhkZDM2ODhjYSJ9.qWnAPAz0EvW4OB819gAJ_jwncxSqO9w59BX9SKoC6mOUSPgglVEwbwKb2B9OkegSu6CtGlmlQBKjyJ0zxdD7cg

The TL;DR of the LAGS Architecture:

Core: Locally Asynchronous, Globally Synchronous (LAGS). Execution islands use 4-Phase RTZ handshakes and Valid Bit completion detection (single-rail, not dual-rail) to act as glitch-filtered QDI pipelines.
Thermal Management: A hardware-level Token Ring acts as a strict power-gating enabler, forcing a rotating thermal duty-cycle across the NoC to prevent Dark Silicon meltdowns without OS intervention.
The EDA Compromise: I know pure async is an EDA nightmare. To make this theoretically fabricable, the internal NoC is clockless, but the boundaries are wrapped in standard Synchronous Interfaces (Two-Flop Synchronizers) to act as a Trojan Horse for Static Timing Analysis (STA) tools.

What I am looking for: I am preparing to move into Phase 1 (VHDL/FPGA deployment) to empirically test the Token Ring thermal heuristics and interrupt latency.

Before I start writing hardware description logic, I want brutal feedback on the theoretical bottlenecks. Specifically:

Does my synchronous boundary wrapper adequately satisfy modern STA tools, or will the tools still choke on the internal QDI logic?
For those working with massive NoCs, does my assumption about the Token Ring acting as a strict hardware memory fence hold up under heavy localized data-dependency?

Tear it apart. I want to know where the physical limits break my logic before I try to simulate it. Thanks!

reddit.com

u/LuckySalary — 14 days ago

▲ 8 r/computerarchitecture

Microarchitecture Assessment and Critique

Hey guys,

I’ve recently began to wrap up development of my most recent CPU core, Anvil-Pro. While this has mostly been an educational endeavor, I’ve ended up with what may be a fairly solid FPGA softcore. Through development, I’ve attempted not just to “implement” but also, within reason, to “perform”. As such, the microarchitecture and decisions underlining it have been made with the explicit goal of high IPC/LUT.

My rationale was, worst case scenario, I learn strong fundamentals and end up with a resume item. Best case scenario, I may create something that could carve out a legitimate use case (however marginal) within the spectrum of demand.

Since, ultimately, the project is educational, I’ve chosen to make every decision top down from principle rather than from textbook or convention. This comes with the caveat that, ultimately, I will make suboptimal and poor decisions. I’ve implemented or reinvented many standard internal CPU structures, but have combined them in a way that is less commonly done. Importantly, the overall architecture was from what I reasoned to be effective, rather than from a model CPU.

This design philosophy has pros and cons. As to the cons, I am ignoring the previously discovered wisdom of everyone prior to me. I am also putting into practice something untested rather than proven optimal. As to the pros, I am creating something slightly interesting and perhaps less treaded. I also get to learn stronger architectural principles from the additional accountability.

Given all this, I hope to have established that Anvil-Pro is somewhat different from other softcores. While this difference is marginal, it is still worth noting. A question in my mind now remains: “Is this actually any good?”.

This is the reason for this post. If anyone is interested and has sufficient time to waste, could you evaluate my microarchitecture and tell me how it compares to convention inside its own performance class. Is my performance good, poor, decent? Are my decisions justified, is my architecture sane? I’ve yet to have someone actually look at this other than myself. To be completely honest I really don’t know, if i could do it all over again, what I would change.

If interested, there’s an architecture.md document detailing design philosophy. There are also several diagrams I’ve put together to illustrate my thoughts. You may also look through the verilog, but I would recommend against it. My coding style is rather convoluted in all honesty, especially given that this was built solo rather than as part of a team.

Please let me know thoughts and critiques, of which I’m happy to hear.

https://github.com/JohnH2448/Anvil-Pro

Note: I have not run timing analysis or FPGA resource usage estimates. I certainly plan to, but at this point I have not yet gotten to it. This is a first functional prototype.

u/No_Experience_2282 — 11 days ago