How Taalas "prints" LLM onto a chip?

(anuragk.com)

60 points | by beAroundHere 12 hours ago

10 comments

  • Hello9999901 1 hour ago
    This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.

    Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.

    [1] https://en.wikipedia.org/wiki/Structured_ASIC_platform

    • roncesvalles 6 minutes ago
      Well even programmable ASICs like Cerebras and Groq give many-multiples speedup over GPUs and the market has hardly reacted at all.
  • londons_explore 7 minutes ago
    So why only 30,000 tokens per second?

    If the chip is designed as the article says, they should be able to do 1 token per clock cycle...

    And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...

  • kinduff 11 minutes ago
    Very nice read, thank you for sharing this so well written.
  • owenpalmer 1 hour ago
    > Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.

    Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.

    • roncesvalles 4 minutes ago
      That slot is called USB-C. I can fully imagine inference ASICs coming in powerbank form factor that you'd just plug and play.
    • beAroundHere 1 hour ago
      That's the kind of hardware am rooting for. Since it'll encourage Open weighs models, and would be much more private.

      Infact, I was thinking, if robots of future could have such slots, where they can use different models, depending on the task they're given. Like a Hardware MoE.

    • 8cvor6j844qw_d6 47 minutes ago
      A cartridge slot for models is a fun idea. Instead of one chip running any model, you get one model or maybe a family of models per chip at (I assume) much better perf/watt. Curious whether the economics work out for consumer use or if this stays in the embedded/edge space.
    • Onavo 8 minutes ago
      Yeah maybe you can call it PCIe.
  • rustyhancock 1 hour ago
    Edit: reading the below it looks like I'm quite wrong here but I've left the comment...

    The single transistor multiply is intriguing.

    Id assume they are layers of FMA operating in the log domain.

    But everything tells me that would be too noisy and error prone to work.

    On the other hand my mind is completely biased to the digital world.

    If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.

    Mulling it over, actually the noise probably doesn't matter. It'll average to 0.

    It's essentially compute and memory baked together.

    I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!

    • generuso 1 hour ago
      The document referenced in the blog does not say anything about the single transistor multiply.

      However, [1] provides the following description: "Taalas’ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."

      [1] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

      • londons_explore 1 minute ago
        [delayed]
      • rustyhancock 53 minutes ago
        That's much more informative, I think my original comment is quite off the mark then.
  • rustybolt 44 minutes ago
    Note that this doesn't answer the question in the title, it merely asks it.
    • beAroundHere 41 minutes ago
      Yeah, I had written the blog to wrap my head around the idea of 'how would someone even be printing Weights on a chip?' 'Or how to even start to think in that direction?'.

      I didn't explore the actual manufacturing process.

  • abrichr 21 minutes ago
    ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:

    "Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]

    "Mask Programmable ROM Using Shared Connections" [3]

    The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.

    The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.

    Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.

    If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.

    Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.

    [1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

    [2] https://patents.google.com/patent/WO2025147771A1/en

    [3] https://patents.google.com/patent/WO2025217724A1/en

    [4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

  • sargun 30 minutes ago
    Isn’t the highly connected nature of the model layers problematic to build into physical layer?
  • moralestapia 0 minutes ago
    >HOW NVIDIA GPUs process stuff? (Inefficiency 101)

    Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).

    Then proceeds to explain how inference is not implemented and draws conclusions from there ...

  • villgax 10 minutes ago
    This read itself is slop lol, literally dances around the term printing as if its some inkjet printer