MoonObserver's comments

MoonObserver · 2025-10-07T15:38:17 1759851497

Intel TDX unfortunately suffers from the exact same vulnerability as Scalable SGX. The underlying root cause is the lack of randomized encryption; using a static-adversary encryption scheme (XTS) rather than a dynamic-adversary one. The result is that plaintext-ciphertext mappings are unchanged at a fixed memory address. While the choice of scheme might initially seem puzzling, it is due to a randomized encryption scheme requiring counters for each memory block, which has a prohibitive on-chip memory cost when scaling to hundreds of GBs of memory.

MoonObserver · 2025-08-07T09:51:00 1754560260

M2 Max processor. I saw 60+ tok/s on short conversations, but it degraded to 30 tok/s as the conversation got longer. Do you know what actually accounts for this slowdown? I don’t believe it was thermal throttling.

summarity · 2025-08-07T09:54:25 1754560465

Physics: You always have the same memory bandwidth. The longer the context, the more bits will need to pass through the same pipe. Context is cumulative.

VierScar · 2025-08-07T10:02:30 1754560950

No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance

I seriously doubt it's the throughput of memory during inference that's the bottleneck here.

MereInterest · 2025-08-07T11:18:12 1754565492

Nitpick: O(n^2) is quadratic, not exponential. For it to “increase exponentially”, n would need to be in the exponent, such as O(2^n).

esafak · 2025-08-07T13:35:26 1754573726

To contrast with exponential, the term is power law.

zozbot234 · 2025-08-07T10:09:48 1754561388

Typically, the token generation phase is memory-bound for LLM inference in general, and this becomes especially clear as context length increases (since the model's parameters are a fixed quantity.) If it was pure compute bound there would be huge gains to be had by shifting some of the load to the NPU (ANE) but AIUI it's just not so.

summarity · 2025-08-07T11:09:05 1754564945

It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth.

zozbot234 · 2025-08-07T12:02:11 1754568131

Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase.

(A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)

torginus · 2025-08-07T13:09:38 1754572178

Inference takes quadratic amount of time wrt context size

MoonObserver · 2025-08-07T09:49:29 1754560169

This has been my thinking as well. Quite anti-human that a technology that improves automation and productivity works against common interest rather than for it.

MoonObserver · on Aug 6, 2024

> Never use random IVs with GCM; this breaks the authentication [2] [3]. Given the pitfalls of AES-GCM with respect to random nonces, you might prefer switching to XSalsa20+Poly1305. The advantage of XSalsa is it has an extended nonce length, so you can use random nonces without fear.

Those papers are a bit over my head. Could you please explain what's wrong with using random IVs here? What should we do instead (assuming we can only use GCM, and not switch to chacha)

conradludgate · on Aug 6, 2024

There's two issues.

Background: the key+IV define a keystream which is xor-ed against the message. The same key+IV generate the same keystream. Thus you can XOR two cipher texts and reveal information from the two plaintext.

AES-GCM is authenticated encryption. To combat known-ciphertext-attacks, you want to have authenticated cipher texts. AES-GCM specifically is vulnerable to an attack with a reused IV to recover the authentication key. Allowing you to forge authentication tags and employ a KCA.

The solution, if you're stuck with aes, is to switch to XAES-GCM or better AES-GCM-SIV. Alternatively you must use a counter or checkes system to not reuse IV. Since this is in the context of 1fps, you could use unix timestamp + random bytes to reduce the chance of collisions.

hatsunearu · on Aug 6, 2024

Is the statement just that if you use a random value for a nonce rather than some guaranteed never-used-once value, it's possible to get a collision faster than the "natural" block collision complexity (half block size or something like that)?

conradludgate · on Aug 7, 2024

It's a birthday attack principle. With only 96bits after roughly a billion messages with the key and random IVs, you start reaching realistic probabilities that you will reuse an IV

irundebian · on Aug 7, 2024

And how you will get a billion messages with 1 frame per second?

jszymborski · on Aug 6, 2024

Not an expert, but this is my understanding.

1. It is necessary for nonces to never be re-used for a given key lest you open yourself to a certain class of attacks that can decode all messages using that key. This is specific to AES-GCM due to how it internally reuses nonces.

2. AES-GCM uses very small nonces, making the probability of randomly using the same nonce unacceptably as the number of messages encoded with a given key increases (as it would with each frame sent on 1fps).

You can avoid all this by using a different primitive with a longer nonce such as XSalsa (a version of Salsa with a 192-bit nonce)

Vegenoid · on Aug 7, 2024

In AES-GCM (simplified explanation), an encrypted message takes 3 inputs: plaintext, a symmetric encryption key (generally unique per-session, as in this program), and a 12-byte nonce (a.k.a. IV).

If an attacker intercepts 2 messages that were encrypted with the same key and the same nonce, they can reveal the authentication key used for the generation of authentication tags (auth tags), and they can then forge auth tags for any message. These auth tags are how a recipient verifies that the message was created by someone who knows the symmetric key that was used to encrypt the plaintext, and that it was not altered in transit.

More simply, it allows an attacker to alter the ciphertext of an encrypted message, and then forge an authentication tag so that the modification of the ciphertext could not be detected. It does not reveal the symmetric key that allows decryption of the ciphertext, or encryption of arbitrary plaintext.

If a random nonce is generated, there is a chance that it is the same as a random nonce that was generated earlier in the session. Since the nonce is 12 bytes, this chance is very small for any 2 random nonces (1 in 2^96), but the chance of a collision increases rapidly with the number of encrypted messages sent in a session (see the birthday problem). It still requires a large number of messages to be sent before the chance of a collision becomes significant: "after 2^28 encryptions the probability of a nonce collision will be around 0,2 % ... After 2^33 [encryptions] the probability will be more than 80 %"[0]

If this program is sending 1 message per second (1 FPS), it would take 8 years for 2^28 messages to be sent. I haven't looked at the code, it may well be sending many more messages than that.

The alternative to a random nonce is a "counter" nonce, which starts at 1 and increments with each message. The potential pitfall of counter nonces is that they can be harder to implement, as they require tracking and updating state. If the program ever fails to track or update this state correctly, nonce reuse will occur. A different counter must be used for each symmetric key (which should be randomly generated for each session).

EXTRA CREDIT: There is also information revealed about the plaintext of the 2 messages that used the same key and nonce - specifically, the XOR of the 2 plaintexts. While this doesn't directly reveal the plaintexts, if some information about one of the plaintexts is known, that can be used to reveal information about the other plaintext.

I learned most of this information from David Wong's Real-World Cryptography.

[0]: https://eprint.iacr.org/2016/475.pdf, section 3.3

MoonObserver · on July 18, 2024

Wow, that situation with Android phones looks DIRE.

The charts for Android look intriguing; are they claiming they can break into BFU (before first unlock) phones that have secure startup (i.e. pin at boot) enabled?

MoonObserver · on July 1, 2024

This is incredible work! Is a desktop version planned?

MoonObserver · on April 12, 2024

ROCm is a whole ecosystem. HIP falls under ROCm.