I'm sure there are countless tricks, but one that can implemented at home, and I...

tough · 2025-08-08T20:03:12 1754683392

gpt-oss-120b can be used with gpt-oss-20b as speculative drafting on LM Studio

I'm not sure it improved the speed much

roadside_picnic · 2025-08-08T20:54:20 1754686460

To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model.

In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model.

You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance.

It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens.

vrm · 2025-08-08T20:37:20 1754685440

a 6:1 parameter ratio is too small for specdec to have that much of an effect. You'd really want to see 10:1 or even more for this to start to matter

lhl · 2025-08-09T08:13:32 1754727212

You're right on ratios, but actually the ratio is much worse than 6:1 since they are MoEs. The 20B has 3.6B active, and the 120B has only 5.1B active, only about 40% more!

qcnguy · 2025-08-09T13:18:28 1754745508

It depends a lot on the type of conversation. A lot of ChatGPT load appears to be therapy talk that even small models can correctly predict.