Hacker Read

behnamoh | karma 20551 | avg karma 4.64 · 2023-10-11 16:46:16

FWIW, while Apple silicon can _run_ huge models thanks to the unified memory (not to be confused with shared memory), the inference is pretty slow compared to dedicated GPUs, so it's a tradeoff. The significance of this PR is that inference speed can—at least in certain applications—be sped up using parallel decoding.