Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

FWIW, while Apple silicon can _run_ huge models thanks to the unified memory (not to be confused with shared memory), the inference is pretty slow compared to dedicated GPUs, so it's a tradeoff. The significance of this PR is that inference speed can—at least in certain applications—be sped up using parallel decoding.


view as:

Legal | privacy