Why do ML models run so slow on my laptop
I tried downloading and running a small llama model on my MacBook Air. It is running very slow and if I try a bigger model, it is completely erroring out. Why is this happening?
I tried downloading and running a small llama model on my MacBook Air. It is running very slow and if I try a bigger model, it is completely erroring out. Why is this happening?
Member • 4mos
Hey Arjun - is the M1 or M2? Also what kind of memory does your MB Air have?
Member • 4mos
Hey Kaustubh, I have an M1 macbook air. It has 8 GB of ram
Member • 4mos
Hey Arjun, M1 Macbook Air base model has a 7 core GPU with 8 GB unified memory. Unified memory means the same memory will be shared between CPU and GPU. CPU vs GPU CPU is a flexible and versatile but it doesn't have much bandwidth. So if it has to do the same calculation with multiple values, it has to do these calculations one by one. GPU has a larger portion of transistors dedicated to mathematical operations and a few for control. So GPU has a higher bandwidth for operations. If a GPU needs to do a same operation for multiple values, it can do them all in parallel instead of having to do it sequentially. LLM Inference is basically running a lot of matrix operations over a range of inputs. So it is uniquely suited for GPUs.
Memory Requirements for LLM Inference These are the main components for calculating the memory required for LLM inference Model Parameters KV Cache (for storing reusable calculations) Activations (intermediate values during processing)
Total Memory = Parameters + KV Cache + Activations Example Calculations Large Model (meta-llama/Llama-3.3-70B-Instruct)
70B parameters × 2 bytes (BF16 precision) = 140GB With 1.2× multiplier for activations/cache = 168GB Cannot run on MacBook Air (exceeds available RAM)
Smaller Model (meta-llama/Llama-3.2-1B) 1B parameters × 2 bytes × 1.2 = 2.4GB Can run on MacBook Air but consumes ~30% of available RAM This memory usage explains why larger models are impractical on MacBook Air, while smaller models run at acceptable speeds with reduced performance. So how can you run the models faster? You have a few options
Hardware Solutions Use dedicated GPUs with higher VRAM Deploy on cloud GPUs (AWS, GCP, Azure) Leverage specialized AI hardware (TPUs, NPUs)
Model Optimization Quantization (reducing precision: FP16, INT8) Pruning (removing redundant weights) Knowledge distillation (training smaller models)
Inference Optimization Stream outputs instead of waiting for full generation Hope it helps!