• Why do ML models run so slow on my laptop

    Arjun Panda

    Member

    Updated: Jan 3, 2025
    Views: 459

    I tried downloading and running a small llama model on my MacBook Air. It is running very slow and if I try a bigger model, it is completely erroring out. Why is this happening?

    1
    Replies
Howdy guest!
Dear guest, you must be logged-in to participate on machinelearning. We would love to have you as a member of our community. Consider creating an account or login.
Replies
  • Kaustubh Katdare

    Member4mos

    Hey Arjun - is the M1 or M2? Also what kind of memory does your MB Air have?

    Are you sure? This action cannot be undone.
    Cancel
  • Arjun Panda

    Member4mos

    Hey Kaustubh, I have an M1 macbook air. It has 8 GB of ram

    Are you sure? This action cannot be undone.
    Cancel
  • Ananth

    Member4mos

    Hey Arjun, M1 Macbook Air base model has a 7 core GPU with 8 GB unified memory. Unified memory means the same memory will be shared between CPU and GPU. CPU vs GPU CPU is a flexible and versatile but it doesn't have much bandwidth. So if it has to do the same calculation with multiple values, it has to do these calculations one by one. GPU has a larger portion of transistors dedicated to mathematical operations and a few for control. So GPU has a higher bandwidth for operations. If a GPU needs to do a same operation for multiple values, it can do them all in parallel instead of having to do it sequentially. LLM Inference is basically running a lot of matrix operations over a range of inputs. So it is uniquely suited for GPUs.

    Memory Requirements for LLM Inference These are the main components for calculating the memory required for LLM inference Model Parameters KV Cache (for storing reusable calculations) Activations (intermediate values during processing)

    Total Memory = Parameters + KV Cache + Activations Example Calculations Large Model (meta-llama/Llama-3.3-70B-Instruct)

    70B parameters × 2 bytes (BF16 precision) = 140GB With 1.2× multiplier for activations/cache = 168GB Cannot run on MacBook Air (exceeds available RAM)

    Smaller Model (meta-llama/Llama-3.2-1B) 1B parameters × 2 bytes × 1.2 = 2.4GB Can run on MacBook Air but consumes ~30% of available RAM This memory usage explains why larger models are impractical on MacBook Air, while smaller models run at acceptable speeds with reduced performance. So how can you run the models faster? You have a few options

    Hardware Solutions Use dedicated GPUs with higher VRAM Deploy on cloud GPUs (AWS, GCP, Azure) Leverage specialized AI hardware (TPUs, NPUs)

    Model Optimization Quantization (reducing precision: FP16, INT8) Pruning (removing redundant weights) Knowledge distillation (training smaller models)

    Inference Optimization Stream outputs instead of waiting for full generation Hope it helps!

    Are you sure? This action cannot be undone.
    Cancel
Home Channels Search Login Register