Why do ML models run so slow on my laptop

Arjun Panda · 2024-12-31T03:21:32+00:00

I tried downloading and running a small llama model on my MacBook Air. It is running very slow and if I try a bigger model, it is completely erroring out. Why is this happening?

Why do ML models run so slow on my laptop

Arjun Panda
@75Jhhjw

Updated: Dec 31, 2024

Views: 733

I tried downloading and running a small llama model on my MacBook Air. It is running very slow and if I try a bigger model, it is completely erroring out. Why is this happening?

1

Replies

Howdy guest!

Dear guest, you must be logged-in to participate on machinelearning. We would love to have you as a member of our community. Consider creating an account or login.

Replies

Kaustubh

@thebigk • 10mos

Hey Arjun - is the M1 or M2? Also what kind of memory does your MB Air have?

Arjun Panda

@75Jhhjw • 9mos

Hey Kaustubh, I have an M1 macbook air. It has 8 GB of ram

Ananth

@MxAobcn • 9mos

Hey Arjun, M1 Macbook Air base model has a 7 core GPU with 8 GB unified memory. Unified memory means the same memory will be shared between CPU and GPU. CPU vs GPU CPU is a flexible and versatile but it doesn't have much bandwidth. So if it has to do the same calculation with multiple values, it has to do these calculations one by one. GPU has a larger portion of transistors dedicated to mathematical operations and a few for control. So GPU has a higher bandwidth for operations. If a GPU needs to do a same operation for multiple values, it can do them all in parallel instead of having to do it sequentially. LLM Inference is basically running a lot of matrix operations over a range of inputs. So it is uniquely suited for GPUs.

Memory Requirements for LLM Inference These are the main components for calculating the memory required for LLM inference Model Parameters KV Cache (for storing reusable calculations) Activations (intermediate values during processing)

Total Memory = Parameters + KV Cache + Activations Example Calculations Large Model (meta-llama/Llama-3.3-70B-Instruct)

70B parameters × 2 bytes (BF16 precision) = 140GB With 1.2× multiplier for activations/cache = 168GB Cannot run on MacBook Air (exceeds available RAM)

Smaller Model (meta-llama/Llama-3.2-1B) 1B parameters × 2 bytes × 1.2 = 2.4GB Can run on MacBook Air but consumes ~30% of available RAM This memory usage explains why larger models are impractical on MacBook Air, while smaller models run at acceptable speeds with reduced performance. So how can you run the models faster? You have a few options

Hardware Solutions Use dedicated GPUs with higher VRAM Deploy on cloud GPUs (AWS, GCP, Azure) Leverage specialized AI hardware (TPUs, NPUs)

Model Optimization Quantization (reducing precision: FP16, INT8) Pruning (removing redundant weights) Knowledge distillation (training smaller models)

Inference Optimization Stream outputs instead of waiting for full generation Hope it helps!

About machinelearning

Join our vibrant AI & ML community to explore innovations, network with experts, and access exclusive resources. Together, let’s shape the future of AI!

Founded on: Dec 30, 2024

Recently active members

Shubhada Pande Kaustubh Sandy s Manasa Kesannagari Ananth Arjun Panda Tobi Robot Raj S

Activity feed

Shubhada Pande has joined machinelearning as a new member

2mos
Kaustubh liked Suggest the best...

3mos
Kaustubh has posted Suggest the best...

3mos
Sandy replied on What should a...

3mos
Sandy replied on What should a...

3mos
Manasa has posted What should a...

3mos
Ananth replied on Why do ML models run...

9mos
Arjun replied on Why do ML models run...

9mos
Kaustubh replied on The best beginner course to...

9mos
Sandy replied on The best beginner course to...

10mos