Posted on X by Xenova No, this video is not sped up... Phi-3-mini running locally in your browser at 70 tokens per second on WebGPU!

(h/t @evilpingwin for the video!)

Research Notes: Phi-3-mini Running Locally in Browser at 70 Tokens per Second on WebGPU

Overview

The post highlights the impressive performance of Phi-3-mini, a smaller version of the larger Phi-3 language model, running locally in a browser at 70 tokens per second using WebGPU. This setup leverages Transformers.js and ONNX Runtime Web for inference, enabling fast and private AI processing directly within the browser. The post also references a video demonstrating this capability, credit to @evilpingwin.

Key insights from the search results suggest that Phi-3-mini is well-suited for general-purpose tasks and can achieve comparable performance to larger models when optimized properly. Additionally, it appears that similar configurations, like Phi-3.5-mini, can even outperform its predecessor, running at 90 tokens per second on WebGPU.

Technical Analysis

Phi-3-mini's ability to run locally in the browser at 70 tokens per second is a testament to the efficiency of smaller models when optimized for inference. The use of WebGPU and ONNX Runtime Web allows for hardware-accelerated computation, enabling real-time performance without relying on external servers.

According to Result #2, Phi-3.5-mini achieves an even higher speed of 90 tokens per second in a similar setup, suggesting that model iteration has further improved performance. This aligns with the broader trend in AI development, where smaller models are increasingly favored for their efficiency and scalability (Result #3).

The post also touches on the potential for fine-tuning Phi-3 for specific tasks like retrieval-augmented generation (RAG), as discussed in Result #3. Smaller models like Phi-3-mini are particularly advantageous for production environments due to their lower computational demands and faster inference times compared to larger models.

Implementation Details

The implementation details of this setup include:

Transformers.js: An open-source JavaScript library for running transformer-based models in the browser, enabling seamless integration with WebGPU for hardware acceleration.
ONNX Runtime Web: A runtime environment for executing ONNX models in web browsers, providing support for GPU and WebGL acceleration.
WebGPU: A low-level API for GPU computation in the browser, allowing developers to leverage hardware-accelerated computing for tasks like AI inference.

These technologies work together to deliver high-performance AI capabilities directly within a web environment, making it accessible to end-users without requiring server-side infrastructure.

Retrieval-Augmented Generation (RAG): As mentioned in Result #3, smaller models like Phi-3-mini are ideal for fine-tuning in RAG systems due to their efficiency and scalability. This technology enhances AI capabilities by combining pre-trained models with external data sources.
LongRAG: According to Result #5, models like Phi-3-mini are also being explored for LongRAG applications, which focus on document-based question answering and other long-form tasks.
Llama Models: While not the focus of this post, Llama 3.2 and Phi 3.1 mini 3B are compared in Result #5 for their suitability in RAG and document Q&A tasks, highlighting the competitive landscape of smaller language models.

Key Takeaways

Phi-3-mini achieves impressive performance at 70 tokens per second in a browser-based environment using WebGPU (Result #1).
Smaller models like Phi-3-mini are well-suited for production environments due to their efficiency and scalability, as noted in Result #3.
Phi-3.5-mini demonstrates even higher performance, running at 90 tokens per second on similar hardware configurations (Result #2).
The use of WebGPU and ONNX Runtime Web enables efficient local AI processing, aligning with trends in browser-based machine learning (Results #4 and #2).

Further Research

Here is a curated 'Further Reading' section based on the provided search results:

How Good Is Phi-3-mini for Everyone?
Discusses the performance and suitability of Phi-3-mini for general use. Link
Phi-3.5-mini Running in-Browser at ~90 Tokens/Second on WebGPU
Details the speed and capabilities of Phi-3.5-mini when running in a browser environment. Link
Fine-Tuning PHI-3 for RAG: Why Small Models Are Best for Production
Explores optimizing Phi-3 for Retrieval-Augmented Generation (RAG) and the benefits of using smaller models in production. Link
Phi-3 WebGPU: A Private and Powerful AI Chatbot
Highlights the features of Phi-3 running on WebGPU, emphasizing privacy and performance. Link
Are Llama 3.2 and Phi 3.1 mini 3B Any Good for LongRAG or Document QA?
Compares the effectiveness of Llama 3.2 and Phi 3.1 mini models for LongRAG (Long Document) tasks and document-based Question Answering. Link

Phi-3-mini Running Locally at 70 TokensSec