Framework

Compress it! 8bit Post-Training Model Quantization

July 07. 2024

bef01632-4ae9-4c74-9b7d-99a9cf5a3345_1920x1080

This week, I want to share with you a few notes about the 8bit quantization technique of a PyTorch model using Neural Network Compression Framework. The final goal is to squeeze the model to obtain excellent performances for a local inference on your device, without spending money on expensive new hardware or cloud API providers.

To achieve this goal we are going to navigate several different steps starting from the download of a PyTorch fine-tuned model, pre-trained on MRPC (the Microsoft Research Paraphrase Corpus).

Continue Reading >>>

Tags: Natural Language Processing, Natural Language Understanding, Natural Language Generation

A Ray of light in scaled Generative AI

June 06. 2024

This week I would like to make a step forward in presenting the opportunity of running advanced local AI capabilities (like models fine-tuning, retrieval augmented generation, langchain-based applications, etc…) based on open source frameworks specifically designed for CPU.

After Neural-Speed, NeuralChat, auto-round, and Intel Extension for Transformers, it’s time to play with llm-on-ray,

Before embarking in the exploration of Intel/llm-on-ray, we need to take some time to discover the framework on which it is based: Ray.io by Anyscale.

Continue Reading >>>

Tags: Natural Language Processing, Natural Language Understanding, Natural Language Generation

Advanced Weight-only Quantization Technique on CPU

May 05. 2024

When LLMs started spreading at the end of 2022, it sounded something really impossible: training or even just fine-tune a model on your modest customer-grade hardware was fantasy.

Now, in the middle of 2024, thanks to an intensive work of scientific research, considerable investment, open governance, open collaboration, and a good dose of human ingenuity, we are now able to fine-tune models directly on our devices. Incredibile!
Continue Reading >>>

Tags: Natural Language Processing, Natural Language Understanding, Natural Language Generation, auto-round, Quantization

NeuralChat: deploy a local chatbot within minutes

May 05. 2024

After showcasing Neural Speed in my past articles, my desire is to share a direct application of the theory: a tool developed using Neural Speed as very first brick, NeuralChat.

NeuralChat is highlighted as “A customizable framework to create your own LLM-driven AI apps within minutes”: it is available as part of the Intel® Extension for Transformers, a Transformer-based toolkit that makes possible to accelerate Generative AI/LLM inference both on CPU and GPU.
Continue Reading >>>

Tags: Natural Language Processing, Natural Language Understanding, Natural Language Generation, neuralChat

Bare-Metal AI

Generative AI on prem: secure, ethical, and accessible.

Compress it! 8bit Post-Training Model Quantization

A Ray of light in scaled Generative AI

Advanced Weight-only Quantization Technique on CPU

NeuralChat: deploy a local chatbot within minutes