Advanced Weight-only Quantization Technique on CPU

May 05. 2024 • Category: Framework

When LLMs started spreading at the end of 2022, it sounded something really impossible: training or even just fine-tune a model on your modest customer-grade hardware was fantasy.

Now, in the middle of 2024, thanks to an intensive work of scientific research, considerable investment, open governance, open collaboration, and a good dose of human ingenuity, we are now able to fine-tune models directly on our devices. Incredibile!
Continue Reading >>>

Tags: Natural Language Processing, Natural Language Understanding, Natural Language Generation, auto-round, Quantization

NeuralChat: deploy a local chatbot within minutes

May 05. 2024 • Category: Framework

After showcasing Neural Speed in my past articles, my desire is to share a direct application of the theory: a tool developed using Neural Speed as very first brick, NeuralChat.

NeuralChat is highlighted as “A customizable framework to create your own LLM-driven AI apps within minutes”: it is available as part of the Intel® Extension for Transformers, a Transformer-based toolkit that makes possible to accelerate Generative AI/LLM inference both on CPU and GPU.
Continue Reading >>>

Tags: Natural Language Processing, Natural Language Understanding, Natural Language Generation, neuralChat

Neural Speed, Advanced Usage

May 05. 2024 • Category: Inference Engine

74a82db1-be7b-4479-95f9-f0eddaa29683_1920x1080

After an initial presentation and a second follow up, here the third episode of my excursus about weight-only quantization, SignRound technique and their code implementation: the tensor parallelism library and inference engine, Neural Speed.

It is an amazing tool, and makes sense to explore a little more all the opportunities it offers through its multiple options.
Continue Reading >>>

Tags: neural speed, Natural Language Processing, Natural Language Understanding, Natural Language Generation, LLM Inference

Bare-Metal AI

Generative AI on prem: secure, ethical, and accessible.

Advanced Weight-only Quantization Technique on CPU

NeuralChat: deploy a local chatbot within minutes

Neural Speed, Advanced Usage