Inference Engine

Neural Speed, Advanced Usage

May 05. 2024

74a82db1-be7b-4479-95f9-f0eddaa29683_1920x1080

After an initial presentation and a second follow up, here the third episode of my excursus about weight-only quantization, SignRound technique and their code implementation: the tensor parallelism library and inference engine, Neural Speed.

It is an amazing tool, and makes sense to explore a little more all the opportunities it offers through its multiple options.
Continue Reading >>>

Tags: neural speed, Natural Language Processing, Natural Language Understanding, Natural Language Generation, LLM Inference

Efficient LLM inference on CPU: the approach explained

April 04. 2024

c9f55316-cba7-400c-84fd-2b8e44b79e8c_1920x1080

In the previous article I presented you a new inference engine, Neural Speed, that demonstrates incredible performance and that can run proficiently on consumer-grade CPU, without the need for expensive graphic cards or other dedicated resources.

Before proceeding in taking a deep dive into the advanced features of Neural Speed, it makes sense to take a seat and try to understand how it works under the hood. So, the intention of this article is to directly report a few concepts from the original documentation of Neural Speed.
Continue Reading >>>

Tags: neural speed, Natural Language Processing, Natural Language Understanding, Natural Language Generation, LLM Inference

Effectual LLM inference on Intel CPUs

April 04. 2024

LLMs (Large Language Models) are incredible: they have demonstrated phenomenal performance and a proven potential across thousands of different scenarios. However, they require massive amount of computational resources to be executed: starting from basic elements like several gigabytes of storage, and astronomical quantity of GPU vRAM. Really impractical for many users who would like to execute simple natural language tasks, such as classification or entity recognition, and the only hardware available is usually consumer-grade. The more obvious solution is usually taking advantage of mainstream cloud providers (the triad Amazon AWS, Microsoft Azure, Google Cloud Platform), that make available services designed to provide direct APIs and offers, but the scarcity of GPU availability, the chip shortage, and the high demand raised the costs of this online platforms. Costs that are not affordable for the vast audience of software developers, data scientists and researchers, students, hobbyists, little professionals that want to try out solutions based on LLMs and build something innovative from scratch without the need to rely on third party services and continuous stable connection to Internet; not to mention the privacy and security of external providers that is not always fully guaranteed.

The alternative I want to propose is to develop solutions that could take full advantage of your personal hardware.

But, how is it possible to deploy large language models considering their immense amount of parameters?
Continue Reading >>>

Tags: neural speed, Natural Language Processing, Natural Language Understanding, Natural Language Generation, LLM Inference

Bare-Metal AI

Generative AI on prem: secure, ethical, and accessible.

Neural Speed, Advanced Usage

Efficient LLM inference on CPU: the approach explained

Effectual LLM inference on Intel CPUs