The English version of quarkus.io is the official project site. Translated sites are community supported on a best-effort basis.

November 29, 2024 #ai #llm #local-inference #jlama

Creating pure Java LLM infused application with Quarkus, Langchain4j and Jlama

By Mario Fusco

Currently the vast majority of LLM-based applications rely on external services provided by specialized companies. These services typically offer the access to huge, general purpose models, implying energy consumption and then costs that are proportional to the size of these models.

Even worse, this usage pattern also comes with both privacy and security concerns, since it is virtually impossible to be sure how those service providers will eventually re-use the prompts of their customers, which in some cases could also contain sensitive information.

For these reasons many companies are exploring the option of training or fine-tuning smaller models that do not claim to be usable in a general context, but that are tailored towards specific business needs and subsequently running (serving in LLM terms) these models on premise or on private clouds.

The features provided by these specialized models need to be integrated into the existing software infrastructure, that in the enterprise world are very often written in Java. This could be accomplished following a traditional client-server architecture, for instance serving the model through an external server like Ollama and querying it through REST calls. While this should not present any particular problem for Java developers, they could work more efficiently, if they could consume the model directly in Java and without any need to install additional tools. Finally the possibility of embedding the LLM interaction directly in the same Java process running the application will make it easier to move from local dev to deployment, relieving IT from the burden of managing an external server, thus bypassing the need for a more mature platform engineering strategy. This is where Jlama comes into play.

How and why executing LLM inference in pure Java with Jlama

Jlama is a library allowing to execute LLM inference in pure Java. It supports many LLM model families like Llama, Mistral, Qwen2 and Granite. It also implements out-of-the-box many useful LLM related features like functions calling, models quantization, mixture of experts and even distributed inference.

Jlama is well integrated with Quarkus through the dedicated lanchain4j based extension. Note that for performance reasons Jlama uses the Vector API which is still in preview in Java 23, and very likely will be released as a supported feature in Java 25.

In essence Jlama makes it possible to serve an LLM in Java, directly embedded in the same JVM running your Java application, but why could this be useful? Actually this is desirable in many use cases and presents a number of relevant advantages like the following:

Similar lifecycle between model and app: There can be use cases where the model and the application using it have the same lifecycle, so that the development of a new feature in the application also requires a change in the model. Similarly, since prompts are very dependent on the model, when the model gets updated even through fine-tuning, your prompt may need to be replaced. In these situations having the model embedded in the application will contribute to simplify the versioning and traceability of the development cycle.
Fast development/prototyping: Not having to install, configure and interact with an external server can make the development of a LLM-based Java application much easier.
Easy models testing: Running the LLM inference embedded in the JVM also makes it easier to test different models and their integration during the development phase.
Security: Performing the model inference in the same JVM instance that is run the application using it, eliminates the need of interacting with the LLM only through REST calls, thus preventing the leak of private data and allowing to enforce user authorization at a much finer grain.
Monolithic applications support: The former point will be also beneficial for users still running monolithic applications, who in this way will be also able to include LLM-based capabilities in those applications without changing their architecture or platform.
Monitoring and Observability: Running the LLM inference in pure Java will also allow simplify monitoring and observability, gathering statistics on the reliability and speed of the LLM response.
Developer Experience: Debuggability will be simplified in the same way, allowing the Java developer to also navigate and debug the Jlama code if necessary.
Distribution: Having the possibility to run LLM inference embedded in the same Java process will also make it possible to include the model itself into the same application package of the application using it (even though this could probably be advisable only in very specific circumstances).
Edge friendliness: The possibility of implementing and deploying a self-contained LLM-capable Java application will also make it a better fit than a client/server architecture for edge environments.
Embedding of auxiliary LLMs: Many applications, especially the ones relying on agentic AI patterns, uses many different LLMs at once. For instance a smaller LLM could be used to validate and approve the responses of the main bigger one. In this case an hybrid approach could be convenient, embedding the smaller auxiliary LLMs while keeping serving the main one through a dedicated server.

The site summarizer: a pure Java LLM-based application

To demonstrate how Quarkus, Langchain4j and Jlama make straightforward to create a pure Java LLM infused application, where the LLM inference is directly embedded in the same JVM running the application I created a simple project that uses a LLM to automatically generate the summarization of a Wikipedia page or more in general of a blog post taken from any website.

This video demonstrates how the application works.

Out-of-the-box this project uses a small Llama-3.2 model with 4-bit quantization. When the application is compiled for the first time the model is automatically downloaded locally by Jlama from the Huggingface repository. However it is possible to replace this model and experiment with any other one by simply editing the quarkus.langchain4j.jlama.chat-model.model-name property in the application.properties file.

After having compiled and packaged the project with mvn clean package, the simplest way to use it is launching the jar passing as argument the URL of the web page containing the article that you want to summarize, something like:

java -jar --enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector target/quarkus-app/quarkus-run.jar https://www.infoq.com/articles/native-java-quarkus/

that will generate an output like the following:

$ java -jar --enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector target/quarkus-app/quarkus-run.jar https://www.infoq.com/articles/native-java-quarkus/
WARNING: Using incubator modules: jdk.incubator.vector
__  ____  __  _____   ___  __ ____  ______
 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/
2024-11-28 11:01:09,295 INFO  [io.quarkus] (main) site-summarizer 1.0.0-SNAPSHOT on JVM (powered by Quarkus 3.16.4) started in 0.402s. Listening on: http://0.0.0.0:8080
2024-11-28 11:01:09,299 INFO  [io.quarkus] (main) Profile prod activated.
2024-11-28 11:01:09,300 INFO  [io.quarkus] (main) Installed features: [cdi, langchain4j, langchain4j-jlama, qute, rest, smallrye-context-propagation, smallrye-openapi, vertx]
2024-11-28 11:01:11,006 INFO  [org.mfu.sit.SiteSummarizer] (main) Site crawl took 1701 ms
2024-11-28 11:01:11,010 INFO  [org.mfu.sit.SiteSummarizer] (main) Text extraction took 3 ms
2024-11-28 11:01:11,010 INFO  [org.mfu.sit.SiteSummarizer] (main) Summarizing content 17749 characters long
2024-11-28 11:01:11,640 INFO  [com.git.tja.jla.ten.ope.TensorOperationsProvider] (main) Using Native SIMD Operations (OffHeap)
2024-11-28 11:01:11,647 INFO  [com.git.tja.jla.mod.AbstractModel] (main) Model type = Q4, Working memory type = F32, Quantized memory type = I8
The text you provided is a summary of the Kubernetes Native Java series, which is part of the "Native Compilations Boosts Java" article series. The series aims to provide answers to questions about native compilation, such as how to use native Java, when to switch to native Java, and what framework to use.

The text also mentions the following key points:

* Native compilation with GraalVM makes Java in the cloud cheaper.
* Native compilation raises many questions for all Java users, such as how to use native Java, when to switch to native Java, and what framework to use.
* The series will provide answers to these questions.

Overall, the text provides an overview of the Kubernetes Native Java series and its goals, highlighting the importance of native compilation in the cloud and the need for answers to specific questions about native Java.

Here is a summary of the key points:

* Native compilation with GraalVM makes Java in the cloud cheaper.
* Native compilation raises many questions for all Java users, such as how to use native Java, when to switch to native Java, and what framework to use.
* The series will provide answers to these questions.

I hope this summary is helpful. Let me know if you have any further questions or if there's anything else I can help with.
---
Site summarization done in 53851 ms
2024-11-28 11:02:03,164 INFO  [io.quarkus] (main) site-summarizer stopped in 0.012s

Note that it is necessary to launch the JVM with a few additional arguments that enable the access to the Vector API which is still a Java preview feature, but it is internally used by Jlama to speed up the computation.

As better clarified by the readme of the project, there’s a dedicated operating mode to process Wikipedia pages and also the possibility to expose this service through a REST endpoint.

The internal implementation of this project is relatively straightforward: after having programmatically extracted the text to be summarized from the HTML page containing it, that text is sent to Jlama to be processed via a usual Langchain4j AiService.

import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.UserMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
import io.smallrye.mutiny.Multi;
import jakarta.inject.Singleton;

@RegisterAiService
@Singleton
public interface SummarizerAiService {

    @SystemMessage("""
            You are an assistant that receives the content of a web page and sums up
            the text on that page. Add key takeaways to the end of the sum-up.
    """)
    @UserMessage("Here's the text: '{text}'")
    Multi<String> summarize(String text);
}

As anticipated, despite this implementation looks identical to any other LLM inference engine integrations, in this case there isn’t any remote call to an external service, but the LLM inference is performed directly inside the same JVM running the application.

The combination of the 2 trends of the increasing spread of small and tailored models and the adoption of these models in the enterprise software development world will very likely promote the use of similar solutions in the near future.

Quarkus is open. All dependencies of this project are available under the Apache Software License 2.0 or compatible license.

This website was built with Jekyll, is hosted on GitHub Pages and is completely open source. If you want to make it better, fork the website and show us what you’ve got.

Navigation

Get Help

Languages

Quarkus is made of community projects

CC by 3.0 | Privacy Policy Sponsored by