Large language models (LLMs) are fundamental technologies in the field of machine learning. They are crucial for companies that need to manage the demands of thousands of customers efficiently and in a personalized way. However, managing these models, such as LLAMA and Falcon, can be challenging in environments where the massive use of GPUs is not practical or economically viable. This has generated a need for innovative solutions that optimize resource utilization and reduce operating costs. In this context, techniques such as QLoRA and LoRA emerge as viable options. These strategies allow models to be tailored to the specific needs of each customer without overloading systems or incurring high costs (Xu et al., 2023).
Efficient Adjustment Strategies: QLoRA and LoRA
An LLM can be defined as a function f(x,W)=yf(x, W) = yf(x,W)=y, where xxx is the input sequence, andy the output sequence, and WWW This is the set of weights that are adjusted during model training. A model's efficiency depends largely on how these weights are managed. While traditional weight updates can be costly and slow, QLoRA and LoRA have introduced approaches that store and update changes. ΔW\Delta WΔW more efficiently (Xu et al., 2023). These methods allow for lighter and more economical model adjustments, which is essential in resource-constrained environments, such as those that cannot deploy large hardware infrastructures.
The LoRA Technique: Reducing the Memory Footprint
LoRA, as Zhang et al. (2023) explain, uses singular value decomposition (SVD) to decompose changes ΔW\Delta WΔW in two matrices WaW_aWa y WbW_bWb. This allows for a significant reduction in the model's memory footprint. Multiplying Wa×WbW_a \times W_bWa×Wb provides an accurate approximation of ΔW\Delta WΔW, This facilitates rapid updates during inference. Furthermore, the decomposition range, set to 3, optimizes the fitting process by ensuring that only linearly independent rows and columns are used, thus reducing computational complexity.


