핀테크 LLM을 위한 분산 컴퓨팅 사례

핀테크 LLM을 위한 분산 컴퓨팅 사례

소스 노드 : 3047275

Previous year i.e., 2023 has clearly been a standout year in terms of advancements in field of AI domain. Traditionally it’s always been felt that to get the most out of AI one need a strong investment in infrastructure and support. It’s never been as clear
as last year due to the virtue of advent of Generative AI.  Most of the traditional AI technology prior to Gen AI performed reasonably well on a handful of GPUs and RAM. All this changed after the release of GPT-3 by Open AI and the further release of large
number of opensource models. These Large Language Models were large in every sense, they needed massive computation resources in form of high-performance GPUs and large memory in terms of RAM. Financial services sector in particular is recognized as the top
beneficiary of this technology. The number of resources utilized in this sector in analyses and processing of data particularly textual data can be optimized to a large extent using LLMs. Infact it is the opensource LLMs that has found its most utility in
this sector. There are multiple reasons for this

(a) 데이터 및 보안의 중요성: Quite a lot of data in financial sector are sensitive. They are to be secured and refrained from public access. The potential leak of these data can cause serious issues for the business. It makes the case
for opensource or internal solutions instead of proprietary ones particularly for critical and sensitive usecases.

(b) LLM의 맞춤화: 이 분야의 대부분의 사용 사례에서는 올바른 응답을 제공하기 위해 회사마다 매우 구체적인 데이터 세트를 사용하여 LLM 모델을 사용자 정의해야 합니다.

It’s is quite evident that the applicability of opensource LLM in financial sector is increasing but at same time there are many challenges in basic implementation of LLM solution. The sheer number of resources required in terms of both computation capability
and memory is costly as well as difficult to support. Take the case of a recent milestone of Big Science project’s unveiling of BLOOM, a model with 176 billion parameters capable of supporting 46 natural languages and 13 programming languages. While the public
accessibility of these 100B+ parameter models has facilitated their use, the associated challenges of high memory and computational costs persist. Notably, models like OPT-175B and BLOOM-176B demand over 350 GB of accelerator memory for inference, and even
more for fine-tuning. Consequently, the practical utilization of such LLMs often necessitates multiple high-end GPUs or multi-node clusters, which, due to their high costs, limits accessibility for many researchers and practitioners.

이것은 그들이 말하는 것처럼 완전히 다른 전망을 함께 테스트하는 경우를 만듭니다.
틀에서 벗어나 생각하기.

클라이언트 – 서버 접근 방식 

This makes the case for distributed computing setup for the LLMs as one of possible solutions. It also makes sense since we are already using normal distributed computing systems like cloud and edge computing. This facilitates collaboration among multiple
users for the purpose of inference and fine-tuning of large language models over the Internet. Participants in distributed network can assume the roles of a server, a client, or both. A server is responsible for hosting a subset of model layers, typically
Transformer blocks, and managing requests from clients. Clients, in turn, can form a chain of pipeline-parallel consecutive servers to execute the inference of the entire model. Beyond inference, one can engage in fine-tuning activities using parameter-efficient
training methods like adapters, or by training entire layers. Trained submodules can be shared on a model hub, where others can leverage them for inference or further training. This demonstrates the efficient execution of existing 100B+ models in this collaborative
setting, aided by several optimizations such as dynamic quantization, prioritizing low-latency connections, and load balancing between servers. Let discuss this in bit more detail.

디자인 기술 개요

Practical applications of large language models can be broadly categorized into two main scenarios: inference and parameter-efficient adaptation to downstream tasks. I would try to outline the design of distributed network, elucidating how it effectively
manages both scenarios and facilitates the seamless sharing of trained adapters among system users.

  • 10억 규모 모델의 추론: In the token generation process, a client locally stores the model’s token embeddings, typically constituting a small fraction of the total parameter count and fitting comfortably in the RAM of most modern laptops,
    servers, and workstations. The client relies on servers to execute Transformer blocks, with each server hosting several consecutive blocks, the number of which is determined by the available GPU memory. Before each inference session, the client establishes
    a chain of servers that collectively encompass all model layers. During the active session, the client utilizes the local embedding layer to retrieve embedding vectors for prefix tokens, transmitting these vectors to servers and receiving updated representations.
    After obtaining the outputs of the final block, the client calculates next token probabilities and iterates through this process. Servers retain attention keys and values from past client inputs for subsequent inference steps, and clients store past inputs
    to each server to facilitate a quick replacement if a server fails or goes offline.
  • 다운스트림 작업을 위한 교육: While Large Language Models (LLMs) excel on many problems with simple prompt engineering, achieving optimal results often requires training. Traditional fine-tuning strategies, which involve updating all model parameters
    for the downstream task, become impractical for very large models due to extensive hardware requirements. For instance, fine-tuning BLOOM- 176B would demand nearly 3 TB of GPU memory to accommodate model, gradients, and optimizer states. 해결하기 위해
    this challenge, the NLP community has devised parameter-efficient fine-tuning methods that preserve most pretrained model parameters. Some approaches select a subset of existing parameters, while others augment the model with additional trainable weights.
    Despite lower memory requirements, these parameter-efficient approaches often compete favorably with full model fine-tuning and can outperform it in low-data scenarios.
  • 분산 미세 조정: The fundamental idea behind fine-tuning in a distributed network is that clients own trained parameters, while servers host the original pretrained layers. Servers can run backpropagation through their layers, returning gradients
    concerning activations, but they do not update server-side parameters. This allows clients to concurrently execute different    training tasks on the same set of servers without interference.

내부 구조 및 최적화

Performance considerations are paramount for distributed inference, involving three key aspects: computation speed (comparing a 5-year-old gaming GPU with a new data center GPU), communication delay due to node distance (intercontinental vs. local), and
bandwidth-induced communication delay (10 Mbit/s vs. 10 Gbit/s). While even consumer-grade GPUs like the GeForce RTX 3070 boast the capability to execute a complete inference step of BLOOM-176B in less than a second, the challenge lies in GPU memory constraints,
necessitating efficient solutions. One way to address this is by employing quantization for optimized parameter storage and dynamic server prioritization for enhanced communication speed.

  • 소비자 GPU 사용: Considering the fact that each server possesses at least 16 GB of CPU RAM and 8 GB of GPU memory, the primary objective is to minimize the model’s memory footprint, enabling each device to accommodate more Transformer
    blocks. For BLOOM with 176B parameters, requiring 352 GB of GPU memory in 16-bit precision, we can optimize this by compressing hidden states through dynamic blockwise quantization and reducing the weights to 8-bit precision using mixed matrix decomposition.
    This results in a substantial reduction in the required number of nodes, effectively halving latency and minimizing the likelihood of failure.
  • 압축 의사 소통 버퍼:
    파이프라인 병렬 통신 전에 숨겨진 상태에 대해 Dynamic Blockwise 양자화를 사용하여 생성 품질을 저하시키지 않고 대역폭 요구 사항을 절반으로 줄일 수 있습니다. 
  • 모델 가중치 압축: 행렬 곱셈을 위해 8비트 혼합 행렬 분해를 활용하면 품질 저하 없이 메모리 사용량이 절반 정도 줄어듭니다.
  • 인터넷을 통한 협업: 노드의 합류, 탈퇴, 실패에도 불구하고 안정적인 추론과 훈련을 보장하기 위해. 서버와 클라이언트를 위한 분산형 훈련과 맞춤형 내결함성 프로토콜을 위해 hivemind 라이브러리를 활용할 수 있습니다.

민주화 및 개인 정보 보호 문제

We can take inspiration from Blockchain to address potential imbalance between peers supplying GPU resources (servers) and those utilizing these servers for inference or fine-tuning. To address this, a system of incentives could be implemented. Peers running
servers could earn special points, redeemable for high-priority inference and fine-tuning or other rewards. This approach aims to encourage active participation and maintain a balanced network. An acknowledged limitation of our current approach is the potential
privacy concern where peers serving the initial layers of the model might leverage inputs to recover input tokens. One way to address this is users handling sensitive data are advised to limit their clients to trusted servers or establish their isolated swarm.
 Though we can explore privacy-enhancing technologies such as secure multi-party computing or privacy-preserving hardware from NVIDIA.

결론

My aim through this blog is to introduce my take on Distributed Computing for AI and to explain both why it’s required and a brief technical overview on one possible approach to implement it. I am open to discuss new ideas to implement this. Considering
the fact that there will be massive application of AI in financial sector in coming years, we have to start thinking about how can we optimally utilize current resources before creating new ones. The another aim is to democratize access to large language models,
enabling a broader range of applications, studies, and research questions that were previously challenging or cost- prohibitive.

 

타임 스탬프 :

더보기 핀텍스라