DeepSeek R1의 추론 능력을 바탕으로 100분의 1 작은 ModernBERT 훈련하기

DeepSeek R1으로 생성한 레이블을 활용하여 더 작은 사이즈의 ModernBERT 분류 모델을 학습시키기

44 min readJan 29, 2025

최근 공개된 DeepSeek-R1 모델은 GPT-o1 이상의 추론 능력을 보유하여 많은 관심을 받고 있다. 하지만 DeepSeek-R1 같은 모델 역시 정말 강력한 성능만큼이나 리소스를 많이 소모한다. 이 문제를 해결하기 위해 공개된 여러 distilled 버전의 모델들이 있다. 크기가 훨씬 작은 만큼 활용 범위가 넓어지게 된다.

그렇지만 더 작고 가벼운 BERT 계열 모델을 그대로 사용하는 편이 더 적합할 때도 있다. 예컨대 간단한 분류(classification) 모델이 필요하거나, 리소스 제약이 있는 환경이라면 GPT 계열처럼 아주 거대한 모델보다는, 보다 가벼운 ModernBERT 계열 모델 하나를 fine-tuning해서 사용하는 편이 효율적이겠다.

문제는 분류를 위한 학습 데이터(레이블 데이터)가 충분하지 않다 라는 부분이다. 이런 상황을 해결하기 위해, 우리는 “추론 능력이 뛰어난 더 큰 LLM”을 이용해 ‘가짜 입력 데이터’가 아니라 ‘가짜 레이블(추론 기반 레이블)’을 만들어낼 수 있겠다. 이렇게 하면 훨씬 저렴하고도 가벼운 모델을 학습시킬 수 있으면서도, 충분히 DeepSeek의 추론 능력을 distill 즉 이식 받을 수 있겠다.

새로운 데이터셋을 소개하는 DeepSeek-BERT 모델

코드의 대부분의 아이디어는 아래 링크에서 가져옴

Distiling DeepSeek reasoning to ModernBERT classifiers

How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?

danielvanstrien.xyz

누군가가 ArXiv에 올라오는 논문들 중에서 “새로운 데이터셋을 소개하는 논문”만 빠르게 찾아내고 싶다고 해보자. 예를 들어 방대한 Arxiv 논문 데이터에서 “dataset”이라는 단어만 검색했을 때에 내가 보고자 하는 ML 논문이 너무 많이 걸려서 시간을 낭비할 수 있다. 실제로는 “새로 만들어서 공개했다” 라는 맥락이 담긴 논문만 골라내고 싶은데, 단순히 키워드 필터만으로는 충분하지 않을 때가 많다.

여기에 LLM을 활용할 수 있다. DeepSeek-R1 같은 추론 능력이 뛰어난 LLM에게 논문 제목과 초록을 주고, “이 논문이 실제로 새로운 데이터셋을 제안하는지”를 스스로 판단하게 하여 라벨을 달 수 있다. 이렇게 생성된 라벨을 사용해 ModernBERT 같은 더 작은 모델을 파인튜닝하면, 매번 LLM에 의존할 필요 없이 빠르고 가볍게 분류를 수행할 수 있게 된다.

큰 흐름은 다음과 같다. 먼저 Hugging Face Hub에 올라온 ArXiv 메타데이터를 불러온다. polars라는 고성능 DataFrame 라이브러리를 이용해 제목(title)과 초록(abstract), 그리고 다른 메타데이터들을 읽을 수 있다. 여기에서 “cs.”로 시작하는 카테고리(categories)를 필터링해 컴퓨터 과학 분야의 논문만 추리고, 다시 “dataset”이라는 단어가 포함된 논문만 남긴다. 이렇게 하면 데이터셋 혹은 벤치마크 등과 관련된 논문들만 추려지므로, 이후의 라벨링 과정이 좀 더 집중적으로 진행된다.

%pip install polars huggingface_hub datasets openai --upgrade

import os
import polars as pl
from huggingface_hub import snapshot_download

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"  # 큰 파일 전송 최적화

files = snapshot_download(
    repo_id="librarian-bots/arxiv-metadata-snapshot",
    allow_patterns=["*.parquet"],
    repo_type="dataset",
)

df = pl.scan_parquet(files)

# snapshot_download를 통해 Hugging Face Hub에 올라온 ArXiv 메타데이터(parquet)를 다운로드합니다.
# pl.scan_parquet(...)로 Polars의 lazy 연산을 활용하였습니다.
# 메타데이터의 크기가 상당히 클 수 있기에, scan_parquet가 적합합니다.

# 예시로 첫 행만 확인 (df.head(1).collect())
print(df.head(1).collect())

# categories 열에 cs.가 들어 있는지 필터.
# 이 과정을 통해 컴퓨터 과학 분야이면서, “dataset”이 제목이나 초록에 언급된 논문들만 추출하게 됩니다.
df = df.filter(
    pl.col("categories")
    .str.split(" ")
    .list.eval(pl.element().str.startswith("cs."))
    .list.any()
)

# 제목 또는 초록에 "dataset" 단어가 있으면 필터
df = df.filter(
    pl.col("title").str.contains("dataset") | pl.col("abstract").str.contains("dataset")
)

# 실제로 데이터를 메모리에 불러오기(collect)
df = df.collect()

그 다음, DeepSeek-R1 모델이나 그 증류(distilled) 버전을 사용해 실제로 논문들에 대한 라벨을 생성한다. 여기서 주의할 점은, 그냥 모델한테 답변을 달라고 하면 응답 형식이 들쑥날쑥할 수 있기 때문에, JSON 형태의 구조화된 출력을 요청한다. 예컨대 “label” 필드는 “new_dataset” 또는 “no_new_dataset” 중 하나를 고르게 하고, “explanation” 필드에는 자신이 내린 판단 근거를 설명하도록 지시한다.

# LLM에게서 “JSON 형태”로 응답을 받으면, 그 결과를 바로 파이썬 오브젝트처럼 활용하기가 편리합니다.
# 아래 예시는 pydantic을 사용해, 레이블과 이유(explanation)를 가지는 스키마를 정의합니다.

from enum import Enum
from pydantic import BaseModel, constr
from typing import Annotated

class DatasetLabel(str, Enum):
    # DatasetLabel 열거형(Enum)
    # "new_dataset" 또는 "no_new_dataset" 중 하나를 선택하게끔 강제합니다.

    NEW = "new_dataset"
    NOT_NEW = "no_new_dataset"

class IntroducesNewDataset(BaseModel):
    # explanation: 최소 길이 40자 이상의 문자열(LLM이 어느 정도 자세한 설명을 하도록).
    explanation: constr(min_length=40)

    # label: DatasetLabel Enum을 만족해야 합니다.
    label: DatasetLabel

이 과정을 위해 pydantic이라는 라이브러리로, 최소 길이 40자 이상의 explanation과 고정된 선택지를 가지는 label을 가진 클래스(스키마)를 정의해두면 좋다. 그러면 모델이 JSON을 반환했을 때 pydantic 검증을 통과하지 못하면 에러를 띄워주기 때문에, 모델 출력이 잘못된 형식인 경우를 쉽게 거른다.

DeepSeek-R1 Distill 모델은 LM Studio라는 도구를 이용해 로컬 환경에서 구동할 수 있는데, 이때 LM Studio는 OpenAI 호환 API 모드를 지원하므로, 파이썬에서 openai 라이브러리를 사용하듯 호출이 가능하다.

!lms server start
!lms ls | grep DeepSeek
!lms load DeepSeek-R1-Distill-Qwen-7B-GGUF

-----------------------------------------------------------------------
from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
client.models.list()

사용 방법으로는 LM Studio를 서버 모드로 실행해 두고, 파이썬 코드에서 base_url을 http://localhost:1234/v1 이렇게 지정한 뒤 api_key를 임의로 lm-studio라 두면 된다. 그 다음 호출하고 싶은 모델 이름을 client.chat.completions.create 같은 형태로 지정하면, LM Studio 내부에서 해당 모델을 로드해 inference를 진행한다. 예시로 deepseek-r1-distill-qwen-7b 같은 모델을 불러와서 사용할 수도 있다.

논문 제목/초록을 보고, new_dataset 인지 no_new_dataset 인지 결정하고, JSON 형태로 이유와 함께 알려달라”고 요청하는 프롬프트

def format_text_as_prompt(data: dict[str, str]):
    return f"""Look at the title and abstract for the following arXiv paper. Assess whether the paper is likely to introduce a newly created dataset.

Title: {data['title']}
Abstract: {data['abstract']}

Your role is to decide whether the paper introduces a newly created dataset. First you should think about whether the paper is likely to introduce a newly created dataset. You should then return your reasoning and the label you've chosen. 
You should choose out of the "new_dataset" or "no_new_dataset" labels.

Return your reasoning and the label you've chosen as a JSON object like this:
```json
{{
    "label": "new_dataset" | "no_new_dataset",
    "explanation": "The reasoning the model used to come to its conclusion"
}}"""

이렇게 모델에게 prompt를 구성해서, title과 abstract를 보여주고 JSON 형태로 레이블을 달라고 요청한다.

examples = df.head(4).select(pl.col(["abstract", "title"])).to_dicts()
print(examples[0])  # {'abstract': '...', 'title': '...'}

messages = [
    {"role": "user", "content": format_text_as_prompt(examples[0])},
]

response = client.beta.chat.completions.parse(
    model="deepseek-r1-distill-qwen-7b",
    messages=messages,
    temperature=0.7,
    response_format=IntroducesNewDataset,
)

result = IntroducesNewDataset.model_validate_json(response.choices[0].message.content)
print(result)

from rich import print as rich_print

structured_results = []
for example in examples:
    title = example["title"]
    abstract = example["abstract"]
    prediction = predict_label(example)
    structured_results.append(prediction)
    rich_print(title)
    rich_print(abstract)
    rich_print(prediction)
    rich_print("---")

An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for
  High Dimensional Data
  This paper presents a new fuzzy k-means algorithm for the clustering of high
dimensional data in various subspaces. Since, In the case of high dimensional
data, some features might be irrelevant and relevant but may have different
significance in the clustering. For a better clustering, it is crucial to
incorporate the contribution of these features in the clustering process. To
combine these features, in this paper, we have proposed a new fuzzy k-means
clustering algorithm in which the objective function of the fuzzy k-means is
modified using two different entropy term. The first entropy term helps to
minimize the within-cluster dispersion and maximize the negative entropy to
determine clusters to contribute to the association of data points. The second
entropy term helps to control the weight of the features because different
features have different contributing weights in the clustering process for
obtaining the better partition of the data. The efficacy of the proposed method
is presented in terms of various clustering measures on multiple datasets and
compared with various state-of-the-art methods.

IntroducesNewDataset(
    explanation="The paper presents an algorithm for clustering high-dimensional data, focusing on feature 
weighting and entropy-based modifications to the fuzzy k-means method. The abstract mentions that their proposed 
method is evaluated against various datasets using different measures. Since the title doesn't suggest a new 
dataset but rather an improvement or variation in an existing one (fuzzy k-means), and the abstract emphasizes 
performance evaluation across multiple datasets without indicating the introduction of a new one, it's reasonable 
to assume that no new dataset was created in this paper.",
    label=<DatasetLabel.NOT_NEW: 'no_new_dataset'>
)

단순히 structured output에 따라서 레이블 값만 받을 수도 있지만, 모델의 추론 과정을 모두 내뱉도록 한 다음에 값을 자유롭게 반환하게 할 수도 있다. 다시 말해서, 모델이 <think> 를 활용하도록 허용하는 것이 더 좋을까? 아닐까? Daniel Van Stien은 <think> 가 아무래도 더 좋겠다는 언급을 했는데, 이는 독자의 판단에 맡겨본다.

Distiling DeepSeek reasoning to ModernBERT classifiers

How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?

danielvanstrien.xyz

# compare the results vs structured output
for i, example in enumerate(examples):
    rich_print(example["title"])
    rich_print(example["abstract"])
    prediction = predict_label_without_structured_output(example)
    print(f"Previous: {structured_results[i].label}")
    print(f"New: {prediction}")
    rich_print("---")

An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for
  High Dimensional Data
  This paper presents a new fuzzy k-means algorithm for the clustering of high
dimensional data in various subspaces. Since, In the case of high dimensional
data, some features might be irrelevant and relevant but may have different
significance in the clustering. For a better clustering, it is crucial to
incorporate the contribution of these features in the clustering process. To
combine these features, in this paper, we have proposed a new fuzzy k-means
clustering algorithm in which the objective function of the fuzzy k-means is
modified using two different entropy term. The first entropy term helps to
minimize the within-cluster dispersion and maximize the negative entropy to
determine clusters to contribute to the association of data points. The second
entropy term helps to control the weight of the features because different
features have different contributing weights in the clustering process for
obtaining the better partition of the data. The efficacy of the proposed method
is presented in terms of various clustering measures on multiple datasets and
compared with various state-of-the-art methods.
Previous: DatasetLabel.NOT_NEW
New: <think>
Okay, so I need to figure out whether the paper introduces a newly created dataset. The title and abstract are provided.
The title is: "An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data." It mentions an algorithm related to clustering high-dimensional data using fuzzy k-means with some entropy terms and feature weighting.
Looking at the abstract, it says they've proposed a new fuzzy k-means algorithm. The focus is on modifying the objective function by adding two different entropy terms: one to minimize within-cluster dispersion and another to control feature weights because features have varying contributions in clustering.
The paper mentions that their method was tested against various datasets and compared with state-of-the-art methods, but there's no explicit mention of introducing a new dataset. They evaluate performance on multiple existing datasets without specifying any novel data creation here.
So, the key points are: they're improving an algorithm for high-dimensional clustering but don't indicate creating a new dataset; instead, they apply it to various datasets that already exist.
</think>
The paper does not introduce a newly created dataset as part of its methodology. It focuses on enhancing an existing fuzzy k-means algorithm and evaluates its performance across multiple existing datasets.
```json
{
    "label": "no_new_dataset",
    "explanation": "The paper discusses modifications to an existing fuzzy k-means algorithm for high-dimensional data, but it does not mention introducing a new dataset. Instead, it evaluates the proposed method on various datasets that likely already exist."
}
```
---
Identifying Influential Brokers on Social Media from Social Network
  Structure
  Identifying influencers in a given social network has become an important
research problem for various applications, including accelerating the spread of
information in viral marketing and preventing the spread of fake news and
rumors. The literature contains a rich body of studies on identifying
influential source spreaders who can spread their own messages to many other
nodes. In contrast, the identification of influential brokers who can spread
other nodes' messages to many nodes has not been fully explored. Theoretical
and empirical studies suggest that involvement of both influential source
spreaders and brokers is a key to facilitating large-scale information
diffusion cascades. Therefore, this paper explores ways to identify influential
brokers from a given social network. By using three social media datasets, we
investigate the characteristics of influential brokers by comparing them with
influential source spreaders and central nodes obtained from centrality
measures. Our results show that (i) most of the influential source spreaders
are not influential brokers (and vice versa) and (ii) the overlap between
central nodes and influential brokers is small (less than 15%) in Twitter
datasets. We also tackle the problem of identifying influential brokers from
centrality measures and node embeddings, and we examine the effectiveness of
social network features in the broker identification task. Our results show
that (iii) although a single centrality measure cannot characterize influential
brokers well, prediction models using node embedding features achieve F$_1$
scores of 0.35--0.68, suggesting the effectiveness of social network features
for identifying influential brokers.
Previous: DatasetLabel.NEW
New: <think>
Okay, so I need to figure out whether the paper titled "Identifying Influential Brokers on Social Media from Social Network Structure" introduces a new dataset. Let me break this down.
First, looking at the title, it's about identifying influential brokers in social media using network structure. The abstract mentions they used three social media datasets to study these influencers. They compared brokers with source spreaders and central nodes based on centrality measures.
The abstract also talks about tackling the problem of identifying brokers from both centrality measures and node embeddings. It evaluates the effectiveness of network features, getting some F1 scores as a result.
So, I'm trying to see if they created any new dataset or used existing ones. They mention using three datasets: Twitter in their experiments. The paper doesn't seem to introduce any entirely new type of data beyond what's commonly available, like Twitter datasets. They're analyzing these existing datasets with their methods.
Therefore, the paper probably uses existing social media datasets rather than creating a new one.
</think>
The paper does not introduce a newly created dataset; it utilizes existing social media datasets such as Twitter for its analysis.
```json
{
    "label": "no_new_dataset",
    "explanation": "The paper does not introduce any new datasets. It uses three social media datasets, including Twitter, which are already available data sources."
}
```
---
Improving Performance of Automatic Keyword Extraction (AKE) Methods
  Using PoS-Tagging and Enhanced Semantic-Awareness
  Automatic keyword extraction (AKE) has gained more importance with the
increasing amount of digital textual data that modern computing systems
process. It has various applications in information retrieval (IR) and natural
language processing (NLP), including text summarisation, topic analysis and
document indexing. This paper proposes a simple but effective
post-processing-based universal approach to improve the performance of any AKE
methods, via an enhanced level of semantic-awareness supported by PoS-tagging.
To demonstrate the performance of the proposed approach, we considered word
types retrieved from a PoS-tagging step and two representative sources of
semantic information - specialised terms defined in one or more
context-dependent thesauri, and named entities in Wikipedia. The above three
steps can be simply added to the end of any AKE methods as part of a
post-processor, which simply re-evaluate all candidate keywords following some
context-specific and semantic-aware criteria. For five state-of-the-art (SOTA)
AKE methods, our experimental results with 17 selected datasets showed that the
proposed approach improved their performances both consistently (up to 100% in
terms of improved cases) and significantly (between 10.2% and 53.8%, with an
average of 25.8%, in terms of F1-score and across all five methods), especially
when all the three enhancement steps are used. Our results have profound
implications considering the ease to apply our proposed approach to any AKE
methods and to further extend it.
Previous: DatasetLabel.NOT_NEW
New: <think>
Okay, I'm trying to figure out whether the paper "Improving Performance of Automatic Keyword Extraction (AKE) Methods Using PoS-Tagging and Enhanced Semantic-Awareness" introduces a newly created dataset. 
First, looking at the title suggests that it's about improving an existing AKE method, which implies they're working with existing datasets rather than creating new ones.
The abstract mentions experiments conducted on 17 selected datasets for five SOTA AKE methods. They used these datasets to test their approach but didn't mention any new data collection or creation here. 
So, there's no indication that the paper includes a newly created dataset in its methodology or results section.
</think>
The paper focuses on enhancing existing AKE methods using PoS-tagging and semantic-aware criteria without introducing new datasets.
```json
{
    "label": "no_new_dataset",
    "explanation": "The paper does not mention any new datasets being created. It evaluates existing 17 datasets with five SOTA AKE methods."
}
```

본인이라면 <think> 를 출력하고 불안하더라도 파싱해서 사용하겠다는 편이다. 왜냐하면 IntroducesNewDataset 와 같이 Pydantic dataclasses를 이용한 structured output을 사용할 경우 모델의 출력을 decoding 시점에서 on-the-fly로 제한하는 것이기 때문에 모델의 출력 결과물이 저하될 수 있다는 것이다.

def predict_label_without_structured_output(
    data: dict[str, str], model: str = "deepseek-r1-distill-qwen-1.5b", client=client
) -> str:
    prompt = format_text_as_prompt(data)
    messages = [
        {"role": "user", "content": prompt},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
    )
    return response.choices[0].message.content

# 정규표현식이나 파싱 함수를 써서, “json ...” 사이의 문자열을 가져오거나, { "label": ..., "explanation": ... } 구문을 찾아내면 됩니다.

import contextlib
import re
import json

JSON_PATTERN = re.compile(r"```json\n(.*?)```", re.DOTALL)
DIRECT_JSON_PATTERN = re.compile(r"\{[^}]*\}", re.DOTALL)


def try_extract_json_from_text(text: str) -> tuple[str, dict | None]:
    if match := JSON_PATTERN.search(text):
        json_results = match.group(1)
        with contextlib.suppress(json.JSONDecodeError):
            return text, json.loads(json_results)
    if match := DIRECT_JSON_PATTERN.search(text):
        json_text = match.group(0)
        with contextlib.suppress(json.JSONDecodeError):
            return text, json.loads(json_text)
    return text, None

따라서 본인은 조금 JSON 구조가 형식에 맞지 않으면 실패할 수도 있다는 것을 감안하더라도 아래와 같이 structured output을 이용하지 않고 직접 파싱을 해보겠다. 실제 응답에는 모델이 JSON만 깔끔하게 반환할 때도 있지만, Chain-of-Thought 이라 불리는 텍스트를 함께 내보낼 수도 있다.

다시 말해서<think> ... </think>와 함께 JSON이 섞여 있을 수 있으므로, JSON이 포함된 영역만 파싱해 label과 explanation을 얻으면 된다. 파싱 방법은 간단하게 정규표현식을 이용할 수 있다. 예컨대 json { ... } 구간을 추출해서 json.loads로 읽는 식이다. 다만 본문에 해당 구간이 없거나 JSON 구조가 형식에 맞지 않으면 실패할 수도 있다. 이런 경우에는 tenacity 처럼 exponential retry를 시도하는 것이 방법이겠다.

실제로 라벨이 필요한 데이터가 많다면, 다음과 같이 여러 개의 샘플을 한꺼번에 예측하는 것도 가능하겠다.

df.sample(3000, seed=42): 전체 데이터 중 3천 건만 무작위 추출(재현성 위해 seed=42 고정).
thread_map: Python의 tqdm에서 제공하는 멀티스레드 map 함수로, 병렬화하여 predict speed 최적화하기.
predict() : LLM에 프롬프트를 전송하고, JSON 형식의 결과를 추출하여 반환하기.

import random

sample_df = df.sample(3000, seed=42)
examples = sample_df.select(pl.col(["abstract", "title"])).to_dicts()

def predict(data):
    # LLM 프롬프트 -> 응답 -> JSON 파싱
    text_output = predict_label_without_structured_output(data)
    parsed = try_extract_json_from_text(text_output)
    return parsed
from tqdm.contrib.concurrent import thread_map
results = thread_map(predict, examples, max_workers=5)
print(results[0])

# (label, explanation) 두 필드를 sample_df에 붙이기
labels = [r[1].get("label") if r[1] else None for r in results]
explanations = [r[1].get("explanation") if r[1] else None for r in results]
sample_df = sample_df.with_columns(
    pl.Series(name="labels", values=labels),
    pl.Series(name="explanations", values=explanations),
)

# sample_df에 "labels", "explanations"라는 새 열이 생깁니다.
print(sample_df.select(["labels", "explanations"]).head(5))

# "new_dataset" vs "no_new_dataset"의 수량을 확인합니다.
label_counts = sample_df.select(pl.col("labels").value_counts())
print(label_counts)
sample_df = sample_df.filter(pl.col("labels").is_in(["new_dataset", "no_new_dataset"]))

최종적으로 수천 개의 논문에 대해 이러한 라벨링을 수행하면, (title, abstract, label)의 형태로 깔끔한 데이터셋을 얻는다. 이제 이 라벨들을 활용해 ModernBERT 모델을 파인튜닝한다. ModernBERT는 BERT 계열의 작은 모델로, 다양한 분류 태스크에 적합하며, LLM에 비해 훨씬 가볍고 추론 속도가 빠르다. ModernBERT 에 대해서는 아래 영상에서 더 알아볼 수 있겠다.

https://blog.sionic.ai/modernbert

ModernBERT: 최신 LLM 기법으로 BERT를 개선할 수 있을까?

2024년 12월 릴리즈된 ModernBERT를 통해 새로운 인코더 모델의 백본에 대해 알아봅니다.

blog.sionic.ai

ModernBERT를 학습하기 위해서는 Hugging Face의 datasets 라이브러리와 transformers 라이브러리를 사용할 수 있다.

from datasets import load_dataset
from evaluate import load
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
)
import numpy as np

# 1) 라벨이 달린 데이터셋 불러오기(예: 방금 만든 'davanstrien/arxiv-new-datasets')
ds = load_dataset("davanstrien/arxiv-new-datasets", split="train")

# 2) 라벨 인덱스 매핑.
# label2id, id2label: “new_dataset”과 “no_new_dataset” 두 라벨을 0, 1(또는 1, 0)로 맵핑.
labels = ds.features["labels"].names
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}

# 3) 타이틀과 초록을 하나의 텍스트로 합치기
def combine_text(row):
    return {"text": row["title"] + " " + row["abstract"]}

# ds.map(combine_text): title + abstract를 하나로 합쳐서 text라는 칼럼으로 만들기.
ds = ds.map(combine_text)

# 4) 학습-테스트 분리
ds = ds.train_test_split(test_size=0.2, stratify_by_column="labels")

# 5) 토크나이저 로드
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

# 6) 토크나이징 함수
# tokenize_function: 문장을 BERT 입력 형식에 맞춰 토크나이징하되, 길이가 너무 길면 잘라냄(truncation).
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_datasets = ds.map(tokenize_function, batched=True)

datasets를 통해 csv나 parquet 형식의 데이터를 Dataset 객체로 변환하고, train/test 분할을 한 뒤, tokenizer를 이용해 텍스트를 토큰화한다. 그 다음 AutoModelForSequenceClassification으로 ModernBERT를 불러오고, Trainer 클래스로 학습을 진행한다. Trainer에 compute_metrics 함수를 설정해두면, 학습 중 혹은 평가 시점에 정확도나 F1 점수를 확인할 수 있다. 학습률이나 배치 크기, 에폭 수, warmup 비율 등 하이퍼파라미터도 조정 가능하다. 학습이 끝나면 trainer.evaluate로 최종 지표를 확인한 뒤, trainer.save_model로 모델을 저장해둔다.

# 정확도/ F1 점수를 측정하기 위한 metric
accuracy_metric = load("accuracy")
f1_metric = load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_metric.compute(predictions=preds, references=labels)["accuracy"]
    f1 = f1_metric.compute(predictions=preds, references=labels, average="weighted")["f1"]
    return {
        "accuracy": acc,
        "f1": f1,
    }

# ModernBERT 로드
model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=2,
    label2id=label2id,
    id2label=id2label,
)

# 훈련 관련 하이퍼파라미터
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=64,
    num_train_epochs=20,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    weight_decay=0.01,
    max_grad_norm=1.0,
    label_smoothing_factor=0.1,
    logging_dir="./logs",
    logging_strategy="epoch",
)

# 패딩 자동 처리
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Trainer 초기화
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.001)
    ],
)

trainer.train()
eval_results = trainer.evaluate()
print("\nFinal evaluation results:", eval_results)

# 최적 모델 저장
trainer.save_model("./best_model")

나가며

위와 같은 과정을 거치면 DeepSeek R1 으로부터 만든 라벨을 사용해 ModernBERT 모델이 “논문이 새 데이터셋을 소개하는지”를 예측하도록 학습하게 된다. 이렇게 만들어진 모델은 곧바로 사용이 가능하며, 실무 서비스 환경에서 추론만 할 때는 LLM처럼 거대한 리소스가 필요 없다. 그럼에도 레이블이 필요한 분류 문제에서 LLM의 추론 능력을 간접적으로 전이받은 것이므로 어느 정도 정확도가 잘 유지된다.

특히 RAG 에서 간단한 Text 매칭 만으로는 해결되지 않고 어느 정도 추론이 필요한 경우, 라벨이 많이 부족하거나 만들기 어려울 때, DeepSeek이 생성한 “대체로 괜찮은" 라벨을 기반으로 간단한 모델을 학습시키는 방식이 상당히 유용하다. 논문 제목과 초록을 예로 들었지만, 다른 텍스트 데이터에도 동일한 방식을 적용할 수 있다. 예를 들어 제품 리뷰 분류, 문의 메일 자동 분류, 특정한 이벤트 감지 등에 응용 가능하겠다.

LLM은 추론력이 뛰어나지만 매번 사용할 때마다 리소스와 비용이 크다.
그렇다고 모든 분류 문제를 LLM으로만 해결할 필요는 없다.
LLM으로부터 라벨만 생성해서, 가벼운 ModernBERT 모델을 학습(fine-tuning), 실시간 서비스 시에는 ModernBERT를 사용해 빠르고 효율적으로 분류를 처리하는 추론 파이프라인을 구축해보자.

DeepSeek R1의 추론 능력을 바탕으로 100분의 1 작은 ModernBERT 훈련하기

DeepSeek R1으로 생성한 레이블을 활용하여 더 작은 사이즈의 ModernBERT 분류 모델을 학습시키기

새로운 데이터셋을 소개하는 DeepSeek-BERT 모델

Distiling DeepSeek reasoning to ModernBERT classifiers

How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?

Distiling DeepSeek reasoning to ModernBERT classifiers

How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?

ModernBERT: 최신 LLM 기법으로 BERT를 개선할 수 있을까?

2024년 12월 릴리즈된 ModernBERT를 통해 새로운 인코더 모델의 백본에 대해 알아봅니다.

나가며

Written by Sigrid Jin

No responses yet