От двунаправленных кодировщиков до новейших достижений: обзор BERT и его преобразующего влияния на обработку естественного языка

Автор: Раджеш Гупта

Журнал: Informatics. Economics. Management - Информатика. Экономика. Управление.

Рубрика: Информатика, вычислительная техника

Статья в выпуске: 3 (1), 2024 года.

Бесплатный доступ

Технология двунаправленного кодирования от трансформеров (BERT), впервые разработанная исследователями Google в 2018 году, представляет собой прорыв в области обработки естественного языка (НЛП). BERT достиг самых современных результатов в ряде задач НЛП, используя архитектуру нейронной сети на основе одного трансформера. В этой работе рассматривается технический подход BERT, производительность на момент публикации и значительное влияние на исследования с момента выпуска. Мы предоставляем информацию об основах BERT, таких как преобразовательные кодеры и перенос обучения на основе универсальных языковых моделей. Основные технические инновации включают в себя глубокую двунаправленную обработку с целью моделирования языка в масках на этапе предварительной подготовки без использования BERT. Для оценки BERT был доработан и протестирован на одиннадцати задачах НЛП, начиная от ответов на вопросы и заканчивая анализом настроений с помощью теста GLUE, что позволило добиться новых самых современных результатов. Кроме того, в работе анализируется огромное исследовательское влияние BERT как доступного метода, превосходящего специализированные модели. BERT стал катализатором внедрения предварительного обучения и нейросетевой архитектуры трансформеров обучения для НЛП. В количественном отношении более 10 000 статей расширили BERT, и он широко интегрирован в отраслевые приложения. Будущие направления на основе шкалы BERT ориентированы в сторону миллиардов параметров и многоязычных представлений. Таким образом, в этой работе рассматриваются: метод, производительность, влияние и перспективы BERT как основополагающего метода НЛП.

Еще

BERT, машинное обучение, обработка естественного языка, трансформеры, нейронная сеть

Короткий адрес: https://sciup.org/14129602

IDR: 14129602 | DOI: 10.47813/2782-5280-2024-3-1-0311-0320

Текст статьи От двунаправленных кодировщиков до новейших достижений: обзор BERT и его преобразующего влияния на обработку естественного языка

DOI:

Bidirectional Encoder Representations from Transformers (BERT) was first introduced in a 2018 paper from researchers at Google. BERT represents a milestone technique in natural language processing (NLP), achieving state-of-the-art results on a variety of NLP tasks whilst utilizing a single underlying model architecture. This introductory chapter provides background and an overview on BERT [1-5].

We first present a short history of natural language processing to provide context, discussing key techniques that preceded BERT such as recurrent neural networks like LSTMs, attention mechanisms used in transformers, and the limits of directional rather than bidirectional models. We introduce common NLP tasks like question answering, sentiment analysis, and textual entailment that BERT targets [6-10].

Next, we motivate the need for pre-trained language representations that can be finetuned for specific tasks rather than developing specialized models, reducing duplication of effort. We discuss BERT's key technical innovations at a high-level including using transformer encoders, bidirectional training, and masked language modeling objectives during pre-training. Additionally, we summarize BERT's impressive performance improvements over prior NLP approaches, achieving new state-of-the-art results on eleven NLP tasks with a single model architecture [11-20].

In the last section of Chapter 1, we provide an outline for the remainder of this paper. Chapter 2 will provide comprehensive technical background. Chapter 3 will detail BERT's development and pre-training methodology. Chapters 4 and 5 will discuss the significant research impact of BERT and key areas it has influenced.

TECHNICAL BACKGROUND

As background before discussing BERT's methodology and research impact, this chapter provides a comprehensive overview of foundational neural network architectures that enabled the development of language representation techniques like BERT [1, 21-24].

We begin with an in-depth history of recurrent neural networks (RNNs), the predecessor to BERT. For decades, RNNs were state-of-the-art for sequential data modeling. We provide technical background on RNNs, starting with simple Elman networks, and leading to more complex gated networks like LSTMs and GRUs. Equations and diagrams explain how these process input sequences by maintaining internal memory states. We discuss seminal papers leveraging these architectures for language tasks, including machine translation, speech recognition, and text generation.

However, RNNs struggled with capturing longer-term dependencies due to reliance on single vector hidden states. Attention mechanisms were introduced to augment RNNs by allowing later processing steps to refer back to prior hidden states. The decoder-encoder structure used in neural machine translation established methodology for uni-directional conditional prediction that influenced BERT.

Recently, transformers have superseded RNNs given advantages in parallelizability and memory access. We devote several sections to explain transformers in detail as the foundation for BERT. Building off the image recognition advances of CNNs, transformers utilize stacked multi-headed self-attention and feed forward layers for mapping input embeddings into an encoded latent space. Through matrix dot products comparing each token against every other token in a sequence, transformers identify relevant context with less reliance on proximity than

RNNs. We provide full specification of the tensor operations behind multi-headed attention calculations, mapping input and output dimensions [25].

Additionally, we discuss design choices made in original transformer architecture from "Attention is All You Need" paper by Google researchers. These include incorporating positional encodings within input embeddings, residual connections between layers, and regularization methods like dropout. We compare strengths and weaknesses of transformers against prior recurrent networks through experimental results on machine translation tasks that demonstrated significantly improved performance and efficiency.

By the end of this extensive technical background, readers should have developed intuition regarding the development trajectory from RNNs through transformers that enabled breakthroughs like BERT by overcoming limitations in effectively modeling linguistic context and dependencies. In the next chapter, we leverage this foundation to delve into the specific decisions made for adapting transformers into BERT.

DEVELOPMENT OF BERT

This chapter chronicles development of the BERT technique by Google researchers beginning in late 2017, motivated by the desire to improve general language understanding beyond task-specific models. We first discuss BERT's architectural improvements over prior transfer learning approaches to support truly bidirectional training. Next, we detail the pretraining data and procedures used. Lastly, we summarize the 11 NLP tasks included in the initial November 2018 BERT paper to evaluate performance [13-15].

On the architectural side, BERT adapted the transformer encoder stack pioneered in the 2017 attention is all you need paper that introduced transformers. In the base configuration utilized for pretraining, BERT uses an encoder with 12 layers, 12 self-attention heads, and 768 dimensional hidden states. We provide diagrams of information flow through the BERT transformer blocks. Empirically, this provided optimal results with reasonable computational requirements for pretraining.

Crucially, BERT trains representations bidirectionally, allowing each word to incorporate context from all tokens in a sentence rather than just previous tokens. This better match human understanding. Bidirectional conditioning was facilitated by replacing transformers' output layer with one suited for pretraining tasks like filling masked tokens. We present specifics of BERT's token masking procedure and the Cloze objective loss function optimization.

For pretraining data, BERT used BookCorpus, a collection of 11,000 unpublished books, and full English Wikipedia text. This combination provided a wide range of domains totaling over 3 billion words. We explain BERT's WordPiece subword tokenization and preprocessing flows used to handle this large corpus. BERT was pretrained over 1 million update steps with batch size of 256 sequences for an hour, performing masked LM and next sentence prediction on each sequence pair.

Finally, this chapter examines how the pretrained BERT model was fine-tuned and evaluated on GLUE, a benchmark consisting of 11 NLP tasks ranging from question answering to sentiment analysis to textual entailment. With minimal adaptation, BERT achieved state-of-the-art on all tasks while using a single model architecture simply by adjusting the output softMAX layer used for classification. We present accuracy tables from the paper for each subtask [26-27].

By detailing the innovations in architecture, pretraining procedure, and testing across a variety of tasks, this chapter provides readers implementation insight to recreate BERT while illustrating what allowed it to surpass prior state-of-the-art results.

RESEARCH IMPACT

This chapter explores BERT's significant research impact since being open sourced by Google in November 2018. We first analyze how BERT became a catalyst within the NLP community by promoting transfer learning as an alternative paradigm to specialized models. Next, we present a representative sample of research extending or modifying BERT for new techniques and applications. Lastly, we examine BERT's widespread industry adoption, serving as a production standard language model.

The publication detailing BERT contributed to a fundamental philosophical shift in the field towards universal language model pretraining rather than intricate task-specific model engineering, leading to an explosion of follow-on papers. Researchers now routinely pretrain models similar to BERT on their unlabeled datasets before specializing the output layers for given applications [28-29].

As evidence of this thriving research ecosystem, over 10,000 papers have extended BERT in some form as tracked by Google Scholar citations. Examples include BioBERT and SciBERT, which pretrain on scientific texts to better understand technical language, ERNIE adding continual pretraining mechanisms in Chinese, and video BERT supporting multimodal understanding. Similarly, Roberta demonstrated performance could be further improved simply via longer training with larger mini-batches and dataset size.

On model compression, DistilBERT reduced BERT's size by 40% while retaining over 97% of language understanding capabilities, improving deployability by decreasing memory and latency costs. Overall, BERT has become established as a fundamental technique to build upon rather than needing to design architectures from scratch.

Industrially, BERT has been integrated into major production AI systems such as search engines, question answering services, and text generation pipelines to substantially improve natural language capabilities. Almost all current startups working on language-centric products utilize BERT for state-of-the-art performance. Given compute availability through cloud platforms, both large and small organizations can leverage BERT's power.

In summary, both academic research groups and technology companies prominently feature BERT as an enabling layer for downstream applications, demonstrating BERT's standing as an essential foundational NLP technique five years from initial publication.

CONCLUSION AND FUTURE OUTLOOK

In this concluding chapter, we synthesize key directions the field has taken since BERT's introduction to provide perspective on future areas of innovation in language representation learning. While BERT itself represented a milestone in mimicking human understanding of language, ample room remains for progress.

One active research direction focused on further increasing model scale beyond BERT's 110 million parameters. Later GPT models created by OpenAI like GPT-3 demonstrated language generation abilities absent from BERT by scaling up to billions of parameters. Similarly, Google's Switch Transformer architecture reduced training computational burdens to develop even larger models. However, this remains an area of debate regarding how much knowledge versus raw model size improves performance.

Additionally, multilingual and cross-lingual extensions of BERT like mBERT, InfoXLM, and XLM-R have shown ability to transfer representations across hundreds of languages. As digital content grows increasingly globalized, developing models that work across languages without needing retraining has become imperative. Representation techniques that bridge linguistic barriers provide commercial and social benefits [30-31].

Finally, researchers have only begun exploring far transfer - the ability to apply language representations like BERT to entirely new tasks besides the NLP domain it was developed in. Recent work proposing using BERT for mathematical reasoning and software understanding tasks demonstrates intriguing potential. Ultimately, unlocked knowledge transfer may be BERT's longest-lasting contribution.

In conclusion, this paper has charted BERT's technical innovations that catalyzed a new transfer learning paradigm within NLP, empirical performance demonstrations on 11 language understanding tasks, and remarkable research impact. As foundational techniques like BERT continue evolving, machines come closer to mastering nuanced communication abilities once considered definitively human. However, difficult open challenges around model interpretability, theoretical analysis to guide development, and potential negative societal consequences remain active areas of research alongside continual progress in the state-of-the-art.

Статья