Ep48. Large Language Models Can Self-Improve in Long-context Reasoning

11 min · 16 nov 2024

Beschrijving

This research paper investigates how large language models (LLMs) can improve their ability to reason over long contexts. The authors propose a self-improvement method called SEALONG that involves sampling multiple reasoning outputs from an LLM, scoring these outputs using Minimum Bayes Risk (MBR), and then fine-tuning the model using the highest-scoring outputs or by contrasting high-scoring and low-scoring outputs for preference optimization. Extensive experiments on several leading LLMs demonstrate that SEALONG effectively improves the long-context reasoning capabilities of LLMs without relying on human annotations or advanced models. The paper further analyzes the impact of various prompting strategies, scoring methods, and training parameters on SEALONG's performance.

Reacties

Wees de eerste die een reactie plaatst

Meld je nu aan en word lid van de The Daily ML community!

Probeer gratis

Alle afleveringen

10 afleveringen

Ep49. Artificial Intelligence, Scientific Discovery, and Product Innovation

This research paper examines the impact of an artificial intelligence tool for materials discovery on the productivity and performance of scientists working in a large U.S. firm's R&D lab. The study exploits a randomized rollout of the AI tool across teams of scientists, allowing the researchers to draw causal inferences about the effects of the technology. The paper demonstrates that the AI tool significantly increases the rate of materials discovery, patent filings, and product innovation, but these benefits are unequally distributed among scientists. The researchers find that the AI tool is most beneficial to scientists with strong judgment skills, which involve the ability to evaluate and prioritize AI-generated candidate compounds. The study also reveals that the AI tool automates a significant portion of idea generation tasks, resulting in a reallocation of scientist labor towards judgment tasks. This reallocation, along with the increased demand for judgment skills, explains the heterogeneous impact of the AI tool on scientific performance.

18 nov 20249 min

Ep48. Large Language Models Can Self-Improve in Long-context Reasoning

16 nov 202411 min

Ep47. Personalization of Large Language Models: A Survey

This paper is a survey of personalized large language models (LLMs), outlining different ways to adapt these models for user-specific needs. It analyzes how to personalize LLMs based on various user-specific data such as static attributes, interaction history, and pair-wise human preferences. The authors propose taxonomies for personalization granularity (user-level, persona-level, and global preference), techniques (RAG, prompting, representation learning, and RLHF), evaluation metrics (intrinsic and extrinsic), and datasets (with and without ground-truth text). The paper concludes by highlighting key challenges for the future of personalized LLMs, including the cold-start problem, stereotype and bias issues, privacy concerns, and the complexities of multimodality.

16 nov 202426 min

Ep46. Number Cookbook: Number Understanding of Language Models and How to Improve It

This research paper investigates the numerical understanding and processing abilities (NUPA) of large language models (LLMs). The authors introduce a benchmark, covering various numerical representations and tasks, to systematically evaluate LLMs' capabilities in handling numbers. The paper finds that while LLMs perform well on simpler tasks, their performance deteriorates significantly as task complexity and input length increase. The authors also explore various techniques to improve NUPA, including specialized tokenizers, positional encodings, and data formats. Despite some successes in improving NUPA during pre-training, these techniques are found to be ineffective when applied to already trained models. The paper concludes that further research is necessary to address the challenges of NUPA in LLMs and enable them to confidently handle numerical tasks in real-world applications.

14 nov 202417 min

Ep45. Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models

This paper describes a novel method called Multi-expert Prompting that aims to improve the reliability, safety, and usefulness of large language models (LLMs). The method simulates multiple experts with different areas of expertise and aggregates their responses to a query, ultimately selecting the best answer based on criteria like truthfulness, factuality, and informativeness. This process is inspired by the Nominal Group Technique, a human-designed decision-making framework. The authors demonstrate that Multi-expert Prompting significantly outperforms existing prompting methods, especially in scenarios where diverse perspectives are valuable, and surpasses prior methods on various benchmarks. The paper also discusses ethical considerations related to the potential for bias amplification and explores ways to mitigate these risks.

12 nov 202411 min

Ep48. Large Language Models Can Self-Improve in Long-context Reasoning

Beschrijving

Reacties

Probeer 14 dagen gratis

Alle afleveringen