Expertly evaluate LLMs.

Design and analyze complex LLM evaluations on your domains. Drive LLM performance increases.

New Thread
Compare LLMs
Thread {
Prompt: 'Only respond to the following topics...',
Conversation: {
Response time: 12 sec,
Input: 'What is the top cause of CO2... ',
},
}

Introduction

Getting started

Start using and evaluating LLMs for your specialized domains in seconds!

Core concepts

Introduction to the core pieces of Threads.

Core actions

Step-by-step guides to using Threads to run, evaluate, compare, & improve LLMs.

Advanced guides

Recipe books, strategies, and guides for bringing your LLMs to the next level.

Communities

Learn about and join key communities using Threads, or start your own!

The documentation is organized into four chapters. If you want to start running and evaluating LLMs immediately, use the quick start guide below.


Motivation: Turning Excitement into Results

LLMs are exciting. You're excited. We're excited. Everyone is excited! But we've seen tech trends happen again where excitement dies down because the technology doesn't deliver.

For LLMs to work, they need to work for you. On your problems. In your domains. And this is where things get difficult. Flashy demos are great, but when you go to apply them to your own problems, they fall short.

In order to successfully apply LLMs to specialized domains, these LLMs must be rigorously and exhaustively evaluated by the community experts. Armed with this knowledge, the community can continue to improve LLM performance, whether that means finding the best off-the-shelf LLM to apply to a problem or continuing to fine-tune a custom LLM. Critically, this evaluation process cannot be limited just to technical engineers, but must include the non-technical domain experts who will drive their progress forward.

Benchmarks are Broken

Right now, LLMs are graded on broad benchmark datasets and evaluation cases. The problem is, these benchmarks are broken.

Models that top the leaderboard often fail when applied "in real life." Error rates are extremely high, especially as they are applied to new tasks on which they were not previously optimized.

Generic Evaluations, Generic Results

Even if generic benchmarks worked great for generic base models, they still are not helpful for specialized domains.

Generic Evaluations Stagnate Progress!

Say you're a doctor (your parents must be very proud!) If you’re trying to understand how an LLM will respond in a complex conversation with a patient, knowing it has an MMLU score of 65.78 is not as helpful as knowing that a doctor has reviewed and graded the answer!!

To truly get value, communities need to come together to continue iteratively fine-tuning and evaluating highly-specialized LLMs for their domains. But you can't fix what you don't know is wrong!

This is where Threads come in.

Threads: Rigorous and Customized LLM Evluations

Threads enable you to evaluate your LLMs in ways that matter to you and your specialized domain. Threads allow domain experts to build specially designed sets of evaluation datasets to test their specialized LLMs, discuss LLM performance on them, and continually iterate.

Why does this matter? Because this process of careful and deliberate evaluation is what will drive massive improvements in LLM performance for your domain. You can’t choose, fix, or customize LLMs for your domain until you have a thorough understanding of its strengths and weaknesses within your domain. As LLMs continue to be fine-tuned for each specific domain or problem, Threads allow the community to find problems and where their LLMs are not working well, and see if new fine-tuned LLMs fix these problems.

Threads allow you to evaluate all aspects of LLM performance for your domain. They encompass domain-specific knowledge, as well as general concepts such as toxicity (can your LLM be tricked into saying something toxic), security (can your LLM be tricked into revealing private information it was trained on), and all the other things you need to think about before deploying an LLM to your domain.