BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Hongjin Su*1, Howard Yen*2, Mengzhou Xia*2, Weijia Shi3, Niklas Muennighoff,
Han-yu Wang1, Haisu Liu1, Quan Shi2, Zachary S. Siegel2, Michael Tang2,
Ruoxi Sun4, Jinsung Yoon4, Sercan Ö. Arik4, Danqi Chen2, Tao Yu1
1The University of Hong Kong, 2Princeton University, 3University of Washington, 4Google Cloud AI Research

Why a new benchmark?

Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many real-world, complex queries necessitate in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. We introduce BRIGHT to better benchmark retrieval on such challenging and realistic scenarios.

BRIGHT

We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. We collect 1,385 real-world queries from diverse domains (StackExchange, LeetCode, and math competitions), sourced from naturally occurring or carefully curated human data. We pair these queries with web pages linked in StackExchange answers, tagged theorems in math Olympiad questions—all of which require deliberate reasoning to identify the connections.
test image /

Leaderboard submission

If you would like to submit your results to the leaderboard, email the results to suhongjin96@gamil.com! Optionally, you are encouraged to provide the link to the open-sourced codebase. Otherwise, you may provide a short description on the used models and approaches (e.g., size of retrieval model, whether LLMs like GPT-4 or reranking are used, etc.)!

Have Questions?

Ask us questions at our Github issues page or contact Hongjin Su, Howard Yen, or Mengzhou Xia.

Model Performance

Rank Model Model Size (Million Parameters) Memory Usage (GB, fp32) Embedding Dimensions Max Tokens Average (35 datasets) Classification Average (9 datasets) Clustering Average (4 datasets) PairClassification Average (2 datasets) Reranking Average (4 datasets)

Leaderboard

We report the average nDCG@10 score across 12 datasets in BRIGHT. Apart from using the original query, retrievers can replace it with the LLM reasoning steps to retrieve relevant documents.
Rank Retriever Score

1

Aug 28, 2024
BM25, with GPT-4 reasoning and top-100 reranking by Llama-3.1-70B

Salesforce Research (proprietary code)

30.4

2

July 11, 2024
BM25, with gpt-4-0125-preview reasoning

Microsoft

26.5

3

July 11, 2024
BM25, with Claude-3-Opus reasoning

Microsoft

26.3

4

July 11, 2024
instructor-xl, with gpt-4-0125-preview reasoning

The University of Hong Kong, University of Washington

26.2

5

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, with gpt-4-0125-preview reasoning

Google

25.8

6

July 11, 2024
instructor-xl, with Llama-3-70B-Instruct reasoning

The University of Hong Kong, University of Washington

25.8

7

July 11, 2024
instructor-xl, with Claude-3-Opus reasoning

The University of Hong Kong, University of Washington

25.8

8

July 11, 2024
BM25, with Llama-3-70B-Instruct reasoning

Microsoft

25.3

9

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, with Claude-3-Opus reasoning

Google

25.0

10

July 11, 2024
gte-Qwen1.5-7B-instruct, with gpt-4-0125-preview reasoning

Alibaba

24.5

11

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, with Llama-3-70B-Instruct reasoning

Google

24.5

12

July 11, 2024
gte-Qwen1.5-7B-instruct, with Claude-3-Opus reasoning

Alibaba

24.5

13

July 11, 2024
voyage-large-2-instruct, with gpt-4-0125-preview reasoning

Voyage AI

24.4

14

July 11, 2024
GritLM-7B, with gpt-4-0125-preview reasoning

ContextualAI, The University of Hong Kong, Microsoft

24.0

15

July 11, 2024
instructor-xl, with Gemini-1.0-pro reasoning

The University of Hong Kong, University of Washington

24.0

16

July 11, 2024
BM25, with Gemini-1.0-pro reasoning

Microsoft

23.5

17

July 11, 2024
text-embedding-3-large, with gpt-4-0125-preview reasoning

OpenAI

23.1

18

July 11, 2024
gte-Qwen1.5-7B-instruct, with Llama-3-70B-Instruct reasoning

Alibaba

23.1

19

July 11, 2024
instructor-large, with gpt-4-0125-preview reasoning

The University of Hong Kong, University of Washington

22.9

20

July 11, 2024
voyage-large-2-instruct, with Llama-3-70B-Instruct reasoning

Voyage AI

22.8

21

July 11, 2024
GritLM-7B, with Claude-3-Opus reasoning

ContextualAI, The University of Hong Kong, Microsoft

22.8

22

July 11, 2024
voyage-large-2-instruct, with Claude-3-Opus reasoning

Voyage AI

22.8

23

July 11, 2024
text-embedding-3-large, with Claude-3-Opus reasoning

OpenAI

22.6

24

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, top-100 reranking by gpt-4-0125-preview

Google

22.6

25

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, with Gemini-1.0-pro reasoning

Google

22.5

26

July 11, 2024
Cohere-embed-english-v3.0, with gpt-4-0125-preview reasoning

Cohere

22.3

27

July 11, 2024
instructor-large, with Llama-3-70B-Instruct reasoning

The University of Hong Kong, University of Washington

22.3

28

July 11, 2024
gte-Qwen1.5-7B-instruct, with Gemini-1.0-pro reasoning

Alibaba

22.3

29

July 11, 2024
gte-Qwen1.5-7B-instruct

Alibaba

22.1

30

July 11, 2024
instructor-xl, with GritLM-7B reasoning

The University of Hong Kong, University of Washington

22.1

31

July 11, 2024
voyage-large-2-instruct, with Gemini-1.0-pro reasoning

Voyage AI

22.1

32

July 11, 2024
text-embedding-3-large, with Llama-3-70B-Instruct reasoning

OpenAI

22.0

33

July 11, 2024
Cohere-embed-english-v3.0, with Llama-3-70B-Instruct reasoning

Cohere

21.9

34

July 11, 2024
e5-mistral-7b-instruct, with gpt-4-0125-preview reasoning

Microsoft

21.8

35

July 11, 2024
SFR-Embedding-Mistral, with gpt-4-0125-preview reasoning

Salesforce

21.7

36

July 11, 2024
bge-large-en-v1.5, with gpt-4-0125-preview reasoning

Beijing Academy of Artificial Intelligence

21.6

37

July 11, 2024
instructor-large, with Claude-3-Opus reasoning

The University of Hong Kong, University of Washington

21.6

38

July 11, 2024
SFR-Embedding-Mistral, with Claude-3-Opus reasoning

Salesforce

21.5

39

July 11, 2024
Cohere-embed-english-v3.0, with Claude-3-Opus reasoning

Cohere

21.5

40

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, top-10 reranking by gpt-4-0125-preview

Google

21.5

41

July 11, 2024
text-embedding-3-large, with Gemini-1.0-pro reasoning

OpenAI

21.2

42

July 11, 2024
e5-mistral-7b-instruct, with Claude-3-Opus reasoning

Microsoft

21.1

43

July 11, 2024
bge-large-en-v1.5, with Claude-3-Opus reasoning

Beijing Academy of Artificial Intelligence

20.7

44

July 11, 2024
GritLM-7B

ContextualAI, The University of Hong Kong, Microsoft

20.6

45

July 11, 2024
GritLM-7B, with Llama-3-70B-Instruct reasoning

ContextualAI, The University of Hong Kong, Microsoft

20.5

46

July 11, 2024
GritLM-7B, with Gemini-1.0-pro reasoning

ContextualAI, The University of Hong Kong, Microsoft

20.5

47

July 11, 2024
instructor-large, with Gemini-1.0-pro reasoning

The University of Hong Kong, University of Washington

20.4

48

July 11, 2024
bge-large-en-v1.5, with Llama-3-70B-Instruct reasoning

Beijing Academy of Artificial Intelligence

20.3

49

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, top-10 reranking by Gemini-1.0-pro

Google

20.1

50

July 11, 2024
SFR-Embedding-Mistral, with Gemini-1.0-pro reasoning

Salesforce

19.9

51

July 11, 2024
SFR-Embedding-Mistral, with Llama-3-70B-Instruct reasoning

Salesforce

19.7

52

July 11, 2024
gte-Qwen1.5-7B-instruct, with GritLM-7B reasoning

Alibaba

19.7

53

July 11, 2024
e5-mistral-7b-instruct, with Llama-3-70B-Instruct reasoning

Microsoft

19.6

54

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768

Google

19.5

55

July 11, 2024
Cohere-embed-english-v3.0, with Gemini-1.0-pro reasoning

Cohere

19.5

56

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, with GritLM-7B reasoning

Google

19.3

57

July 11, 2024
e5-mistral-7b-instruct, with Gemini-1.0-pro reasoning

Microsoft

19.3

58

July 11, 2024
BM25, with GritLM-7B reasoning

Microsoft

19.1

59

July 11, 2024
instructor-xl

The University of Hong Kong, University of Washington

18.6

60

July 11, 2024
voyage-large-2-instruct, with GritLM-7B reasoning

Voyage AI

18.5

61

July 11, 2024
bge-large-en-v1.5, with Gemini-1.0-pro reasoning

Beijing Academy of Artificial Intelligence

18.4

62

July 11, 2024
GritLM-7B, with GritLM-7B reasoning

ContextualAI, The University of Hong Kong, Microsoft

18.1

63

July 11, 2024
SFR-Embedding-Mistral

Salesforce

18.0

64

July 11, 2024
text-embedding-3-large, with GritLM-7B reasoning

OpenAI

17.8

65

July 11, 2024
text-embedding-3-large

OpenAI

17.6

66

July 11, 2024
voyage-large-2-instruct

Voyage AI

17.6

67

July 11, 2024
e5-mistral-7b-instruct

Microsoft

17.5

68

July 11, 2024
sentence-transformers, with gpt-4-0125-preview reasoning

Technische Universität Darmstadt

17.5

69

July 11, 2024
e5-mistral-7b-instruct, with GritLM-7B reasoning

Microsoft

17.5

70

July 11, 2024
BM25, top-10 reranking by gpt-4-0125-preview

Microsoft

17.4

71

July 11, 2024
SFR-Embedding-Mistral, with GritLM-7B reasoning

Salesforce

17.2

72

July 11, 2024
BM25, top-100 reranking by gpt-4-0125-preview

Microsoft

17.0

73

July 11, 2024
Cohere-embed-english-v3.0

Cohere

16.3

74

July 11, 2024
sentence-transformers, with Llama-3-70B-Instruct reasoning

Technische Universität Darmstadt

16.1

75

July 11, 2024
sentence-transformers, with Claude-3-Opus reasoning

Technische Universität Darmstadt

16.1

76

July 11, 2024
Cohere-embed-english-v3.0, with GritLM-7B reasoning

Cohere

16.0

77

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, top-10 reranking by MiniLM

Google

16.0

78

July 11, 2024
bge-large-en-v1.5, with GritLM-7B reasoning

Beijing Academy of Artificial Intelligence

15.7

79

July 11, 2024
instructor-large, with GritLM-7B reasoning

The University of Hong Kong, University of Washington

15.7

80

July 11, 2024
BM25, top-10 reranking by Gemini-1.0-pro

Microsoft

15.7

81

July 11, 2024
sentence-transformers, with Gemini-1.0-pro reasoning

Technische Universität Darmstadt

15.3

82

July 11, 2024
sentence-transformers

Technische Universität Darmstadt

14.6

83

July 11, 2024
BM25

Microsoft

14.3

84

July 11, 2024
instructor-large

The University of Hong Kong, University of Washington

14.0

85

July 11, 2024
sentence-transformers, with GritLM-7B reasoning

Technische Universität Darmstadt

13.7

86

July 11, 2024
bge-large-en-v1.5

Beijing Academy of Artificial Intelligence

13.6

87

July 11, 2024
BM25, top-10 reranking by MiniLM

Microsoft

13.1

88

July 11, 2024
google-gecko.text-embedding-preview-0409, dim=768, top-100 reranking by MiniLM

Google

9.2

89

July 11, 2024
BM25, top-100 reranking by MiniLM

Microsoft

8.3