Benchmark

Benchmarking Large Language Models for Energy

Samuel, an experienced drilling manager, was discussing a series of complex issues about their continuous drilling project in the operations center of a major oil and gas operator. Samuel explored two separate sources for solutions when faced with the difficulty of maximizing drilling approaches and ensuring smooth operations: the SLB specialized large language model (SLB LLM) and the general-purpose models, GPT-4o and Gemini-1.5-Pro-001. Samuel was provided with comprehensive, contextually rich responses by the SLB LLM, which was specifically designed for the oil and gas industry. When Samuel asked about fishing, SLB LLM provided an in-depth analysis of recovery strategies for lost tools, including specific methods and their practical applications in the field. Additionally, its answer to questions regarding "mud" was filled with industry-specific information about mud composition and its critical function in preserving wellbore stability.

GPT-4o and Gemini-1.5-Pro-001, on the other hand, failed to meet the complex needs of the drilling operation even if they could provide general ideas. Samuel inquired about "rig" and "rig release," and these models gave an overview that, while accurate, lacked the detail needed for real-time operational decision making. Their answers were more theoretical than targeted on the particular difficulties and processes relevant to Samuel's task. Although general models are great for a broad spectrum of areas, the side-by-side analysis showed that the specialized knowledge of the SLB LLM was essential for handling the difficult, domain-specific questions associated with oil and gas operations. The circumstance emphasized the need for customized knowledge in reaching operational effectiveness and success in specialized domains.

Why do we need domain benchmarking?

The oil and gas industry requires specialized and in-depth technical expertise for tasks such as subsurface modeling, generating a drilling plan to manage production and operations, as well as analyzing data. Large language models (LLMs) have revolutionized all-purpose tasks; however, the accuracy of a generic model varies notably for domain-specific applications. A day in the life of a petrophysicist would include analyzing log data from multiple projects while applying technical knowledge and concepts from geology and petroleum engineering. Obtaining accurate domain-specific answers that corroborate industry standards is the demand from the LLMs. Correspondingly, for a geophysicist the LLM needs to perform complex tasks like interpreting seismic data, which also requires the capability to correlate specific terminology with geoscientific attributes. To generate the optimum drilling strategy, the LLMs must make complicated decisions by gathering insights from multiple layers of technical data regarding subsurface conditions, real-time drilling operations, and safety protocols to generate pragmatic suggestions.

Benchmarking plays a pivotal role in comparing model performances. A thorough and comparative analysis is required to objectively identify which of the LLMs surpasses the rest to offer domain-specific responses and meaningful recommendations for the oil and gas sector. Benchmarking identifies the areas of strength as well as areas for development in Geophysics, Petrophysics, Drilling or Production. It identifies the adaptability, strengths, and weaknesses of every LLM, thereby greatly influencing accuracy and decision making in many other fields.

Data preparation and considerations

Nine distinct domains—Geomechanics, Reservoir Engineering, Petrophysics, Data Workspace, Well Construction, Production, Geology, Intersect, and Geoscience—were gathered and arranged under the data preparation process. The data collection comprised well-chosen questions, responses, and contextual knowledge relevant to these technological fields. This was essential because the oil and gas industry involves highly technical inquiries that demand precise responses, contrary to most generic tasks.

Figure 1: The pie chart shows the distribution of various domains, with Data Workspace taking up the largest portion, followed by Petrophysics and Intersect.

The data were compared to eight distinct LLMs to determine the extent to which each could manage the complexity of these domain-specific queries upon preparation. The models evaluated were OpenAI GPT-4, OpenAI GPT-4o, OpenAI GPT-3.5-Turbo, Mistral Large, Mistral Nemo, Mistral Small, Gemini-1.5-Pro-001 and Gemini-1.5-Flash-001.

Each model received the same set of questions, contextual background, and ground-truth responses derived from domain-specific datasets. The objective was to assess the models' ability to accurately respond to queries based on the information provided, while simultaneously analyzing critical metrics such as:

Precision: The degree to which the model's response was consistent with the ground-truth answer.
Contextual Understanding: The extent to which the model accurately interpreted and used the context provided.
Domain Knowledge: The depth of the model's understanding of specialized fields including Petrophysics and Geomechanics.

Evaluation Metrics

Using a variety of metrics that reflect the quality and accuracy of the responses, the generated responses were compared to ground truth values to assess the performance of each LLM. These included:

ROUGE Score: It highlights the ability of the model to preserve lengthier word sequences in a correct order, emphasizing on the longest common subsequence (LCS) between the generated and reference texts.
Faithfulness: The generated answer is certain to remain true to the actual data provided in the context, without the addition of hallucinated information.
Answer Relevancy: How well does the response answer the domain-specific question?
Answer Correctness: The extent to which the response is technically accurate in relation to the context provided.
Context Precision and Recall: The ability of the model to accurately extract and apply appropriate details from the context provided.

In addition to assessing these metrics, cost efficiency, which is a critical factor when deploying LLMs at scale, was considered. Specifically, the computational cost per input and output token was determined, since each token that an LLM processes incurs a cost. The total cost for each domain-model pair was calculated by aggregating the token count used for question-answer pairs. This allows us to balance cost with accuracy in trade-off terms. Another critical factor in the evaluation process was the average response time of each LLM when responding to domain-specific questions.

To reflect the overall performance of each model, the evaluation metrics were combined into a single score using a weighted total of all the metrics.

Results

Contextual information matters: Models such as GPT-4o and Mistral-Nemo performed better in domains with contextual information, such as Geoscience and Subsurface, compared to those without much context.
Model strengths vary by domain: The strengths of Mistral-Nemo and GPT-4o differ a lot across domains. Mistral-Nemo has been best in Geoscience and Subsurface, while GPT-4o has been more competitive in Production and Drilling. Both models have performed similarly in the Upstream (Overall) domain, making them good for different specialized tasks.
GPT-4o vs Mistral-Nemo Performance Across Domains
Figure 2: This bar chart compares the performance of GPT-4o and Mistral-Nemo across several domains, including Geoscience, Subsurface, Production, Drilling, and Upstream (Overall). Mistral-Nemo has consistently outperformed GPT-4o in most domains, particularly in Geoscience and Subsurface, while their scores have been closer in Production and Drilling.
Mistral-Nemo has shown great skill across many specialized fields. It achieved the highest score in Geoscience, showing that it is very good at complex geoscientific tasks. In Subsurface, it earned the second-highest score. In Production, it has tied for second, showing it works well in production-related tasks. In Drilling, it is close to the top and shows strong understanding of drilling tasks. GPT-4o, on the other hand, has had a more balanced performance with some key strengths. It achieved the highest score in the Production domain, showing it is good at handling production tasks. While its scores in Geoscience and Subsurface have been moderate, they are still good enough for general tasks. In Drilling, GPT-4o has shown competitive results.
Drilling
Figure 3: The bar chart illustrates the performance of various models in the Drilling domain. Both GPT-4 and Gemini-1.5-Pro-001 achieved the highest scores, while Gemini-1.5-Flash-001 performed the lowest.
In the Drilling domain, the performance of various models has revealed significant insights into their abilities. GPT-4 and Gemini-1.5-Pro-001 both achieved the highest scores, indicating expertise in handling drilling-related tasks. Close behind the top performers, Mistral-Nemo showcased high competence in the Drilling domain. Mistral-Small and GPT-4o also showed solid performance with scores of 0.289 and 0.290, respectively.
Overall performance insights: Overall, the results have indicated that Mistral-Nemo and GPT-4o have consistently achieved the highest scores across most domains, showcasing proficiency, particularly in Geoscience and Subsurface. GPT-4o has excelled in the Production domain with the top score, while both GPT-4 and Gemini-1.5-Pro-001 have led in the Drilling domain. Although the Gemini models showed competitive performance in certain domains, they have generally scored lower than the Mistral and GPT models, suggesting potential areas for improvement. These findings suggest that models such as Mistral-Nemo and the GPT series are well-suited for applications requiring comprehensive expertise across specialized fields, while models like Gemini-1.5-Pro-001 may be more suitable for domain-specific applications such as drilling.

Way Forward

Several significant enhancements will be the primary focus of LLM benchmarking in the oil and gas industry in the years to come. It is expected that LLMs would become more proficient in the management of specific tasks such as reservoir simulations and predictive maintenance. As a means of improving both accuracy and decision-making, benchmarks will focus on the capacity of models to include real-time data and inputs from several modalities. A significant amount of attention will be placed on efficiency, which will include maximizing the utilization of resources and minimizing the costs of operations.

Moreover, the seismic and wellbore log foundation models will be benchmarked on deeper subsurface characterization, geo-feature detection, and geo-mechanical property estimate with improved domain-specific data. It is necessary to benchmark the model's capacity to process larger, more detailed datasets to ensure scalability across a variety of geophysical locations and formations. Benchmarking will evaluate the efficiency of seismic and well log data when integrated to produce comprehensive subsurface models, thereby assisting geologists in making more improved exploration decisions. Comparing these specialized foundation models with finer-tuned large language models is another important benchmarking aspect. This could include testing how fast LLMs can be adapted to geoscience tasks, if they can reach comparable accuracy with limited fine-tuning, and how well they connect with existing geological tools and workflows. This will help determine the necessity of foundation models for domain-specific applications.

Ready to find the best AI model for your domain-specific needs? Contact us today to learn more about how we can help you select the right model for your specialized applications!