Technology
minutes read

There's Coffee In That Nebula. Part 8: Exploring the potential of emergent LLM behaviours

Written by
Mariano Cigliano
Published on
August 16, 2024
TL;DR

This article, part 8 of "There's Coffee In That Nebula," explores the capabilities of a single AI agent for Exploratory Data Analysis (EDA) on a Customer Personality Analysis dataset. We compare here, three implementations—LlamaIndex, PandasAI, and LangChain—assessing their ability to handle basic data tasks. Key insights reveal the agent's strengths in data understanding and the limitations of single-agent systems. The article also touches on advanced reasoning techniques like Chain of Thought (CoT) and hints at future exploration of multi-agent systems for more complex data analysis.

Author
Mariano Cigliano
R&D Tech Leader
My LinkedIn
Dowload 2024 SaaS Report
By subscribing you agree to our Privacy Policy.
Thank you! Your submission has been received
Oops! Something went wrong while submitting the form.
Share

In our previous article, we introduced the concept of LEDA, a Conversational Retail Customer Analysis system that leverages autonomous LLM agents to revolutionise how retail analysts interact with data.

As we continue our three-part series, we now turn our attention to the development and testing of a single agent for Exploratory Data Analysis (EDA). This article will delve deep into the capabilities of our AI agent when working with a pandas DataFrame, specifically focusing on a Customer Personality Analysis dataset.

Building on the foundation laid in our first article, we'll examine how our autonomous agent tackles the complex task of EDA, a crucial step in understanding customer behaviour and segmentation in retail. We'll explore various implementations of pandas agents and assess their effectiveness in replicating and potentially enhancing human-led EDA processes.

Our journey will take us through the challenges of working with real-world datasets, and the evaluation of LLMs performance in data analysis tasks. As we progress, we'll keep in mind the strategic implications for retail businesses and the potential this technology holds for transforming data analysis workflows.

Join us as we take another step towards our vision of a more intuitive, efficient, and insightful approach to LLM-driven data analysis.


Single agent approach

The dataset

For our exploration we've chosen: Customer Personality Analysis

This dataset, sourced from Kaggle, provides a comprehensive view of a company's customer base, offering a perfect playground for our agent to demonstrate its EDA ability.

The dataset encompasses a wide range of customer attributes, including demographics such as age, education, marital status, and income. It also captures product preferences across various categories like wines, fruits, meat, and fish. Shopping behaviours are represented through spending patterns, use of discount offers, and website visits. Additionally, it includes data on customer responses to marketing campaigns.

It presents several analytical challenges that are common in retail customer analysis. Customer segmentation stands out as a primary task, involving the identification of distinct groups based on purchasing behaviour and demographics. The data also allows for analysis of campaign effectiveness, understanding channel preferences between online and in-store shopping, and estimating the potential long-term value of customers.

To benchmark our AI agent's performance, we're referencing a Kaggle notebook by Karnika Kapoor. This notebook goes beyond basic analysis, proposing a customer segmentation through clustering techniques. The clusters are then profiled, ultimately providing four distinct customer personas. This approach represents a sophisticated level of analysis that combines statistical techniques with business acumen.

It's important to note that with our single-agent approach, we're not aiming to fully replicate or enhance this level of human analysis – at least not yet. Our primary goal at this stage is to understand if a single agent can perform basic tasks on a complex dataframe. We want to explore how well the agent can handle data exploration, generate initial insights, and potentially identify areas for deeper analysis.

By setting this dataset and human analysis as our reference point, we're challenging our agent with a real-world scenario. We'll be assessing its ability to navigate the dataset, identify key trends, and possibly suggest directions for further investigation. While we don't expect the single agent to match the depth of the human analyst's segmentation and persona creation, we're interested in seeing how it approaches these complex tasks.

In the following sections, we'll examine different implementations of our agent working with this dataset. We'll evaluate their basic EDA capabilities, their approach to data exploration, and their ability to generate initial insights. This evaluation will help us understand the current limitations of a single-agent system and set the stage for our future exploration of more complex, multi-agent approaches.

Three Paths, One Goal:
A Comparative Study of Three Implementation Approaches

LlamaIndex

LlamaIndex implements the pandas agent as a query engine within their experimental package. This implementation is designed to convert natural language queries into executable Python code using Pandas.

At the core of this approach is a tool that provides the Language Model (LLM) with access to Python's eval function. This allows for direct code execution on the machine running the tool. However, LlamaIndex explicitly warns that this capability introduces potential security risks. They strongly advise against using this tool in production environments without implementing robust sandboxing or virtual machine safeguards.

The query engine also includes options for response synthesis, allowing for more natural language outputs if desired.

That said, LlamaIndex's implementation does not explicitly incorporate advanced reasoning techniques like ReAct. Instead, it relies on the LLM's ability to infer the necessary operations directly from the prompt and the given context. This approach is simpler but limits the agent's ability to handle more complex analytical tasks or explain its reasoning process.

PandasAI

PandasAI stands out as a robust, Python library designed specifically for AI-driven data analysis. While it wasn't our intention to compete with this well-developed tool, studying its approach provided valuable insights for our project.

PandasAI offered a versatile solution capable of 

  • Working with multiple dataframes simultaneously, enhancing its ability to handle complex data analysis tasks. 
  • Providing the ability to elucidate the operations performed, improving transparency in AI-driven analysis. 
  • Streamlining data visualisation by automatically generating and saving charts
  • Offering various connectors to different data sources. This feature simplifies the process of incorporating diverse data types into the analysis workflow, enhancing the library's flexibility and utility across different scenarios.

Despite implementing an agent system that can maintain context across multiple interactions, at the time of our exploration, PandasAI did not explicitly implement ReAct or similar advanced reasoning techniques out of the box. 

It's worth noting that PandasAI has been rapidly evolving since then; the library now appears to have undergone significant developments.

It offers both free and enterprise versions, with the free version limited to 25 queries per month. 

LangChain

LangChain provided support for multiple dataframes, allowing the agent to interact with various data sources simultaneously through a Python shell interface.

At the moment of the exploration, LangChain offered two types of agents: OpenAI Functions agents, which are specific to OpenAI's models, and the Zero-shot ReAct agent, which we focused on due to its LLM-agnostic nature. 

The ReAct (Reasoning and Acting) framework enables the agent to alternate between reasoning about the task and taking actions, providing a more structured approach to problem-solving and inherently incorporates aspects of step-by-step reasoning.

Implementation

Before delving into the specifics of our implementation, it's crucial to understand LEDA's design is rooted in modularity and configurability, principles carried over from our previous Mobegí project.

Our architecture is divided into distinct modules: Brain, Knowledge, Memory, and Skillset. Each module has a specific responsibility, allowing for independent development and easy integration of new components.

The cornerstone of our system's flexibility is our configuration framework. It consists of:

  • Configuration Data structures that define core abstractions.
  • Collapsers that resolve specific implementations based on configuration data.
  • Configurators that manage collections of collapsers for each module.
  • Factories that use configurators to build module components.

This approach allows us to switch between different implementations, such as various LLM models or vector stores, simply by adjusting configuration files. It also facilitates easy experimentation and comparison of different setups.

For our DataFrame query engine, we implemented a collapser specifically designed to handle different DataFrame query approaches. This collapser  is capable of resolving to different implementations based on the provided configuration data.

With this setup in place, we were able to start testing our implementation. We could easily switch between different DataFrame query engine implementations, compare their performance, and identify strengths and weaknesses.

Evaluation

For our evaluation, we continued to leverage LangSmith, which offers a robust set of tools for both debugging and assessing LLM-powered systems. LangSmith provides several evaluator types:

  • QA evaluator: Assesses correctness based on reference answers.
  • Context QA evaluator: Judges correctness based on example outputs without ground truths.
  • COT QA evaluator: Evaluates correctness using chain of thought reasoning.

Additionally, LangSmith offers criteria evaluators that don't require references, allowing assessment of specific aspects like helpfulness, relevance, and coherence. Custom criteria evaluators can also be created. At the time of our evaluation, OpenAI's GPT-4 was considered the best-performing model for judging responses.

Dataset preparation

We prepared a hand-curated small set – usually called golden set, of questions designed to test various aspects of our agent's capabilities. These questions ranged from simple data queries to more complex analytical tasks. Here are some examples:

Simple queries:

  • "Which is the dataset shape?"
  • "Is the dataset clean?"
  • "What is the average customer age?"

Complex analytical questions:

  • "Which was the most successful campaign?"
  • "Which is the preferred form of purchase for young people with no kids?"
  • "Which age groups are more likely to prefer online shopping?"

For each question, we had a correct answer verifiable in the data, allowing us to assess the accuracy of our agent's responses.

Evaluation Criteria and Setup

We configured our evaluation pipeline with the following components:

a) COT QA evaluator: This was used to assess the correctness of responses based on the agent's chain of thought reasoning.

b) Criteria evaluators:

  • Conciseness: To determine if responses provide necessary information without excessive wordiness.
  • Relevance: To check if answers directly pertain to the question's intent and details.
  • Coherence: To assess the logical flow and clarity of responses.
  • Helpfulness: To estimate the utility of the answer for decision support.

Results

Our evaluation revealed that while LlamaIndex, our simplest implementation, showed lower performance, both PandasAI and LangChain implementations scored similarly and outperformed it. This aligned with our expectations, given the relative complexity and sophistication of these implementations.
As in our previous project, we found that while quantitative metrics provided valuable indicators, manual review of specific datapoints was crucial for a comprehensive understanding of the agent's performance.


Our evaluation process also unveiled significant challenges and limitations in assessing agent-based systems:

  • State-Dependent Evaluation: Unlike traditional question-answering systems, our agent maintains a conversation state. This means that the correctness of an answer can depend on previous interactions, such as data cleaning operations performed earlier in the conversation.
  • Reasoning Process Evaluation: While we could evaluate the correctness of actions taken by the agent, we found ourselves lacking robust metrics to qualitatively assess the agent's reasoning process. An agent might arrive at the correct action for the wrong reasons, a nuance our current evaluation framework struggled to capture.
  • Limitations of Traditional Metrics: We realised that evaluation methods developed for Retrieval-Augmented Generation (RAG) systems don't fully apply to agent-based systems. The dynamic nature of agent interactions and the importance of reasoning processes require new evaluation paradigms.

Findings

Data understanding

One of the most impressive aspects was the agent's ability to infer meaning and relationships from data or column names without requiring examples or in-context learning. 

For instance just asking “Which was the most successful campaign?” was enough for any of our implementations to identify the right columns and perform the right operation in order to provide a correct response.

PandasAI lessons

Our exploration of PandasAI yielded valuable insights that we plan to incorporate into our final solution and user experience.
The system's transparency, allowing users to view and understand the agent's reasoning process, emerged as a critical feature for building trust and facilitating user comprehension. The seamless integration of automatically generated visualisations, which significantly enhances data interpretation and analysis efficiency.
Finally, the structured output capability, which clearly presents different types of data (text, dataframes, plots) in an intuitive manner, proved to be a powerful tool for effective data communication. 

Agent reasoning

The ReAct (Reasoning and Acting) approach is intimately connected with Chain of Thought (CoT) prompting. At its core, CoT encourages the model to break down complex problems into smaller, more manageable steps.
If you are interested, the paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou describes the approach in detail.

One of the most effective CoT techniques involves adding specific instructions and describing a role in the system prompt. This simple addition can dramatically improve a model's ability to tackle complex reasoning tasks.

While Chain of Thought has shown promising results, studies like “Chain of Thoughtlessness? An Analysis of CoT in Planning” by Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati revealed that performance improvements from CoT prompts were highly dependent on the specificity of the prompts to the problem class.


In light of these challenges, the Self-Discover framework presents an innovative approach to reasoning in language models. This framework allows LLMs to self-compose reasoning structures for complex problems, potentially overcoming some of the limitations observed in traditional CoT approaches.

Self-Discover works by having the model select and compose multiple atomic reasoning modules, such as critical thinking and step-by-step analysis, into an explicit reasoning structure. This self-discovered structure then guides the model during its decoding process.

What makes Self-Discover particularly promising is its ability to outperform both standard CoT and more inference-intensive methods like CoT-Self-Consistency, while requiring significantly less computational resources. Furthermore, the reasoning structures discovered through this process have shown commonalities with human reasoning patterns and demonstrate applicability across different model families.

Implications for Multi-Agent Systems

We find there is a striking parallel between the flow of thoughts in a single agent's reasoning process and the potential flow of communication among specialised agents in a multi-agent team.

In a single agent we observe a structured sequence of cognitive steps. The agent breaks down complex problems, considers various aspects sequentially, and builds upon previous thoughts to reach a conclusion. This internal dialogue resembles a conversation, with each step informing and building upon the last.

Now, imagine translating this internal process into a multi-agent system. Instead of these steps occurring within a single agent, we can envision them as interactions between specialised agents, each responsible for a specific aspect of the problem-solving process. In this scenario, the "thoughts" become inter-agent communications, and the reasoning flow becomes a choreographed interaction among team members.

Coming next

As we conclude our exploration of the single agent approach in LEDA, we've seen how implementing and testing various DataFrame query engines has unveiled both the potential and limitations of a single agent system.

These insights have led us to a question. What if, instead of relying on a single agent, we could harness the power of multiple specialised agents working in concert? How might this change the landscape of AI-driven EDA?

In our next article, we'll embark on the second phase of the LEDA project, exploring the potential of a team of agents tackling exploratory data analysis together. We'll delve into how multiple agents can collaborate, specialise, and potentially overcome the limitations we've observed in the single agent model.

Join us as the journey of LEDA continues, and the possibilities are more exciting than ever!