import os
from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import Conversation
import time
from datetime import datetime, timedelta
import json
# Set up credentials
# Load API key and agent ID from .env file
from dotenv import load_dotenv
load_dotenv()
= os.getenv("ELEVENLABS_API_KEY")
api_key = os.getenv("AGENT_ID")
agent_id
# Create client
= ElevenLabs(api_key=api_key) client
Introduction
About this Project
This project is part of my broader AI Learner initiative— A personal tool designed to help me learn more effectively, a necessity in this fast changing AI environment.
At its core, this notebook is a self-evaluation system. I have regular voice-based conversations with an AI agent (powered by ElevenLabs for voice and LLMs for reasoning). Each conversation contains one or more questions, and I want to analyze how well I answered them. The system automatically extracts these questions, identifies my responses, and uses a large language model to evaluate my performance.
This helps me answer questions like:
- Where am I consistently strong?
- Where do I need to improve?
- Are there patterns in how I learn or answer over time?
It’s like having an personal one on one tutor and learning coach— one that listens to how I answer, gives feedback, and helps me focus on what to learn next.
The aim of this Jupyter notebook
This jupyter notebook will analyse the questions answered and will highlight areas for improvement and areas where I am competent. I wanted to automate as much as possible for a couple of reasons: 1. it’s time consuming and I don’t have much time, and 2. I want to explore how I can use a Large Language Model as a judge, where the boundaries are, and which models are best for this type of activity.
To implement this analysis system, I need to perform several key steps. Every conversation contains multiple questions that need to be extracted for proper evaluation. When the AI agent retrieves a question from the database, it provides a question ID. This ID is crucial as it allows me to precisely locate where each question begins and ends within the conversation transcript.
About Eleven Labs Voice Labs
Eleven labs voice agents allows you to add voice to your “Agents” on your website, mobile phone app or telephony in minutes. It’s a realtime API that delivers low latency, full configurability, and scalability on tap.
Setup and import routines
# get all conversations for the agent
=agent_id) client.conversational_ai.get_conversations(agent_id
GetConversationsPageResponseModel(conversations=[ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='OD8knPLaflzg4yXoqJtV', start_time_unix_secs=1742555298, call_duration_secs=58, message_count=13, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='TTrEIeWIcAuUDym2wgSv', start_time_unix_secs=1741963691, call_duration_secs=14, message_count=3, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='MuNftp2yyppPJBjRwKWg', start_time_unix_secs=1741800081, call_duration_secs=123, message_count=23, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='uNtMgUSX7xGIXyvkCVDE', start_time_unix_secs=1741729214, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='sxoNGvcAHg86PTPjfS4l', start_time_unix_secs=1741729214, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='UOlxBXfWZs6oZmuW8NFx', start_time_unix_secs=1741729210, call_duration_secs=3, message_count=1, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='obzaHjNTFFQVSoceQR15', start_time_unix_secs=1741728029, call_duration_secs=2, message_count=1, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='XkVzJFRaLNcUfEwZgdml', start_time_unix_secs=1741727981, call_duration_secs=3, message_count=1, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='ibLbMI4wZSifQYWqa1GZ', start_time_unix_secs=1741727954, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='9Cu34dMDhJ9SohemkJaB', start_time_unix_secs=1741727899, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='v2wiH6WhdU5e1eQ4UEO4', start_time_unix_secs=1741691933, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='0GCsc6e4ed8ZMLDgM7r1', start_time_unix_secs=1741691889, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='hdtX0X3FGPpwvIujgLOp', start_time_unix_secs=1741632757, call_duration_secs=304, message_count=31, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='HhDoWH2yrH0jdSUceJn2', start_time_unix_secs=1741599909, call_duration_secs=301, message_count=44, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='JAIkaqxDe1dB4hJU3TQb', start_time_unix_secs=1741449684, call_duration_secs=32, message_count=8, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='jyfsdK26FGAFv3sDUzYc', start_time_unix_secs=1741449403, call_duration_secs=62, message_count=14, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='WErNjC2tMjEVUkVa4Ulq', start_time_unix_secs=1741448269, call_duration_secs=8, message_count=5, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='Druasprb2fMIRPaG8NnD', start_time_unix_secs=1741448239, call_duration_secs=0, message_count=0, status='done', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='QXc8FQZzhKOSad7UxQSB', start_time_unix_secs=1741379212, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='twoeB0EUbaGNKRuEtbJi', start_time_unix_secs=1741379164, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='kRlstrOeMPDJGK4mQw8h', start_time_unix_secs=1741379101, call_duration_secs=3, message_count=1, status='done', call_successful='success'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='2KdTo9LJ52QjrsNw0q60', start_time_unix_secs=1741378600, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='MViLft1vyruti2WM8Vwu', start_time_unix_secs=1741358430, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='OKuP7LkOouxr1iov58Mb', start_time_unix_secs=1741358040, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='bEtiGdbkJo96AndqkOn0', start_time_unix_secs=1741357998, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='mDz35DWT8frmhwtLEjeL', start_time_unix_secs=1741337967, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='S2rzokoIef3tz4VWSshc', start_time_unix_secs=1741278004, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='Q6ITHpzklwv8YHLcODX9', start_time_unix_secs=1741207243, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='7gV0KcqXxtFlS982xiN6', start_time_unix_secs=1741199428, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown'), ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='X8oO9EYHKEHGlEJIsiZW', start_time_unix_secs=1741196744, call_duration_secs=0, message_count=0, status='processing', call_successful='unknown')], next_cursor='eyJsYXN0X2NvbnZlcnNhdGlvbl9pZCI6Ilg4b085RVlIS0VIR2xFSklzaVpXIiwib2Zmc2V0IjpudWxsLCJvcmRlcl9ieSI6InN0YXJ0X3RpbWVfdW5peF9zZWNzIiwib3JkZXJfZGlyZWN0aW9uIjoiZGVzYyJ9', has_more=True)
Understanding the Converastional AI library and data
In this section we will be exploring the conversational AI data structure. We’ll retrieve conversation transcripts using the Eleven Labs API, analyze the format of question-answer exchanges, examine a specific conversation example, and identify the tool calls that extract question IDs from the database. This will help us understand how the system tracks and processes questions throughout the conversation flow.
Get all conversations for specific agent
# Function to get all conversations for a specific agent
def get_all_conversations(agent_id):
= []
all_conversations = None
cursor = True
has_more
while has_more:
# Get page of conversations
= client.conversational_ai.get_conversations(
response =agent_id,
agent_id=cursor
cursor
)
# Add conversations from current page to our list
all_conversations.extend(response.conversations)
# Update pagination parameters
= response.next_cursor
cursor = response.has_more
has_more
print(f"Retrieved {len(response.conversations)} conversations. Has more: {has_more}")
print(f"Total conversations retrieved: {len(all_conversations)}")
return all_conversations
# Get all conversations for the specified agent ID
= get_all_conversations(agent_id)
conversations
# Display first few conversations
3] conversations[:
Retrieved 30 conversations. Has more: True
Retrieved 30 conversations. Has more: True
Retrieved 30 conversations. Has more: True
Retrieved 5 conversations. Has more: False
Total conversations retrieved: 95
[ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='OD8knPLaflzg4yXoqJtV', start_time_unix_secs=1742555298, call_duration_secs=58, message_count=13, status='done', call_successful='success'),
ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='TTrEIeWIcAuUDym2wgSv', start_time_unix_secs=1741963691, call_duration_secs=14, message_count=3, status='done', call_successful='success'),
ConversationSummaryResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', agent_name='ankicard in the AI world (tool)', conversation_id='MuNftp2yyppPJBjRwKWg', start_time_unix_secs=1741800081, call_duration_secs=123, message_count=23, status='done', call_successful='success')]
Prints the conversations in a easy to use format
# Get current time in Unix timestamp format
= int(time.time())
current_time
# Calculate timestamp for 14 days ago
= current_time - (14 * 24 * 60 * 60)
three_days_ago
# Filter conversations from the last 3 days
= [
recent_conversations for conv in conversations
conv if conv.start_time_unix_secs >= three_days_ago
]
# Display the count of recent conversations
print(f"Conversations in the last 3 days: {len(recent_conversations)}")
# Display the recent conversations with human-readable timestamps
for conv in recent_conversations:
# Convert Unix timestamp to readable date/time
= datetime.fromtimestamp(conv.start_time_unix_secs)
start_time print(f"ID: {conv.conversation_id}")
print(f"Start time: {start_time}")
print(f"Duration: {conv.call_duration_secs} seconds", f"or {conv.call_duration_secs /60} in minutes" )
print(f"Status: {conv.status}")
print("-" * 50)
Conversations in the last 3 days: 2
ID: OD8knPLaflzg4yXoqJtV
Start time: 2025-03-21 11:08:18
Duration: 58 seconds or 0.9666666666666667 in minutes
Status: done
--------------------------------------------------
ID: TTrEIeWIcAuUDym2wgSv
Start time: 2025-03-14 14:48:11
Duration: 14 seconds or 0.23333333333333334 in minutes
Status: done
--------------------------------------------------
Working an example conversation (ID) to break the problem down
Let’s break the problem down and focus on a single conversation to understand how to extract one question.
Get the first conversation and print agent/user in turns
This conversation can contain many questions and answers and this is only the user and agent responces. There are other elements to a conversation that we will get to later.
# Get example conversation hdtX0X3FGPpwvIujgLOp
# this only gets the status and the transcript details
from datetime import datetime
if recent_conversations:
= "hdtX0X3FGPpwvIujgLOp"
example_conv_id print(f"Getting details for the first recent conversation (ID: {example_conv_id})")
# Get the detailed conversation data
= client.conversational_ai.get_conversation(
conversation_details =example_conv_id
conversation_id
)
# print the metadata
# Convert Unix timestamp to readable date/time
= datetime.fromtimestamp(conversation_details.metadata.start_time_unix_secs)
start_time = datetime.fromtimestamp(conversation_details.metadata.deletion_settings.deletion_time_unix_secs) if conversation_details.metadata.deletion_settings.deletion_time_unix_secs else None
deletion_time
print("Conversation Metadata:")
print(f"Start Time: {start_time}")
print(f"Call Duration: {conversation_details.metadata.call_duration_secs} seconds ({conversation_details.metadata.call_duration_secs/60:.2f} minutes)")
print(f"Cost: {conversation_details.metadata.cost}")
print(f"Scheduled Deletion Time: {deletion_time}")
print(f"Authorization Method: {conversation_details.metadata.authorization_method}")
print(f"Termination Reason: {conversation_details.metadata.termination_reason or 'None'}")
# Display the conversation details
print("\nConversation Details:")
print(f"Status: {conversation_details.status}")
print(f"Number of transcript entries: {len(conversation_details.transcript)}")
# Display the conversation transcript
print("\nTranscript:")
for i, entry in enumerate(conversation_details.transcript):
print(f"\n[{i+1}] {entry.role.upper()}: {entry.message}")
conversation_detailselse:
print("No recent conversations found.")
Getting details for the first recent conversation (ID: hdtX0X3FGPpwvIujgLOp)
Conversation Metadata:
Start Time: 2025-03-10 18:52:37
Call Duration: 304 seconds (5.07 minutes)
Cost: 2023
Scheduled Deletion Time: 2027-03-10 18:57:42
Authorization Method: public
Termination Reason: None
Conversation Details:
Status: done
Number of transcript entries: 31
Transcript:
[1] AGENT: Hello Angelo, I'm here to help you study, are you ready?
[2] USER: Yes.
[3] AGENT: Okay Angelo, let's get started. I'm going to fetch a question for you.
[4] AGENT: None
[5] AGENT: None
[6] AGENT: Okay, Angelo, here's your first question: How do loss functions in contrastive learning measure alignment between image-text pairs? Take your time to think about it. If you don't know, that's perfectly fine, I can give you a little cue.
[7] USER: When they're close together, when they're at right angles, it means that they are more similar. When they're further away, it means more dissimilar. So the aim of the loss function is to make sure that they're, to get those tensors close on and those vectors close together.
[8] AGENT: That's a good start, Angelo. You're right that the goal is to bring the embeddings of similar image-text pairs closer together and push dissimilar pairs further apart.
To expand on that, contrastive loss functions typically use a distance metric (like cosine similarity) to measure the similarity between the image and text embeddings. The loss is minimized when the embeddings of matching pairs are highly similar (close together in the embedding space) and the embeddings of non-matching pairs are dissimilar (far apart).
Do you understand how the concept of 'margin' is used in some contrastive loss functions to define how far apart dissimilar pairs should be?
[9] USER: No, I don't know.
[10] AGENT: Okay, no problem at all. The 'margin' in some contrastive loss functions sets a threshold. It specifies how far apart the embeddings of dissimilar (non-matching) image-text pairs need to be in the embedding space before the loss becomes zero.
So, the loss function only penalizes those dissimilar pairs that are within the margin. This helps to create a clearer separation between similar and dissimilar pairs in the embedding space. Does that make sense?
[11] USER: It does, yes?
[12] AGENT: Great. So, to recap, contrastive loss functions use a distance metric to measure the similarity between image and text embeddings. The goal is to minimize the distance between matching pairs and maximize the distance between non-matching pairs, often with a margin to define the desired separation.
Shall we try another question?
[13] USER: Yes, please?
[14] AGENT: None
[15] AGENT: None
[16] AGENT: Okay Angelo, here's your next question: Why is Q-learning classified as an off-policy method?
[17] USER: Not sure. What is Q policy?
[18] AGENT: Okay, let's clarify that first. In reinforcement learning, a policy is essentially the strategy an agent uses to decide what action to take in a given state. It maps states to actions.
Now, Q-learning learns an optimal policy by estimating the Q-values, which represent the expected reward for taking a specific action in a specific state and following the optimal policy thereafter.
With that in mind, can you take another shot at the question: Why is Q-learning classified as an off-policy method? I'll give you another cue if you need it.
[19] USER: Yeah, give me another Q here.
[20] AGENT: Okay, here's a cue: Think about how Q-learning updates its Q-value estimates. Does it update them based on the actions it actually takes, or does it update them based on a different policy?
[21] USER: A different policy.
[22] AGENT: Exactly. Q-learning is considered off-policy because it learns the optimal Q-values (and thus the optimal policy) independently of the actions taken by the agent. It updates its Q-values based on the best possible action, regardless of what action the agent actually chose to do.
In other words, it learns about the optimal policy while following a potentially different, exploratory policy. This is in contrast to on-policy methods like SARSA, which update Q-values based on the actions actually taken by the agent.
Does that distinction between learning from the best possible action versus the action actually taken make sense?
[23] USER: Yes it does?
[24] AGENT: Fantastic. To make sure we're solid on this, can you explain in your own words the key difference between on-policy and off-policy learning?
[25] USER: No, next question.
[26] AGENT: No problem at all. Let's move on to the next question then.
[27] AGENT: None
[28] AGENT: None
[29] AGENT: Alright Angelo, next question is: What is a logarithm in mathematical terms?
[30] USER: Logarithm is a calculation to reduce the number.
[31] AGENT: That's not quite right, Angelo. While logarithms can help simplify calculations, that's not their primary definition.
A logarithm is actually the inverse operation to exponentiation. In simpler terms, the logarithm of a number is the exponent to which another fixed value, the base, must be raised to produce that number.
For example, the logarithm of one hundred to the base ten is two, because ten raised to the power of two is one hundred: ten squared equals one hundred, so log base ten of one...
Print all of the transcript including tool call
Same as conversation above but also includes too calls which will be used to spit the conversation into the individual conversations about the quesiton.
conversation_details.transcript
[ConversationHistoryTranscriptCommonModel(role='agent', message="Hello Angelo, I'm here to help you study, are you ready? ", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=0, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='Yes.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=4, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Okay Angelo, let's get started. I'm going to fetch a question for you.\n\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=6, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[ConversationHistoryTranscriptToolCallCommonModel(request_id='question_retreiver_8be15540962042e59546a10b0d41799d', tool_name='question_retreiver', params_as_json='{}', tool_has_been_called=True)], tool_results=[], feedback=None, time_in_call_secs=6, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[], tool_results=[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_8be15540962042e59546a10b0d41799d', tool_name='question_retreiver', result_value='{"question_id":25,"User_Question":"How do loss functions in contrastive learning measure alignment between image-text pairs?"}', is_error=False, tool_has_been_called=True)], feedback=None, time_in_call_secs=6, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, Angelo, here's your first question: How do loss functions in contrastive learning measure alignment between image-text pairs? Take your time to think about it. If you don't know, that's perfectly fine, I can give you a little cue.\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=6, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message="When they're close together, when they're at right angles, it means that they are more similar. When they're further away, it means more dissimilar. So the aim of the loss function is to make sure that they're, to get those tensors close on and those vectors close together.", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=24, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="That's a good start, Angelo. You're right that the goal is to bring the embeddings of similar image-text pairs closer together and push dissimilar pairs further apart.\n\nTo expand on that, contrastive loss functions typically use a distance metric (like cosine similarity) to measure the similarity between the image and text embeddings. The loss is minimized when the embeddings of matching pairs are highly similar (close together in the embedding space) and the embeddings of non-matching pairs are dissimilar (far apart).\n\nDo you understand how the concept of 'margin' is used in some contrastive loss functions to define how far apart dissimilar pairs should be?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=42, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message="No, I don't know.", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=83, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, no problem at all. The 'margin' in some contrastive loss functions sets a threshold. It specifies how far apart the embeddings of dissimilar (non-matching) image-text pairs need to be in the embedding space before the loss becomes zero.\n\nSo, the loss function only penalizes those dissimilar pairs that are within the margin. This helps to create a clearer separation between similar and dissimilar pairs in the embedding space. Does that make sense?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=85, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='It does, yes?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=112, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message='Great. So, to recap, contrastive loss functions use a distance metric to measure the similarity between image and text embeddings. The goal is to minimize the distance between matching pairs and maximize the distance between non-matching pairs, often with a margin to define the desired separation.\n\nShall we try another question?\n', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=113, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='Yes, please?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=133, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[ConversationHistoryTranscriptToolCallCommonModel(request_id='question_retreiver_2ebf5ccad94b4771a2c8b6cbfa0e960d', tool_name='question_retreiver', params_as_json='{}', tool_has_been_called=True)], tool_results=[], feedback=None, time_in_call_secs=135, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[], tool_results=[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_2ebf5ccad94b4771a2c8b6cbfa0e960d', tool_name='question_retreiver', result_value='{"question_id":49,"User_Question":"Why is Q-learning classified as an off-policy method?"}', is_error=False, tool_has_been_called=True)], feedback=None, time_in_call_secs=135, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Okay Angelo, here's your next question: Why is Q-learning classified as an off-policy method?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=135, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='Not sure. What is Q policy?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=144, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, let's clarify that first. In reinforcement learning, a policy is essentially the strategy an agent uses to decide what action to take in a given state. It maps states to actions.\n\nNow, Q-learning learns an optimal policy by estimating the Q-values, which represent the expected reward for taking a specific action in a specific state and following the optimal policy thereafter.\n\nWith that in mind, can you take another shot at the question: Why is Q-learning classified as an off-policy method? I'll give you another cue if you need it.\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=148, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='Yeah, give me another Q here.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=182, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, here's a cue: Think about how Q-learning updates its Q-value estimates. Does it update them based on the actions it actually takes, or does it update them based on a different policy?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=185, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='A different policy.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=198, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message='Exactly. Q-learning is considered off-policy because it learns the optimal Q-values (and thus the optimal policy) independently of the actions taken by the agent. It updates its Q-values based on the best possible action, regardless of what action the agent actually chose to do.\n\nIn other words, it learns about the optimal policy while following a potentially different, exploratory policy. This is in contrast to on-policy methods like SARSA, which update Q-values based on the actions actually taken by the agent.\n\nDoes that distinction between learning from the best possible action versus the action actually taken make sense?\n', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=201, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='Yes it does?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=239, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="Fantastic. To make sure we're solid on this, can you explain in your own words the key difference between on-policy and off-policy learning?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=242, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='No, next question.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=250, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="No problem at all. Let's move on to the next question then.\n\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=252, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[ConversationHistoryTranscriptToolCallCommonModel(request_id='question_retreiver_e856b2a978b546cca87d93b71b137b9e', tool_name='question_retreiver', params_as_json='{}', tool_has_been_called=True)], tool_results=[], feedback=None, time_in_call_secs=252, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[], tool_results=[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_e856b2a978b546cca87d93b71b137b9e', tool_name='question_retreiver', result_value='{"question_id":31,"User_Question":"What is a logarithm in mathematical terms?"}', is_error=False, tool_has_been_called=True)], feedback=None, time_in_call_secs=252, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message='Alright Angelo, next question is: What is a logarithm in mathematical terms?\n', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=252, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='user', message='Logarithm is a calculation to reduce the number.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=265, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None),
ConversationHistoryTranscriptCommonModel(role='agent', message="That's not quite right, Angelo. While logarithms can help simplify calculations, that's not their primary definition.\n\nA logarithm is actually the inverse operation to exponentiation. In simpler terms, the logarithm of a number is the exponent to which another fixed value, the base, must be raised to produce that number.\n\nFor example, the logarithm of one hundred to the base ten is two, because ten raised to the power of two is one hundred: ten squared equals one hundred, so log base ten of one...", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=275, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None)]
Understanding Eleven labs API (python object)
The returned getv_conversation
Dir inspects objects for things like properties, methods that can be called, internal structures. This will help understand how we can use existing features that elevan labs built into the library.
# top level
dir(conversation_details)
['__abstractmethods__',
'__annotations__',
'__class__',
'__class_getitem__',
'__class_vars__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__fields__',
'__fields_set__',
'__format__',
'__ge__',
'__get_pydantic_core_schema__',
'__get_pydantic_json_schema__',
'__getattr__',
'__getattribute__',
'__getstate__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pretty__',
'__private_attributes__',
'__pydantic_complete__',
'__pydantic_computed_fields__',
'__pydantic_core_schema__',
'__pydantic_custom_init__',
'__pydantic_decorators__',
'__pydantic_extra__',
'__pydantic_fields__',
'__pydantic_fields_set__',
'__pydantic_generic_metadata__',
'__pydantic_init_subclass__',
'__pydantic_parent_namespace__',
'__pydantic_post_init__',
'__pydantic_private__',
'__pydantic_root_model__',
'__pydantic_serializer__',
'__pydantic_validator__',
'__reduce__',
'__reduce_ex__',
'__replace__',
'__repr__',
'__repr_args__',
'__repr_name__',
'__repr_recursion__',
'__repr_str__',
'__rich_repr__',
'__setattr__',
'__setstate__',
'__signature__',
'__sizeof__',
'__slots__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_impl',
'_calculate_keys',
'_check_frozen',
'_copy_and_set_values',
'_get_value',
'_iter',
'agent_id',
'analysis',
'construct',
'conversation_id',
'conversation_initiation_client_data',
'copy',
'dict',
'from_orm',
'json',
'metadata',
'model_computed_fields',
'model_config',
'model_construct',
'model_copy',
'model_dump',
'model_dump_json',
'model_extra',
'model_fields',
'model_fields_set',
'model_json_schema',
'model_parametrized_name',
'model_post_init',
'model_rebuild',
'model_validate',
'model_validate_json',
'model_validate_strings',
'parse_file',
'parse_obj',
'parse_raw',
'schema',
'schema_json',
'serialize_model',
'status',
'transcript',
'update_forward_refs',
'validate']
From the looks of transcript, tool_results looks like its giving us information about the result of the tool call, we can inspect that next
# one level deeper, the transcript level
dir(conversation_details.transcript[0])
['__abstractmethods__',
'__annotations__',
'__class__',
'__class_getitem__',
'__class_vars__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__fields__',
'__fields_set__',
'__format__',
'__ge__',
'__get_pydantic_core_schema__',
'__get_pydantic_json_schema__',
'__getattr__',
'__getattribute__',
'__getstate__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pretty__',
'__private_attributes__',
'__pydantic_complete__',
'__pydantic_computed_fields__',
'__pydantic_core_schema__',
'__pydantic_custom_init__',
'__pydantic_decorators__',
'__pydantic_extra__',
'__pydantic_fields__',
'__pydantic_fields_set__',
'__pydantic_generic_metadata__',
'__pydantic_init_subclass__',
'__pydantic_parent_namespace__',
'__pydantic_post_init__',
'__pydantic_private__',
'__pydantic_root_model__',
'__pydantic_serializer__',
'__pydantic_validator__',
'__reduce__',
'__reduce_ex__',
'__replace__',
'__repr__',
'__repr_args__',
'__repr_name__',
'__repr_recursion__',
'__repr_str__',
'__rich_repr__',
'__setattr__',
'__setstate__',
'__signature__',
'__sizeof__',
'__slots__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_impl',
'_calculate_keys',
'_check_frozen',
'_copy_and_set_values',
'_get_value',
'_iter',
'construct',
'conversation_turn_metrics',
'copy',
'dict',
'feedback',
'from_orm',
'json',
'message',
'model_computed_fields',
'model_config',
'model_construct',
'model_copy',
'model_dump',
'model_dump_json',
'model_extra',
'model_fields',
'model_fields_set',
'model_json_schema',
'model_parametrized_name',
'model_post_init',
'model_rebuild',
'model_validate',
'model_validate_json',
'model_validate_strings',
'parse_file',
'parse_obj',
'parse_raw',
'role',
'schema',
'schema_json',
'serialize_model',
'time_in_call_secs',
'tool_calls',
'tool_results',
'update_forward_refs',
'validate']
Grabing the tool call results (contains the queston ID)
I will loop though every item in the transcript and print only items that are tools results (i.e. not user/agent responces)
for item in conversation_details.transcript :
if item.tool_results:
print(item.tool_results)
[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_8be15540962042e59546a10b0d41799d', tool_name='question_retreiver', result_value='{"question_id":25,"User_Question":"How do loss functions in contrastive learning measure alignment between image-text pairs?"}', is_error=False, tool_has_been_called=True)]
[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_2ebf5ccad94b4771a2c8b6cbfa0e960d', tool_name='question_retreiver', result_value='{"question_id":49,"User_Question":"Why is Q-learning classified as an off-policy method?"}', is_error=False, tool_has_been_called=True)]
[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_e856b2a978b546cca87d93b71b137b9e', tool_name='question_retreiver', result_value='{"question_id":31,"User_Question":"What is a logarithm in mathematical terms?"}', is_error=False, tool_has_been_called=True)]
Now, lets List the methods and properties of tool_results to see what will give us the question ID and questions… result_value seems to give the required data
dir(conversation_details.transcript[4].tool_results[0])
['__abstractmethods__',
'__annotations__',
'__class__',
'__class_getitem__',
'__class_vars__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__fields__',
'__fields_set__',
'__format__',
'__ge__',
'__get_pydantic_core_schema__',
'__get_pydantic_json_schema__',
'__getattr__',
'__getattribute__',
'__getstate__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pretty__',
'__private_attributes__',
'__pydantic_complete__',
'__pydantic_computed_fields__',
'__pydantic_core_schema__',
'__pydantic_custom_init__',
'__pydantic_decorators__',
'__pydantic_extra__',
'__pydantic_fields__',
'__pydantic_fields_set__',
'__pydantic_generic_metadata__',
'__pydantic_init_subclass__',
'__pydantic_parent_namespace__',
'__pydantic_post_init__',
'__pydantic_private__',
'__pydantic_root_model__',
'__pydantic_serializer__',
'__pydantic_validator__',
'__reduce__',
'__reduce_ex__',
'__replace__',
'__repr__',
'__repr_args__',
'__repr_name__',
'__repr_recursion__',
'__repr_str__',
'__rich_repr__',
'__setattr__',
'__setstate__',
'__signature__',
'__sizeof__',
'__slots__',
'__str__',
'__subclasshook__',
'__weakref__',
'_abc_impl',
'_calculate_keys',
'_check_frozen',
'_copy_and_set_values',
'_get_value',
'_iter',
'construct',
'copy',
'dict',
'from_orm',
'is_error',
'json',
'model_computed_fields',
'model_config',
'model_construct',
'model_copy',
'model_dump',
'model_dump_json',
'model_extra',
'model_fields',
'model_fields_set',
'model_json_schema',
'model_parametrized_name',
'model_post_init',
'model_rebuild',
'model_validate',
'model_validate_json',
'model_validate_strings',
'parse_file',
'parse_obj',
'parse_raw',
'request_id',
'result_value',
'schema',
'schema_json',
'serialize_model',
'tool_has_been_called',
'tool_name',
'update_forward_refs',
'validate']
its returns just a string
4].tool_results[0].result_value conversation_details.transcript[
'{"question_id":25,"User_Question":"How do loss functions in contrastive learning measure alignment between image-text pairs?"}'
convert, the tool results into something I can use
= json.loads(conversation_details.transcript[4].tool_results[0].result_value)
one_tool_result one_tool_result
{'question_id': 25,
'User_Question': 'How do loss functions in contrastive learning measure alignment between image-text pairs?'}
print each element separately
print(one_tool_result["question_id"])
print(one_tool_result["User_Question"])
25
How do loss functions in contrastive learning measure alignment between image-text pairs?
Extracting conversation data for analysis
Now I know how to extract key data for analysis: 1. Question_id’s from each conversation 2. Atart time of each question 3. End time (when the next question_retriever tool is called)
With this information, I can extract the complete conversation for each question and begin performing analysis based just on the question conversation.
Get Conversations function
Next Steps: Analyzing Question Positions in Conversations
We’ve successfully extracted question start and end positions for one example conversation.
Now we’ll build a more comprehensive solution that:
- Creates a function to fetch a specific conversation by ID
- Loops through the conversation transcript to identify all questions
- Records each question’s start and end positions in a dictionary
- Organizes this data with question IDs as keys
- Presents the information in a readable format
This function will get all conversations for a the example conversaton ID
def get_conversation_by_id(agent_id, conversation_id):
# Get the detailed conversation data
try:
= client.conversational_ai.get_conversation(
conversation =conversation_id
conversation_id
)print(f"Retrieved conversation ID: {conversation_id}")
return conversation
except Exception as e:
print(f"Error retrieving conversation {conversation_id}: {str(e)}")
return None
# Example usage
= get_conversation_by_id(agent_id=agent_id, conversation_id="hdtX0X3FGPpwvIujgLOp")
conversations conversations
Retrieved conversation ID: hdtX0X3FGPpwvIujgLOp
GetConversationResponseModel(agent_id='KBGx8lTP4Jbw8tn2kyXL', conversation_id='hdtX0X3FGPpwvIujgLOp', status='done', transcript=[ConversationHistoryTranscriptCommonModel(role='agent', message="Hello Angelo, I'm here to help you study, are you ready? ", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=0, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='Yes.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=4, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Okay Angelo, let's get started. I'm going to fetch a question for you.\n\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=6, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[ConversationHistoryTranscriptToolCallCommonModel(request_id='question_retreiver_8be15540962042e59546a10b0d41799d', tool_name='question_retreiver', params_as_json='{}', tool_has_been_called=True)], tool_results=[], feedback=None, time_in_call_secs=6, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[], tool_results=[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_8be15540962042e59546a10b0d41799d', tool_name='question_retreiver', result_value='{"question_id":25,"User_Question":"How do loss functions in contrastive learning measure alignment between image-text pairs?"}', is_error=False, tool_has_been_called=True)], feedback=None, time_in_call_secs=6, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, Angelo, here's your first question: How do loss functions in contrastive learning measure alignment between image-text pairs? Take your time to think about it. If you don't know, that's perfectly fine, I can give you a little cue.\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=6, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message="When they're close together, when they're at right angles, it means that they are more similar. When they're further away, it means more dissimilar. So the aim of the loss function is to make sure that they're, to get those tensors close on and those vectors close together.", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=24, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="That's a good start, Angelo. You're right that the goal is to bring the embeddings of similar image-text pairs closer together and push dissimilar pairs further apart.\n\nTo expand on that, contrastive loss functions typically use a distance metric (like cosine similarity) to measure the similarity between the image and text embeddings. The loss is minimized when the embeddings of matching pairs are highly similar (close together in the embedding space) and the embeddings of non-matching pairs are dissimilar (far apart).\n\nDo you understand how the concept of 'margin' is used in some contrastive loss functions to define how far apart dissimilar pairs should be?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=42, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message="No, I don't know.", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=83, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, no problem at all. The 'margin' in some contrastive loss functions sets a threshold. It specifies how far apart the embeddings of dissimilar (non-matching) image-text pairs need to be in the embedding space before the loss becomes zero.\n\nSo, the loss function only penalizes those dissimilar pairs that are within the margin. This helps to create a clearer separation between similar and dissimilar pairs in the embedding space. Does that make sense?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=85, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='It does, yes?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=112, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message='Great. So, to recap, contrastive loss functions use a distance metric to measure the similarity between image and text embeddings. The goal is to minimize the distance between matching pairs and maximize the distance between non-matching pairs, often with a margin to define the desired separation.\n\nShall we try another question?\n', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=113, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='Yes, please?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=133, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[ConversationHistoryTranscriptToolCallCommonModel(request_id='question_retreiver_2ebf5ccad94b4771a2c8b6cbfa0e960d', tool_name='question_retreiver', params_as_json='{}', tool_has_been_called=True)], tool_results=[], feedback=None, time_in_call_secs=135, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[], tool_results=[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_2ebf5ccad94b4771a2c8b6cbfa0e960d', tool_name='question_retreiver', result_value='{"question_id":49,"User_Question":"Why is Q-learning classified as an off-policy method?"}', is_error=False, tool_has_been_called=True)], feedback=None, time_in_call_secs=135, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Okay Angelo, here's your next question: Why is Q-learning classified as an off-policy method?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=135, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='Not sure. What is Q policy?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=144, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, let's clarify that first. In reinforcement learning, a policy is essentially the strategy an agent uses to decide what action to take in a given state. It maps states to actions.\n\nNow, Q-learning learns an optimal policy by estimating the Q-values, which represent the expected reward for taking a specific action in a specific state and following the optimal policy thereafter.\n\nWith that in mind, can you take another shot at the question: Why is Q-learning classified as an off-policy method? I'll give you another cue if you need it.\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=148, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='Yeah, give me another Q here.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=182, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Okay, here's a cue: Think about how Q-learning updates its Q-value estimates. Does it update them based on the actions it actually takes, or does it update them based on a different policy?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=185, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='A different policy.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=198, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message='Exactly. Q-learning is considered off-policy because it learns the optimal Q-values (and thus the optimal policy) independently of the actions taken by the agent. It updates its Q-values based on the best possible action, regardless of what action the agent actually chose to do.\n\nIn other words, it learns about the optimal policy while following a potentially different, exploratory policy. This is in contrast to on-policy methods like SARSA, which update Q-values based on the actions actually taken by the agent.\n\nDoes that distinction between learning from the best possible action versus the action actually taken make sense?\n', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=201, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='Yes it does?', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=239, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="Fantastic. To make sure we're solid on this, can you explain in your own words the key difference between on-policy and off-policy learning?\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=242, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='No, next question.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=250, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="No problem at all. Let's move on to the next question then.\n\n", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=252, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[ConversationHistoryTranscriptToolCallCommonModel(request_id='question_retreiver_e856b2a978b546cca87d93b71b137b9e', tool_name='question_retreiver', params_as_json='{}', tool_has_been_called=True)], tool_results=[], feedback=None, time_in_call_secs=252, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message=None, tool_calls=[], tool_results=[ConversationHistoryTranscriptToolResultCommonModel(request_id='question_retreiver_e856b2a978b546cca87d93b71b137b9e', tool_name='question_retreiver', result_value='{"question_id":31,"User_Question":"What is a logarithm in mathematical terms?"}', is_error=False, tool_has_been_called=True)], feedback=None, time_in_call_secs=252, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message='Alright Angelo, next question is: What is a logarithm in mathematical terms?\n', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=252, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='user', message='Logarithm is a calculation to reduce the number.', tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=265, conversation_turn_metrics=None, llm_override=None, rag_retrieval_info=None), ConversationHistoryTranscriptCommonModel(role='agent', message="That's not quite right, Angelo. While logarithms can help simplify calculations, that's not their primary definition.\n\nA logarithm is actually the inverse operation to exponentiation. In simpler terms, the logarithm of a number is the exponent to which another fixed value, the base, must be raised to produce that number.\n\nFor example, the logarithm of one hundred to the base ten is two, because ten raised to the power of two is one hundred: ten squared equals one hundred, so log base ten of one...", tool_calls=[], tool_results=[], feedback=None, time_in_call_secs=275, conversation_turn_metrics={'metrics': {}}, llm_override=None, rag_retrieval_info=None)], metadata=ConversationHistoryMetadataCommonModel(start_time_unix_secs=1741632757, call_duration_secs=304, cost=2023, deletion_settings=ConversationDeletionSettings(deletion_time_unix_secs=1804705062, deleted_logs_at_time_unix_secs=None, deleted_audio_at_time_unix_secs=None, deleted_transcript_at_time_unix_secs=None, delete_transcript_and_pii=True, delete_audio=True), feedback=ConversationHistoryFeedbackCommonModel(overall_score=None, likes=0, dislikes=0), authorization_method='public', charging=ConversationChargingCommonModel(dev_discount=False), termination_reason='', phone_call=None), analysis=ConversationHistoryAnalysisCommonModel(evaluation_criteria_results={}, data_collection_results={}, call_successful='success', transcript_summary="The conversation involves a study session between an AI assistant and a user named Angelo. The assistant asks questions on various topics, including contrastive learning, Q-learning in reinforcement learning, and logarithms in mathematics. Angelo demonstrates varying levels of understanding across these topics.\n\nFor contrastive learning, Angelo shows some understanding of how loss functions measure alignment between image-text pairs. The assistant expands on this, explaining the concept of 'margin' in contrastive loss functions.\n\nWhen discussing Q-learning, Angelo initially struggles but eventually grasps that it's an off-policy method because it learns from a different policy than the one it follows.\n\nFinally, when asked about logarithms, Angelo provides an incorrect definition. The assistant begins to explain the correct definition but the transcript ends mid-explanation.\n\nThroughout the conversation, the assistant maintains a patient and encouraging tone, providing additional explanations and cues when Angelo struggles with concepts."), conversation_initiation_client_data=ConversationInitiationClientData(conversation_config_override=ConversationConfigClientOverride(agent=None, tts=None), custom_llm_extra_body={}, dynamic_variables={'user_name': 'Angelo'}))
Get questions from conversations
The results will show the question ID and then in another dictionary, the start and end position.
# loop through all the trascript
for conversation in conversations:
= []
question_positions_list = {}
question_positions
# loop thought every item in the transcript and print only items that are tools results (i.e. not user responces)
for index, item in enumerate(conversation_details.transcript):
if item.tool_results and item.tool_results[0].tool_name == 'question_retreiver':
= json.loads(item.tool_results[0].result_value)
result #print(f"Position: {index}, Question ID: {result['question_id']}, Time in Call: {item.time_in_call_secs}")
'question_id'], index))
question_positions_list.append((result[# Then create dictionary with start and next positions
for i in range(len(question_positions_list)):
= question_positions_list[i]
question_id, start_pos = question_positions_list[i+1][1] if i < len(question_positions_list)-1 else len(conversation_details.transcript)
next_pos = {"start": start_pos, "end": next_pos}
question_positions[question_id]
question_positions
{25: {'start': 4, 'end': 14},
49: {'start': 14, 'end': 27},
31: {'start': 27, 'end': 31}}
Print conversations for each question
This is is going to be useful to see the information in readable format before we pass the conversation text to a LLM to judge details on the answers to the questions.
for question_id, positions in question_positions.items():
# Extract the transcript segment for this question
= positions['start']
start_pos = positions['end']
end_pos = detailed_conversation.transcript[start_pos:end_pos]
question_transcript print(f"Question ID: {question_id}, Start Position: {start_pos}, End Position: {end_pos}")
print(f"Transcript segment length: {len(question_transcript)} messages")
for index, item in enumerate(question_transcript):
print(f"Role: {item.role}, {item.message}")
Question ID: 25, Start Position: 4, End Position: 14
Transcript segment length: 10 messages
Role: agent, None
Role: agent, Okay, Angelo, here's your first question: How do loss functions in contrastive learning measure alignment between image-text pairs? Take your time to think about it. If you don't know, that's perfectly fine, I can give you a little cue.
Role: user, When they're close together, when they're at right angles, it means that they are more similar. When they're further away, it means more dissimilar. So the aim of the loss function is to make sure that they're, to get those tensors close on and those vectors close together.
Role: agent, That's a good start, Angelo. You're right that the goal is to bring the embeddings of similar image-text pairs closer together and push dissimilar pairs further apart.
To expand on that, contrastive loss functions typically use a distance metric (like cosine similarity) to measure the similarity between the image and text embeddings. The loss is minimized when the embeddings of matching pairs are highly similar (close together in the embedding space) and the embeddings of non-matching pairs are dissimilar (far apart).
Do you understand how the concept of 'margin' is used in some contrastive loss functions to define how far apart dissimilar pairs should be?
Role: user, No, I don't know.
Role: agent, Okay, no problem at all. The 'margin' in some contrastive loss functions sets a threshold. It specifies how far apart the embeddings of dissimilar (non-matching) image-text pairs need to be in the embedding space before the loss becomes zero.
So, the loss function only penalizes those dissimilar pairs that are within the margin. This helps to create a clearer separation between similar and dissimilar pairs in the embedding space. Does that make sense?
Role: user, It does, yes?
Role: agent, Great. So, to recap, contrastive loss functions use a distance metric to measure the similarity between image and text embeddings. The goal is to minimize the distance between matching pairs and maximize the distance between non-matching pairs, often with a margin to define the desired separation.
Shall we try another question?
Role: user, Yes, please?
Role: agent, None
Question ID: 49, Start Position: 14, End Position: 27
Transcript segment length: 13 messages
Role: agent, None
Role: agent, Okay Angelo, here's your next question: Why is Q-learning classified as an off-policy method?
Role: user, Not sure. What is Q policy?
Role: agent, Okay, let's clarify that first. In reinforcement learning, a policy is essentially the strategy an agent uses to decide what action to take in a given state. It maps states to actions.
Now, Q-learning learns an optimal policy by estimating the Q-values, which represent the expected reward for taking a specific action in a specific state and following the optimal policy thereafter.
With that in mind, can you take another shot at the question: Why is Q-learning classified as an off-policy method? I'll give you another cue if you need it.
Role: user, Yeah, give me another Q here.
Role: agent, Okay, here's a cue: Think about how Q-learning updates its Q-value estimates. Does it update them based on the actions it actually takes, or does it update them based on a different policy?
Role: user, A different policy.
Role: agent, Exactly. Q-learning is considered off-policy because it learns the optimal Q-values (and thus the optimal policy) independently of the actions taken by the agent. It updates its Q-values based on the best possible action, regardless of what action the agent actually chose to do.
In other words, it learns about the optimal policy while following a potentially different, exploratory policy. This is in contrast to on-policy methods like SARSA, which update Q-values based on the actions actually taken by the agent.
Does that distinction between learning from the best possible action versus the action actually taken make sense?
Role: user, Yes it does?
Role: agent, Fantastic. To make sure we're solid on this, can you explain in your own words the key difference between on-policy and off-policy learning?
Role: user, No, next question.
Role: agent, No problem at all. Let's move on to the next question then.
Role: agent, None
Question ID: 31, Start Position: 27, End Position: 31
Transcript segment length: 4 messages
Role: agent, None
Role: agent, Alright Angelo, next question is: What is a logarithm in mathematical terms?
Role: user, Logarithm is a calculation to reduce the number.
Role: agent, That's not quite right, Angelo. While logarithms can help simplify calculations, that's not their primary definition.
A logarithm is actually the inverse operation to exponentiation. In simpler terms, the logarithm of a number is the exponent to which another fixed value, the base, must be raised to produce that number.
For example, the logarithm of one hundred to the base ten is two, because ten raised to the power of two is one hundred: ten squared equals one hundred, so log base ten of one...
LLM as a judge
Now we get to the fun part, send the conversation with a prompt to the LLM with details of the conversation. We’re trying a few methods here, using lang chain.
The Langchain method
Langchain is useful too as it abstracts the the models away which is uesful for prompting but can cause more issues when tool calling as each platform behaves differently. I’ve added the code below as a example method without tool calling where you can see the responce.
Function to call Open AI (langchain)
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from typing import Optional, List, Union
import os
from dotenv import load_dotenv
def get_llm_response(
str,
prompt: str = "gpt-3.5-turbo",
model: float = 0.7,
temperature: str] = None
api_key: Optional[-> str:
) """
Get a response from an LLM using LangChain.
Args:
prompt (str): The input prompt/question for the LLM
model (str): Model identifier (default: "gpt-3.5-turbo")
temperature (float): Sampling temperature (default: 0.7)
api_key (Optional[str]): API key. If None, loads from environment
Returns:
str: The LLM's response
"""
# Load API key from environment if not provided
if not api_key:
load_dotenv()= os.getenv("OPENAI_API_KEY")
api_key if not api_key:
raise ValueError("No API key provided or found in environment")
# Initialize the LLM
= ChatOpenAI(
llm =model,
model=temperature,
temperature=api_key
openai_api_key
)
# Get response
= [HumanMessage(content=prompt)]
messages = llm.invoke(messages)
response
return response.content
# test script
"tell me your name ? my name is alex") get_llm_response(
'Nice to meet you, Alex! Is there anything else you would like to know or talk about?'
Building the prompt and sending the data to function above
note how the data returned isnt structured so it would be difficult to place the details into a columns in a relational database.
# Create a scoring function to evaluate correctness from 1 to 10
def evaluate_correctness(transcript):
"""
Evaluate the correctness of a user's responses in a conversation transcript.
Returns a score from 1 to 10 based on correctness.
"""
# Extract user messages and agent responses
= [item.message for item in transcript if item.role == 'user']
user_responses = [item.message for item in transcript if item.role == 'agent']
agent_messages
# Skip tool calls and focus on the conversation
= ""
conversation_text for i, item in enumerate(transcript):
if item.tool_calls or item.tool_results:
continue
+= f"{item.role.upper()}: {item.message}\n\n"
conversation_text
# Prepare prompt for LLM
= f"""
prompt Please evaluate the correctness of the user's responses in this conversation.
Score the user's understanding on a scale from 1 to 10, where:
- 1-3: Poor understanding with significant misconceptions
- 4-6: Partial understanding with some errors
- 7-8: Good understanding with minor inaccuracies
- 9-10: Excellent understanding with accurate responses
Conversation:
{conversation_text}
Provide a numerical score (1-10) and a brief explanation for your rating.
"""
= get_llm_response(prompt)
responce return responce
# Evaluate the correctness of the user's responses
for question_id, positions in question_positions.items():
# Extract the transcript segment for this question
= positions['start']
start_pos = positions['end']
end_pos = conversations.transcript[start_pos:end_pos]
question_transcript = evaluate_correctness(question_transcript)
evaluation_result print(evaluation_result)
I would rate the user's understanding as a 6. The user demonstrated a partial understanding of how loss functions in contrastive learning measure alignment between image-text pairs. They correctly mentioned that the goal is to bring similar pairs closer together and dissimilar pairs further apart. However, they seemed unsure about the concept of 'margin' in contrastive loss functions. Overall, the user showed some understanding with some errors in their responses.
I would rate the user's understanding as a 6. The user demonstrated a partial understanding of Q-learning being classified as an off-policy method, correctly identifying that Q-learning updates its Q-values based on a different policy. However, the user seemed unsure about the distinction between on-policy and off-policy learning. Overall, there were some errors in the responses but the user showed some understanding of the concepts.
I would rate the user's understanding a 3. The user's response shows a poor understanding of what a logarithm is, stating that it is a calculation to reduce the number, which is not accurate. They did not mention the relationship between logarithms and exponentiation, which is a key concept in understanding logarithms.
Calling Open AI model directly in JSON mode
JSON mode enforces the data to be returned in a structured format that is easy to maniplate in python and also send to a relational database if required.
# Create a scoring function to evaluate correctness from 1 to 10
def evaluate_correctness(transcript):
"""
Evaluate the correctness of a user's responses in a conversation transcript.
Returns a score from 1 to 10 based on correctness.
"""
# Extract user messages and agent responses
= [item.message for item in transcript if item.role == 'user']
user_responses = [item.message for item in transcript if item.role == 'agent']
agent_messages
# Skip tool calls and focus on the conversation
= ""
conversation_text for i, item in enumerate(transcript):
if item.tool_calls or item.tool_results:
continue
+= f"{item.role.upper()}: {item.message}\n\n"
conversation_text
# Prepare prompt for LLM
= f"""
prompt Please evaluate the correctness of the user's responses in this conversation.
Score the user's understanding on a scale from 1 to 10, where:
- 1-3: Poor understanding with significant misconceptions
- 4-6: Partial understanding with some errors
- 7-8: Good understanding with minor inaccuracies
- 9-10: Excellent understanding with accurate responses
Conversation:
{conversation_text}
Provide a numerical score (1-10) and a brief explanation for your rating.
"""
= client.responses.create(
response ="gpt-4o-2024-08-06",
modelinput=[
"role": "system", "content": prompt},
{"role": "user", "content": "..."}
{
],={
text"format": {
"type": "json_schema",
"name": "question_evaluation",
"schema": {
"type": "object",
"properties": {
"question_id": { "type": "number" },
"score": { "type": "number" },
"Summary": { "type": "string" },
"Improvements": { "type": "string" }
},"required": ["question_id", "score", "Summary", "Improvements"],
"additionalProperties": False
},"strict": True # Moved strict inside the format object
}
}
)
= json.loads(response.output_text)
evaluation_result return evaluation_result
Formatting the evaluation results from json to a more readable format
# Create a function to format the evaluation results in a more readable way
def format_evaluation_result(result):
"""Format the evaluation result in a more readable way."""
= f"""
formatted_output Question ID: {result['question_id']}
Score: {result['score']}/10
Summary: {result['Summary']}
Improvements:
{result['Improvements']}
"""
return formatted_output
# Evaluate and display results in a more readable format
print("\n" + "="*50 + "\nEVALUATION RESULTS\n" + "="*50)
for question_id, positions in question_positions.items():
# Extract the transcript segment for this question
= positions['start']
start_pos = positions['end']
end_pos = conversations.transcript[start_pos:end_pos]
question_transcript
# Get evaluation result
= evaluate_correctness(question_transcript)
evaluation_result
# Format and display the result
= format_evaluation_result(evaluation_result)
formatted_result print(formatted_result)
print("-"*50)
==================================================
EVALUATION RESULTS
==================================================
Question ID: 1
Score: 7/10
Summary: The user has a good understanding of the basic concept in contrastive learning, specifically how loss functions work to bring similar image-text embeddings closer together and push dissimilar ones further apart. However, there is some ambiguity in their response, specifically in the explanation of measuring similarity or distance (using phrases like 'when they're at right angles'). This indicates a need for clearer conceptual clarity on the mechanics, such as using specific metrics like cosine similarity.
Improvements:
To improve the response, the user could provide a more detailed explanation of how distance metrics (e.g., cosine similarity) are used to measure closeness or similarity in embeddings. Additionally, clarity on how these measures are mathematically represented in loss functions would enhance their understanding.
--------------------------------------------------
Question ID: 1
Score: 4/10
Summary: The user demonstrated partial understanding with some errors. They did not initially grasp the concept of Q-learning being off-policy and required guidance to arrive at the correct answer. They could identify the correct aspect when prompted but did not explain it independently.
Improvements:
To improve understanding, the user should learn more about how Q-learning updates Q-values using the optimal policy rather than the behavior policy. Exploring examples of both Q-learning and SARSA could help solidify the difference between off-policy and on-policy learning.
--------------------------------------------------
Question ID: 1
Score: 3/10
Summary: The user's response indicates a significant misconception about logarithms.
Improvements:
The user should understand that a logarithm is the inverse of exponentiation and represents the power to which a base number must be raised to produce a given number. For example, for log base 10 of 100, the answer is 2 because 10^2 = 100. This definition clarifies the concept, moving away from the incorrect notion of simply 'reducing the number.'
--------------------------------------------------