Success Outcomes:
- Generated test steps, it executes the steps in the browser
- Can import steps from a CSV file, so if user has manual test cases, then they can input test cases into the sequence
- tested with sample shoe store ecommerce (air birds current customer)
Failure Outcomes:
- Inconsistent results, especially with menu items or items where labels aren’t clear.
- Same result that works one time, may not work successfully the next time.
- Prompt engineering the text command makes a huge difference.

LLM Cost Performance Benchmarking

Idea is to see if the current AI setup if price effective in a chatbot/llm application. We want a way to benchark to see if an llm app is correct AND also see at what price.
tool that will run simulation of tests against a chatbot
it simulates multiple types of tests:
1. happy path
2. confusing questions
3. inappropriate questions
4. abort scenarios
it measures chatbot accuracy: did it give correct responses or not?
it also measures number of words and tokens
between test runs you get a price/performance reportl

Versions Compared