...
Success Outcomes:
Generated test steps, it executes the steps in the browser
Can import steps from a CSV file, so if user has manual test cases, then they can input test cases into the sequence
tested with sample shoe store ecommerce (air birds current customer)
Failure Outcomes:
Inconsistent results, especially with menu items or items where labels aren’t clear.
Same result that works one time, may not work successfully the next time.
Prompt engineering the text command makes a huge difference.
LLM Cost Performance Benchmarking
Idea is to see if the current AI setup if price effective in a chatbot/llm application. We want a way to benchark to see if an llm app is correct AND also see at what price.
tool that will run simulation of tests against a chatbot
it simulates multiple types of tests:
happy path
confusing questions
inappropriate questions
abort scenarios
it measures chatbot accuracy: did it give correct responses or not?
it also measures number of words and tokens
between test runs you get a price/performance reportl