Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Success Outcomes:

    • Generated test steps, it executes the steps in the browser

    • Can import steps from a CSV file, so if user has manual test cases, then they can input test cases into the sequence

    • tested with sample shoe store ecommerce (air birds current customer)

  • Failure Outcomes:

    • Inconsistent results, especially with menu items or items where labels aren’t clear.

    • Same result that works one time, may not work successfully the next time.

    • Prompt engineering the text command makes a huge difference.

LLM Cost Performance Benchmarking

  1. Idea is to see if the current AI setup if price effective in a chatbot/llm application. We want a way to benchark to see if an llm app is correct AND also see at what price.

  2. tool that will run simulation of tests against a chatbot

  3. it simulates multiple types of tests:

    1. happy path

    2. confusing questions

    3. inappropriate questions

    4. abort scenarios

  4. it measures chatbot accuracy: did it give correct responses or not?

  5. it also measures number of words and tokens

  6. between test runs you get a price/performance reportl