5. Case Study 5: Literature Reviews Wearin, Wearout and Optimal Frequency

This case study focuses on testing AI’s capability in conducting literature reviews, each platform was asked 5-8 prompts related to questions on advertising wearin, wearout and optimal frequency, which are key questions in media planning. The study compares AI-generated reviews with those done by human researchers, using the ARF Knowledge Center’s extensive literature reviews as a benchmark.

For the three AI platforms – ChatGPT 4, Bard and Claude AI – we began with the same simple prompt:

What are best practices for identifying wearin and wearout along with optimal campaign frequency? What are best research approaches to identify wearin and wearout along with frequency for our brand?
 

Follow up prompts were conducted according to the responses generated by each platform. Overall, the testing showed that the current iterations of AIs are not very adept at carrying out literature reviews as rampant hallucinations occurred, and the AIs missed several key resources on the topics. In what follows we elaborate on the performance of each AI platform.

ChatGPT 4

  • Structured Responses: ChatGPT 4 organized responses clearly, aiding in comprehension.
  • ChatGPT 4 synthesized information coherently using general knowledge.
  • ChatGPT 4 used fictional citations and references, undermining reliability.
  • ChatGPT 4’s inability to access external sources impacted the accuracy and currency of reviews.
  • ChatGPT 4 sometimes misrepresented or omitted crucial study details.
  • The provided literature reviews lacked detailed insights of the original research and often did not align with actual source content.
  • Despite varied prompts, responses showed minor alterations without significant improvements.

Bard

  • Initial Misinterpretations: Bard confused terms like “wearin” and provided general best practices.
  • Bard frequently hallucinated details for referenced papers, indicating a substantial gap in accuracy and often misrepresented study content and findings. This severely undermines the credibility of Bard’s ability to conduct literature reviews.
  • Literature reviews provided superficial insights, lacking meaningful depth.
  • Bard’s responses to different prompts were very similar, indicating limited adaptability.
  • Like ChatGPT 4, Bard’s effectiveness was hampered by the inability to access real-time data.

Claude AI

  • Initial Accuracy: Claude AI showed a basic grasp of the subject matter and could format a literature review.
  • Claude AI often fabricated source content and misinterpreted research focus and methodology. It also fabricated most of the summaries and incorrectly interpreted studies. Additionally, Claude AI often cited sources with incorrect publication dates and incomplete summaries.
  • Mainly, Claude AI cited academic papers, excluding industry or trade sources.
  • Claude AI tended to repeat earlier errors in subsequent prompts.

Key takeaways:
Even though Claude AI shows some promise, none of the AI platforms can be currently relied upon to conduct accurate literature reviews. Across the board, the AIs tended to hallucinate summaries and references (see Athaluri et al., 2023 for an elaboration on this phenomenon). In addition, different types and levels of prompts did not make significant differences in results, except for making slight alterations in response formats. At this point, the AIs seem likely to be best used to summarize very general findings in marketing and advertising research. However, they seem incapable of attributing the precise sources to their findings, which prohibits an accurate and reliable literature review.

For a more detailed review of the prompts and LLM, see.

Add a Comment