Summaries that don't suck

Summaries that don't suck

I often talk to my kids about experimentation and continuous improvement. This approach has been helpful in reframing failed attempts in the kitchen or garden and facilitates discussions of behaviors like study habits without shame. Viewing negative experiences as learning opportunities has helped us focus on long-term skill growth rather than a report card.

But unlike my not-quite-green thumb, LLMs and the technologies that enable the development of AI applications are improving rapidly. The number and maturity of tools are growing, enabling faster and more effective development of AI applications.

  • Galileo - An evaluation and observability stack designed to accelerate AI product development cycles.
  • BAML - A configuration file format for writing better and cleaner LLM functions. Open source with paid features.
  • Unstructured - A managed service for ETL of unstructured data.

Research

Sev Geraskin, a peer of mine, recently participated in an AI Security Evaluation Hackathon and shared some intriguing findings. His team published a paper titled "WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries.” The study showed that LLMs tend to remove toxic content when summarizing text.

Research published in 2018 studied the changes in tone and tone biases in summarization tasks in a pre-LLM world (GPT-1 and BERT would be released that year), but a gap exists in studying changes in tone for modern LLM models; I couldn’t find a single one.

The impact is significant: users who rely on summaries to understand content may get an accurate big picture but miss the author's full intent due to changes in tone. This can lead to mistakes when decisions are based on these summaries. Imagine these scenarios:

  • Customer complaints or employee feedback mechanisms that summarize without maintaining tone and sentiment could leave Customer Support and HR teams misinformed about the severity of issues.
  • A site summarizing political speeches might lose the original angry or unhinged tone, leading to an electorate missing important information about a politician’s demeanor and personality.

The good news is that these concerns can be mitigated through prompt-engineering techniques:

  1. Explicitly instructing an LLM to "summarize text while maintaining the tone and sentiment of the original text."
  2. Using a follow-up prompt to grade the summary with a rubric that includes a score or description of the original and summarized tones. This can be followed by a request to identify and resolve differences, producing a new summary.

My Experience

I’ve developed a habit of copying and pasting text I want to summarize into ChatGPT. However, this week, I was reminded that there are easier and better ways!

  1. Voila.ai - A browser extension that lets you access AI features like summarization without leaving a tab. Voila is built on top of OpenAI and offers the first 250 requests for free.
  2. Perplexity - An AI tool I use extensively for research because of its ✨ citations ✨. It was considered among five other tools for various common LLM tasks and emerged as the best tool for summarization. Why was I surprised? I shouldn’t have been. A research use case involves finding various sources across the web and summarizing them. Provide a URL and ask it to summarize—easy peasy.

I hope this was helpful! I offer consulting services to help companies find the prompts and internal training that will enable their AI strategies to take root.