Becoming a High Taste Tester

swyx 2025-07-25

theres a specific reason why i'm writing this post i can't disclose yet. but am sharing my prep work in public

A "high taste tester" is what the labs are calling influencers who often pass judgment on frontier models because they can creatively test models in interesting ways (than, say, FrontierMath, IMO, or any other "standard human expert test"). It is somewhat valuable to be a HTT and I think you can somewhat few-shot-learn your way to getting there.

Here is my collection of examples and my learnings

https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html
- grab examples from your chatgpt history
https://simonwillison.net/2025/Jun/6/six-months-in-llms/
- kind of a silly but undeniably entertaining and visual test
- more visual tests: https://mcbench.ai/
every.to dan shipper
- https://every.to/vibe-check/vibe-check-openai-s-4o-image-generation
  - for imaegegen testing
- https://every.to/vibe-check/we-tried-openai-s-new-agent-here-s-what-we-found
  - "We asked it to help us understand how Spotify Wrapped has evolved over the years. What did it start out as? What does it include now that’s new?"
- https://every.to/vibe-check/we-tried-openai-s-new-deep-research-here-s-what-we-found
  - "We tried it on a sprawling timeline of how Every has evolved over the past five years"
  - "we asked it to read the first chapter of War and Peace, note the way that Tolstoy uses descriptions to tell us about the inner lives of his characters, and deduce what that says about his view of human nature."
  - “Here are pictures of me in my favorite outfits, I like THESE brands, my body proportions are X,Y, Z—design a capsule wardrobe and recommend stores that suit my style.”
  - Digging through legislative documents: "We asked deep research to look up large spending bills from 10-15 years ago and validate if the intended impact was achieved. "
  - Spotting irregularities in 10-Ks. Asked to find recent corporate filings with errors or questionable data, deep research only surfaced known issues instead of hunting for brand-new ones. This suggests that unless it’s being explicitly prodded, it relies heavily on what’s already been flagged rather than truly “investigating.”
- https://every.to/vibe-check/vibe-check-openai-enters-the-browser-wars-with-chatgpt-agent
ben hylak
- https://www.latent.space/p/o1-skill-issue
- https://www.latent.space/p/o3-pro
riley goodside
- and zack witten
humor and memes
- https://karpathy.github.io/2012/10/22/state-of-computer-vision/ cited in gpt4 https://x.com/karpathy/status/1635697741925064704
- other memes
  - https://x.com/andykreed/status/1948066696415330561 <- sort people
  - fewshot writing prompts

Things that are "my thing"

tweets summarizer
youtube summarizer and timestamper https://gemini.google.com/gem/da23325c2fca
https://github.com/openai/codex
- codex --full-auto "create the fanciest todo-list app"
https://github.com/swyxio/openai-test-sandbox
https://github.com/stackblitz-labs/bolt.diy

Becoming a High Taste Tester

Subscribe to the newsletter

Latest posts