Challenging Token Usage Assumptions

We’re building a new feature to help solve the “needle in a haystack” problem for our users. We’re calling it Lenses. Lenses are a kind of filter using LLMs to understand the intent of social media posts better. It’s the first in a series of features that we think will make Messijo the best social media management tool on the market.

Lenses

You might be wondering how they work. Well, you still use your simple and compound Keywords to match and find social media posts that are relevant to you and your team. But, keyword matching only goes so far. You can still end up with hundreds or even thousands of posts that have massively varied themes and intentions. Lenses are LLM/AI based filters in which you ask AI if the post fits a prompt. If it does, it’s added to the Lens’s results.

Some example lenses might be:

Is this user trying to report a problem or find help in some way?

If so, you can notify your support staff or even use one of Messijo's upcoming "secret" features when it's ready.

Is this post something that could make a good testimonial on our marketing site?

You might want to wire up a webhook notification so you can get it into your marketing site's data ahead of the next build.

Is this user requesting a feature from one of our competitors that we actually have already?

You could directly respond to the thread with a link to the feature or even a direct message to the user.

All of these prompts get injected into a template that we pass onto the LLM and then we receive a structured output to accomplish our business logic on the backend.

This post discusses an internal backend prompt experiment. The current Messijo dashboard does not expose Lens confidence controls or scores to users.

The Assumption

Naively, we were using our prompt to say basically what our structured output was already saying. Someone pointed out that we could probably remove the JSON output parts of our prompt and it would cut down on token usage. They also posited that we would get better and faster results.

It makes sense from a purely logical standpoint, but since AI/LLMs are not perfect and less deterministic that we would like to expect, we decided to test it out.

The Prompt Templates

Our first template looks something like this:

def lens_prompt(user_prompt, content) do
  """
  Return a JSON object classifying a social media post based on a specific prompt.

  ### CLASSIFICATION CRITERIA
  Determine if the following post satisfies this requirement:
  [PROMPT]
  #{user_prompt}
  [END PROMPT]

  ### OUTPUT SPECIFICATION
  Your response must be a single JSON object with this exact structure:
  {
    "result": boolean,
    "confidence": number // 0.0 to 1.0
  }

  ### RULES
  1. If the criteria are met, set "result" to true and provide an internal score based on the clarity of the evidence.
  2. If the criteria are NOT met, set "result" to false and provide an internal score for why it is not a match.
  3. Search through the main text, hashtags, and mentions.
  4. Output ONLY the raw JSON. Do not include markdown code blocks or introductory text.

  ### SOCIAL MEDIA POST
  [POST]
  #{content}
  [END POST]
  """
end

You can see the obvious intructions for it to adhere to our proposed JSON output structure. We also added a few rules to make it easier for the model to understand what we’re looking for.

Our second template looks like this:

def lens_prompt_lighter(user_prompt, content) do
  """
  ### CLASSIFICATION CRITERIA
  Determine if the following social media post satisfies this requirement:
  [PROMPT]
  #{user_prompt}
  [END PROMPT]

  ### RULES
  1. If the criteria are met, set "result" to true and provide an internal score based on the clarity of the evidence.
  2. If the criteria are NOT met, set "result" to false and provide an internal score for why it is not a match.
  3. Search through the main text, hashtags, and mentions.

  ### SOCIAL MEDIA POST
  [POST]
  #{content}
  [END POST]
  """
end

You can already see that the second template is much shorter and more concise. It also doesn’t have the JSON output structure, but we’re going to see if that makes a difference.

The data we’re passing into both functions is the same:

user_prompt = "Is this pokemon related? It should use the word pokemon or ghosttype."
content = "Pokémon Center China just dropped the Ghost-type Pokémon series!"

The Results

We ran both prompts through the same model, openai/gpt-oss-120B.

Here are the raw outputs for you to draw your own conclusions from first.

First Prompt Results

%{
  "choices" => [
    %{
      "finish_reason" => "stop",
      "index" => 0,
      "logprobs" => nil,
      "message" => %{
        "content" => "{\"result\":true,\"confidence\":0.99}",
        "reasoning" => "We need to output JSON with result true/false and confidence. The prompt: \"Is this pokemon related? It should use the word pokemon or ghosttype.\" The post: \"Pokémon Center China just dropped the Ghost-type Pokémon series!\" Contains \"Pokémon\" and \"Ghost-type\". So criteria met. Confidence high, maybe 0.99. Return JSON.",
        "reasoning_details" => [
          %{
            "format" => "unknown",
            "index" => 0,
            "text" => "We need to output JSON with result true/false and confidence. The prompt: \"Is this pokemon related? It should use the word pokemon or ghosttype.\" The post: \"Pokémon Center China just dropped the Ghost-type Pokémon series!\" Contains \"Pokémon\" and \"Ghost-type\". So criteria met. Confidence high, maybe 0.99. Return JSON.",
            "type" => "reasoning.text"
          }
        ],
        "refusal" => nil,
        "role" => "assistant"
      },
      "native_finish_reason" => "stop"
    }
  ],
  "created" => 1766968014,
  "id" => "gen-1766968013-o8KxTslbg0Q79saEGMsN",
  "model" => "openai/gpt-oss-120b:free",
  "object" => "chat.completion",
  "provider" => "OpenInference",
  "usage" => %{
    "completion_tokens" => 95,
    "completion_tokens_details" => %{
      "image_tokens" => 0,
      "reasoning_tokens" => 82
    },
    "total_tokens" => 376,
    "cost" => 0,
    "cost_details" => %{
      "upstream_inference_completions_cost" => 0,
      "upstream_inference_cost" => nil,
      "upstream_inference_prompt_cost" => 0
    },
    "is_byok" => false,
    "prompt_tokens" => 281,
    "prompt_tokens_details" => %{
      "audio_tokens" => 0,
      "cached_tokens" => 0,
      "video_tokens" => 0
    },
  }
}

Second Prompt Results

%{
  "choices" => [
    %{
      "finish_reason" => "stop",
      "index" => 0,
      "logprobs" => nil,
      "message" => %{
        "content" => "{\n  \"result\": true,\n  \"confidence\": 0.99,\n  \"evidence\": \"The post contains the word \\\"Pokémon\\\" (appearing twice) and also mentions \\\"Ghost-type\\\" which directly references the ghosttype concept.\"\n}",
        "reasoning" => "We need to determine if the post satisfies the requirement: \"Is this pokemon related? It should use the word pokemon or ghosttype.\"\n\nWe need to check if the post uses the word \"pokemon\" or \"ghosttype\". The post: \"Pokémon Center China just dropped the Ghost-type Pokémon series!\" It contains \"Pokémon\" (with accent) and \"Ghost-type Pokémon\". Also \"Ghost-type\" includes ghosttype concept. The requirement: use the word pokemon or ghosttype. It uses \"Pokémon\". That's a match. So result true. Confidence high, maybe 0.99.\n\nProvide JSON? The instruction: set \"result\" to true and provide an internal score. Likely output something like:\n\n{\n  \"result\": true,\n  \"confidence\": 0.99,\n  \"evidence\": \"Contains the word 'Pokémon' and 'Ghost-type Pokémon'.\"\n}\n\nWe should follow format.",
        "reasoning_details" => [
          %{
            "format" => "unknown",
            "index" => 0,
            "text" => "We need to determine if the post satisfies the requirement: \"Is this pokemon related? It should use the word pokemon or ghosttype.\"\n\nWe need to check if the post uses the word \"pokemon\" or \"ghosttype\". The post: \"Pokémon Center China just dropped the Ghost-type Pokémon series!\" It contains \"Pokémon\" (with accent) and \"Ghost-type Pokémon\". Also \"Ghost-type\" includes ghosttype concept. The requirement: use the word pokemon or ghosttype. It uses \"Pokémon\". That's a match. So result true. Confidence high, maybe 0.99.\n\nProvide JSON? The instruction: set \"result\" to true and provide an internal score. Likely output something like:\n\n{\n  \"result\": true,\n  \"confidence\": 0.99,\n  \"evidence\": \"Contains the word 'Pokémon' and 'Ghost-type Pokémon'.\"\n}\n\nWe should follow format.",
            "type" => "reasoning.text"
          }
        ],
        "refusal" => nil,
        "role" => "assistant"
      },
      "native_finish_reason" => "stop"
    }
  ],
  "created" => 1766968045,
  "id" => "gen-1766968045-7ZIh0NyZLg637VdQ4pRs",
  "model" => "openai/gpt-oss-120b:free",
  "object" => "chat.completion",
  "provider" => "OpenInference",
  "usage" => %{
    "completion_tokens" => 249,
    "completion_tokens_details" => %{
      "image_tokens" => 0,
      "reasoning_tokens" => 208
    },
    "cost" => 0,
    "cost_details" => %{
      "upstream_inference_completions_cost" => 0,
      "upstream_inference_cost" => nil,
      "upstream_inference_prompt_cost" => 0
    },
    "is_byok" => false,
    "prompt_tokens" => 207,
    "prompt_tokens_details" => %{
      "audio_tokens" => 0,
      "cached_tokens" => 0,
      "video_tokens" => 0
    },
    "total_tokens" => 456
  }
}

The first thing we can notice is that the input tokens are indeed less for the second template. We use 281 prompt tokens for the first template and only 207 for the second prompt. That was to be expected but it was nice to validate that assumption, at least.

However, when we look at the total token usage we see a surprise. total_tokens for the first prompt is 376 but the second prompt consumed 456. That is a pretty significant increase. When looking at why we can see the reasoning_tokens jump from 82 to 208. Since reasoning tokens are billed as output, we are actually paying way more for the second prompt.

Conclusion

The increase in token usage ends up being 21% more at a raw token level, but we should consider the price difference between input and output. At the time of this post, input is billed at $0.04 per 1M tokens. Output is billed at $0.30 to $0.40 per 1M tokens. With output tokens being 10x the cost of input tokens, we are paying way more for the second prompt.

Some quick napkin math:

80 decreased input token vs 120 increased output tokens.
Normalized to a dollar relative amount of tokens: 80 vs 1200
1200 / 80 = 15x the cost

15x the cost could really add up over time. So, the moral of the story is that we should always challenge our assumptions and test them out despite what feels logical.

It’s worth pointing out, that this quick evaluation is far from scientific. The results could vary based on more complex prompts or model selection, etc. But, it’s a good start and we should probably be looking into a more formal way of tracking prompt token usage and performance int he future.

We hope you enjoyed reading this and we look forward to your feedback on our implementation of Lenses when they are released to the general public.