Getting structured input and output from Ollama

If you have ever wired a local LLM into a production-ish Rails app, you have probably met this failure mode: you ask for JSON, the model says Sure, here is the JSON:, wraps the payload in a Markdown fence, forgets one field, adds another field because it felt helpful, and now your parser is holding the bag.

Ollama gives you useful tools for structured work, especially the format option on chat requests. But the important lesson is not "turn on JSON mode and relax." The real pattern is more boring and more reliable: shape the input so the model understands the task, constrain the output when you can, then parse, sanitize, validate, and retry when the model drifts.

That is the difference between a demo and a feature you can leave running overnight.

Start with the message shape

The chat endpoint takes a messages array. In practice, the cleanest shape is usually one system message for stable instructions, optional alternating user and assistant messages for history, and one final user message that contains the actual task.

That last message matters. If you are using RAG, do not scatter retrieved context and the user's question across multiple consecutive user messages. Bundle them together and put the real task at the end. Models are trained on conversation patterns, and the last concrete user request tends to anchor the response better than a task hidden above a wall of context.

A simple version can be Markdown:

prompt = <<~MARKDOWN
  # Context

  #{retrieved_context}

  # Task

  Answer this customer question using only the context above:

  #{question}
MARKDOWN

messages = [
  {
    role: "system",
    content: "You answer support questions for a Rails application."
  },
  {
    role: "user",
    content: prompt
  }
]

Markdown is often enough when the shape is simple. Headings give the model obvious boundaries, and the result is easy to inspect in logs. The downside is that Markdown does not protect your instruction structure from dynamic user content. If the question itself contains headings, fake instructions, or snippets that look like delimiters, the model can blur the line between your prompt and the user's data.

When that starts to matter, move to a structured envelope.

Use JSON or XML for structured input

JSON is the obvious default because every model has seen endless JSON, every stack can generate it safely, and your application already knows how to escape strings. For many Rails apps, JSON.generate is enough to build a prompt payload that keeps retrieved documents, instructions, and the user question separate.

documents = results.map.with_index(1) do |result, index|
  {
    index: index,
    title: result.title,
    snippets: result.snippets
  }
end

prompt = JSON.generate(
  instructions: "Answer the question using the documents as context.",
  documents: documents,
  question: question
)

The nice thing here is not that the model becomes a JSON parser. It is that your application owns the escaping. Quotes, braces, and newlines inside user content no longer break the prompt structure.

XML can also be a good fit, especially for smaller models. Tags give the model semantic anchors that are sometimes more legible than nested JSON keys, and CDATA gives you a practical way to wrap dynamic text.

builder = Nokogiri::XML::Builder.new(encoding: "UTF-8") do |xml|
  xml.prompt do
    xml.instructions "Answer the question using the documents as context."

    xml.documents do
      results.each.with_index(1) do |result, index|
        xml.document(index: index) do
          xml.title { xml.cdata result.title }
          result.snippets.each do |snippet|
            xml.snippet { xml.cdata snippet }
          end
        end
      end
    end

    xml.question { xml.cdata question }
  end
end

prompt = builder.doc.root.to_xml(
  save_with: Nokogiri::XML::Node::SaveOptions::NO_DECLARATION
)

The rule I like is simple: if you want JSON out, JSON in is usually a good match. If your prompt is mostly long text with named sections, XML can be clearer. Do not make the model translate between too many structures at once unless you have measured that it helps.

For output, prefer JSON Schema over loose JSON mode

Ollama's chat API supports format. Passing "json" asks the model to produce JSON-ish output. Passing a JSON Schema gives the decoder a much stronger shape to follow.

For production code, schema mode is the better starting point.

schema = {
  type: "object",
  required: ["answer", "confidence"],
  additionalProperties: false,
  properties: {
    answer: { type: "string" },
    confidence: {
      type: "string",
      enum: ["low", "medium", "high"]
    }
  }
}

response = Ollama.chat(
  model: "llama3.1",
  format: schema,
  messages: [
    {
      role: "system",
      content: "Return JSON with answer and confidence fields."
    },
    {
      role: "user",
      content: prompt
    }
  ]
)

There is one sharp edge worth knowing about: schema descriptions are not a substitute for prompt instructions. Ollama uses the schema to constrain generation, but field descriptions may not be visible to the model in the way you expect. If a field has business meaning, say that meaning in the prompt too.

For example, do not rely only on a description inside the schema to explain what "confidence" means. Tell the model in prose that high confidence means the answer is directly supported by the supplied documents, medium means the answer is inferred, and low means the documents are insufficient.

Still parse like the model is trying to embarrass you

Even with format set, treat model output as hostile until proven otherwise. You will eventually see Markdown fences, leading prose, blank content, partial JSON, or valid JSON with the wrong shape.

Start by cleaning the common wrappers.

def parse_llm_json(content)
  raise "Empty LLM response" if content.blank?

  clean = content.strip
  clean = clean.sub(/\A```(?:json)?\s*/i, "")
  clean = clean.sub(/\s*```\z/, "")
  clean = clean.sub(/\A(?:sure|here is the json)[:,!\s]*/i, "")

  JSON.parse(clean, decimal_class: BigDecimal)
end

This helper is deliberately small. It handles common junk without pretending to repair every malformed response. Once parsing succeeds, validate the payload against the same schema your application expected in the first place.

parsed = parse_llm_json(response.message.content)
errors = JSON::Validator.fully_validate(schema, parsed)

if errors.any?
  raise "LLM response did not match schema: #{errors.join(", ")}"
end

That validation step is where structured output becomes operationally useful. Your app should not continue because the response "looks about right." It should continue because the response matches the contract the rest of the code depends on.

Retry with feedback, not vibes

When parsing or validation fails, a retry can work well, but make the retry concrete. Feed the error back to the model and ask for the same response shape again. Keep the original task, include the invalid output if it is safe to do so, and include the parser or validation error.

repair_prompt = JSON.generate(
  instructions: "Repair the response so it matches the schema exactly.",
  schema: schema,
  invalid_response: response.message.content,
  error: errors.join(", ")
)

Do not retry forever. One to three attempts is usually enough to separate transient formatting drift from a prompt or model that simply cannot satisfy the contract. After that, fail loudly, log the raw output, and let the job or request surface a useful error.

A practical Rails shape

In a Rails app, I like keeping the model call and the response contract near each other. You do not need a grand framework for this. A small object that builds the prompt, calls Ollama, parses the response, and validates the result is enough.

class SupportAnswer
  SCHEMA = {
    type: "object",
    required: ["answer", "confidence"],
    additionalProperties: false,
    properties: {
      answer: { type: "string" },
      confidence: { type: "string", enum: ["low", "medium", "high"] }
    }
  }

  def self.call(question:, documents:)
    new(question:, documents:).call
  end

  def initialize(question:, documents:)
    @question = question
    @documents = documents
  end

  def call
    response = Ollama.chat(
      model: "llama3.1",
      format: SCHEMA,
      messages: messages
    )

    parsed = parse_json(response.message.content)
    validate!(parsed)
    parsed
  end

  private

  attr_reader :question, :documents

  def messages
    [
      {
        role: "system",
        content: system_prompt
      },
      {
        role: "user",
        content: user_prompt
      }
    ]
  end

  def system_prompt
    "Answer with JSON only. Confidence is high only when the documents directly support the answer."
  end

  def user_prompt
    JSON.generate(
      instructions: "Use the documents to answer the question.",
      documents: documents,
      question: question
    )
  end

  def parse_json(content)
    clean = content.to_s.strip
    clean = clean.sub(/\A```(?:json)?\s*/i, "").sub(/\s*```\z/, "")

    JSON.parse(clean, decimal_class: BigDecimal)
  end

  def validate!(parsed)
    errors = JSON::Validator.fully_validate(SCHEMA, parsed)
    return if errors.empty?

    raise "Ollama response did not match schema: #{errors.join(", ")}"
  end
end

The object is not exciting, which is the point. The prompt has a stable shape. The output has a schema. The parser removes common wrappers. The validator protects the rest of the app. When this fails, it fails in one obvious place.

The production rule

Structured LLM work is not about convincing the model to be perfectly obedient. It is about reducing ambiguity before generation and enforcing a contract after generation.

For Ollama, that means putting the real task last, choosing an input format that preserves boundaries, using JSON Schema for output when possible, repeating important field semantics in the prompt, stripping common response wrappers, validating parsed data, and retrying with specific feedback.

Do that, and structured output stops being a hopeful string convention. It becomes a normal integration boundary, which is exactly how boring we want production LLM code to be.

Happy parsing!