Large Language Models (LLMs) are becoming the go-to technology for analyzing, summarizing and searching large volumes of text. But in practice, limitations surface quickly: context windows are finite, throughput bottlenecks slow systems down and even advanced retrieval methods can miss critical content.
Here are the three key pain points in LLM-based document processing, the most important lessons learned from real-world projects and the succesfull strategies that work.
Pain Point 1: Context Size Remains a Bottleneck
Even with today’s long-context models supporting hundreds of thousands of tokens, it doesn’t mean everything can fit in at once. Research shows that LLMs often fail to consistently use relevant information in very long inputs (the “Lost in the Middle²” effect). If you try to summarize 50 documents at once, important details will inevitably be lost.
Strategy: Use multi-step LLM orchestration approaches like Map-Reduce or Refine. Individual documents or larger chunks are first processed (Map) separately into smaller pieces of information (e.g. by summarization), then a consolidation step (Reduce) aggregates into a final output. The Map step can be parallelized and is therefore well suited for scaling to massive document sizes. The Refine approach processes an initial chunk or document and generates an initial draft answer. Then it takes the next chunk, as well as the current draft and refines the draft, given the additional information. This is repeated until all chunks are processed. It must be noted that both approaches increase latency.
Pain Point 2: Throughput & Single-Request Latency
A single request may take or seconds or even a minute - while acceptable in many cases, it is helpful for users to see early their request is being processed. The bigger issue is throughput. Platforms such as Azure OpenAI limit the number of requests or tokens per minute. Once those caps are hit, users can end up waiting 30–90 seconds before processing even begins.
Strategy:
- Use streaming to provide users with early feedback during long single-request latencies (SRL), even if throughput remains limited.
- Apply prompt caching, avoiding repeated recomputation of static instructions or context.
- Manage request volume and prioritization around token and request-per-minute budgets.
Pain Point 3: More Documents Doesn’t Mean Better Results
Handling more documents does not automatically improve accuracy. Even with hybrid search (semantic + keyword), relevant items can slip through. This creates the classic Top-k dilemma:
- Process only Top 10 → fast, but risk missing half the relevant information.
- Process Top 20 → better coverage, but at least twice the latency.
- Process Top 30 → even more complete, but three times slower and more expensive.
Strategy:
- Offer clear modes: “Fast” (Top 10, low latency, streaming) vs. “Thorough” (Top 20–30, Map-Reduce, slower).
- Involve user expertise: let users filter or select documents to add as context, leveraging their domain knowledge.
Lessons Learned from Customer Projects
- Often speed outweights completness
Assess with users their requirements. If speed is prefered, the default setup should therefore optimize for “time to first token.”
- Transparency builds trust
When thorough modes take longer, display this information iin the UI. Users accept the trade-off if it’s explicitly communicated.
- Mixing model sizes improves efficiency
Smaller, cheaper models are ideal for preprocessing (classification, ranking). Larger models should be reserved for final answers. In a Map-Reduces strategy this saves cost, quality and speed.
- Search still matters
Users are already familiar with search tools. Integrating filtering and selection into the workflow ensures higher relevance and reduces unnecessary processing.
Practical Examples
- Customer Service Knowledge: Agents get an immediate response based on the Top 10 documents. For complex issues, they switch to “Thorough Mode,” which uses Map-Reduce over 30 documents.
- Legal & Compliance: Completeness is critical. In default mode, retrieval is “Thorough,” with additional filters by time period or document type to improve results.
- Technical Documentation: Smaller models cluster and tag sections first, before a larger model generates the final summary. This prevents irrelevant material from overwhelming the context.
Key Takeaways
- Context limits remain the choke point: Even with long context windows, structured retrieval is essential.
- Throughput is the real bottleneck: Streaming, caching, and queue management are critical.
- More documents = more cost and latency: The trade-off must be made transparent to users.
- Prioritize user expectations: Most value speed over completeness.
²Quelle: https://arxiv.org/abs/2307.03172