Agentic AI in legal services: what survives the billable-hour decomposition
Three of the six billable-hour sub-tasks capture durable value with agentic AI. Two increase malpractice risk vs a junior-associate equivalent at the same time-to-delivery. One is bounded by conduct rules, not technology. The evidence from AmLaw 100 deployments now allows a clear-eyed breakdown.
Holding·reviewed12 May 2026·next+71dAgentic AI has entered Big Law. Allen & Overy deployed Harvey AI across 43 offices and 3,500 lawyers. DLA Piper and Clifford Chance have published formal AI policies. Thomson Reuters has moved Westlaw Precision AI into production at subscribing firms. The tooling is past the pilot stage; the questions that remain are not whether to deploy but which tasks hold and which do not.
The evidence from 2025–2026 deployments now allows a specific decomposition. Three of the six principal billable-hour sub-tasks capture durable value with agentic AI. Two produce a net increase in malpractice risk compared to a junior-associate-drafted equivalent at the same time-to-delivery. The comparator matters: the claim is not that AI produces worse output than zero assistance. It is that AI-generated legal drafting submitted as final, and AI-generated citation lists filed without independent verification, carry higher malpractice exposure than a junior-associate equivalent would at equivalent speed. The remaining sub-task, client communication, is bounded by professional-conduct rules that make the technology choice a secondary concern.
This piece walks through all six sub-tasks with the supporting evidence, addresses the malpractice insurance angle where data allows, and closes with what changes the verdict.
The billable hour decomposed
Law firm economics are built on six categories of billable work. They are not equally automatable, equally risky to automate, or equally bounded by conduct rules.
Document review and e-discovery is the work of identifying relevant documents in large productions, categorising privilege, and flagging issues for senior attorney attention. Review volumes in complex litigation routinely run to millions of pages.
Precedent retrieval and case-law research is the work of locating cases, statutes, regulations, and secondary sources that bear on a legal question, and presenting them in a usable form.
Deposition preparation is the work of synthesising discovery material, prior deposition transcripts, and case theory to prepare an attorney or expert witness for examination.
Legal drafting submitted as final is the work of producing a brief, motion, contract clause, or agreement that goes to a court or counterparty without a documented senior-review step. The qualifier “submitted as final” is load-bearing: a first-draft that goes to a partner for revision is categorically different from a first-draft that goes to the court.
Citation generation is the work of producing the specific citations that appear in a filed document: case name, docket, jurisdiction, reporter, pinpoint.
Client communication is the work of keeping clients informed, explaining legal developments, and obtaining instructions.
Where agentic AI captures durable value
Document review
Allen & Overy’s Harvey deployment, disclosed publicly in 2024 and covered by the Financial Times and Law.com, is the most documented AmLaw-adjacent case. The firm reported document-review and contract-analysis tasks completing in a fraction of prior time across its global network. The pattern is consistent with the structural logic of the task: document review is classification work on large corpora, which is exactly the problem LLMs are well-designed to address at scale. The human-review gate is architecturally enforced because the output of AI review is a priority set for attorney attention, not a filing.
Thomson Reuters’ “Future of Professionals Report 2025” (thomsonreuters.com/legal/future-of-professionals) surveyed legal professionals across the UK and US and found document review and legal research as the two tasks where practitioners most frequently reported time savings above 50%. The report covers a self-selected survey population, which limits the generalisability, but the directional finding is consistent with the deployment evidence.
The residual risk in document review is miscategorisation, not fabrication. An AI system that tags a privileged document as non-privileged, or that classifies a responsive document as non-responsive, creates a disclosure problem. The mitigation is a senior-attorney quality-control step on the AI’s output, not a return to manual review of every document.
Precedent retrieval
Westlaw Precision AI and Lexis+ AI are both retrieval-augmented systems that ground case-law research outputs in their verified legal databases rather than generating citations from model weights alone. The architecture is the relevant fact: the model retrieves from a curated corpus, which structurally reduces (but does not eliminate) hallucination on citation tasks.
Both vendors publish accuracy claims for their retrieval-augmented research tools. Neither has published an independent hallucination rate on citation tasks that would allow an attorney to remove the verification step. The practical position, consistent with Rule 11, is that AI-retrieved precedent is a research starting point requiring attorney verification before filing citation, not a verified citation.
Within that constraint, the productivity gain in precedent retrieval is well-evidenced. The Thomson Reuters 2025 report found legal research as the second task (after document review) where practitioners most frequently reported material time savings. DLA Piper and Clifford Chance have both published AI policies that explicitly permit the use of AI research tools for initial case-law synthesis, subject to attorney review before reliance.
Deposition preparation
Deposition preparation is a synthesis task: a practitioner needs to understand the full record on a witness and translate it into a coherent examination strategy. Large-context-window models are well-suited to this work. The output, an attorney’s preparation notes, is internal. It does not go to a court or a counterparty. The risk of hallucination in deposition-prep synthesis is real, but it is contained: the attorney using the preparation materials is exercising judgment in real time during the deposition.
No published AmLaw 100 deployment has specifically quantified deposition-prep outcomes, which means any specific figure here would require the label source:"our-estimate". The directional finding, that the task structure makes it more tractable than drafting and safer than citation generation, is supported by the deployment evidence in aggregate.
Where agentic AI backfires
Legal drafting submitted as final
Mata v. Avianca, decided in the Southern District of New York in June 2023, is the foundational sanction case. Attorneys submitted a brief citing six cases that did not exist. The citations had been generated by ChatGPT. Counsel had not verified them against Westlaw, LexisNexis, or the court’s own docket. Judge Castel imposed monetary sanctions of $5,000 on the firm and referred the matter for potential disciplinary review (S.D.N.Y. Jun 2023, No. 22-cv-1461).
Park v. Kim, decided by the Second Circuit in January 2024, is the immediate follow-on. The court found that counsel’s reliance on an AI-generated citation without verification violated Rule 11. The Second Circuit used more direct language than Mata: the obligation to verify citations before filing is a basic professional duty that AI tools do not modify (2nd Cir. Jan 2024).
Both cases concern citation specifically. Their structural implication extends to any legal text submitted as final without documented senior review. The pattern in both sanctions is not that AI was used; it is that the output of AI use was submitted as final without a verification step. The malpractice risk in legal drafting submitted as final is not the AI itself. It is the removal of the human-review gate that makes the risk measurable.
The comparator in the AM-151 claim is explicit: the risk increase is against a junior-associate-drafted equivalent at the same time-to-delivery. A junior associate working under time pressure produces drafts that contain errors. Those errors are caught by the review gate. When AI output goes to filing without a review gate, the error rate on the output is not the issue; the absence of the gate is.
Citation generation
Stanford’s CodeX and HAI research programs (codex.stanford.edu, hai.stanford.edu) have documented that general-purpose LLMs hallucinate legal citations at rates that would be professionally untenable if relied upon for filing. The research distinguishes between retrieval-augmented systems (Westlaw Precision, Lexis+ AI) and generative-first approaches: the retrieval-augmented systems perform substantially better, but published hallucination rates for legal citation tasks on any system still run high enough to make unverified reliance a Rule 11 violation.
The operational consequence is that citation generation by AI requires a verification step against a verified legal database before any citation appears in a filing. The AI can do the first pass. The pass is not the citation; it is the research starting point. When the AI’s output is treated as the citation rather than the starting point, the risk materialises as it did in Mata and Park.
Legal-specific models trained on legal corpora (including the models underlying Harvey and Lexis+ AI) perform better on citation tasks than general-purpose models. They do not perform well enough to remove the verification requirement. No published study as of the 12 May 2026 review date has documented a legal AI system achieving citation accuracy sufficient to satisfy Rule 11 without attorney verification.
Where conduct rules bound technology
Client communication sits in a different category. The technology for AI-drafted client communications is available. Attorneys are using it. The binding constraint is not the capability of the tool; it is ABA Formal Opinion 512, issued in July 2024 (americanbar.org).
FO 512 reaffirms that Model Rule 1.4, the attorney’s duty to keep clients reasonably informed, cannot be delegated to an AI system. The duty to explain matters sufficiently to permit informed decision-making belongs to the licensed attorney. FO 512 does not prohibit AI-drafted client communications. It requires that a licensed attorney supervise the content, exercise contextual judgment about what the client needs to know, and make the final send decision.
The practical boundary is structural: AI drafts; attorney reviews and approves. European bar associations apply comparable reasoning under their own conduct frameworks. The French and German bars have published guidance broadly consistent with FO 512’s position on supervision obligations. The UK Solicitors Regulation Authority has noted similar principles in its 2024 AI guidance.
Client communication is bounded by conduct rules rather than technology in a specific sense: the conduct rules draw a line that the technology cannot cross regardless of capability level. A model that produces perfect client communications still requires attorney review before sending, because the obligation is not about output quality; it is about professional responsibility.
The malpractice insurance angle
Several errors-and-omissions carriers are tightening underwriting criteria for legal malpractice coverage where AI tools are involved in final-drafted work. The specifics are not uniformly public. What is documentable is the direction: carriers that previously underwrote legal malpractice without AI-specific exclusions or riders are now asking whether AI-generated drafts were reviewed by a licensed attorney before filing.
The tightening is most visible in renewal negotiations for firms that have disclosed broad AI deployment without specifying the review-gate architecture. Carriers drawing on the Mata and Park sanction record are treating AI-final-drafted work as a separate underwriting category. The synthesis that follows is directional rather than drawn from a published carrier dataset: source:"our-estimate" for the characterisation that e&o underwriting on AI-augmented legal practice is diverging from the pre-2023 standard.
The practical implication for managing partners is that the deployment question and the insurance question are not separable. The review-gate architecture that the deployment requires for conduct-rule compliance is the same architecture that e&o carriers are asking about. Firms that deploy AI without a documented review gate are taking on both the malpractice exposure and the underwriting risk simultaneously.
GAUGE compliance dimension
Running a legal-services AI deployment through the GAUGE framework’s compliance dimension surfaces the structural constraint quickly. GAUGE scores governance, auditability, usage boundaries, guard-rails, and escalation paths. On compliance, the question is whether the deployment can demonstrate, after the fact, that every AI output touching a filed document or client communication was reviewed by a licensed attorney before it became consequential.
Document review, precedent retrieval, and deposition prep pass the compliance dimension under GAUGE because the output is internal and the consequential decision (what to file, what to argue, what to tell the client) remains with the attorney. Legal drafting submitted as final and citation generation fail the compliance dimension when the review gate is absent. Client communication passes the compliance dimension only when the FO 512 supervision requirement is architecturally enforced. The GAUGE compliance dimension is not the only relevant dimension for legal AI, but it is the one where the deployment evidence draws the clearest line.
Anti-patterns
Three patterns show up repeatedly in legal AI deployments that go wrong.
Treating Harvey, Lexis+ AI, or Westlaw Precision AI as a substitute for senior partner review. These tools are well-built for the tasks they are designed for. They are not designed to replace the professional judgment that a senior attorney applies when deciding whether a draft is ready to file. The vendor’s accuracy claims measure performance on benchmark tasks; they do not measure performance on any specific firm’s specific matters under specific time pressure.
Relying on vendor accuracy claims without establishing an independent task baseline. A vendor that reports 95% accuracy on document review is reporting performance on the vendor’s evaluation set. The firm’s matters may have different document types, different privilege profiles, different relevance standards. The defensible deployment process establishes a firm-specific baseline before relying on vendor claims, then measures against that baseline in the first 90 days of production.
Using general-purpose consumer AI tools for any client-facing legal work. ChatGPT, Claude.ai consumer tier, and comparable tools are not designed for legal citation tasks. They have no verification layer against a legal database. The Mata and Park sanctions both involved counsel using a general-purpose tool (ChatGPT in Mata) without the verification infrastructure that legal-specific tools provide. The conduct-rule obligations apply regardless of which tool was used; the risk of hallucination is substantially higher with general-purpose tools.
What changes this verdict
The AM-151 claim is on a 90-day review cadence with four specific trigger conditions.
First: an e&o carrier publicly underwriting AI-final-drafted filings with a documented review-gate requirement. This would not eliminate the risk of the two high-risk sub-tasks, but it would change the insurance calculus that currently makes the risk unmanageable for most firms.
Second: a second sanction wave in 2026 H2. If federal courts and circuit courts issue additional sanctions for AI-related filing failures at scale, the claim strengthens. If the sanction record remains stable at Mata and Park, the claim holds at current confidence.
Third: a shift in ABA or EU bar guidance post-FO 512. If bar associations issue guidance that tightens or relaxes the client-communication supervision requirement, the classification of that sub-task changes.
Fourth: a published hallucination-rate study from Stanford CodeX, HAI, or a comparable independent research program showing sustained citation-accuracy rates below 5% on legal-specific LLMs across verified task conditions. At that level of accuracy, the case for removing the verification step becomes analytically tractable. No such study exists as of 12 May 2026.
Related reading
The governance architecture that legal AI deployments require is addressed in the enterprise agentic AI governance playbook. The procurement due-diligence framework for vendor accuracy claims is in the unverified citation chain. For the broader use-cases picture, the use cases index maps the deployment evidence across sectors.
Cite this article
Pick a citation format. Click to copy.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.