Which legal tasks are safe to delegate to agentic AI today?

The documented-deployment evidence points to three sub-tasks where agentic AI captures durable value: document review and e-discovery, precedent retrieval and case-law research, and deposition-prep synthesis. All three share a structural property: the output requires senior-attorney review before it affects any filing, client communication, or court record. The risk in all three is still present — hallucinated case summaries happen — but the human review gate is architecturally enforced before the output is consequential. The two sub-tasks where the risk profile is worse: legal drafting submitted as final without senior review, and AI-generated citation lists filed without independent verification. Courts have now sanctioned counsel in both Mata v. Avianca (S.D.N.Y. Jun 2023) and Park v. Kim (2nd Cir. Jan 2024) for filing fabricated AI-generated citations. The comparator throughout is vs a junior-associate-drafted equivalent at the same time-to-delivery, not vs zero assistance.

What happened in Mata v. Avianca and why does it matter for law-firm AI deployment?

In June 2023, a federal district court in the Southern District of New York sanctioned attorneys who submitted briefs citing six cases that did not exist. The citations had been generated by ChatGPT. The attorneys had not verified the citations against Westlaw, LexisNexis, or the court's own docket. Judge Castel imposed sanctions of $5,000 on the firm and referred the matter for potential disciplinary review. Park v. Kim, the 2nd Circuit's January 2024 follow-on, extended the pattern: the court found that counsel's reliance on an AI-generated citation without verification violated Rule 11. The two cases together established that AI-generated legal citations are not covered by the same standard of care as a citation taken from a verified legal database. The professional duty to verify before filing remains entirely with counsel.

How does ABA Formal Opinion 512 affect AI use in client communications?

ABA Formal Opinion 512, issued July 2024, addresses the use of generative AI in legal practice. On client communications specifically, it reaffirms that Model Rule 1.4 (the communication obligation) cannot be delegated to an AI system: the duty to keep clients reasonably informed and to explain matters to permit informed decision-making is a lawyer's professional obligation. An AI tool may draft client communications, but the supervision, contextual judgment, and ultimate send-decision are the attorney's. FO 512 does not prohibit AI-drafted client communications; it prohibits treating them as equivalent to attorney-supervised communications. The practical boundary is that AI drafts the first pass; a licensed attorney reads and approves before sending.

Are legal-specific AI tools safer than general-purpose models for citation tasks?

More reliable, yes. Safer in an absolute sense, no. Stanford's CodeX and HAI research programs have documented that legal-specific LLMs (models trained on legal corpora, or RAG-augmented systems with access to verified legal databases) hallucinate at lower rates on citation tasks than general-purpose models like baseline GPT-4. Westlaw Precision AI and Lexis+ AI both use retrieval-augmented architectures that ground outputs in verified case-law databases, which structurally reduces (but does not eliminate) citation fabrication. The residual hallucination rate on legal-specific models is still non-trivial; no vendor has published a documented hallucination rate on citation tasks that would allow an attorney to remove the verification step. Rule 11 applies regardless of which model generated the citation.

What should a managing partner check before deploying Harvey or a comparable tool at scale?

Four questions that the vendor's accuracy claims typically do not answer. First: what is the firm's independent task baseline for the specific sub-task the tool is being applied to? Vendor benchmarks are not firm-specific. Second: what is the human-review gate between AI output and any filing, court record, or client communication? The gate must be architecturally enforced, not a policy memo. Third: what does the firm's e&o carrier require for AI-final-drafted work? Several carriers are tightening underwriting criteria for filings where AI produced the first draft with no documented senior-review step. Fourth: what is the firm's plan when the tool halluccinates on a consequential task? The incident-response protocol should exist before the first production deployment.

Agentic AI in legal services: billable-hour decomposition

At a glance

Claim

Across the 2025–2026 documented deployments at AmLaw 100 firms, agentic AI captures durable value in three of the six billable-hour sub-tasks (document review, precedent retrieval, deposition prep) and produces a net malpractice-risk increase in two (legal drafting submitted as final, citation generation) vs a junior-associate-drafted equivalent at the same time-to-delivery; the remaining sub-task (client communication) is bounded by professional-conduct rules, not technology.

Supporting figure

Allen & Overy's Harvey AI deployment, disclosed publicly in 2024, spans 43 offices and 3,500 lawyers. The firm reported document-review and precedent-retrieval tasks completing in a fraction of prior time — but has not published final-drafting accuracy data, which is consistent with where the deployment evidence draws the line.

Date

12 May 2026

Verdict

Holding(AM-151)

Next review

10 Aug 2026(+53d)

Agentic AI has entered Big Law. Allen & Overy deployed Harvey AI across 43 offices and 3,500 lawyers. DLA Piper and Clifford Chance have published formal AI policies. Thomson Reuters has moved Westlaw Precision AI into production at subscribing firms. The tooling is past the pilot stage; the questions that remain are not whether to deploy but which tasks hold and which do not.

The evidence from 2025–2026 deployments now allows a specific decomposition. Three of the six principal billable-hour sub-tasks capture durable value with agentic AI. Two produce a net increase in malpractice risk compared to a junior-associate-drafted equivalent at the same time-to-delivery. The comparator matters: the claim is not that AI produces worse output than zero assistance. It is that AI-generated legal drafting submitted as final, and AI-generated citation lists filed without independent verification, carry higher malpractice exposure than a junior-associate equivalent would at equivalent speed. The remaining sub-task, client communication, is bounded by professional-conduct rules that make the technology choice a secondary concern.

This piece walks through all six sub-tasks with the supporting evidence, addresses the malpractice insurance angle where data allows, and closes with what changes the verdict.

The billable hour decomposed

Law firm economics are built on six categories of billable work. They are not equally automatable, equally risky to automate, or equally bounded by conduct rules.

Document review and e-discovery is the work of identifying relevant documents in large productions, categorising privilege, and flagging issues for senior attorney attention. Review volumes in complex litigation routinely run to millions of pages.

Precedent retrieval and case-law research is the work of locating cases, statutes, regulations, and secondary sources that bear on a legal question, and presenting them in a usable form.

Deposition preparation is the work of synthesising discovery material, prior deposition transcripts, and case theory to prepare an attorney or expert witness for examination.

Legal drafting submitted as final is the work of producing a brief, motion, contract clause, or agreement that goes to a court or counterparty without a documented senior-review step. The qualifier “submitted as final” is load-bearing: a first-draft that goes to a partner for revision is categorically different from a first-draft that goes to the court.

Citation generation is the work of producing the specific citations that appear in a filed document: case name, docket, jurisdiction, reporter, pinpoint.

Client communication is the work of keeping clients informed, explaining legal developments, and obtaining instructions.

Where agentic AI captures durable value

Document review

Allen & Overy’s Harvey deployment, disclosed publicly in 2024 and covered by the Financial Times and Law.com, is the most documented AmLaw-adjacent case. The firm reported document-review and contract-analysis tasks completing in a fraction of prior time across its global network. The pattern is consistent with the structural logic of the task: document review is classification work on large corpora, which is exactly the problem LLMs are well-designed to address at scale. The human-review gate is architecturally enforced because the output of AI review is a priority set for attorney attention, not a filing.

Thomson Reuters’ “Future of Professionals Report 2025” (thomsonreuters.com/legal/future-of-professionals) surveyed legal professionals across the UK and US and found document review and legal research as the two tasks where practitioners most frequently reported time savings above 50%. The report covers a self-selected survey population, which limits the generalisability, but the directional finding is consistent with the deployment evidence.

The residual risk in document review is miscategorisation, not fabrication. An AI system that tags a privileged document as non-privileged, or that classifies a responsive document as non-responsive, creates a disclosure problem. The mitigation is a senior-attorney quality-control step on the AI’s output, not a return to manual review of every document.

Precedent retrieval

Westlaw Precision AI and Lexis+ AI are both retrieval-augmented systems that ground case-law research outputs in their verified legal databases rather than generating citations from model weights alone. The architecture is the relevant fact: the model retrieves from a curated corpus, which structurally reduces (but does not eliminate) hallucination on citation tasks.

Both vendors publish accuracy claims for their retrieval-augmented research tools. Neither has published an independent hallucination rate on citation tasks that would allow an attorney to remove the verification step. The practical position, consistent with Rule 11, is that AI-retrieved precedent is a research starting point requiring attorney verification before filing citation, not a verified citation.

Within that constraint, the productivity gain in precedent retrieval is well-evidenced. The Thomson Reuters 2025 report found legal research as the second task (after document review) where practitioners most frequently reported material time savings. DLA Piper and Clifford Chance have both published AI policies that explicitly permit the use of AI research tools for initial case-law synthesis, subject to attorney review before reliance.

Deposition preparation

Deposition preparation is a synthesis task: a practitioner needs to understand the full record on a witness and translate it into a coherent examination strategy. Large-context-window models are well-suited to this work. The output, an attorney’s preparation notes, is internal. It does not go to a court or a counterparty. The risk of hallucination in deposition-prep synthesis is real, but it is contained: the attorney using the preparation materials is exercising judgment in real time during the deposition.

No published AmLaw 100 deployment has specifically quantified deposition-prep outcomes, which means any specific figure here would require the label source:"our-estimate". The directional finding, that the task structure makes it more tractable than drafting and safer than citation generation, is supported by the deployment evidence in aggregate.

Where agentic AI backfires

Legal drafting submitted as final

Mata v. Avianca, decided in the Southern District of New York in June 2023, is the foundational sanction case. Attorneys submitted a brief citing six cases that did not exist. The citations had been generated by ChatGPT. Counsel had not verified them against Westlaw, LexisNexis, or the court’s own docket. Judge Castel imposed monetary sanctions of $5,000 on the firm and referred the matter for potential disciplinary review (S.D.N.Y. Jun 2023, No. 22-cv-1461).

Park v. Kim, decided by the Second Circuit in January 2024, is the immediate follow-on. The court found that counsel’s reliance on an AI-generated citation without verification violated Rule 11. The Second Circuit used more direct language than Mata: the obligation to verify citations before filing is a basic professional duty that AI tools do not modify (2nd Cir. Jan 2024).

Both cases concern citation specifically. Their structural implication extends to any legal text submitted as final without documented senior review. The pattern in both sanctions is not that AI was used; it is that the output of AI use was submitted as final without a verification step. The malpractice risk in legal drafting submitted as final is not the AI itself. It is the removal of the human-review gate that makes the risk measurable.

The comparator in the AM-151 claim is explicit: the risk increase is against a junior-associate-drafted equivalent at the same time-to-delivery. A junior associate working under time pressure produces drafts that contain errors. Those errors are caught by the review gate. When AI output goes to filing without a review gate, the error rate on the output is not the issue; the absence of the gate is.

Citation generation

Stanford’s CodeX and HAI research programs (codex.stanford.edu, hai.stanford.edu) have documented that general-purpose LLMs hallucinate legal citations at rates that would be professionally untenable if relied upon for filing. The research distinguishes between retrieval-augmented systems (Westlaw Precision, Lexis+ AI) and generative-first approaches: the retrieval-augmented systems perform substantially better, but published hallucination rates for legal citation tasks on any system still run high enough to make unverified reliance a Rule 11 violation.

The operational consequence is that citation generation by AI requires a verification step against a verified legal database before any citation appears in a filing. The AI can do the first pass. The pass is not the citation; it is the research starting point. When the AI’s output is treated as the citation rather than the starting point, the risk materialises as it did in Mata and Park.

Legal-specific models trained on legal corpora (including the models underlying Harvey and Lexis+ AI) perform better on citation tasks than general-purpose models. They do not perform well enough to remove the verification requirement. No published study as of the 12 May 2026 review date has documented a legal AI system achieving citation accuracy sufficient to satisfy Rule 11 without attorney verification.

Where conduct rules bound technology

Client communication sits in a different category. The technology for AI-drafted client communications is available. Attorneys are using it. The binding constraint is not the capability of the tool; it is ABA Formal Opinion 512, issued in July 2024 (americanbar.org).

FO 512 reaffirms that Model Rule 1.4, the attorney’s duty to keep clients reasonably informed, cannot be delegated to an AI system. The duty to explain matters sufficiently to permit informed decision-making belongs to the licensed attorney. FO 512 does not prohibit AI-drafted client communications. It requires that a licensed attorney supervise the content, exercise contextual judgment about what the client needs to know, and make the final send decision.

The practical boundary is structural: AI drafts; attorney reviews and approves. European bar associations apply comparable reasoning under their own conduct frameworks. The French and German bars have published guidance broadly consistent with FO 512’s position on supervision obligations. The UK Solicitors Regulation Authority has noted similar principles in its 2024 AI guidance.

Client communication is bounded by conduct rules rather than technology in a specific sense: the conduct rules draw a line that the technology cannot cross regardless of capability level. A model that produces perfect client communications still requires attorney review before sending, because the obligation is not about output quality; it is about professional responsibility.

The malpractice insurance angle

Several errors-and-omissions carriers are tightening underwriting criteria for legal malpractice coverage where AI tools are involved in final-drafted work. The specifics are not uniformly public. What is documentable is the direction: carriers that previously underwrote legal malpractice without AI-specific exclusions or riders are now asking whether AI-generated drafts were reviewed by a licensed attorney before filing.

The tightening is most visible in renewal negotiations for firms that have disclosed broad AI deployment without specifying the review-gate architecture. Carriers drawing on the Mata and Park sanction record are treating AI-final-drafted work as a separate underwriting category. The synthesis that follows is directional rather than drawn from a published carrier dataset: source:"our-estimate" for the characterisation that e&o underwriting on AI-augmented legal practice is diverging from the pre-2023 standard.

The practical implication for managing partners is that the deployment question and the insurance question are not separable. The review-gate architecture that the deployment requires for conduct-rule compliance is the same architecture that e&o carriers are asking about. Firms that deploy AI without a documented review gate are taking on both the malpractice exposure and the underwriting risk simultaneously.

GAUGE compliance dimension

Running a legal-services AI deployment through the GAUGE framework’s compliance dimension surfaces the structural constraint quickly. GAUGE scores governance, auditability, usage boundaries, guard-rails, and escalation paths. On compliance, the question is whether the deployment can demonstrate, after the fact, that every AI output touching a filed document or client communication was reviewed by a licensed attorney before it became consequential.

Document review, precedent retrieval, and deposition prep pass the compliance dimension under GAUGE because the output is internal and the consequential decision (what to file, what to argue, what to tell the client) remains with the attorney. Legal drafting submitted as final and citation generation fail the compliance dimension when the review gate is absent. Client communication passes the compliance dimension only when the FO 512 supervision requirement is architecturally enforced. The GAUGE compliance dimension is not the only relevant dimension for legal AI, but it is the one where the deployment evidence draws the clearest line.

Anti-patterns

Three patterns show up repeatedly in legal AI deployments that go wrong.

Treating Harvey, Lexis+ AI, or Westlaw Precision AI as a substitute for senior partner review. These tools are well-built for the tasks they are designed for. They are not designed to replace the professional judgment that a senior attorney applies when deciding whether a draft is ready to file. The vendor’s accuracy claims measure performance on benchmark tasks; they do not measure performance on any specific firm’s specific matters under specific time pressure.

Relying on vendor accuracy claims without establishing an independent task baseline. A vendor that reports 95% accuracy on document review is reporting performance on the vendor’s evaluation set. The firm’s matters may have different document types, different privilege profiles, different relevance standards. The defensible deployment process establishes a firm-specific baseline before relying on vendor claims, then measures against that baseline in the first 90 days of production.

Using general-purpose consumer AI tools for any client-facing legal work. ChatGPT, Claude.ai consumer tier, and comparable tools are not designed for legal citation tasks. They have no verification layer against a legal database. The Mata and Park sanctions both involved counsel using a general-purpose tool (ChatGPT in Mata) without the verification infrastructure that legal-specific tools provide. The conduct-rule obligations apply regardless of which tool was used; the risk of hallucination is substantially higher with general-purpose tools.

What changes this verdict

The AM-151 claim is on a 90-day review cadence with four specific trigger conditions.

First: an e&o carrier publicly underwriting AI-final-drafted filings with a documented review-gate requirement. This would not eliminate the risk of the two high-risk sub-tasks, but it would change the insurance calculus that currently makes the risk unmanageable for most firms.

Second: a second sanction wave in 2026 H2. If federal courts and circuit courts issue additional sanctions for AI-related filing failures at scale, the claim strengthens. If the sanction record remains stable at Mata and Park, the claim holds at current confidence.

Third: a shift in ABA or EU bar guidance post-FO 512. If bar associations issue guidance that tightens or relaxes the client-communication supervision requirement, the classification of that sub-task changes.

Fourth: a published hallucination-rate study from Stanford CodeX, HAI, or a comparable independent research program showing sustained citation-accuracy rates below 5% on legal-specific LLMs across verified task conditions. At that level of accuracy, the case for removing the verification step becomes analytically tractable. No such study exists as of 12 May 2026.

The governance architecture that legal AI deployments require is addressed in the enterprise agentic AI governance playbook. The procurement due-diligence framework for vendor accuracy claims is in the unverified citation chain. For the broader use-cases picture, the use cases index maps the deployment evidence across sectors.

ShareX / Twitter LinkedIn Email

Cite this article

Pick a citation format. Click to copy.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Referenced by · 1 piece

Why AI productivity gains create workforce reduction pressure: the demand ceiling and the competitive trap

Part of the pillar

Agentic AI governance →

Governance frameworks, oversight patterns, and compliance postures for enterprise agentic-AI deployment. 63 other pieces in this pillar.

Agentic AI in legal services: what survives the billable-hour decomposition

The billable hour decomposed

Where agentic AI captures durable value

Document review

Precedent retrieval

Deposition preparation

Where agentic AI backfires

Legal drafting submitted as final

Citation generation

Where conduct rules bound technology

The malpractice insurance angle

GAUGE compliance dimension

Anti-patterns

What changes this verdict

Agentic AI governance →

Related reading

The billable hour decomposed

Where agentic AI captures durable value

Document review

Precedent retrieval

Deposition preparation

Where agentic AI backfires

Legal drafting submitted as final

Citation generation

Where conduct rules bound technology

The malpractice insurance angle

GAUGE compliance dimension

Anti-patterns

What changes this verdict

Related reading

Score this governance picture on six instrumented dimensions.

Agentic AI governance →

Related reading

Agentic AI in manufacturing starts in the engineering layer

Agentic AI lands in banking, and it starts with AML

AgentFlayer and the cross-agent prompt-injection class: what the vendor-response split tells procurement

AI-written analysis, signed by a practitioner. One or two pieces a week.

AI-written analysis, signed by a practitioner. One or two pieces a week.