Data Licensing Agreements for AI Startups: Legal Risks of Training on Third-Party Data in Australia

Data Licensing Agreements for AI Startups: Legal Risks of Training on Third-Party Data in Australia

If your startup is building an AI model, the data you train it on is as important as the model architecture itself. It is also where most of the legal risk lives. In Australia, there is no general right to scrape, copy, or use third-party data for AI training purposes. Every dataset your model ingests carries potential exposure under copyright law, privacy legislation, and contract.

This article explains the legal framework that applies to AI training data in Australia and sets out the key clauses that should appear in any data licensing agreement your startup enters into.

Why Data Licensing Matters More in Australia

Many AI founders assume that if data is publicly available on the internet, it is free to use. That assumption is wrong in most jurisdictions, and it is particularly wrong in Australia.

In October 2025, the Australian Government explicitly rejected a proposed text-and-data mining (TDM) exception to the Copyright Act. Unlike the European Union, which introduced a mandatory TDM exception with opt-out rights under the DSM Directive, and unlike the United States, where fair use arguments provide some (contested) cover, Australia has taken the position that copyright holders’ rights must be preserved in full. AI developers do not get automatic permission to scrape or reproduce copyrighted material for training purposes.

The Attorney-General convened a Copyright & AI Reference Group to explore licensing pathways and transparency standards, but as of early 2026, no new legislative framework is in place. The practical effect is straightforward: if you want to train a model on data you do not own, you need a licence.

Under the Copyright Act 1968 (Cth), the reproduction of a literary, artistic, or musical work — or a substantial part of it — without the copyright owner’s permission constitutes infringement. Scraping web content, downloading image libraries, or ingesting text corpora all involve acts of reproduction.

Australia’s fair dealing exceptions are narrow and purpose-specific. They cover research and study, criticism and review, parody and satire, reporting news, and providing legal advice. There is no general fair use defence of the kind available under US copyright law, and the existing fair dealing categories do not naturally accommodate commercial AI training at scale.

The UK case of Getty Images v Stability AI illustrates the stakes. Getty alleged that Stability AI scraped approximately 12 million images without authorisation to train its Stable Diffusion model. While the UK High Court ultimately found against Getty on the specific claims brought, the case confirmed that scraping copyrighted content for AI training engages copyright law and that rights holders will pursue enforcement. In Australia, where there is no TDM exception at all, a similar claim would start from an even stronger position for the copyright holder.

The takeaway for founders is clear. If your training data includes content created by someone else — articles, images, code, audio, videos, databases — you need either a licence from the rights holder or a defensible argument that you fall within one of Australia’s narrow fair dealing exceptions. For commercial AI development, the latter is rarely available.

The Privacy Overlay

Copyright is not the only concern. If your training data contains personal information — and in practice, most large datasets scraped from the web do — the Privacy Act 1988 (Cth) and the Australian Privacy Principles (APPs) apply.

The Office of the Australian Information Commissioner (OAIC) has published specific guidance on privacy obligations for organisations developing and training generative AI models. The key obligations include:

Collection limitation (APP 3). Personal information must be collected by lawful and fair means and only where reasonably necessary for the organisation’s functions. Bulk scraping of personal data from the internet for AI training is difficult to justify under this principle.

Purpose limitation (APP 6). Personal information must be used only for the primary purpose for which it was collected, unless an exception applies. If data was collected for one purpose — say, providing a service to a customer — using it to train an AI model is a secondary use that requires either consent or a reasonable expectation on the part of the individual.

Data quality (APP 10). Organisations must take reasonable steps to ensure personal information is accurate, up-to-date, and complete. Training data scraped at scale is inherently noisy, and inaccurate personal information embedded in a model can create downstream liability.

Automated decision-making transparency. From 10 December 2026, new obligations under APPs 1.7 to 1.9 will require organisations to disclose in their privacy policies when they use personal information in substantially automated decisions that could significantly affect individuals. AI startups building models that inform or make decisions about people will need to comply.

Penalties for serious or repeated privacy breaches can reach the greater of $50 million, three times the benefit obtained, or 30 per cent of adjusted turnover. For an early-stage startup, even the threat of regulatory action can be existential.

Structuring a Data Licensing Agreement

Given the legal landscape, any AI startup acquiring third-party data for training purposes needs a properly structured data licensing agreement. These agreements are not standard software licences — they need to address risks specific to AI model development. The following clauses are essential.

Scope of Licence

The agreement must define precisely what the licensee is permitted to do with the data. At a minimum, it should cover:

  • Permitted uses. Is the data licensed for model training only, or can it also be used for fine-tuning, evaluation, benchmarking, or inference? Can it be used to train derivative models?
  • Exclusivity. Is the licence exclusive, sole, or non-exclusive? For high-value proprietary datasets, exclusivity may be critical to the startup’s competitive position.
  • Sublicensing. Can the licensee sublicense the data to third parties, including cloud providers, contractors, or research partners involved in model development?
  • Field of use. Are there restrictions on the domains or applications in which the model trained on the data can be deployed?

Vague scope definitions create disputes. If the licence says “for AI research purposes” but your startup trains a commercial product, you have a problem.

Representations and Warranties

The data licensor should represent and warrant that:

  • It owns the data or has the right to license it for AI training purposes.
  • The data does not infringe any third-party intellectual property rights.
  • The data has been collected in compliance with all applicable privacy laws, including the APPs and any equivalent overseas legislation.
  • Where the data contains personal information, all necessary consents have been obtained for the intended use, including for AI model training.
  • The data is free from material inaccuracies, or alternatively, the licensor discloses known limitations.

These warranties are not negotiating niceties. They are the primary mechanism by which the startup shifts the risk of defective or infringing data back to the party best placed to manage it.

Indemnification

The agreement should include a mutual indemnification clause, with particular focus on:

  • IP indemnity from the licensor. If a third party claims that the training data infringes their copyright, trade mark, or other IP rights, the licensor should indemnify the licensee against those claims, including legal costs and any damages awarded.
  • Privacy indemnity. If the data contains personal information that was not lawfully collected or that triggers regulatory action against the licensee, the licensor should bear the cost.
  • Use-based indemnity from the licensee. Conversely, the licensee should indemnify the licensor for claims arising from the licensee’s use of the data outside the scope of the licence.

Indemnification caps and baskets should be negotiated carefully. An uncapped IP indemnity from a small data aggregator may be worthless in practice, so consider whether escrow arrangements or insurance requirements are appropriate.

Data Provenance and Audit Rights

AI startups are increasingly subject to pressure — from investors, customers, and regulators — to demonstrate the provenance of their training data. The agreement should address:

  • Documentation of sources. The licensor should provide a description of the data’s origins, collection methods, and any chain of title from the original rights holders.
  • Audit rights. The licensee should have the right to audit the licensor’s compliance with its representations, including verification that the data was lawfully collected and that necessary consents are in place.
  • Regulatory cooperation. If the OAIC or another regulator investigates the licensee’s use of the data, the licensor should be obligated to cooperate and provide supporting documentation.

Data provenance is rapidly becoming a due diligence item in venture capital fundraising. Investors conducting legal due diligence on AI startups are asking pointed questions about the lawfulness of training data, and a data licensing agreement without provenance protections is a red flag.

Confidentiality and Security

Training data is often commercially sensitive. The agreement should impose obligations on both parties to:

  • Maintain the confidentiality of the data and any proprietary methods used to compile it.
  • Implement appropriate technical and organisational security measures, consistent with the data’s sensitivity and any applicable regulatory requirements.
  • Notify the other party promptly of any data breach, consistent with the Notifiable Data Breaches scheme under the Privacy Act.
  • Return or destroy the data (and any copies embedded in training pipelines) upon termination of the licence.

The question of whether trained model weights constitute a “copy” of the underlying data is legally unsettled. The agreement should address this explicitly — either by deeming that model weights do not constitute retained data (which benefits the licensee) or by requiring deletion of model weights trained on the data upon termination (which benefits the licensor).

Term, Termination, and Survival

Data licensing agreements for AI training raise unique termination issues. Once data has been used to train a model, it cannot simply be “returned” in any meaningful sense. The agreement should specify:

  • Whether the licensee may continue to use models trained on the data after the licence expires or is terminated.
  • Whether the licensee must retrain models without the licensor’s data upon termination.
  • Which obligations survive termination — typically confidentiality, indemnification, and data security.

These are among the most heavily negotiated provisions in AI data licensing, and there is no market-standard answer. The outcome depends on the relative bargaining positions of the parties and the nature of the data.

Practical Steps for Founders

If your startup is training AI models on third-party data, the minimum steps are:

  1. Audit your training data. Understand what is in your datasets, where it came from, and whether you have a lawful basis to use it.
  2. Licence what you need. Do not assume publicly available means freely usable. Negotiate proper data licensing agreements with provenance protections.
  3. Strip personal information where possible. If your model does not need personal information to function, de-identify or anonymise the data before ingestion.
  4. Document everything. Maintain records of your data sources, licences, collection methods, and any consent mechanisms. You will need these for investor due diligence, regulatory inquiries, and customer trust.
  5. Prepare for the December 2026 transparency obligations. If your AI model makes or contributes to automated decisions about individuals, update your privacy policy and internal processes to comply with the new APP 1.7 to 1.9 requirements.

The cost of getting data licensing right is a fraction of the cost of defending an infringement claim or responding to a regulatory investigation. For AI startups in Australia, clean data is not just good practice — it is a legal necessity.


This article provides general information only and does not constitute legal advice. For guidance on data licensing agreements for AI model development, contact Viridian Lawyers.

Recent Articles

blog-image
Data Licensing Agreements for AI Startups: Legal Risks of Training on Third-Party Data in Australia

If your startup is building an AI model, the data you train it on is as important as the model architecture itself. It is also where most of the legal risk lives. In Australia, there is no general …

blog-image
Pay-to-Play Provisions: How They Work in Australian VC Rounds and When They Bite

Most venture capital deal terms only matter at the margins — until the company hits rough waters. Pay-to-play provisions are one of those terms. In good times, they sit quietly in the …

blog-image
When Your Startup Fails: A Legal Guide to Solvent Wind-Ups and Voluntary Administration

Not every startup succeeds. The statistics are well known — the majority of venture-backed companies do not return capital to investors, and many do not survive at all. What is less well understood is …