Supply chain security for ML and AI projects

Machine learning and AI projects have a distinctive supply chain security profile. They share all the risks of ordinary Python projects, but with several additional factors that make them higher-priority targets and harder to secure.

Why ML projects are high-value targets

Cloud compute credentials. ML training jobs run on GPU instances — AWS p3, p4, or g5 instances; GCP A100 pods; Azure NC-series VMs. These instances are expensive to run and have generous compute quotas. An attacker who exfiltrates your AWS access key from an ML development machine can immediately run arbitrary GPU workloads on your account. Cryptocurrency mining at scale, model training for the attacker's own use, or simply burning your budget — all are common outcomes.

API keys with large allowances. OpenAI, Anthropic, and similar API providers bill by usage. A compromised API key can generate thousands of dollars in API costs before you notice. These keys are almost universally stored as environment variables — exactly what supply chain malware looks for.

Model weights and proprietary data. The model checkpoint you've been training for three weeks, the proprietary dataset you spent six months collecting — these live on the filesystem and in S3 buckets with credentials accessible from the training environment. Supply chain malware that can read files can exfiltrate model weights.

The ML-specific dependency problem

A typical ML project installs 100–200 packages. The dependency tree for torch + transformers + diffusers + langchain + accelerate pulls in a staggering number of indirect dependencies from PyPI, Hugging Face, and potentially custom indexes.

# Count how many packages a typical ML project installs
pip install torch transformers diffusers langchain
pip list | wc -l  # typically 150-200 packages

Each of these 200 packages is a potential attack vector. Each of their maintainer accounts is a potential compromise target.

The new-package problem in ML

ML tooling is young. New packages appear weekly — new agent frameworks, new model wrappers, new quantization tools, new vector database clients. Many have small maintainer teams, limited security resources, and enormous popularity among developers eager to try the latest techniques.

"Just pip install this new RAG library" is a sentence that has preceded multiple supply chain attacks.

The cooling gate is particularly important in ML development. New packages published last week by a one-person team with 50 GitHub stars deserve more scrutiny before you run them on your machine that has AWS credentials.

Practical mitigations

Use a requirements lockfile with hashes for your ML project. This is harder than for ordinary Python projects because ML packages have many platform-specific wheels and some packages don't publish source distributions. uv handles this well:

# Generate a comprehensive lockfile with hashes
uv lock
uv sync --frozen

Isolate training credentials from development credentials. The credentials used to submit training jobs should be different from the credentials on your development machine. Use IAM roles with minimal scope for training (can write to a specific S3 bucket, can launch a specific instance type). Use a separate account or project for development work.

Use environment variable injection for API keys, not dotfiles. .env files that are loaded on shell startup are available to every process, including install scripts. Use your secret manager (AWS Secrets Manager, HashiCorp Vault, etc.) to inject credentials only when needed, not as persistent environment variables.

# Better: inject credentials at the point of use
import boto3
client = boto3.client('openai', aws_session_token=get_secret('openai_key'))

# Worse: credentials always in the environment
# import os; api_key = os.environ['OPENAI_API_KEY']  # available to postinstall scripts

Audit Hugging Face model downloads. The supply chain attack surface for ML extends beyond pip packages to model weights. Hugging Face supports custom Python files in model repositories that execute when the model is loaded. Always set trust_remote_code=False unless you've explicitly reviewed the model's code:

# Dangerous — executes arbitrary code from the model repository
model = AutoModel.from_pretrained("some-model", trust_remote_code=True)

# Safe default
model = AutoModel.from_pretrained("some-model", trust_remote_code=False)

Use Veln in your ML development environment. The combination of large dependency trees, high-value credentials, and frequent new package installations makes ML environments exactly where Veln's cooling gate and behavioral analysis are most valuable. A one-week-old RAG library that reads OPENAI_API_KEY in its setup.py will be flagged immediately.

For CI training jobs

# training.yml
steps:
  - uses: veln-sh/setup-action@v1
    with:
      license-key: ${{ secrets.VELN_LICENSE_KEY }}
      mode: enforce

  - run: uv sync --frozen  # all packages verified before any training code runs

  - name: Submit training job
    env:
      AWS_ROLE_ARN: ${{ secrets.TRAINING_ROLE_ARN }}  # scoped role, not root credentials
    run: python submit_training.py

ML projects combine large dependency trees with high-value credentials. Veln's cooling gate and behavioral analysis are particularly effective in this environment.

Why ML projects are high-value targets

The ML-specific dependency problem

The new-package problem in ML

Practical mitigations

For CI training jobs

More in Technical explainer