AlphaWe’re still building this tool. Results may be incomplete or inaccurate, and features may change.It’s publicly accessible so others can try it and share feedback.
Home/Articles/Do Cursor and vibe coding tools send your code to AI models? A developer-focused data flow guide
LLM Crawlers & Bots

Do Cursor and vibe coding tools send your code to AI models? A developer-focused data flow guide

A developer-focused, data-driven guide to what Cursor, Lovable, and GitHub Copilot send to model providers, what gets stored, what can be used for training, and how to reduce risk.

Kimmo Ihanus
14 min read

Do Cursor and vibe coding tools send your code to AI models and crawlers?

Yes, most AI coding tools send your prompt plus some code context to their servers and then to a language model provider to generate a response. Whether that data is stored or used for training depends on the tool, your plan, and your settings, and "AI crawlers" are a separate pipeline that can index whatever you publish publicly on the web.

This article breaks down the common data flows behind tools like Cursor, Lovable, and GitHub Copilot, what "privacy mode" usually does and does not mean, and a practical checklist to reduce risk without giving up the productivity gains.

What data typically leaves your machine when you use AI coding tools?

Most tools need context to be useful. In practice, an "AI request" can include:

  • Your prompt: What you typed into chat, a command palette, or an agent task.
  • Nearby code: The current file, the selected range, or the function you are editing.
  • Relevant files: Retrieved automatically by language server signals, repo search, or semantic indexing.
  • Conversation history: Prior prompts and tool outputs to keep the thread coherent.
  • Metadata and telemetry: IDE version, feature usage, error logs, and performance metrics.

Some tools also maintain a codebase index (embeddings) to answer questions across a whole repo. This is not the same thing as uploading your entire repo as plaintext forever, but it does mean your code is processed and representations may be stored.

AI crawlers vs model training vs product logs: what’s the difference?

Developers often use "AI crawlers" as shorthand for "my code is feeding AI." It helps to separate four different pipelines:

PipelineWhat it isTypical risk
Inference requestYour prompt + context sent to a model to get an answer nowSensitive data can leave your environment
Retention/loggingThe tool provider and/or model provider stores inputs/outputs for debugging, abuse prevention, analyticsData can persist longer than expected
Model trainingStored data is used to train future modelsYour code might influence future outputs
Web crawling/indexingBots crawl public URLs (docs, repos, blog posts) to build search or training datasetsAnything public can be collected without you using an AI coding tool

The key takeaway is: even if model training is off, data can still leave your environment for inference, and even if you never use an AI tool, public code and docs can still be crawled.

If you’re trying to understand the “crawler side” in more detail (training vs indexed search vs agentic scraping), Superlines has a clear breakdown of how AI crawlers and bots read your site differently from humans.

What model providers do with API data (example: OpenAI)

Even if your IDE vendor offers a privacy mode, your data often still touches a model provider. Providers typically publish separate policies for:

  • Whether API inputs and outputs are used for training by default
  • How long inputs and outputs may be retained for abuse monitoring and service operation
  • Whether you can request zero data retention for eligible endpoints

For example, OpenAI states for its business offerings (including the API Platform) that it does not train models on business data by default, and that it may retain API inputs and outputs for up to 30 days for eligible endpoints, with a zero data retention (ZDR) option for qualifying use cases.

Source: OpenAI: Enterprise privacy at OpenAI

What Cursor sends, stores, and trains on (and how Privacy Mode changes it)

Cursor documents its data handling in two places: its Data Use overview and its Security page.

What Cursor says happens in Privacy Mode

Cursor states that if you enable Privacy Mode:

  • Zero data retention is enabled for their model providers.
  • Cursor may store some code data to provide features, but your code is not trained on by Cursor or third parties.

Source: Cursor: Data Use & Privacy Overview

Cursor also states that privacy mode is a guarantee: code data is never stored by model providers or used for training, and it notes that more than 50% of users have privacy mode enabled.

Source: Cursor: Security

What Cursor says happens even if you bring your own API key

Cursor explicitly notes:

  • "Even if you use your API key, your requests will still go through our backend. That's where we do our final prompt building."

Source: Cursor: Data Use & Privacy Overview

This matters for enterprise threat models, because "BYO key" is not the same thing as "direct-to-provider from my device."

What Cursor says about codebase indexing and embeddings

Cursor states that if you choose to index your codebase:

  • It uploads code in chunks to compute embeddings, then plaintext for that embedding request ceases to exist.
  • Embeddings and some metadata (for example hashes and file names) may be stored.
  • It also describes temporary encrypted caching of file contents to reduce latency, and provides guidance to exclude files using
    .cursorignore
    .

Sources:

Practical Cursor-specific controls

  • Turn on Privacy Mode (and confirm it is enforced for teams if you are on a managed plan).
  • Use
    .cursorignore
    for high-risk paths (secrets, infra configs, incident docs).
  • Disable codebase indexing for repos that contain regulated data or highly sensitive IP.

Example

.cursorignore
starter:

txt
# Secrets and env
.env
.env.*
**/*secret*
**/*secrets*

# Cloud and infra
**/terraform.tfstate*
**/.aws/**
**/.ssh/**

# Incident and internal docs
**/incidents/**
**/postmortems/**
**/security/**

What Lovable sends, stores, and trains on (platform + AI Gateway)

Lovable is a "build and deploy" platform, so the data model is different from an IDE plugin. There is typically a persistent workspace, and potentially hosted apps (Lovable Cloud).

What Lovable says about third-party model providers (AI Gateway)

Lovable states that when using its AI Gateway:

  • Inputs (prompts, queries) and related customer data are transmitted to third-party AI providers (for example OpenAI, Google Gemini, OpenRouter) for processing.
  • Transmissions occur on a pass-through basis and Lovable says it does not store raw prompts or responses unless you explicitly save them in your workspace.

Source: Lovable: Privacy Policy

What Lovable says about using customer data for training

Lovable’s Terms state that, except for PII, you grant Lovable a license to use Customer Data for business purposes including developing and training AI/ML models, and it describes an opt-out path (contact them or upgrade to a Business plan with enhanced controls).

Source: Lovable: Terms of Service

Lovable’s Privacy Policy also describes that project artifacts (for example prompts, code snippets, deployment configurations) are used to serve your workspace and, once anonymized or aggregated, to improve models. It also states they are not used to train general-purpose AI models that benefit other customers without your permission.

Source: Lovable: Privacy Policy

Practical Lovable-specific controls

  • Treat your Lovable workspace like a hosted repo: assume prompts, generated code, and deploy configs may be stored to operate the service.
  • Decide whether you are comfortable with the training license for Customer Data under the plan you are on.
  • If you need hard guarantees, use the opt-out path described in their terms or move to a plan with stronger controls.

What GitHub Copilot sends, stores, and trains on (and the settings you can control)

GitHub Copilot is typically embedded in IDEs, which means prompts and code context are sent to generate suggestions and chat answers.

What GitHub says you can control

GitHub states that individual subscribers can configure settings including:

  • Whether prompts and Copilot suggestions are collected and retained (and further processed and shared with Microsoft).
  • Whether to allow or block suggestions matching public code, including how matches are detected.

Source: GitHub Docs: Managing GitHub Copilot policies as an individual subscriber

GitHub also notes that you can configure these settings on GitHub.com (in addition to IDE plugin configuration).

Source: GitHub Docs: Configuring GitHub Copilot in your environment

What GitHub says about model training by default

GitHub states:

  • "By default, GitHub, its affiliates, and third parties will not use your data, including prompts, suggestions, and code snippets, for AI model training."

Source: GitHub Docs: Managing GitHub Copilot policies as an individual subscriber

Why developer privacy settings matter: secrets are already leaking at scale

Even without AI coding tools, secret leakage is a systemic problem. Adding "copy-paste from an AI chat" can amplify it if teams are not careful.

GitGuardian reports:

  • 23,770,171 new secrets detected in public GitHub commits in 2024 (+25% year-over-year).
  • 70% of secrets leaked in 2022 are still valid.
  • 15% of commit authors leaked a secret (in their scanned population).

Source: GitGuardian: State of Secrets Sprawl 2025

The same report cites two external benchmarks that are useful for framing business impact:

  • Verizon DBIR: stolen credentials appear in 31% of all breaches (in the report’s analysis).
  • IBM: breaches involving stolen or compromised credentials take an average of 292 days to identify and remediate.

Sources:

What developers are saying across different source categories (and where it can mislead you)

When this topic comes up in practice, the conclusions differ based on where you look. A useful way to de-risk decisions is to weigh multiple source categories:

  • Vendor documentation: strongest signal for what the tool claims it does, plus what settings exist (Cursor, Lovable, GitHub Copilot).
  • Security research and incident data: best for "what actually goes wrong at scale" (for example secret leakage reports).
  • Community platforms: best for spotting edge cases, confusing UX defaults, and unexpected behavior, but often mixes rumors with facts.
  • Professional networks: useful for real-world rollout patterns (policies, approvals, audits), but often light on technical detail.

If you only consume community posts, you can overestimate "training risk" and underestimate "retention risk" (logs, cached context, agent traces). Vendor docs often do the opposite. You usually need both.

A practical checklist: how to reduce risk without banning AI tools

1) Decide which data is never allowed to leave your environment

Most teams start with these categories:

  • API keys, access tokens, private keys, passwords
  • Customer PII and regulated data
  • Incident details, vulnerabilities, and internal threat models
  • Proprietary algorithms and unreleased product plans

2) Make "context minimization" the default workflow

  • Ask for approaches, not code pastes: describe the problem, not the whole file.
  • Share interfaces, not implementations: types, public contracts, example payloads.
  • Use small reproduction projects: a minimal repo without customer data.

3) Enforce tool settings, not individual heroics

  • Turn on the strongest privacy mode available (for example Cursor Privacy Mode).
  • Turn off prompt/suggestion retention if your tool supports it (Copilot).
  • Use organization-enforced policies where possible, not "optional guidance."

4) Add secret detection where mistakes actually happen

Secret leaks often come from "one fast commit." Add guardrails:

  • Pre-commit secret scanning and server-side push protection
  • Rotatable credentials by default (short-lived tokens)
  • Automated key rotation playbooks (assume leaks will happen)

5) Track AI crawlers separately from AI coding tools

AI crawlers hit your websites, docs, and public repos. If you care about what bots collect:

  • Monitor bot user agents at the edge (LLM crawlers, scrapers, SEO tools)
  • Separate bot traffic from human traffic in analytics
  • Decide what to allow in
    robots.txt
    based on business goals

For a practical marketer/dev-facing overview of which AI user agents matter and how to prepare for the agentic web, see Superlines’ guide: How AI crawlers and bots read your site differently from humans.

Best practice summary (5-point rule)

  1. Separate concerns: inference vs retention vs training vs crawling are different risks.
  2. Minimize context: send the smallest useful slice of code.
  3. Use enforced settings: team-level policies beat personal preferences.
  4. Assume secrets will leak: design for detection and rotation, not perfection.
  5. Monitor bots: web crawlers are their own data pipeline.

Key takeaways

  • Most AI coding tools send prompts plus code context to a server to build the final model request.
  • "Privacy mode" rarely means "nothing leaves your machine". It usually means stricter retention and training rules.
  • Crawlers are independent: public docs and repos can be indexed even if you never use an AI coding tool.
  • Adoption is mainstream: 84% of respondents in Stack Overflow’s 2025 survey are using or planning to use AI tools in development, and more developers distrust AI tool accuracy (46%) than trust it (33%).
  • Security hygiene is the multiplier: secret scanning and credential rotation reduce the downside of human and AI-assisted mistakes.

Source: Stack Overflow: 2025 Developer Survey (AI)

Frequently asked questions

Q: If I enable Cursor Privacy Mode, does that mean my code never leaves my machine? A: No. Cursor states it still sends code context to its servers to provide AI features, but Privacy Mode changes retention and training: it says it enables zero data retention for model providers and your code is not used for training. See Cursor: Data Use and Cursor: Security.

Q: If I bring my own OpenAI API key to Cursor, do requests go directly to OpenAI? A: Cursor states requests still go through Cursor’s backend because prompt building happens on their servers. See Cursor: Data Use.

Q: Does GitHub Copilot train on my prompts and code? A: GitHub states that by default, prompts, suggestions, and code snippets are not used for AI model training, and it also documents settings for prompt and suggestion collection/retention. See GitHub Docs: Managing Copilot policies.

Q: Does Lovable store my prompts and code? A: Lovable’s Privacy Policy describes storing data to operate the service, and says AI Gateway requests are passed to third-party providers and are not stored unless you explicitly save them in your workspace. See Lovable: Privacy Policy. Lovable’s Terms also describe rights to use Customer Data (except PII) for business purposes including model training, with an opt-out path. See Lovable: Terms.

Q: What should I do first if my team wants to use vibe coding tools safely? A: Start with three things: enable the strictest privacy/retention settings, implement secret scanning and rotation, and establish a "no sensitive data in prompts" policy with clear examples. If you cannot enforce settings at the org level, assume drift will happen.

Conclusion

AI coding tools are not "crawlers" in the traditional web sense, but they do move code context through multiple systems to generate answers. The main engineering task is to treat these tools like any other external integration: map the data flows, configure retention and training settings, and put guardrails around secrets and sensitive data.

If you also care about what AI crawlers collect from your public properties, you can measure and separate LLM crawler traffic from human traffic and decide what to allow or block. Tools like Superlines’ AI Search Index are built for monitoring bot and LLM crawler activity across your websites.

If you want more background reading, Superlines also maintains a set of practical guides here: Superlines articles.


Additional resources