# Data Hubs

> Learn how to compare multiple documents against each other using Copyleaks' private and shared databases to find similarities and prevent plagiarism.

Copyleaks' Data Hubs provide a powerful way to compare multiple documents against each other, allowing you to detect similarities and prevent plagiarism within a large batch of content.

This is particularly useful for educators who want to check if students have shared work or submitted identical content across a batch of assignments, or companies with large amounts of documents in order to find duplication.

## How It Works
Copyleaks provides two types of databases for storing and comparing documents:
- **Shared Data Hub**: Global database that contains millions of documents from institutions worldwide.
- **Private Cloud Hub**: Private database that is exclusive to your organization, ensuring that your documents remain confidential and secure.

You can contribute documents to those databases and compare your documents against them.

<Note>
You can use both databases simultaneously to maximize detection coverage while keeping sensitive documents private.
</Note>
## Understanding Your Database Options

You have two database options for storing and comparing your documents:

### Shared Data Hub (Free)
- Contains millions of documents from institutions worldwide
- When you index a document, it becomes available for **everyone** to compare against
- Contributes to the global academic integrity community
- Your documents will be matched against submissions from other institutions

### Private Cloud Hub (Paid)
- Creates a completely **private database** for your organization only
- Your documents stay within your private environment
- Perfect for sensitive or confidential documents
- Only you and your organization can access and compare against these documents
- Built for large organizations looking to securely store and manage documents
- Enables team collaboration with controlled access and user management

<Note>
 You can use both databases simultaneously. Your documents can be stored in your Private Cloud Hub while also being compared against the Shared Data Hub for maximum detection coverage.
</Note>

## How Cross-Comparison Works

The process involves two main steps:

1. ** Index your documents**: Upload documents to your chosen database using `IndexOnly` mode.
2. ** Start the comparison**: Run a scan that compares all indexed documents against each other and your selected databases.

This two-step approach ensures all documents are properly stored before the comparison begins.

## Get Started

<Steps>
  <Step title="Before you begin">
    Before you start, ensure you have the following:
    - An active Copyleaks account. If you don't have one, **[sign up for free](https://api.copyleaks.com/signup)**.
    - You can find your API key on the **[API Dashboard](https://api.copyleaks.com/dashboard)**.
  </Step>

  <Step title="Installation">
    <InstallSDKs />
  </Step>

  <Step title="Login">
    To perform a scan, we first need to generate an access token. For that, we will use the [**login**](/reference/actions/account/login) endpoint.
    The API key can be found on the [**Copyleaks API Dashboard**](https://api.copyleaks.com/dashboard).

    Upon successful authentication, you will receive a token that must be attached to subsequent API calls via the `Authorization: Bearer <TOKEN>` header.
    This token remains valid for 48 hours.

    <CodeGroup>
        ```http title="HTTP" icon="globe"
        POST https://id.copyleaks.com/v3/account/login/api

        Headers
        Content-Type: application/json

        Body
        {
            "email": "your@email.address",
            "key": "00000000-0000-0000-0000-000000000000"
        }
        ```
        ```bash title="cURL" icon="terminal"
        export COPYLEAKS_EMAIL="your@email.address"
        export COPYLEAKS_API_KEY="your-api-key-here"

        curl --request POST \
          --url https://id.copyleaks.com/v3/account/login/api \
          --header 'Accept: application/json' \
          --header 'Content-Type: application/json' \
          --data "{
            \"email\": \"${COPYLEAKS_EMAIL}\",
            \"key\": \"${COPYLEAKS_API_KEY}\"
          }"
        ```
        ```python title="Python" icon="python"
        from copyleaks.copyleaks import Copyleaks

        EMAIL_ADDRESS = "your@email.address"
        API_KEY = "your-api-key-here"

        # Login to Copyleaks
        auth_token = Copyleaks.login(EMAIL_ADDRESS, API_KEY)
        print("Logged successfully!\nToken:", auth_token)
        ```
        ```javascript title="JavaScript" icon="square-js"
        const { Copyleaks } = require("plagiarism-checker");

        const EMAIL_ADDRESS = "your@email.address";
        const API_KEY = "your-api-key-here";
        const copyleaks = new Copyleaks();

        // Login function
        function loginToCopyleaks() {
          return copyleaks.loginAsync(EMAIL_ADDRESS, API_KEY).then(
            (loginResult) => {
              console.log("Login successful!");
              console.log("Access Token:", loginResult.access_token);
              return loginResult;
            },
            (err) => {
              console.error('Login failed:', err);
              throw err;
            }
          );
        }

        loginToCopyleaks();
        ```
        ```java title="Java" icon="java"
        import com.copyleaks.sdk.api.Copyleaks;

        String EMAIL_ADDRESS = "your@email.address";
        String API_KEY = "00000000-0000-0000-0000-000000000000";

        // Login to Copyleaks
        try {
            String authToken = Copyleaks.login(EMAIL_ADDRESS, API_KEY);
            System.out.println("Logged successfully!\nToken: " + authToken);
        } catch (CommandException e) {
            System.out.println("Failed to login: " + e.getMessage());
            System.exit(1);
        }
        ```
    </CodeGroup>

    **Response**
    ```json
    {
        "access_token": "<ACCESS_TOKEN>",
        ".issued": "2025-07-31T10:19:40.0690015Z",
        ".expires": "2025-08-02T10:19:40.0690016Z"
    }
    ```

    <Note>
    Save this token! It's valid for 48 hours and can be reused for subsequent API calls.
    </Note>
  </Step>

  <Step title="Index Your Documents">
    For each document you want to include in the comparison, submit it for indexing using one of the submit endpoints (`submit-file`, `submit-url`, or `submit-ocr`). 
    
    Set `properties.action` to `2` (`IndexOnly`) to store the document without scanning it immediately. This avoids consuming scan credits during the indexing phase. You also need to specify which repository to index the document into.

    <Warning>
      **Important**: Any other scanning options (like `internet` or `aiDetection`) must be configured during this indexing step. They cannot be changed later when you start the comparison scan.
    </Warning>

    <CodeGroup>
        ```http title="HTTP" icon="globe"
        PUT https://api.copyleaks.com/v3/scans/submit/file/my-index-scan-1
        Content-Type: application/json
        Authorization: Bearer <YOUR_AUTH_TOKEN>

        {
          "base64": "SGVsbG8gd29ybGQh",
          "filename": "document1.txt",
          "properties": {
            "action": 2,
            "indexing": {
              "repositories": ["my-repo-id"]
            },
            "sandbox": true
          }
        }
        ```
        ```bash title="cURL" icon="terminal"
        curl --request PUT \
            --url https://api.copyleaks.com/v3/scans/submit/file/my-index-scan-1 \
            -H "Authorization: Bearer <YOUR_AUTH_TOKEN>" \
            -H "Content-Type: application/json" \
            -d '{
                  "base64": "SGVsbG8gd29ybGQh",
                  "filename": "document1.txt",
                  "properties": {
                    "action": 2,
                    "indexing": {
                      "repositories": ["my-repo-id"]
                    },
                    "sandbox": true
                  }
                }'
        ```
        ```python title="Python" icon="python"
        from copyleaks.copyleaks import Copyleaks
        from copyleaks.models.submit.document import FileDocument
        from copyleaks.models.submit.properties.scan_properties import ScanProperties
        from copyleaks.models.submit.properties.indexing_properties import IndexingProperties

        scan_id = "my-index-scan-1"
        
        properties = ScanProperties()
        properties.set_action(2)  # IndexOnly
        properties.set_sandbox(True)
        
        indexing = IndexingProperties()
        indexing.add_repository("my-repo-id")
        properties.set_indexing(indexing)

        file_submission = FileDocument(
            base64="SGVsbG8gd29ybGQh", 
            filename="document1.txt", 
            properties=properties
        )
        
        response = Copyleaks.Scans.submit_file(auth_token, scan_id, file_submission)
        print("Document indexed successfully!")
        print("Scan ID:", scan_id)
        print("Response:", response)
        ```
        ```javascript title="JavaScript" icon="square-js"
        const { Copyleaks } = require('plagiarism-checker');

        const EMAIL_ADDRESS = "your@email.address";
        const API_KEY = "your-api-key-here";

        async function indexDocumentToRepository() {
            const copyleaks = new Copyleaks();
            
            // Login first
            const authToken = await copyleaks.loginAsync(EMAIL_ADDRESS, API_KEY);
            console.log('Logged successfully!\nToken:', authToken);
            
            // Document to index
            const base64Content = "SGVsbG8gd29ybGQh"; // "Hello world!" in base64
            
            // Submit file for indexing only
            const scanId = "my-index-scan-1";
            const fileSubmission = {
                base64: base64Content,
                filename: "document1.txt",
                properties: {
                    action: 2, // IndexOnly
                    indexing: {
                        repositories: ["my-repo-id"],
                        copyleaksDb: true // Also index to shared database
                    },
                    scanning: {
                        internet: true,
                        repositories: ["my-repo-id"]
                    },
                    sandbox: true
                }
            };
            
            try {
                const result = await copyleaks.submitFileAsync(authToken, scanId, fileSubmission);
                
                console.log('Document indexed successfully!');
                console.log('Scan ID:', scanId);
                console.log('Repository ID: my-repo-id');
                console.log('Status: Indexed - waiting for IndexOnly webhook');
                
                return result;
            } catch (error) {
                console.error('Failed to index document:', error);
            }
        }

        indexDocumentToRepository();
        ```
        ```java title="Java" icon="java"
        import classes.Copyleaks;
        import models.response.CopyleaksAuthToken;
        import models.submissions.CopyleaksFileSubmissionModel;
        import models.submissions.properties.*;

        public class DataHubIndexingExample {
            private static final String EMAIL_ADDRESS = "your@email.address";
            private static final String API_KEY = "00000000-0000-0000-0000-000000000000";
            
            public static void main(String[] args) {
                try {
                    // Login to Copyleaks
                    CopyleaksAuthToken authToken = Copyleaks.login(EMAIL_ADDRESS, API_KEY);
                    System.out.println("Logged in successfully!");
                    
                    // Document content to index
                    String base64Content = "SGVsbG8gd29ybGQh"; // "Hello world!" in base64
                    
                    // Configure submission properties for indexing
                    SubmissionWebhooks webhooks = new SubmissionWebhooks("https://your-server.com/webhook/{STATUS}");
                    SubmissionProperties properties = new SubmissionProperties(webhooks);
                    properties.setSandbox(true);
                    properties.setAction(SubmissionActions.IndexOnly); // Action 2 = IndexOnly
                    
                    // Configure indexing to repositories
                    SubmissionIndexingRepository indexRepo = new SubmissionIndexingRepository();
                    indexRepo.setId("my-repo-id");
                    SubmissionIndexing indexing = new SubmissionIndexing();
                    // Requires copyleaks-java-sdk SubmissionIndexing.setRepositories (coming soon)
                    indexing.setRepositories(new SubmissionIndexingRepository[]{ indexRepo });
                    indexing.setCopyleaksDb(true); // Also index to shared database
                    properties.setIndexing(indexing);
                    
                    // Configure scanning settings (applied during indexing)
                    SubmissionScanningRepository scanRepo = new SubmissionScanningRepository();
                    scanRepo.setId("my-repo-id");
                    SubmissionScanning scanning = new SubmissionScanning();
                    scanning.setInternet(true);
                    scanning.setRepositories(new SubmissionScanningRepository[]{ scanRepo });
                    properties.setScanning(scanning);
                    
                    // Create file submission for indexing
                    String scanId = "my-index-scan-1";
                    CopyleaksFileSubmissionModel fileSubmission = new CopyleaksFileSubmissionModel(
                        base64Content, 
                        "document1.txt", 
                        properties
                    );
                    
                    // Submit file for indexing
                    Copyleaks.submitFile(authToken, scanId, fileSubmission);
                    
                    System.out.println("Document indexed successfully!");
                    System.out.println("Scan ID: " + scanId);
                    System.out.println("Repository ID: my-repo-id");
                    System.out.println("Status: Indexed - waiting for IndexOnly webhook");
                    System.out.println("Next step: Wait for all documents to be indexed, then call /v3/scans/start");
                    
                } catch (Exception e) {
                    System.out.println("Failed to index document: " + e.getMessage());
                    e.printStackTrace();
                }
            }
        }
        ```
    </CodeGroup>    
    You will need to wait for the `IndexOnly` webhook for each document to confirm it has been successfully indexed before proceeding to the next step.
  </Step>

  <Step title="Start Your Cross-Comparison">
    Once all your documents are indexed, make a `PATCH` request to the [`/v3/scans/start`](/reference/actions/authenticity/start/) endpoint. This will begin the comparison scan for all the documents you indexed.

    Provide the list of `scanId`s from the previous step in the `trigger` array.

    <CodeGroup>
        ```http title="HTTP" icon="globe"
        PATCH https://api.copyleaks.com/v3/scans/start
        Content-Type: application/json
        Authorization: Bearer <YOUR_AUTH_TOKEN>

        {
          "trigger": [
            "my-index-scan-1",
            "my-index-scan-2",
            "my-index-scan-3"
          ],
          "errorHandling": 0
        }
        ```
        ```bash title="cURL" icon="terminal"
        curl --request PATCH \
             --url https://api.copyleaks.com/v3/scans/start \
             -H "Authorization: Bearer <YOUR_AUTH_TOKEN>" \
             -H "Content-Type: application/json" \
             -d '{
                   "trigger": [
                     "my-index-scan-1",
                     "my-index-scan-2",
                     "my-index-scan-3"
                   ],
                   "errorHandling": 0
                 }'
        ```
        ```python title="Python" icon="python"
        import requests

        url = "https://api.copyleaks.com/v3/scans/start"
        payload = {
            "trigger": [
                "my-index-scan-1",
                "my-index-scan-2",
                "my-index-scan-3"
            ],
            "errorHandling": 0
        }
        headers = {
            "Authorization": "Bearer <YOUR_AUTH_TOKEN>",
            "Content-Type": "application/json",
            "Accept": "application/json"
        }

        response = requests.patch(url, json=payload, headers=headers)
        result = response.json()

        print("Cross-comparison started!")
        print("Success:", result.get("success", []))
        print("Failed:", result.get("failed", []))
        
        if result.get("success"):
            print(f"Successfully started {len(result['success'])} scans")
            print("Watch for Completed webhooks for each scan")
        ```
        ```javascript title="JavaScript" icon="square-js"
        const { Copyleaks } = require('plagiarism-checker');

        const EMAIL_ADDRESS = "your@email.address";
        const API_KEY = "your-api-key-here";

        async function startCrossComparison() {
            const copyleaks = new Copyleaks();
            
            // Login first
            const authToken = await copyleaks.loginAsync(EMAIL_ADDRESS, API_KEY);
            console.log('Logged successfully!\nToken:', authToken);
            
            // Start cross-comparison for indexed documents
            const scanIds = [
                "my-index-scan-1",
                "my-index-scan-2", 
                "my-index-scan-3"
            ];
            
            const startRequest = {
                trigger: scanIds,
                errorHandling: 0
            };
            
            try {
                const result = await copyleaks.startScansAsync(authToken, startRequest);
                
                console.log('Cross-comparison started successfully!');
                console.log('Success:', result.success || []);
                console.log('Failed:', result.failed || []);
                
                if (result.success && result.success.length > 0) {
                    console.log(`Successfully started ${result.success.length} scans`);
                    console.log('Watch for Completed webhooks for each scan');
                }
                
                return result;
            } catch (error) {
                console.error('Failed to start cross-comparison:', error);
            }
        }

        startCrossComparison();
        ```
        ```java title="Java" icon="java"
        import classes.Copyleaks;
        import models.StartScanRequest;

        public class DataHubStartScanExample {
            private static final String EMAIL_ADDRESS = "your@email.address";
            private static final String API_KEY = "00000000-0000-0000-0000-000000000000";
            
            public static void main(String[] args) {
                try {
                    // Login to Copyleaks
                    String authToken = Copyleaks.login(EMAIL_ADDRESS, API_KEY);
                    System.out.println("Logged successfully!\nToken: " + authToken);
                    
                    // Prepare list of scan IDs to start
                    String[] scanIds = {
                        "my-index-scan-1",
                        "my-index-scan-2",
                        "my-index-scan-3"
                    };
                    
                    // Create start scan request
                    StartScanRequest startRequest = new StartScanRequest();
                    startRequest.setTrigger(scanIds);
                    startRequest.setErrorHandling(0);
                    
                    // Start cross-comparison
                    var result = Copyleaks.startScans(authToken, startRequest);
                    
                    System.out.println("Cross-comparison started successfully!");
                    System.out.println("Success: " + String.join(", ", result.getSuccess()));
                    System.out.println("Failed: " + String.join(", ", result.getFailed()));
                    
                    if (result.getSuccess().length > 0) {
                        System.out.println("Successfully started " + result.getSuccess().length + " scans");
                        System.out.println("Watch for Completed webhooks for each scan");
                    }
                    
                } catch (Exception e) {
                    System.out.println("Failed to start cross-comparison: " + e.getMessage());
                    e.printStackTrace();
                }
            }
        }
        ```
    </CodeGroup>
  </Step>

  <Step title="Interpreting The Results">
    A successful `200 OK` response from the `start` endpoint will confirm which scans were started. The actual scan results for each document will be delivered asynchronously via the `Completed` webhook, just like a regular scan.

    **Example Success Response from `/v3/scans/start`:**
    ```json
    {
      "success": [
        "my-index-scan-1",
        "my-index-scan-2",
        "my-index-scan-3"
      ],
      "failed": []
    }
    ```
  </Step>

  <Step title="Summary">
    You have successfully started a cross-comparison scan between multiple documents in your Data Hub.
  </Step>
</Steps>

## Team Collaboration with Private Cloud Hub

Multiple users can access, scan against, and index to your Private Cloud Hub. Manage permissions and data masking settings through the [admin dashboard](https://admin.copyleaks.com/repositories).

## Best Practices

- **Plan your scanning options**: Configure settings during indexing.
- **Monitor indexing progress**: Wait for all `IndexOnly` webhooks before starting the comparison.
- **Choose your database strategy**: Decide whether to use Private, Shared, or both.
- **Batch efficiently**: Group related documents together.
- **Respect API limits**: Monitor your [API dashboard](https://api.copyleaks.com/dashboard).

## Next Steps

<CardGroup cols={2}>
  <Card title="Create Private Cloud Hub" icon="database" href="https://admin.copyleaks.com/repositories">Set up your own private database for document storage.</Card>
</CardGroup>

## Support

Should you require any assistance, please contact [**Copyleaks Support**](https://help.copyleaks.com/hc/en-us/requests/new) or ask a question on [**Stack Overflow**](https://stackoverflow.com/questions/tagged/copyleaks-api) with the `copyleaks-api` tag.

<br />

<Card title="Schedule a Live Demo" icon="calendar-check" href="https://copyleaks.com/book-a-demo" cta="Book a Demo" arrow="true">
  Want to see how Data Hubs can help you manage and compare your documents? Our technical team can walk you through live examples of setting up a Private Cloud Hub, indexing large batches of content, and running cross-comparisons in a secure environment.
</Card>
