Submit File (OCR)
Scan images with textual content to find where the content has been used before and check its originality. Using submit-ocr you can scan various image file types for plagiarism and identify infringed content. Only the textual content in the picture will be scanned and not the graphics. See supported formats.
Request
Section titled “Request”Path Parameters
Section titled “Path Parameters”A unique scan id provided by you. We recommend you use the same id in your database to represent the scan in the Copyleaks database. This will help you to debug incidents. Using the same ID for the same file will help you to avoid network problems that may lead to multiple scans for the same file. learn more about the criteria for creating a Scan ID.
>= 3 characters
<= 36 characters
Match pattern: [a-z0-9] !@$^&-+%=_(){}<>';:/.",~
|
Headers
Section titled “Headers”Content-Type: application/jsonAuthorization: Bearer YOUR_LOGIN_TOKEN
Request Body
Section titled “Request Body”The request body is a JSON object containing the image file to scan and a properties
object to configure the scan.
A base64 data string of a file. If you would like to scan plain text, encode it as base64 and submit it.
Example: aGVsbG8gd29ybGQ=
The name of the file as it will appear in the Copyleaks scan report. Make sure to include the right extension for your filetype.
<= 255
characters
Example: image.jpg
The language of the text in the image. See supported languages.
Example: en
properties object Required
Configuration options for the scan.
The type of submission action.
0
: Scan- Start scan immediately.
1
: Check-Credits- Check how many credits will be used for this scan.
2
: Index Only- Only index the file in the Copyleaks internal database or Copyleaks Repository (depends on your submit request). No credits will be used.
By default, Copyleaks will present the report in text format. If set to true
, Copyleaks will also include html format.
True
: results will be generated as HTML format, if possible. Otherwise, it will be generated as text format.False
: results will be generated as text format.
Add custom developer payload that will then be provided on the webhooks.
<= 512 characters
You can test the integration with the Copyleaks API for free using the sandbox mode.
You will be able to submit content for a scan and get back mock results, simulating the way Copyleaks will work to make sure that you successfully integrated with the API.
Turn off this feature on production environment.
Rate Limiting: This method has a maximum call rate limit of 100 sandbox scans within 1 hour. See the 429 Response code section at the bottom of this page.
Specify the maximum life span of a scan in hours on the Copyleaks servers. When expired, the scan will be deleted and will no longer be accessible.
>= 1
<= 2880
Choose the algorithm goal. You can set this value depending on your use-case.
Available Options:
- 0 - MaximumCoverage: prioritize higher similarity score.
- 1 - MaximumResults: prioritize finding more sources.
Add custom properties that will be attached to your document in a Copyleaks repository.
If this document is found as a repository result, your custom properties will be added to the result.
Example:
[ { "key":"Test1", "value":"Test1" }, ...]
When specified, the PDF report will be generated in the selected language. Future updates may also apply this setting to the overview and other components.
Currently supported languages:
Code | Language |
---|---|
en | English |
es | Spanish |
de | German |
fr | French |
it | Italian |
pt | Portuguese |
Specify the timezone for the scan time displayed on the final PDF report. The value must be a valid, case-sensitive IANA Time Zone name (e.g., America/New_York). If unspecified, the timezone defaults to the user’s country, or UTC if their country is unknown.
Available Options: See the full List of IANA Time Zones.
You can control the level of plagiarism sensitivity that will be identified according to the speed of the scan. If you prefer a faster scan with the results that contains the highest amount of plagiarism choose 1, and if a slower, more comprehensive scan, that will also detect the smallest instances choose 5.
Optional Values:
Range between 1 (faster) to 5 (slower but more comprehensive)
When set to true the submitted document will be checked for cheating. If a cheating will be detected, a scan alert will be added to the completed webhook.
author object
A unique identifier for the author of the content.
course object
A unique course identifier for tracking analytics.
assignment object
A unique assignment identifier for tracking analytics.
institution object
A unique institution identifier for tracking analytics.
webhooks object
The webhooks
object is where you define the callback URLs for Copyleaks to send notifications to. This object is required.
A URL that will be called when the scan status changes. Use the {STATUS}
placeholder, which will be replaced with completed
, error
, creditsChecked
, or indexed
. Example: https://yoursite.com/webhook/{STATUS}
A URL that will be called when a new result is found during the scan.
Custom headers to add to the newResult
webhook. Example: [["key", "value"]]
Custom headers to add to the status
webhook. Example: [["key", "value"]]
filters object
Fine-tune what kind of results are included in the scan report.
Enable matching of exact words.
Enable matching of nearly identical words (e.g., slow/slowly).
Enable matching of paraphrased content.
Only show results with at least this many copied words.
Block explicit adult content from scan results.
A list of domains to include or exclude from the scan.
0
to include domains, 1
to exclude them.
Allow results from the same domain as the submitted URL.
scanning object
Define the sources to compare your document against.
Compare your content with online sources.
exclude object
Exclude submissions from results if their ID matches the supplied pattern. Matched submissions will be excluded from both internal database and repository results.
Supported pattern wildcards:
*
— Matches any number of characters (including zero).
— Matches exactly one non-whitespace character
Examples:
abc*
— Excludes submissions with IDs starting with “abc”ab..
— Excludes submissions with exactly 4-character IDs starting with “ab”*test
— Excludes submissions with IDs ending with “test”user.*
— Excludes submissions with IDs starting with “user.” followed by any characters
include object
Includes results only if their scan id matches the supplied pattern. Matched submissions will be the only submissions Included from internal database and repositories results.
Supported pattern wildcards:
*
— Matches any number of characters (including zero).
— Matches exactly one non-whitespace character
Examples:
abc*
— Includes submissions with IDs starting with “abc”ab..
— Includes submissions with exactly 4-character IDs starting with “ab”*test
— Includes submissions with IDs ending with “test”user.*
— Includes submissions with IDs starting with “user.” followed by any characters
repositories array[object]
default: "[]"
Specify which repositories to scan the document against.
Id of a repository to scan the submitted document against.
Compare the scanned document against MY submissions in the repository.
Compare the scanned document against OTHER users submissions in the repository.
copyleaksDb object
default: "null"
Configure scanning against the Copyleaks Shared Data Hub.
When set to true: Copyleaks will also compare against content which was uploaded by YOU to the Copyleaks internal database. If true, it will also index the scan in the Copyleaks internal database.
When set to true: Copyleaks will also compare against content which was uploaded by OTHERS to the Copyleaks internal database. If true, it will also index the scan in the Copyleaks internal database.
crossLanguages object
Configure cross-language plagiarism detection.
languages array[object]
default: "[]"
Cross language plagiarism detection. Choose which languages to scan your content against. For each additional language chosen, your pages will be deducted per page submitted. The language of the original document submitted is always scanned, therefore should not be included in the additional languages chosen. Supported languages list.
Language code for cross language plagiarism detection.
indexing object
Configure where to index the submitted content.
Specify which repositories to index the scanned document to.
Add the submitted document to the Copyleaks Shared Data Hub.
exclude object
Configure what content to exclude from the scan.
Exclude quoted text from the scan.
Exclude citations from the scan.
Exclude referenced text from the scan.
Exclude table of contents from the scan.
When the scanned document is an HTML document, exclude titles from the scan.
When the scanned document is an HTML document, exclude irrelevant text that appears across the site like the website footer or header.
Exclude template text found in other documents. Provide an array of scan IDs (max 3).
pdf object
Configure and request a PDF report of the scan results.
Set to true to generate a PDF report for this scan.
Customize the title for the PDF report (max 256 chars).
A base64 encoded PNG image (max 100kb) to use as a logo in the report.
When set to true the text in the report will be aligned from right to left.
reportVersion
. PDF version to generate (1, 2, or 3).Specifies which version of the PDF report to generate (v1
, v2
, v3
, latest
). Overrides version
.
colors object
Customize the highlight colors used in the PDF report.
Change the color of titles in the PDF.
Format: Color in HEX format (#000000-#FFFFFF)
Change the color of identical match highlights in the PDF.
Format: Color in HEX format (#000000-#FFFFFF)
Change the color of minor changes highlights in the PDF.
Format: Color in HEX format (#000000-#FFFFFF)
Change the color of related meaning (paraphrased) highlights in the PDF.
Format: Color in HEX format (#000000-#FFFFFF)
Change the color of AI content detection highlights in the PDF.
Format: Color in HEX format (#000000-#FFFFFF)
Change the color of writing feedback highlights in the PDF.
Format: Color in HEX format (#000000-#FFFFFF)
aiGeneratedText object
Configure AI-generated text detection.
Detects whether the text was written by an AI.
Control the behavior of the AI detection (1-3).
explain object
Enable AI Logic feature for AI detection.
sensitiveDataProtection object
Mask driver’s license numbers from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Supported Types:
Type |
---|
Australia driver’s license number |
Canada driver’s license number |
United Kingdom driver’s license number |
USA drivers license number |
Japan driver’s license number |
Spain driver’s license number |
Germany driver’s license number |
Mask credentials from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Supported Types:
Type |
---|
Authentication token |
Amazon Web Services credentials |
Azure JSON Web Token |
HTTP basic authentication header |
Google Cloud Platform service account credentials |
Google Cloud Platform API key |
JSON Web Token |
Encryption key |
Password |
Mask passports from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Supported Types:
Type |
---|
Canada passport number |
China passport number |
France passport number |
Germany passport number |
Ireland passport number |
Japan passport number |
Korea passport number |
Mexico passport number |
Spain passport number |
United Kingdom passport number |
USA passport number |
Netherlands passport number |
Poland passport |
Sweden passport number |
Australia passport number |
Singapore passport number |
Taiwan passport number |
Mask network identifiers from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Supported Types:
Type |
---|
IP address |
Local MAC address |
MAC address |
Mask url from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Mask email addresses from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Mask credit card numbers and credit card track numbers from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
Mask phone numbers from the scanned document with # characters. Available for users on a plan for 2500 pages or more.
writingFeedback object
Configure the automated Writing Assistant.
Enable automated Writing Assistant for grammar, spelling, etc.
score object
Configure the weighting of different categories in the overall writing score.
Grammar correction category weight. Range: 0.0
to 1.0
.
Mechanics correction category weight. Range: 0.0
to 1.0
.
Sentence structure correction category weight. Range: 0.0
to 1.0
.
Word choice correction category weight. Range: 0.0
to 1.0
.
overview object
Enable Gen-AI Overview feature to extract key insights from the scan data.
Enable Gen-AI Overview feature to extract key insights from the scan data.
Ignore AI detection when generating the scan’s overview. Only applicable if AI detection was enabled.
Ignore plagiarism detection when generating the scan’s overview. Only applicable if plagiarism detection was enabled.
Ignore writing assistant when generating the scan’s overview. Only applicable if the writing assistant was enabled.
Ignore the author’s historical data when generating the scan’s overview. Only applicable if author ID added to the request.
aiSourceMatch object
The AI Source Match feature enhances plagiarism detection by identifying online sources that are suspected of containing AI-generated text. This allows you to find instances of potential plagiarism and understand if the matched source content itself might have been created by an AI.
Activates or deactivates the AI Source Match functionality.
Responses
Section titled “Responses”The scan was successfully created and is now processing.
Response Schema
The response contains the following fields:
scannedDocument object
metadata object
enabled object
results object
score object
notifications object
writingFeedback object
textStatistics object
score object
readability object
Example Response
A typical response from this endpoint:
{"scannedDocument": { "scanId": "scan-id32", "totalWords": 2, "totalExcluded": 0, "credits": 0, "expectedCredits": 1, "creationTime": "2025-08-05T07:19:08.181236Z", "metadata": { "filename": "file.jpg" }, "enabled": { "plagiarismDetection": true, "aiDetection": false, "explainableAi": false,// ... truncated
Examples
Section titled “Examples”PUT https://api.copyleaks.com/v3/scans/submit/ocr/my-scan-123Content-Type: application/jsonAuthorization: Bearer YOUR_LOGIN_TOKEN
{ "base64": "YOUR_BASE64_HERE", "filename": "image.jpg", "langCode": "en", "properties": { "webhooks": { "status": "https://my-server.com/webhook/{STATUS}" }, "sandbox": true }}
curl --request PUT \ --url https://api.copyleaks.com/v3/scans/submit/ocr/my-scan-123 \ --header 'Authorization: Bearer YOUR_LOGIN_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "base64": "YOUR_BASE64_HERE", "filename": "image.jpg", "langCode": "en", "properties": { "webhooks": { "status": "https://my-server.com/webhook/{STATUS}" }, "sandbox": true } }'
import requestsimport base64
# Read and encode image file to base64with open("image.jpg", "rb") as image_file: image_content = image_file.read() base64_content = base64.b64encode(image_content).decode('utf-8')
url = "https://api.copyleaks.com/v3/scans/submit/ocr/my-scan-123"payload = { "base64": base64_content, "filename": "image.jpg", "langCode": "en", "properties": { "webhooks": { "status": "https://my-server.com/webhook/{STATUS}" }, "sandbox": True, "aiGeneratedText": { "detect": True } }}headers = { "Authorization": "Bearer YOUR_LOGIN_TOKEN", "Content-Type": "application/json", "Accept": "application/json"}
response = requests.put(url, json=payload, headers=headers)result = response.json()
print("OCR scan submitted successfully!")if 'scannedDocument' in result: print(f"Scan ID: {result['scannedDocument'].get('scanId', 'N/A')}") print(f"Total words extracted: {result['scannedDocument'].get('totalWords', 'N/A')}") print(f"Detected language: {result['scannedDocument'].get('detectedLanguage', 'N/A')}")
print("Full response:", result)
const { Copyleaks, CopyleaksFileOcrSubmissionModel } = require('plagiarism-checker');const fs = require('fs');
async function submitOcrScan() { try { // Initialize Copyleaks const copyleaks = new Copyleaks();
// Login to get the authentication token object. // Replace with your email and API key.
const scanId = "my-scan-123"; const WEBHOOK_URL = "https://my-server.com/webhook";
// Read an image file and convert it to base64 const filePath = 'image.jpg'; const fileContent = fs.readFileSync(filePath); const base64Content = fileContent.toString('base64');
// Create a submission model const submission = new CopyleaksFileOcrSubmissionModel( 'en', // Language of the text in the image base64Content, 'image.jpg', { sandbox: true, webhooks: { status: `${WEBHOOK_URL}/{STATUS}` } } );
// Submit the OCR file for scanning await copyleaks.submitFileOcrAsync(loginResult, scanId, submission); console.log(`Submission successful. Scan ID: ${scanId}`);
} catch (error) { console.error("An error occurred:", error); }}
submitOcrScan();
import classes.Copyleaks;import models.submissions.CopyleaksOcrSubmissionModel;import models.submissions.properties.*;import java.util.Base64;import java.nio.file.Files;import java.nio.file.Paths;
public class OCRSubmissionExample { private static final String API_KEY = "00000000-0000-0000-0000-000000000000";
public static void main(String[] args) { try { // Login to Copyleaks String authToken = Copyleaks.login(EMAIL_ADDRESS, API_KEY); System.out.println("Logged successfully!\nToken: " + authToken);
// Read and encode image file byte[] imageContent = Files.readAllBytes(Paths.get("image.jpg")); String base64Content = Base64.getEncoder().encodeToString(imageContent);
// Configure webhooks SubmissionWebhooks webhooks = new SubmissionWebhooks("https://my-server.com/webhook/{STATUS}");
// Create submission properties SubmissionProperties properties = new SubmissionProperties(webhooks); properties.setSandbox(true);
// Configure AI detection SubmissionAIGeneratedText aiGeneratedText = new SubmissionAIGeneratedText(); aiGeneratedText.setDetect(true); properties.setAiGeneratedText(aiGeneratedText);
// Configure scanning SubmissionScanning scanning = new SubmissionScanning(); scanning.setInternet(true); properties.setScanning(scanning);
// Create OCR submission String scanId = "my-scan-123"; CopyleaksOcrSubmissionModel ocrSubmission = new CopyleaksOcrSubmissionModel( base64Content, "image.jpg", "en", // language code properties );
// Submit OCR file for scanning var result = Copyleaks.submitOCR(authToken, scanId, ocrSubmission);
System.out.println("OCR scan submitted successfully!"); System.out.println("Scan ID: " + scanId); System.out.println("Language: en"); System.out.println("Status: Processing - watch for webhooks"); System.out.println("Text will be extracted from image and scanned for plagiarism");
} catch (Exception e) { System.out.println("Failed to submit OCR scan: " + e.getMessage()); e.printStackTrace(); } }}