led via headers and query parameters.
| Header | Type | Required | Description |
|---|
x-ingest-secret | string | β
Yes | Secret token matching the platform's INGEST_SECRET environment variable. Grants access to ingestion utilities. |
Query Parameters
| Parameter | Type | Required | Default | Max | Description |
|---|
limit | integer | β No | 800 | 1200 | Maximum number of SimHash values to return. Values exceeding 1200 are silently capped. |
Example Request URL:
GET /api/articles/ingest/simhashes?limit=1000
Success Response (200 OK)
Returns a JSON object containing a success flag, the actual count of returned hashes, and the array of SimHash values.
{
"success": true,
"count": 1000,
"simhashes": [
"0x8a3f2c1d4e5b6a7c",
"0x9b4e3d2c5f6a7b8d",
"0x7c2a1b0d9e8f7a6b",
"...",
"0x5d4c3b2a1f0e9d8c"
]
}
Response Fields:
success (boolean): Always true on successful execution.
count (integer): Number of SimHash values returned in the simhashes array. Matches the array length.
simhashes (string[]): Array of SimHash fingerprints. Values are typically represented as 64-bit hexadecimal strings or decimal integers, depending on platform serialization.
Error Responses
| Status Code | JSON Payload | Description |
|---|
401 Unauthorized | { "error": "Unauthorized" } | Missing x-ingest-secret header or invalid secret value. |
500 Internal Server Error | { "error": "<error_message>" } | Backend failure during hash retrieval (e.g., database timeout, query error). The error field contains the caught exception message. |
Usage Example
Below is a production-ready curl command demonstrating how to request 500 SimHash values with proper authentication:
curl -X GET "https://codcompass.com/api/articles/ingest/simhashes?limit=500" \
-H "x-ingest-secret: your_ingest_secret_here" \
-H "Accept: application/json"
Expected Output:
{
"success": true,
"count": 500,
"simhashes": [
"0x1a2b3c4d5e6f7a8b",
"0x9c8d7e6f5a4b3c2d",
"...",
"0x2e3f4a5b6c7d8e9f"
]
}
In a Node.js ingestion service, you would parse this response, compute the SimHash of your local article using a library like simhash or js-simhash, and iterate through the array to calculate Hamming distances before deciding whether to proceed with /api/articles/ingest/push.
Common Pitfalls
1. Silent limit Capping at 1200
The endpoint enforces a hard ceiling of 1200 SimHash values per request. If you pass ?limit=5000, the API will silently return only 1200 hashes. This design prevents excessive memory allocation and database strain. If your ingestion workflow requires a larger reference set, implement pagination logic or schedule periodic cache refreshes rather than requesting massive payloads in a single call.
This endpoint only returns reference hashes. It does not accept content, compute Hamming distances, or return match results. Developers must implement the comparison logic locally. A typical threshold for near-duplicate detection is a Hamming distance of β€ 3 for 64-bit SimHashes. Values above this threshold indicate sufficiently different content. Failing to implement this client-side will result in duplicate submissions.
The authentication header must be exactly x-ingest-secret. While HTTP/2 normalizes header casing, some proxy layers, API gateways, or older HTTP/1.1 clients may treat header names as case-sensitive. Always use lowercase with hyphens as shown. Additionally, ensure the secret value matches the platform's environment configuration exactly; trailing whitespace or URL-encoded characters will trigger a 401 Unauthorized response.
This endpoint is part of CodCompass's ingestion suite and is typically used in conjunction with the following workflow endpoints:
/api/articles/ingest/push: Used to submit new or updated articles after pre-validation. Call this only when your local Hamming distance check confirms the content is sufficiently unique.
/api/articles/ingest/status: Retrieves ingestion job states, processing queues, and deduplication results for previously pushed content. Useful for monitoring pipeline health.
/api/articles/search: General-purpose article search endpoint. While not part of the ingestion flow, it can be used to verify that successfully ingested articles are discoverable and properly indexed.
Integrate /api/articles/ingest/simhashes early in your ingestion pipeline to minimize redundant uploads, reduce server load, and maintain a clean, deduplicated knowledge base.