How to Analyse Documents Using AWS Services?

Processing thousands of invoices, contracts, or medical records manually isn't just slow; it doesn't scale. AWS solves this with a pair of managed AI services: Amazon Textract for extraction and Amazon Comprehend for understanding. Together, they turn unstructured documents into structured, queryable data with no ML expertise required.
The Two Core Services To Analyse Documents
1. Amazon Textract
Textract is a fully managed OCR service that goes beyond reading text. It understands document structure, returning five extraction types:
| Feature | What it extracts |
Text | Raw lines and words |
Forms | Key-value pairs (e.g. Name: Jane Doe) |
Tables | Rows, columns, and cell contents |
Queries | Answers to natural language questions (e.g. "What is the invoice total?") |
Signatures | Location and confidence of detected signatures |
Layout | Headers, footers, paragraphs, figures in reading order |
Supported inputs: PDF, JPEG, PNG, TIFF, from S3 or as direct byte payloads.
2. Amazon Comprehend
Comprehend is an NLP service that processes extracted text and adds semantic understanding. It doesn't read documents; it reads what Textract gives it.
Key capabilities:
- Entity recognition - people, orgs, dates, locations
- Sentiment analysis - positive, negative, neutral, mixed
- Key phrase extraction - core topics and themes
- PII detection - identify and redact sensitive data
- Custom classification - train models on your own document categories
Amazon Comprehend Medical extends this for healthcare, extracting ICD-10 codes, medications (RxNorm), SNOMED-CT concepts, and PHI from clinical text. It is HIPAA eligible.
End-to-End Architecture
A production document analysis pipeline on AWS typically looks like this:
S3 (upload) → SQS → Lambda → Textract → SNS → Lambda → Comprehend → DynamoDB
Step by step:
- Document uploaded to S3 input bucket
- S3 event triggers SQS queue
- Lambda function start-textract-job calls StartDocumentAnalysis
- Textract completes and notifies via SNS topic
- SNS triggers Lambda process-textract-result → saves extracted text to S3
- Based on document type, Lambda calls Comprehend or Comprehend Medical
- NLP results processed and stored in DynamoDB or DocumentDB
Partner with Us for Success
Experience seamless collaboration and exceptional results.
This is fully serverless, no instances to manage, scales automatically with volume.
Working Code
Extract text and forms with Textract (synchronous)
Use AnalyzeDocument for single-page documents or real-time use cases:
import boto3
textract = boto3.client('textract', region_name='ap-south-1')
def analyse_document(bucket: str, key: str) -> dict:
response = textract.analyze_document(
Document={
'S3Object': {'Bucket': bucket, 'Name': key}
},
FeatureTypes=['FORMS', 'TABLES', 'SIGNATURES']
)
return response['Blocks']Extract from multi-page PDFs (asynchronous)
For multi-page documents, use the async API with SQS notification:
import boto3
textract = boto3.client('textract', region_name='ap-south-1')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-south-1:123456789:textract-topic'
ROLE_ARN = 'arn:aws:iam::123456789:role/TextractRole'
def start_document_analysis(bucket: str, key: str) -> str:
response = textract.start_document_analysis(
DocumentLocation={
'S3Object': {'Bucket': bucket, 'Name': key}
},
FeatureTypes=['FORMS', 'TABLES'],
NotificationChannel={
'SNSTopicArn': SNS_TOPIC_ARN,
'RoleArn': ROLE_ARN
}
)
return response['JobId'] # Poll or wait for SNS notification
def get_analysis_results(job_id: str) -> list:
pages = []
response = textract.get_document_analysis(JobId=job_id)
pages.extend(response['Blocks'])
while 'NextToken' in response:
response = textract.get_document_analysis(
JobId=job_id,
NextToken=response['NextToken']
)
pages.extend(response['Blocks'])
return pagesParse key-value pairs from forms
def extract_key_value_pairs(blocks: list) -> dict:
key_map, value_map, block_map = {}, {}, {}
for block in blocks:
block_map[block['Id']] = block
if block['BlockType'] == 'KEY_VALUE_SET':
if 'KEY' in block.get('EntityTypes', []):
key_map[block['Id']] = block
else:
value_map[block['Id']] = block
kvs = {}
for key_id, key_block in key_map.items():
key_text = get_text(key_block, block_map)
val_block = find_value_block(key_block, value_map)
val_text = get_text(val_block, block_map) if val_block else ''
kvs[key_text] = val_text
return kvs
def get_text(block: dict, block_map: dict) -> str:
text = ''
for rel in block.get('Relationships', []):
if rel['Type'] == 'CHILD':
for child_id in rel['Ids']:
child = block_map.get(child_id, {})
if child.get('BlockType') == 'WORD':
text += child.get('Text', '') + ' '
return text.strip()
def find_value_block(key_block: dict, value_map: dict):
for rel in key_block.get('Relationships', []):
if rel['Type'] == 'VALUE':
for val_id in rel['Ids']:
if val_id in value_map:
return value_map[val_id]
return NoneAnalyse extracted text with Comprehend
import boto3
comprehend = boto3.client('comprehend', region_name='ap-south-1')
def analyse_text(text: str) -> dict:
# Run entity detection and sentiment in parallel
entities = comprehend.detect_entities(Text=text, LanguageCode='en')
sentiment = comprehend.detect_sentiment(Text=text, LanguageCode='en')
key_phrases = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
return {
'entities': entities['Entities'],
'sentiment': sentiment['Sentiment'],
'key_phrases': [kp['Text'] for kp in key_phrases['KeyPhrases']]
}Comprehend Medical for clinical documents
comprehend_medical = boto3.client('comprehendmedical', region_name='ap-south-1')
def analyse_medical_text(text: str) -> dict:
entities = comprehend_medical.detect_entities_v2(Text=text)
phi = comprehend_medical.detect_phi(Text=text)
return {
'medical_entities': entities['Entities'],
'phi_detected': phi['Entities'] # Protected Health Information
}Estimated Cost: 1,000 Documents/Month
| Service | Usage | Estimated Cost |
Amazon S3 | 1 GB storage | ~$0.02 |
Amazon SQS | 2,000 requests | $0.0008 |
AWS Lambda | Within free tier | $0.00 |
Amazon Textract | Text + Forms (1,000 pages) | $51.50 |
Amazon SNS | 1,000 notifications | $0.0005 |
Amazon DynamoDB | 1,000 reads/writes | $0.0015 |
Amazon Comprehend | 1M characters | $1.00 |
Amazon Comprehend Medical | 1M characters | $12.50 |
Total | ~$65.00/month |
Textract dominates cost, specifically the Forms and Tables feature ($50 per 1,000 pages vs $1.50 for text-only). If you only need raw text, skip FORMS and TABLES in FeatureTypes to reduce cost significantly.
Best Practices
Document quality matters most. High-resolution scans (300 DPI+), clear fonts, and consistent formatting directly improve Textract accuracy. Pre-process images to normalize contrast and orientation before uploading.
Use async for anything multi-page. AnalyzeDocument is synchronous and limited to a single page. For PDFs, always use StartDocumentAnalysis with SQS/SNS notification.
Use Queries for targeted extraction. Instead of parsing all blocks to find a specific field, pass a natural-language query like "What is the invoice number?", Textract returns a direct answer with a confidence score.
Encrypt at rest and in transit. Use AWS KMS for S3 bucket encryption and enable server-side encryption on DynamoDB. For medical documents, apply strict IAM policies limiting Comprehend Medical access to specific roles.
Partner with Us for Success
Experience seamless collaboration and exceptional results.
Chunk long texts for Comprehend. Comprehend's detect_entities has a 5,000-byte limit per call. Split the extracted text into chunks before sending, then merge the results.
Frequently Asked Questions
What's the difference between Textract and traditional OCR?
Traditional OCR extracts raw text. Textract understands document structure; it knows that Invoice No: is a key and INV-2024-001 is its value, and returns them as a pair rather than separate lines of text.
When should I use synchronous vs asynchronous Textract?
Use AnalyzeDocument (sync) for single-page documents and real-time use cases. Use StartDocumentAnalysis (async) for multi-page PDFs or batch processing; it's non-blocking and handles pagination automatically.
Is this HIPAA compliant?
Amazon Comprehend Medical is HIPAA eligible when deployed correctly. Ensure PHI is encrypted with KMS, access is restricted via IAM, and the AWS Business Associate Agreement (BAA) is signed.
Can Textract handle handwritten text?
Yes, Textract supports printed and handwritten text detection. Accuracy depends on handwriting quality; printed documents consistently yield higher confidence scores.
How do I improve extraction accuracy for custom document layouts?
Use Textract Adapters, trainable components you attach to AnalyzeDocument. You annotate sample documents, train the adapter, and pass the AdapterId with future requests to customize model output for your specific forms.
Final Thoughts
The Textract → Comprehend pipeline covers the full document intelligence loop: extract structure, then extract meaning. With Lambda and SQS tying the stages together, the entire workflow is serverless, event-driven, and scales to any volume without infrastructure changes. Start with AnalyzeDocument on a few sample documents to validate the extraction quality. Once that's solid, build the async pipeline for production; the architecture above handles the rest.



