Optical Character Recognition (OCR)

Client

Personal Project

Project Date

Dec 10, 2025

Tech Stack

HTML5 Tailwind CSS JavaScript Tesseract.js PDF.js

Project Overview

This is a powerful, all-in-one text extraction tool built entirely with modern web technologies. It solves the problem of digitizing printed documents and scraping web content efficiently.

<li>AI-Powered OCR: Uses Tesseract.js v5 to recognize text in images with high accuracy.</li>

<li>Multi-Language Support: Specialized support for Sinhala and Tamil languages, including mixed-language detection.</li>

<li>Image Pre-processing: Automatically converts images to black & white and boosts contrast to improve readability before processing.</li>

<li>PDF Reading: Can upload multi-page PDF files, converting each page into text automatically.</li>

<li>Smart Web Scraper: Fetches external websites via proxies, cleans up ads/sidebars, and organizes content by sections.</li>

<li>Typing Effect UI: Results are displayed with an engaging, terminal-style typing animation.</li>

</ul>