Empowering Digital Literacy for National Development
Built by Group 5 ⢠Powered by Elona ⢠Honoring NITDA/NCAIR đłđŹ
A comprehensive guide to building an automated receipt processing system using Python, Object-Oriented Programming, and Tesseract OCR
This tutorial will teach you how to build a professional receipt management system using Python and Object-Oriented Programming (OOP) principles. The system will:
We use Python classes to organize our code into logical components (Receipt and ReceiptManager) for better maintainability.
We leverage pytesseract library to extract text from receipt images, converting visual data into processable text.
We clean and structure the extracted data using regular expressions and string manipulation.
We save the processed data to CSV files with proper error handling and file existence checks.
The Receipt class handles all operations related to a single receipt:
load_image()
- Reads the receipt image fileextract_text()
- Uses OCR to extract text from the imageparse_items()
- Processes the extracted text to identify items and pricescategorize()
- Automatically categorizes items based on keywordsThe ReceiptManager coordinates the overall process:
add_receipt()
- Adds receipt files to the processing queueprocess_receipts()
- Processes all queued receipts and saves to CSVrun()
- Provides the user interface for interactionThe main execution block creates a ReceiptManager instance and starts the program:
if __name__ == "__main__":
manager = ReceiptManager()
manager.run()
Below is the complete implementation following NITDA/NCAIR standards:
import pytesseract
import cv2
import csv
import re
import os
from datetime import datetime
# Configure Tesseract OCR path (adjust if needed)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Predefined categories
CATEGORY_MAP = {
"tooth": "Personal Care", "paste": "Personal Care", "soap": "Personal Care",
"detergent": "Cleaning", "omo": "Cleaning", "colgate": "Personal Care",
"sunlight": "Cleaning", "indomie": "Food", "noodle": "Food", "maggie": "Food",
"rice": "Food", "milk": "Beverage", "sugar": "Food", "bread": "Food",
"mayonnaise": "Food", "oil": "Food", "pack": "General", "cream": "Personal Care",
"shampoo": "Personal Care", "brush": "Personal Care", "chocolate": "Snack",
"stew": "Food", "meat": "Food", "fish": "Food", "lorem": "Food",
"ipsum": "Food", "dolor sit amet": "Food", "consectetur": "Snack", "adipiscing elit": "Snack"
}
class Receipt:
def __init__(self, filename):
self.filename = filename
self.items = []
def load_image(self):
image = cv2.imread(self.filename)
if image is None:
raise FileNotFoundError(f"â Could not read image: {self.filename}")
return image
def extract_text(self, image):
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
return pytesseract.image_to_string(gray)
def parse_items(self, text):
lines = [line.strip() for line in text.split('\n') if line.strip()]
for line in lines:
if any(word in line.lower() for word in ['total', 'cash', 'change', 'receipt']):
continue
match = re.search(r'(.+?)\s+([âŚN]?\s?[\d.,]+)[\)]?$', line)
if match:
item_name = match.group(1).strip()
price_str = match.group(2).replace('âŚ', '').replace('N', '').replace(',', '.').strip()
try:
price = round(float(price_str))
category = self.categorize(item_name)
self.items.append((item_name, price, category))
except:
continue
def categorize(self, item_name):
item_name = item_name.lower()
for keyword, category in CATEGORY_MAP.items():
if keyword in item_name:
return category
return "Uncategorized"
class ReceiptManager:
def __init__(self):
self.receipts = []
self.output_file = "expenses.csv"
def add_receipt(self, filename):
if os.path.isfile(filename):
self.receipts.append(filename)
print(f"â
Receipt added: {filename}")
else:
print("â File not found.")
def process_receipts(self):
if not self.receipts:
print("â No receipts added yet.")
return
file_exists = os.path.isfile(self.output_file)
with open(self.output_file, 'a', newline='') as file:
writer = csv.writer(file)
if not file_exists:
writer.writerow(['Receipt', 'Item', 'Price', 'Category', 'Date'])
for filename in self.receipts:
try:
receipt = Receipt(filename)
img = receipt.load_image()
text = receipt.extract_text(img)
receipt.parse_items(text)
now = datetime.now().strftime("%Y-%m-%d %H:%M")
for item_name, price, category in receipt.items:
writer.writerow([filename, item_name, price, category, now])
print(f"â
Processed: {filename}")
except FileNotFoundError as e:
print(e)
self.receipts.clear()
print("đ All receipts processed. Check 'expenses.csv' for your report.")
def run(self):
print("đš PYTHON RECEIPT MANAGER đš")
print("đ Developed by Nuhu @ NITDA/NCAIR\n")
while True:
print("\nChoose an option:")
print("1. Add receipt image")
print("2. Process and generate report")
print("3. Exit")
choice = input("Enter your choice (1/2/3): ").strip()
if choice == '1':
img_name = input("đź Enter image file name (e.g., receipt1.jpg): ").strip()
self.add_receipt(img_name)
elif choice == '2':
self.process_receipts()
elif choice == '3':
print("đ Exiting. Goodbye!")
break
else:
print("â Invalid choice. Please enter 1, 2, or 3.")
# Run the program
if __name__ == "__main__":
manager = ReceiptManager()
manager.run()
pip install pytesseract opencv-python
expenses.csv
for your reportTwo-line explanations so any Group 5 member can understand and explain confidently
Handles reading, extracting, and analyzing a single receipt image.
It finds items, prices, and categories from one photo.
Stores the filename of the receipt image.
This lets us remember which receipt we're working on.
Loads the image from your computer.
If the image isn't found, it shows an error.
Turns the receipt image into text using OCR (pytesseract).
Basically, it reads what's written on the paper.
Goes line by line through the text and finds items with prices.
It also figures out which category each item belongs to.
Checks each item name and tries to match it to a category like food or personal care.
If it doesn't match anything, it's marked "Uncategorized".
Handles all receipts, collects them, and creates the final report.
Think of it as the team captain managing all the receipts together.
Creates an empty list to hold all receipt files added.
Also sets the name of the CSV file to save results.
Checks if a file exists and adds it to the list of receipts.
It's like telling the system "this receipt is ready to process."
Goes through all added receipts and writes the details into a CSV report.
Each item from each receipt is saved with name, price, category, and time.
Shows the main menu where the user can add receipts, process them, or exit.
This is the main loop that runs the app.
This line tells Python to start the program here.
It runs the whole system by calling the ReceiptManager.
Now you can explain this project like a pro! đ
Complete explanation in plain English for absolute beginners
import pytesseract
import cv2
import csv
import re
import os
from datetime import datetime
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
This is like telling Python: "The tool that reads text from images is installed here on my computer". The r
before the path helps Python understand Windows file paths correctly.
CATEGORY_MAP = {
"tooth": "Personal Care",
"paste": "Personal Care",
# ... other items ...
"adipiscing elit": "Snack"
}
This is like a cheat sheet that tells the program:
This makes automatic categorization possible without manual input.
Handles everything about a single receipt - from loading the image to extracting and categorizing items
def __init__(self, filename):
self.filename = filename
self.items = []
What happens: When we create a new Receipt object, we:
self.filename
)self.items
) to store found items laterdef load_image(self):
image = cv2.imread(self.filename)
if image is None:
raise FileNotFoundError(f"â Could not read image: {self.filename}")
return image
Step-by-step:
cv2.imread
)None
), shows an error messageThink of this like trying to open a photo on your phone - if it can't be opened, you get an error.
def extract_text(self, image):
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
return pytesseract.image_to_string(gray)
How it works:
GRAY
) - this helps OCR work betterReal-world analogy: Like taking a photo of a receipt and using your phone's "copy text from image" feature.
def parse_items(self, text):
lines = [line.strip() for line in text.split('\n') if line.strip()]
for line in lines:
if any(word in line.lower() for word in ['total', 'cash', 'change', 'receipt']):
continue
match = re.search(r'(.+?)\s+([âŚN]?\s?[\d.,]+)[\)]?$', line)
if match:
item_name = match.group(1).strip()
price_str = match.group(2).replace('âŚ', '').replace('N', '').replace(',', '.').strip()
try:
price = round(float(price_str))
category = self.categorize(item_name)
self.items.append((item_name, price, category))
except:
continue
Detailed breakdown:
Key Point: The re.search
pattern looks for:
1. Item name (text) â 2. Space â 3. Price (numbers with optional currency symbols)
def categorize(self, item_name):
item_name = item_name.lower()
for keyword, category in CATEGORY_MAP.items():
if keyword in item_name:
return category
return "Uncategorized"
How categorization works:
Example: "Colgate Toothpaste" â contains "tooth" â returns "Personal Care"
Manages multiple receipts and handles the overall program flow
def __init__(self):
self.receipts = []
self.output_file = "expenses.csv"
Setup:
self.receipts
- Empty list to store receipt filenamesself.output_file
- Sets the CSV filename for saving resultsThis is like preparing a blank notebook (receipts
) and deciding where to save the final report (expenses.csv
).
def add_receipt(self, filename):
if os.path.isfile(filename):
self.receipts.append(filename)
print(f"â
Receipt added: {filename}")
else:
print("â File not found.")
Process flow:
os.path.isfile
This is like putting a physical receipt in your "to process" tray.
def process_receipts(self):
if not self.receipts:
print("â No receipts added yet.")
return
file_exists = os.path.isfile(self.output_file)
with open(self.output_file, 'a', newline='') as file:
writer = csv.writer(file)
if not file_exists:
writer.writerow(['Receipt', 'Item', 'Price', 'Category', 'Date'])
for filename in self.receipts:
try:
receipt = Receipt(filename)
img = receipt.load_image()
text = receipt.extract_text(img)
receipt.parse_items(text)
now = datetime.now().strftime("%Y-%m-%d %H:%M")
for item_name, price, category in receipt.items:
writer.writerow([filename, item_name, price, category, now])
print(f"â
Processed: {filename}")
except FileNotFoundError as e:
print(e)
self.receipts.clear()
print("đ All receipts processed. Check 'expenses.csv' for your report.")
Complete workflow:
Key Features:
def run(self):
print("đš PYTHON RECEIPT MANAGER đš")
print("đ Developed by Nuhu @ NITDA/NCAIR\n")
while True:
print("\nChoose an option:")
print("1. Add receipt image")
print("2. Process and generate report")
print("3. Exit")
choice = input("Enter your choice (1/2/3): ").strip()
if choice == '1':
img_name = input("đź Enter image file name (e.g., receipt1.jpg): ").strip()
self.add_receipt(img_name)
elif choice == '2':
self.process_receipts()
elif choice == '3':
print("đ Exiting. Goodbye!")
break
else:
print("â Invalid choice. Please enter 1, 2, or 3.")
User interaction:
add_receipt
)process_receipts
)This creates the interactive experience users see when running the program.
if __name__ == "__main__":
manager = ReceiptManager()
manager.run()
What happens when you run the file:
ReceiptManager
instancerun()
Professional Tip: The if __name__ == "__main__":
block ensures this code only runs when the file is executed directly, not when imported as a module.
Now you can explain every part of this project with confidence! đ
Test your knowledge with these 30 essential questions