Cover2Catalog
Camden Alpert
cra200002
Henry Jones
hsj200000
Rishabh Medhi
rxm200047
Michael Nuyda
man200004
Abstract
This report presents a computer vision application aimed
at simplifying the tedious process of CD cataloging. Our
approach utilizes real-time object detection with a YOLOv8
model to identify and capture snapshots of CDs based on
their orientation (Front, Back, or Side) once a confidence
threshold is met. Subsequent steps involve extracting usable
information from the detected CD: barcode scanning for the
back and OCR-based catalog number recognition for the
side. This extracted information is then cross-referenced
with a music database to provide detailed metadata, with
the option to add entries to a collection using an exter-
nal API. While the application successfully achieved most
objectives, including object detection and information re-
trieval, integration with a collection-tracking API remains
incomplete. Future work includes refining the model with
an expanded dataset, incorporating external data, and de-
veloping a user interface with collection management capa-
bilities. This demonstrates the potential of computer vision
techniques to streamline CD cataloging workflows.
1. Keywords
YOLOv8, PaddleOCR, pyzbar, Label Studio, OpenCV,
Discogs, MusicBrainz, ultralytics
2. Introduction
CD cataloging can be a time-consuming and labor-
intensive process, especially for large collections. Whether
a store has just received a new shipment of albums or a radio
station is looking to reorganize their archive of music, the
first hurdle that one finds is how tedious it becomes to man-
ually search for every single release. This report explores
the application of computer vision techniques to streamline
and simplify this task. By leveraging real-time object detec-
tion through the YOLOv8 model, our approach automates
key aspects of CD cataloging, including identifying CD ori-
entation and extracting metadata from barcodes and catalog
numbers. The ultimate goal is to reduce the manual effort
required while enabling integration with music databases
for efficient collection management. This report outlines
the methods, results, and recommendations for further de-
velopment to achieve a more comprehensive solution.
3. Related Work
The application of computer vision techniques for object
detection and metadata extraction has been explored in var-
ious domains, including inventory management, library cat-
aloging, product identification, and even face identification.
Real-time object detection models, particularly YOLO (You
Only Look Once) architectures, have been widely adopted
for their speed and accuracy in detecting and classifying ob-
jects. The YOLOv8 model, which we have used, builds
upon previous iterations, and its efficiency makes it suitable
for real-time applications.
Optical character recognition (OCR) has also seen sig-
nificant advancements over the years, with applications
ranging from digitizing text in scanned documents to ex-
tracting product identifiers. Various OCR tools and custom
deep learning-based OCR models have been leveraged to
recognize alphanumeric text, such as catalog numbers, un-
der a variety of conditions. Barcode scanning, also a well-
established technology, has similarly been enhanced by in-
tegrating computer vision to improve reliability and perfor-
mance in real-world scenarios.
Our work builds on these advancements by combining
real-time object detection, OCR, and barcode scanning into
a single pipeline specified for CD cataloging. While prior
research often focuses on specific components, such as im-
proving object detection or OCR accuracy, this project aims
to integrate these components into a practical application.
Additionally, the incorporation of music database APIs to
retrieve and manage metadata bridges the gap between com-
puter vision techniques and collection management sys-
tems, while creating a streamlined user experience.
4. Implementation
4.1. System Overview
The application is designed to streamline CD cataloging
through employing computer vision techniques such as by
automating object detection and information retrieval pro-
cesses. The workflow begins with identifying the orienta-
1
tion of a CD using real-time object detection, followed by
extracting relevant data based on the detected orientation.
This data is then cross-referenced with an external music
database to provide metadata and collection management
options. Finally this data can be displayed to the user and
additionally used to track collection data.
We employed computer vision techniques to scan the al-
bum and find the exact release, thus adding the release to
one’s collection is as easy as taking a photo. We applied ex-
isting computer vision techniques to a new subject with this
application-oriented project. The process employs the use
of object tracking techniques and corner matching through
YOLO from either a live video feed of the CD scanning or
web interface to upload photos of the CDs to be scanned.
Using these images, we will correct the perspective of the
photos to create high quality scans of the front, back, and
side of the CD case [2]. Using the scans, we will ana-
lyze the barcode (on the back) through the pyzbar python
library [3] and catalog number (on the side) through extract-
ing text through the PaddleOCR python library [1] to col-
lect uniquely identifying information for each disk. This in-
formation will be cross referenced against the MusicBrainz
and/or Discogs database to find the exact release of the al-
bum. This exact identifier can then be used to feed any on-
line API like adding the album to your collections on Mu-
sicBrainz or Discogs or searching for genre information on
Last.fm. We used an archive of album metadata from Mu-
sicBrainz. This site hosts a nearly comprehensive database
of every piece of music ever released and all versions of
them that are known.
4.2. Dataset
Before we implemented any object detection or
text/information extraction algorithms, we had to create a
dataset of images with which we could train our model to
identify albums from images. To start, our group had to
collect images of albums and label them.
4.2.1 Data Collection
For the image collection part, our group considered the per-
formance of our model in suboptimal conditions such as if
the user of the application would use it in dark environments
or if there were very bright reflections on the album cover,
so we made sure to include images in various lighting con-
ditions. Additionally, we considered cases in which the al-
bums were viewed upside down, at a sharp angle, or through
a blurry lens, so we made sure to include unclear images
with the intent of allowing the model to be able to identify
albums in various lighting conditions and angles. Lastly, we
also included images that contained no albums, or negative
images, in order to reduce the possibility of the model out-
putting false positives. As for the source of the albums that
we photographed: we used our personal album collections
with UTD’s album collections to create a diverse dataset of
albums.
4.2.2 Data Labeling
After collecting the dataset of album images, the next step
for us was to label the dataset of images. In order for the
model to be able to identify and segment the input im-
age it was necessary to create labels that were identifying
enough, but simple enough for our process and time con-
straint. Thus for our labeling solution we used a open-
source software called Label Studio. We divided the dataset
among ourselves for the manual labeling process and each
of us were able to identify instances of each class (front,
back, and side) in the images and draw a mask around in-
stances of the classes using the ”Semantic Segmentation
with Polygons” labeling feature. Upon completion of our
pre-processing steps, we exported the labeled dataset and
began the model training and development step. The fol-
lowing image demonstrates how we labeled instances of
each class in our images:
2
4.3. Model Details
We utilized the YOLOv8 model for real-time object de-
tection. The model was trained on a custom-labeled dataset
of CD images, previously mentioned in order to identify the
front, back, and side of the album covers. Thus we were
able to perform object detection for the given sides of a CD
from our model and identify the orientation of a CD in the
camera’s view. Training parameters and hyperparameters
were optimized to achieve more desirable accuracy for de-
tecting CDs at a confidence threshold suitable for consistent
snapshots.
Once the confidence level for a detected orientation
(Front, Back, or Side) exceeds the set threshold, a snapshot
of the CD is captured. This snapshot serves as input for the
subsequent data extraction step.
For the Back orientation, the snapshot undergoes bar-
code scanning to extract the CD’s barcode utilizing the
pyzbar library. If the Side orientation is detected, an OCR
process is applied to extract the catalog number using the
PaddleOCR library. These extracted details form the basis
for retrieving metadata from a music database.
The following image demonstrates the initial output of
the YOLO model and its outputs (detected class and confi-
dence):
4.3.1 Training Procedures
We trained the YOLOv8 model on our dataset of 600 im-
ages that were 640x480 pixels in size for 300 epochs in or-
der to allow our model to learn sufficient information about
differentiating between the three different classes with a
batch size of 0.9 in order to stay under video RAM limi-
tations of our hardware. Additionally, our training function
was set to prematurely end training once reductions of loss
were minimal, and due to this, our model finished training
147 epochs in 48 minutes.
Upon training the model, we performed automated vali-
dation testing with 300 images to ensure that our model had
decent accuracy. We had a 87.5% accuracy on our valida-
tion dataset. Additionally, we also tested the model through
a live camera feed and produced the following results:
4.3.2 Hardware and Software Environment
We initially trained on a Thinkpad P53 (mobile worksta-
tion laptop) with an Nvidia Quadro T2000 mobile graph-
ics card with CUDA support when we had fewer images in
our dataset, we eventually switched to using Google Colab
with Nvidia Tesla T4, which has more processing power
3
and more RAM. Because we switched to Google’s cloud
computer platform to train our model, we were able to uti-
lize more powerful hardware for free. On the Colab note-
book, we had to install and import the ultralytics library,
from which we imported the YOLO model for training its
weights to fit our application.
4.4. Get Album information
Depending on what our model detects, our application
either extracts the catalog ID of the album if the model de-
tects the album’s side or the barcode number if the model
detects the album’s back. The metadata retrieved using the
barcode or catalog number is used to call the MusicBrainz
public API to query for the album release. Although a sys-
tem was planned to allow users to add CDs directly to their
collection, this feature remains partially implemented. Cur-
rently, the received data can be displayed for user review,
with manual collection updates as a temporary workaround.
Still, our application serves as a useful tool to identify spe-
cific versions of albums, as various versions of an album
could be released across the globe with different track lists.
Therefore, even in its current state, our application can serve
as a useful tool for those who want to find and preserve spe-
cific music albums.
4.5. User Interface
After getting the extracted data, such as the album name
we output it to the live video feed through OpenCV for user
review. As mentioned one of the limitations of our project
was the ability to finish the user interface, so as for right
now manual collection updates are required with the infor-
mation fed back, but we have plans for using the informa-
tion to feed an online API for adding the album to a user’s
collection.
5. Results and Evaluation
5.1. Qualitative Results
From our application we have seen impressive results
from our model to the data extraction. Our model has shown
high confidence while correctly detecting the proper orien-
tation of a shown CD in real time. The captured snapshots
proved to be reliable in the next step of data extraction,
which was made relatively easy through existing libraries.
We saw the model to be consistent in detecting the back or
side and then being able to pass a clear snapshot on for the
proper data extraction, which is the main challenge for data
extraction being the image clarity and of course being the
proper image for the specified tool of extraction. Our model
has a final mAP mean average precision time of: mAP50 =
0.984 and mAP50-95 = 0.882.
6. Discussion
6.1. Challenges and Modifications
One of the main challenges was achieving consistent de-
tection accuracy for the various orientations, as well as pre-
venting false positives. Additionally, the API integration for
collection tracking was delayed due to time constraints.
We addressed the issues of consistent detection and pre-
venting false positives through common dataset procedures.
We expanded our dataset to include not only more images
but images that were rotated as well as negative images.
Negative images helped greatly in reducing false positives,
since our initial dataset only had images with a CD mask in
each one.
4
7. Conclusion and Future Work
7.1. Future Work
Future improvements include expanding the dataset for
better model performance and completing the API imple-
mentation to provide a seamless user experience. Expand-
ing the dataset includes providing more images overall with
different CDs, but also taking images different from our
dataset, since we had a lot of consistency in our dataset that
would prevent further generalization. In addition, our im-
ages only capture a single CD at a time, and while our model
showed signs of possibly capturing more than one CD it
would be recommended to introduce this to the dataset and
make this improvement to the model overall for an even fur-
ther streamlined process.
7.2. Conclusion
This project demonstrates the potential of computer vi-
sion techniques to streamline and automate the process of
CD cataloging. By leveraging the YOLOv8 model for real-
time object detection, OCR for catalog number recognition,
and barcode scanning, we developed an application capable
of extracting and organizing metadata from CDs. The inte-
gration with music databases further improves the ease of
the application, enabling users to retrieve detailed informa-
tion about their collection and even add to it.
While the system successfully achieved most of its goals,
including accurate object detection and metadata extraction,
the full implementation of collection management via an
API remains incomplete. Future improvements include ex-
panding the dataset to enhance detection accuracy, refining
the OCR and barcode scanning processes, and completing
the API integration to enable seamless collection tracking.
Overall, this project highlights how combining computer
vision techniques with practical workflows can simplify tra-
ditionally tedious tasks. With further refinement, this ap-
plication has the potential to serve as a robust tool for CD
cataloging and collection management.
References
[1] Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei
Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing
Dang, and Haoshuang Wang. Pp-ocr: A practical ultra
lightweight ocr system, 2020. 2
[2] M. Ramanan, A. Ramanan, and E. Y. A. Charles. A prepro-
cessing method for printed tamil documents: Skew correc-
tion and textual classification. In 2015 IEEE Seventh Interna-
tional Conference on Intelligent Computing and Information
Systems (ICICIS), 2015. 2
[3] K. Roy, S. Banerjee, R. Dhar, I. Poddar, P. Dhar, S. Halder,
and S. Kumar. An efficient ocr based technique for barcode
reading and editing. In 2017 4th International Conference on
Opto-Electronics and Applied Optics (Optronix), 2017. 2
5