Cover2Catalog

Camden Alpert

cra200002

Henry Jones

hsj200000

Rishabh Medhi

rxm200047

Michael Nuyda

man200004

Abstract

This report presents a computer vision application aimed

at simplifying the tedious process of CD cataloging. Our

approach utilizes real-time object detection with a YOLOv8

model to identify and capture snapshots of CDs based on

their orientation (Front, Back, or Side) once a conﬁdence

threshold is met. Subsequent steps involve extracting usable

information from the detected CD: barcode scanning for the

back and OCR-based catalog number recognition for the

side. This extracted information is then cross-referenced

with a music database to provide detailed metadata, with

the option to add entries to a collection using an exter-

nal API. While the application successfully achieved most

objectives, including object detection and information re-

trieval, integration with a collection-tracking API remains

incomplete. Future work includes reﬁning the model with

an expanded dataset, incorporating external data, and de-

veloping a user interface with collection management capa-

bilities. This demonstrates the potential of computer vision

techniques to streamline CD cataloging workﬂows.

1. Keywords

YOLOv8, PaddleOCR, pyzbar, Label Studio, OpenCV,

Discogs, MusicBrainz, ultralytics

2. Introduction

CD cataloging can be a time-consuming and labor-

intensive process, especially for large collections. Whether

a store has just received a new shipment of albums or a radio

station is looking to reorganize their archive of music, the

ﬁrst hurdle that one ﬁnds is how tedious it becomes to man-

ually search for every single release. This report explores

the application of computer vision techniques to streamline

and simplify this task. By leveraging real-time object detec-

tion through the YOLOv8 model, our approach automates

key aspects of CD cataloging, including identifying CD ori-

entation and extracting metadata from barcodes and catalog

numbers. The ultimate goal is to reduce the manual effort

required while enabling integration with music databases

for efﬁcient collection management. This report outlines

the methods, results, and recommendations for further de-

velopment to achieve a more comprehensive solution.

3. Related Work

The application of computer vision techniques for object

detection and metadata extraction has been explored in var-

ious domains, including inventory management, library cat-

aloging, product identiﬁcation, and even face identiﬁcation.

Real-time object detection models, particularly YOLO (You

Only Look Once) architectures, have been widely adopted

for their speed and accuracy in detecting and classifying ob-

jects. The YOLOv8 model, which we have used, builds

upon previous iterations, and its efﬁciency makes it suitable

for real-time applications.

Optical character recognition (OCR) has also seen sig-

niﬁcant advancements over the years, with applications

ranging from digitizing text in scanned documents to ex-

tracting product identiﬁers. Various OCR tools and custom

deep learning-based OCR models have been leveraged to

recognize alphanumeric text, such as catalog numbers, un-

der a variety of conditions. Barcode scanning, also a well-

established technology, has similarly been enhanced by in-

tegrating computer vision to improve reliability and perfor-

mance in real-world scenarios.

Our work builds on these advancements by combining

real-time object detection, OCR, and barcode scanning into

a single pipeline speciﬁed for CD cataloging. While prior

research often focuses on speciﬁc components, such as im-

proving object detection or OCR accuracy, this project aims

to integrate these components into a practical application.

Additionally, the incorporation of music database APIs to

retrieve and manage metadata bridges the gap between com-

puter vision techniques and collection management sys-

tems, while creating a streamlined user experience.

4. Implementation

4.1. System Overview

The application is designed to streamline CD cataloging

through employing computer vision techniques such as by

automating object detection and information retrieval pro-

cesses. The workﬂow begins with identifying the orienta-

tion of a CD using real-time object detection, followed by

extracting relevant data based on the detected orientation.

This data is then cross-referenced with an external music

database to provide metadata and collection management

options. Finally this data can be displayed to the user and

additionally used to track collection data.

We employed computer vision techniques to scan the al-

bum and ﬁnd the exact release, thus adding the release to

one’s collection is as easy as taking a photo. We applied ex-

isting computer vision techniques to a new subject with this

application-oriented project. The process employs the use

of object tracking techniques and corner matching through

YOLO from either a live video feed of the CD scanning or

web interface to upload photos of the CDs to be scanned.

Using these images, we will correct the perspective of the

photos to create high quality scans of the front, back, and

side of the CD case [2]. Using the scans, we will ana-

lyze the barcode (on the back) through the pyzbar python

library [3] and catalog number (on the side) through extract-

ing text through the PaddleOCR python library [1] to col-

lect uniquely identifying information for each disk. This in-

formation will be cross referenced against the MusicBrainz

and/or Discogs database to ﬁnd the exact release of the al-

bum. This exact identiﬁer can then be used to feed any on-

line API like adding the album to your collections on Mu-

sicBrainz or Discogs or searching for genre information on

Last.fm. We used an archive of album metadata from Mu-

sicBrainz. This site hosts a nearly comprehensive database

of every piece of music ever released and all versions of

them that are known.

4.2. Dataset

Before we implemented any object detection or

text/information extraction algorithms, we had to create a

dataset of images with which we could train our model to

identify albums from images. To start, our group had to

collect images of albums and label them.

4.2.1 Data Collection

For the image collection part, our group considered the per-

formance of our model in suboptimal conditions such as if

the user of the application would use it in dark environments

or if there were very bright reﬂections on the album cover,

so we made sure to include images in various lighting con-

ditions. Additionally, we considered cases in which the al-

bums were viewed upside down, at a sharp angle, or through

a blurry lens, so we made sure to include unclear images

with the intent of allowing the model to be able to identify

albums in various lighting conditions and angles. Lastly, we

also included images that contained no albums, or negative

images, in order to reduce the possibility of the model out-

putting false positives. As for the source of the albums that

we photographed: we used our personal album collections

with UTD’s album collections to create a diverse dataset of

albums.

4.2.2 Data Labeling

After collecting the dataset of album images, the next step

for us was to label the dataset of images. In order for the

model to be able to identify and segment the input im-

age it was necessary to create labels that were identifying

enough, but simple enough for our process and time con-

straint. Thus for our labeling solution we used a open-

source software called Label Studio. We divided the dataset

among ourselves for the manual labeling process and each

of us were able to identify instances of each class (front,

back, and side) in the images and draw a mask around in-

stances of the classes using the ”Semantic Segmentation

with Polygons” labeling feature. Upon completion of our

pre-processing steps, we exported the labeled dataset and

began the model training and development step. The fol-

lowing image demonstrates how we labeled instances of

each class in our images:

4.3. Model Details

We utilized the YOLOv8 model for real-time object de-

tection. The model was trained on a custom-labeled dataset

of CD images, previously mentioned in order to identify the

front, back, and side of the album covers. Thus we were

able to perform object detection for the given sides of a CD

from our model and identify the orientation of a CD in the

camera’s view. Training parameters and hyperparameters

were optimized to achieve more desirable accuracy for de-

tecting CDs at a conﬁdence threshold suitable for consistent

snapshots.

Once the conﬁdence level for a detected orientation

(Front, Back, or Side) exceeds the set threshold, a snapshot

of the CD is captured. This snapshot serves as input for the

subsequent data extraction step.

For the Back orientation, the snapshot undergoes bar-

code scanning to extract the CD’s barcode utilizing the

pyzbar library. If the Side orientation is detected, an OCR

process is applied to extract the catalog number using the

PaddleOCR library. These extracted details form the basis

for retrieving metadata from a music database.

The following image demonstrates the initial output of

the YOLO model and its outputs (detected class and conﬁ-

dence):

4.3.1 Training Procedures

We trained the YOLOv8 model on our dataset of 600 im-

ages that were 640x480 pixels in size for 300 epochs in or-

der to allow our model to learn sufﬁcient information about

differentiating between the three different classes with a

batch size of 0.9 in order to stay under video RAM limi-

tations of our hardware. Additionally, our training function

was set to prematurely end training once reductions of loss

were minimal, and due to this, our model ﬁnished training

147 epochs in 48 minutes.

Upon training the model, we performed automated vali-

dation testing with 300 images to ensure that our model had

decent accuracy. We had a 87.5% accuracy on our valida-

tion dataset. Additionally, we also tested the model through

a live camera feed and produced the following results:

4.3.2 Hardware and Software Environment

We initially trained on a Thinkpad P53 (mobile worksta-

tion laptop) with an Nvidia Quadro T2000 mobile graph-

ics card with CUDA support when we had fewer images in

our dataset, we eventually switched to using Google Colab

with Nvidia Tesla T4, which has more processing power

and more RAM. Because we switched to Google’s cloud

computer platform to train our model, we were able to uti-

lize more powerful hardware for free. On the Colab note-

book, we had to install and import the ultralytics library,

from which we imported the YOLO model for training its

weights to ﬁt our application.

4.4. Get Album information

Depending on what our model detects, our application

either extracts the catalog ID of the album if the model de-

tects the album’s side or the barcode number if the model

detects the album’s back. The metadata retrieved using the

barcode or catalog number is used to call the MusicBrainz

public API to query for the album release. Although a sys-

tem was planned to allow users to add CDs directly to their

collection, this feature remains partially implemented. Cur-

rently, the received data can be displayed for user review,

with manual collection updates as a temporary workaround.

Still, our application serves as a useful tool to identify spe-

ciﬁc versions of albums, as various versions of an album

could be released across the globe with different track lists.

Therefore, even in its current state, our application can serve

as a useful tool for those who want to ﬁnd and preserve spe-

ciﬁc music albums.

4.5. User Interface

After getting the extracted data, such as the album name

we output it to the live video feed through OpenCV for user

review. As mentioned one of the limitations of our project

was the ability to ﬁnish the user interface, so as for right

now manual collection updates are required with the infor-

mation fed back, but we have plans for using the informa-

tion to feed an online API for adding the album to a user’s

collection.

5. Results and Evaluation

5.1. Qualitative Results

From our application we have seen impressive results

from our model to the data extraction. Our model has shown

high conﬁdence while correctly detecting the proper orien-

tation of a shown CD in real time. The captured snapshots

proved to be reliable in the next step of data extraction,

which was made relatively easy through existing libraries.

We saw the model to be consistent in detecting the back or

side and then being able to pass a clear snapshot on for the

proper data extraction, which is the main challenge for data

extraction being the image clarity and of course being the

proper image for the speciﬁed tool of extraction. Our model

has a ﬁnal mAP mean average precision time of: mAP50 =

0.984 and mAP50-95 = 0.882.

6. Discussion

6.1. Challenges and Modiﬁcations

One of the main challenges was achieving consistent de-

tection accuracy for the various orientations, as well as pre-

venting false positives. Additionally, the API integration for

collection tracking was delayed due to time constraints.

We addressed the issues of consistent detection and pre-

venting false positives through common dataset procedures.

We expanded our dataset to include not only more images

but images that were rotated as well as negative images.

Negative images helped greatly in reducing false positives,

since our initial dataset only had images with a CD mask in

each one.

7. Conclusion and Future Work

7.1. Future Work

Future improvements include expanding the dataset for

better model performance and completing the API imple-

mentation to provide a seamless user experience. Expand-

ing the dataset includes providing more images overall with

different CDs, but also taking images different from our

dataset, since we had a lot of consistency in our dataset that

would prevent further generalization. In addition, our im-

ages only capture a single CD at a time, and while our model

showed signs of possibly capturing more than one CD it

would be recommended to introduce this to the dataset and

make this improvement to the model overall for an even fur-

ther streamlined process.

7.2. Conclusion

This project demonstrates the potential of computer vi-

sion techniques to streamline and automate the process of

CD cataloging. By leveraging the YOLOv8 model for real-

time object detection, OCR for catalog number recognition,

and barcode scanning, we developed an application capable

of extracting and organizing metadata from CDs. The inte-

gration with music databases further improves the ease of

the application, enabling users to retrieve detailed informa-

tion about their collection and even add to it.

While the system successfully achieved most of its goals,

including accurate object detection and metadata extraction,

the full implementation of collection management via an

API remains incomplete. Future improvements include ex-

panding the dataset to enhance detection accuracy, reﬁning

the OCR and barcode scanning processes, and completing

the API integration to enable seamless collection tracking.

Overall, this project highlights how combining computer

vision techniques with practical workﬂows can simplify tra-

ditionally tedious tasks. With further reﬁnement, this ap-

plication has the potential to serve as a robust tool for CD

cataloging and collection management.

References

[1] Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei

Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing

Dang, and Haoshuang Wang. Pp-ocr: A practical ultra

lightweight ocr system, 2020. 2

[2] M. Ramanan, A. Ramanan, and E. Y. A. Charles. A prepro-

cessing method for printed tamil documents: Skew correc-

tion and textual classiﬁcation. In 2015 IEEE Seventh Interna-

tional Conference on Intelligent Computing and Information

Systems (ICICIS), 2015. 2

[3] K. Roy, S. Banerjee, R. Dhar, I. Poddar, P. Dhar, S. Halder,

and S. Kumar. An efﬁcient ocr based technique for barcode

reading and editing. In 2017 4th International Conference on

Opto-Electronics and Applied Optics (Optronix), 2017. 2