Abstract | Intelligent service robots are increasingly being deployed in households, hospitals, warehouses,
and other environments. These robots are capable of performing complex tasks,
largely relying on their vision systems for object detection. As these robots operate and
interact in 3D space, data obtained with 3D sensors is crucial for object detection, as it
offers a detailed representation of the environment in the form of point clouds. These
point clouds allow for precise extraction of information about the position, size, and geometry
of objects within the 3D environment. Examples of such sensors include LiDARs
and RGB-D cameras. State-of-the-art 3D object detection methods are predominantly
deep learning-based and, as such, require large amounts of annotated data to perform
effectively. However, acquiring such data can be both expensive and time-consuming.
As a result, synthetic data generation presents a promising alternative, but current object
detectors trained on synthetic data still underperform compared to those trained on real
data. This suggests that current synthetic data generation methods do not yet achieve
sufficient realism.
This thesis aims to address this issue by exploring the field of realistic synthetic 3D
data generation for indoor environments. The primary goal is to analyze the impact of
five key factors of realism: presence of background objects, camera noise, positioning
of objects, context of scenes, and object sizes. To investigate these factors, a modular
method for generating realistic synthetic single-view point clouds was developed. This
method allows for the generation of large, customizable datasets with varying levels of
realism, specifically controlling for the aforementioned factors.
These datasets are used to train state-of-the-art object detectors, and the impact of
each realism factor is evaluated based on the detection performance. Furthermore, the
experiments show that the performance of an object detector can be improved by pretraining
it on a baseline synthetic dataset and fine-tuning it on real data. Notably, the
model trained on geometric data only using this approach outperforms the same object
detector trained solely on real data, which uses both geometric and color data.
In addition to detecting objects, a service robot needs the ability to interact with objects
in their environment. One particular challenge arises with openable objects such as
cabinets, nightstands, and closets. While traditional openable objects often feature simple
handles that can be grasped to open them, modern design increasingly incorporates
handleless doors and drawers. This presents a significant challenge for service robots,
as most existing methods for the robotic manipulation of articulated objects focus on
objects with handles, and there is limited research on handling objects without them.
This thesis addresses the first crucial step in managing such objects: the classification
of the opening mechanism. Three categories are proposed for this classification: objects
with regular handles, objects that can be grasped by their surface to be opened, and objects
with a push latch mechanism. Given that the latter two categories often appear
visually similar and can be difficult even for humans to distinguish based solely on appearance,
this work explores methods that utilize both images of the objects and images
of a human demonstrating the approach to opening them.
Experiments conducted using a new dataset indicate that combining images without
demonstration and images with demonstration significantly improves the performance of CNN classifiers compared to using either type of image alone. Additionally, experiments
conducted with a modern object detector show that such classifiers can be applied
to automatically detected regions providing evidence that the proposed methods could
be used in real-world environments as part of fully autonomous systems. |
Abstract (croatian) | Inteligentni servisni roboti sve se češće koriste u kućanstvima, bolnicama, skladištima
i drugim okruženjima. Ovi roboti sposobni su obavljati složene zadatke, uvelike se
oslanjajući na svoje vizualne sustave za prepoznavanje objekata. Budući da djeluju
u trodimenzionalnom prostoru, za prepoznavanje objekata koriste se podaci dobiveni
3D senzorima jer nude detaljan prikaz okoline u obliku oblaka točaka. Oblaci točaka
omogućuju precizno izdvajanje informacija o položaju, veličini i geometriji objekata unutar
3D okoline. Primjeri takvih senzora uključuju LiDAR-e i RGB-D kamere. Moderne
metode prepoznavanja 3D objekata uglavnom se temelje na dubokom učenju i,
kao takve, zahtijevaju velike količine označenih podataka za učinkovito funkcioniranje.
Međutim, pribavljanje takvih podataka može biti skupo i dugotrajno. Generiranje sintetičkih podataka predstavlja
obećavajuću alternativu, ali trenutni detektori objekata
učeni na sintetičkim podacima i dalje pokazuju slabije rezultate u usporedbi s onima koji
su učeni na stvarnim podacima. To sugerira da trenutne metode generiranja sintetičkih
podataka još uvijek ne postižu dovoljnu razinu realizma.
Cilj ovog rada je doprinijeti rješavanju ovog problema istraživanjem područja generiranja
realističnih sintetičkih 3D podataka unutarnjih prostora. Glavni cilj je analizirati
utjecaj pet ključnih faktora realizma: prisutnosti pozadinskih objekata, šuma kamere,
pozicioniranja objekata, konteksta scene i veličine objekata. Kako bi se istražili ovi faktori,
razvijena je modularna metoda za generiranje realističnih sintetičkih oblaka točaka
iz jedne točke promatranja. Ova metoda omogućava generiranje velikih, prilagodljivih
skupova podataka s različitim razinama realizma, posebno kontrolirajući svaki od navedenih
faktora.
Ti skupovi podataka se koriste za treniranje modernih detektora objekata, a utjecaj
svakog faktora realizma ocjenjuje se na temelju uspješnosti prepoznavanja. Nadalje,
eksperimenti pokazuju da se performanse detektora objekata mogu poboljšati treniranjem
istog na sintetičkom skupu podataka i dodatnom treniranju na stvarnim podacima.
Zanimljivo je da model treniran samo na geometrijskim podacima nadmašuje isti detektor
objekata obučen isključivo na stvarnim podacima, koji koristi i geometrijske podatke
i podatke o boji.
Pored sposobnosti detekcije objekata, servisni robot treba imati i sposobnost interakcije
s objektima u svom okruženju. Poseban izazov javlja se kod objekata koji se mogu
otvoriti kao što su ormarići, noćni ormarići i ormari. Dok tradicionalni objekti koji se
mogu otvoriti često imaju jednostavne ručke koje se mogu uhvatiti za otvaranje, moderni
dizajn sve više uključuje vrata i ladice bez ručki. To predstavlja značajan izazov za
servisne robote jer se većina postojećih metoda robotskog otvaranja vrata fokusira na
objekte s ručkama, a postoji malo istraživanja o interakciji s objektima bez ručki.
Ovaj rad bavi se prvim ključnim korakom u interakciji s takvim objektima: klasifikacijom
mehanizma otvaranja. Predložene su tri kategorije za ovu klasifikaciju: objekti
s ručkama, objekti koji se mogu uhvatiti za površinu kako bi se otvorili te objekti s mehanizmom
za otvaranje na dodir. S obzirom na to da objekti posljednje dvije kategorije
često izgledaju slično, pa čak i ljudima može biti teško razlikovati ih samo na temelju
izgleda, ovaj rad istražuje metode koje koriste i slike objekata i slike na kojima ljudi
demonstriraju pristup otvaranju. |