Building an iOS Hand-drawing Recognition Augmented Reality App Using CoreML, Vision, ARKit, and SceneKit.

Rainer Regan
15 min readMay 31, 2023

--

ARKit and CoreML Logo. The logo copyright belongs to Apple.

AR or augmented reality is one of the brightest and most fascinating technologies invented in the modern mobile engineering era. Augmented reality allows us to see seamlessly integrated computer-generated content, such as three-dimensional objects with the real world only using your mobile smartphone camera. This opens a lot of possibilities for new ideas that can be implemented in the mobile app development and games industry.

In the Apple ecosystem, there is a tool that gives access to developers to build an amazing AR app with ease. This tool, or we can call it a framework, is ARKit. Developed by Apple, ARKit is a powerful framework that empowers developers to create immersive and interactive AR applications for iOS devices.

Another great framework that is also developed by Apple is CoreML. Apple, renowned for its innovative ecosystem, introduced Core ML, a powerful framework that brings the potential of machine learning to its devices and empowers developers to create intelligent applications.

Core ML simplifies the integration of ML models into iOS, iPadOS, macOS, watchOS, and tvOS apps, making it accessible to developers without extensive ML expertise. It offers a wide range of pre-trained models covering areas like image recognition, natural language processing, and even complex tasks like object detection and sentiment analysis.

Introduction

In this article, we will focus on creating an iOS augmented reality that is integrated with the CoreML framework to detect the user’s hand drawing based on a trained machine learning model. For learning purposes, I have trained an ML model using the CreateML app from Apple to simplify the process of training an ML model. Also, this model will only accommodate six types or classes of trained hand-drawing patterns. These trained objects are trees, The Eiffel Tower, a bridge, a traffic light, a bus, and a car.

The example of the Eiffel Tower hand drawing was used for model training.

In this article, we will not focus on the ML training process, because it will be separated on another topic. Training a machine learning model will take a lot of time to do, so I have prepared a trained model that you can download here. This trained ML model was trained using 1500+ hand-drawing photos from Google’s QuickDraw dataset that has been converted into image form before being processed inside the CreateML app. By using this model, we can save some time and be more focused on the development process.

Prerequisites

In this article, we will use MacOS 13.4+ with Xcode version 14.3 (14E222b). Also, it is required to have a physical iPhone device compatible with the Xcode version to test the AR app, because testing an AR app with the simulator is almost impossible. So, prepare your device and software, and we will start right away🔥.

If you want to skip reading and jump directly to the project, you can visit my GitHub repository here:

https://github.com/rainerregan/HandDrawingRecognitionAR

The Setup

Before we do anything further, we need to create a new Augmented Reality app on Xcode.

Let's start with “Hand Drawing Recognition AR”. We will be using Storyboard and SceneKit as the content technology.

After the project is successfully created, Xcode will automatically create a boilerplate code for the sample app. We can take advantage of this sample project and improve it.

I want to make my code as clean as possible, so I will separate the delegate into an extension:

//
// ViewController.swift
// Hand Drawing Recognition AR
//
// Created by Rainer Regan on 27/05/23.
//

import UIKit
import SceneKit
import ARKit

class ViewController: UIViewController {

@IBOutlet var sceneView: ARSCNView!

override func viewDidLoad() {
super.viewDidLoad()

// Set the view's delegate
sceneView.delegate = self

// Show statistics such as fps and timing information
sceneView.showsStatistics = true

// Create a new scene
let scene = SCNScene(named: "art.scnassets/ship.scn")!

// Set the scene to the view
sceneView.scene = scene
}

override func viewWillAppear(_ animated: Bool) {
super.viewWillAppear(animated)

// Create a session configuration
let configuration = ARWorldTrackingConfiguration()

// Run the view's session
sceneView.session.run(configuration)
}

override func viewWillDisappear(_ animated: Bool) {
super.viewWillDisappear(animated)

// Pause the view's session
sceneView.session.pause()
}

}

// MARK: - ARSCNViewDelegate
extension ViewController : ARSCNViewDelegate {
func session(_ session: ARSession, didFailWithError error: Error) {}
}

This is the example code of a starter SceneKit project. As you can see, ViewController is the main code section for this app, and we will be more focused on this file.

If we run this boilerplate code, we will see something like below. There is a default 3D model of a fighter jet anchored to the detected horizontal plane in the real world. You can try to interact with this default model by walking around the model.

Your first AR application using AR Kit and SceneKit

Congratulations!! You have created your first AR application 🔥. But this is not what we aimed for in this article, we will make a much cooler AR app.

The 3D Assets

Thanks to some 3D artists on the internet, we can gather some 3D assets that we can use for the AR application. Based on the trained ML model I’ve mentioned before, we will be focusing on six classifications of objects: trees, The Eiffel Tower, a bridge, a traffic light, a bus, and a car. So, the 3D assets that we will be using are that objects.

To use the 3D models on SceneKit, the format of the model should be .uzdz . Usdz is a file format that is used on iOS apps, especially for AR. With this format, we can directly import the 3D model into the Xcode project without any issues.

I will use Sketchfab for my source of 3D models, and I will prefer to use free models available on Sketchfab for this project. Here is the list:

You need to download all assets from the list above in order to continue building the AR app. Notes: you are not constrained to use those 3d models, you can use any models you want, but make sure the 3d models are in the same classification as the trained machine learning model. NOTE: Make sure to download 3D models as .usdz . And after that, rename each file to its respective classification name, e.g. Laurel_tree_---.usdz as tree.usdz

After you have downloaded and renamed the assets, you can drag and drop the assets directly into the project to import the assets.

Make sure to tick “Copy items if needed” to copy the files into the project.

After importing files, your project structure would look like this.

Project structure after importing 3d files

Creating UI

After the 3D assets are imported, we need to create the basic interface for the scanning session. We need a debug text for displaying the detected classification and the reset button for cleaning the virtual world from deployed 3d models.

So, we can start by modifying the Main screen on the storyboard. The component we need is a label and a button. Also, to make our app more interactive, we can add a ‘crosshair’ image on the center of the screen to make user easier to scan their hand drawing.

Default main screen

We can start by opening the main screen storyboard and you can see the default layout of this main page. We won’t jump into details on how to modify the UI, we only need to add some basic components.

Adding a label to the page

Firstly, we need to add a label on the page and center it horizontally and add some spacing from the top of the safe area. After that, we can link the component to the ViewController by dragging the label component from the structure tree while holding the ‘control’ key.

Connect the component to ViewController

We will name this component ‘debugText’. In ViewController, the code should look like this:


class ViewController: UIViewController {
// ...

@IBOutlet weak var debugText: UILabel!

// ...
}

We will create another component, and also link it to the ViewController, this time, we will add a Button, and connect this to the ViewController.

Besides connecting, we also want to create the button action to handle the reset function. For this, leave this empty with the only print function.


class ViewController: UIViewController {

// ..
@IBOutlet weak var resetButton: UIButton!

@IBAction func resetButtonAction() {
print("Reset")
}

// ...
}

Another thing, we need to lock the device orientation to portrait, so the UI will not break. Inside the viewDidLoad() function, we need to add this code:

// Lock the device orientation to the desired orientation
UIDevice.current.setValue(UIInterfaceOrientation.portrait.rawValue, forKey: "orientation")

So the whole viewDidLoad function would look like this:

override func viewDidLoad() {
super.viewDidLoad()

// Lock the device orientation to the desired orientation
UIDevice.current.setValue(UIInterfaceOrientation.portrait.rawValue, forKey: "orientation")

// Set the view's delegate
sceneView.delegate = self

// Show statistics such as fps and timing information
sceneView.showsStatistics = true

// Create a new scene
let scene = SCNScene(named: "art.scnassets/ship.scn")!

// Set the scene to the view
sceneView.scene = scene
}

By using this code, our orientation will not change even though the iPhone is rotated sideways.

Not only that, inside the ViewController class, add more functions to force the orientation of the application.

override var supportedInterfaceOrientations: UIInterfaceOrientationMask {
// Specify the supported orientations (in this case, only portrait)
return .portrait
}

override var shouldAutorotate: Bool {
// Disable autorotation
return false
}

The Tap Gesture

In this article, we will use raycasting to display the 3D model in the virtual AR world. To implement this, we need to set the handleTap function on the AR kit.

Firstly, create a handleTap function inside the ViewController class.


@objc func handleTap(gestureRecognizer : UITapGestureRecognizer) {
/// Create a raycast query using the current frame
if let raycastQuery: ARRaycastQuery = sceneView.raycastQuery(
from: gestureRecognizer.location(in: self.sceneView),
allowing: .estimatedPlane,
alignment: .horizontal
) {
// Performing raycast from the clicked location
let raycastResults: [ARRaycastResult] = sceneView.session.raycast(raycastQuery)

print(raycastResults.debugDescription)

// Based on the raycast result, get the closest intersecting point on the plane
if let closestResult = raycastResults.first {
/// Get the coordinate of the clicked location
let transform : matrix_float4x4 = closestResult.worldTransform
let worldCoord : SCNVector3 = SCNVector3Make(transform.columns.3.x, transform.columns.3.y, transform.columns.3.z)

/// Load 3D Model into the scene as SCNNode and adding into the scene
// TODO: Later, We will load the 3D model on the raycasted location.
}
}


}

From the function, firstly we need to create a ray cast query from the gesture recognition of the sceneView, to the estimated plane located on the SceneKit environment, and in this case, we will only detect the horizontal plane.

From that query, we will be getting a result containing the transformation on the ray cast target on the estimated plane. From that, we will extract the transformation in the form of a 4x4 float matrix and the world coordinates. Later, we will load and display the selected 3D model onto that location and transformation.

After creating the handleTap function, inside the viewDidLoad method, create a new UITapGestureRecognizer, and assign it to the view.


// Set the tap gesture recognizer
let tapGesture = UITapGestureRecognizer(
target: self,
action: #selector(self.handleTap(gestureRecognizer:))
)
view.addGestureRecognizer(tapGesture)

AR Configuration

We need to specify the preferred plane to be detected by the AR, so we need to adjust the configuration on the viewWillAppear method, this time we need to add this code under the initialization of configuration.

// Create a session configuration
let configuration = ARWorldTrackingConfiguration()
configuration.planeDetection = [.horizontal]

The Machine Learning

After setting up the AR, we can start focusing on the more exciting stuff, machine learning. In this article, we will be using a pre-trained model, so we don't have to go through all tiring model training sessions. You can download the machine learning model from the GitHub repository.

First, create a new group named ‘Model’ at the root of the project.

After that, import the file into the group by drag-and-dropping the file directly.

The model is not integrated with the project and is ready to use. By doing this we can now start to integrate it with our code.

After that, we can import the data model into our view controller. We need to create new variables.

/// The ML model to be used for recognition of arbitrary objects.
private var _handDrawingModel: HandDrawingModel_v4!
private var handDrawingModel: HandDrawingModel_v4! {
get {
if let model = _handDrawingModel { return model }
_handDrawingModel = {
do {
let configuration = MLModelConfiguration()
return try HandDrawingModel_v4(configuration: configuration)
} catch {
fatalError("Couldn't create HandDrawingModel due to: \(error)")
}
}()
return _handDrawingModel
}
}

This code will import the hand drawing machine learning model we have copied before, and store it inside a variable.

After the model is present in our code, we can start using that. Inside the ViewController, create new functions to handle the classification request and create a handler function to handle the result.

Under the handleTap function, create a new function for creating classification requests like below.

private lazy var classificationRequest: VNCoreMLRequest = {
do {
// Instantiate the model from its generated Swift class.
let model = try VNCoreMLModel(for: handDrawingModel.model)
let request = VNCoreMLRequest(model: model, completionHandler: { [weak self] request, error in
self?.classificationCompleteHandler(request: request, error: error)
})

// Crop input images to square area at center, matching the way the ML model was trained.
request.imageCropAndScaleOption = .centerCrop

// Use CPU for Vision processing to ensure that there are adequate GPU resources for rendering.
request.usesCPUOnly = true

return request
} catch {
fatalError("Failed to load Vision ML model: \(error)")
}
}()

This method will handle the classification request from the AR camera and process it using Vision to get the classification result using the Machine Learning model. This method will also call the callback function to process the classification result. So, we need to create another method, this time it will be called classificationCompleteHandler.

func classificationCompleteHandler(request: VNRequest, error: Error?) {
guard let results = request.results else {
print("Unable to classify image.\n\(error!.localizedDescription)")
return
}

// The `results` will always be `VNClassificationObservation`s, as specified by the Core ML model in this project.
let classifications = results as! [VNClassificationObservation]

// Show a label for the highest-confidence result (but only above a minimum confidence threshold).
if let bestResult = classifications.first(where: { result in result.confidence > 0.5 }),
let label = bestResult.identifier.split(separator: ",").first {
identifierString = String(label)
confidence = bestResult.confidence
} else {
identifierString = ""
confidence = 0
}

DispatchQueue.main.async { [weak self] in
self?.displayClassifiedResult()
}
}

This function will handle the result sent by classification request. This function will extract the result and save the best result with the best confidence into a class-level variable. So we need to create it first.


class ViewController: UIViewController {

// ...

/// A variable containing the latest CoreML prediction
private var identifierString = ""
private var confidence: VNConfidence = 0.0
private let dispatchQueueML = DispatchQueue(label: "com.exacode.dispatchqueueml") // A Serial Queue
private var currentBuffer : CVImageBuffer?

//...
}

The identifier string will be storing the latest prediction result in the form of a string. The confidence will be storing the confidence level of the latest prediction. The dispatchQueue is a variable for creating a queue for machine learning requests. The current buffer will store the latest frame buffer from the sceneView before being passed into the ML request.

Next, we need to create a function to pass the image buffer from that captured frame of a sceneView to the machine learning request. This function will act as a bridge function before passing into the ML request.

func classifyCurrentImage() {
let orientation = CGImagePropertyOrientation(UIDevice.current.orientation)

let imageRequestHandler = VNImageRequestHandler(
cvPixelBuffer: currentBuffer!,
orientation: orientation
)

// Run Image Request
dispatchQueueML.async {
do {
// Release the pixel buffer when done, allowing the next buffer to be processed.
defer { self.currentBuffer = nil }
try imageRequestHandler.perform([self.classificationRequest])
} catch {
print(error)
}
}
}

To get the orientation for the captured frame, we need to create a new Utility file called ‘Utilities’ to extend the CGImagePropertyOrientation.

Create a new group called Support, and inside, create a file called Utilities.

import Foundation
import ARKit

// Convert device orientation to image orientation for use by Vision analysis.
extension CGImagePropertyOrientation {
init(_ deviceOrientation: UIDeviceOrientation) {
switch deviceOrientation {
case .portraitUpsideDown: self = .left
case .landscapeLeft: self = .up
case .landscapeRight: self = .down
default: self = .right
}
}
}

So in the short term, we will capture each frame of the AR view from the sceneView and send the image buffer to the opening function before passing it again into the machine learning request.

This is optional, but for giving information to the user about what machine learning has detected, we can create one more function to display the result to the debug text and console.

func displayClassifiedResult() {
// Print the clasification
print("Clasification: \(self.identifierString)", "Confidence: \(self.confidence)")
print("---------")

self.debugText.text = "I'm \(self.confidence * 100)% sure this is a/an \(self.identifierString)"
}

This function will be called from the classifyCurrentImage function and display the classification result in the console and label on the UI.

Starting the Classification Process

To start the classification request we need to call it every frame is updated. To do this, we need to create an extension from the ViewController class and implement ARSessionDelegate.

Inside the delegate, create a session function to track the frame update.


// MARK: - ARSessionDelegate
extension ViewController: ARSessionDelegate {
func session(_ session: ARSession, didUpdate frame: ARFrame) {
guard currentBuffer == nil, case .normal = frame.camera.trackingState else {
return
}

// Retain the image buffer for Vision processing.
self.currentBuffer = frame.capturedImage
classifyCurrentImage();
}
}

On the viewDidLoad function, set the session delegate to itself

sceneView.session.delegate = self

After that, create a function to restart the ar session and start the classification process.

// MARK: - Restart Session and Rerun AR
private func restartSession() {
let configuration = ARWorldTrackingConfiguration()
configuration.planeDetection = [.horizontal, .vertical]
sceneView.session.run(configuration, options: [.resetTracking, .removeExistingAnchors])
}

On the viewDidLoad function, on the bottom, call the function.

override func viewDidLoad() {
// ...

// Start the scanning session
self.restartSession()
}

With this, you can now see the classification result from the label text.

Congratulation! you have created the image classification using an AR camera, now we can focus on displaying the 3D model.

Displaying the 3D Model

Now, after the ML prediction is completed, we can now focus on displaying the 3D model.

First, we need to create a function to determine which 3D model should be loaded based on the prediction data. Inside the ViewController class, create one more function.

func loadNodeBasedOnPrediction(_ text: String) -> SCNNode? {
guard let urlPath = Bundle.main.url(forResource: text.trimmingCharacters(in: .whitespacesAndNewlines), withExtension: "usdz") else {
return nil
}
let mdlAsset = MDLAsset(url: urlPath)
mdlAsset.loadTextures()

let asset = mdlAsset.object(at: 0) // extract first object
let assetNode = SCNNode(mdlObject: asset)
assetNode.scale = SCNVector3(0.001, 0.001, 0.001)

return assetNode
}

This function will take a parameter of a predicted hand-drawing image and display the assets based on the classification.

For this function, we will be using the SCNNode from SceneKit ModelIO, so we need to import it.

// Other imports
import SceneKit.ModelIO

Back to handleTap function, call this function whenever the screen is tapped.

@objc func handleTap(gestureRecognizer : UITapGestureRecognizer) {

/// Create a raycast query using the current frame
if let raycastQuery: ARRaycastQuery = sceneView.raycastQuery(
from: gestureRecognizer.location(in: self.sceneView),
allowing: .estimatedPlane,
alignment: .horizontal
) {
// Performing raycast from the clicked location
// ...

// Based on the raycast result, get the closest intersecting point on the plane
if let closestResult = raycastResults.first {
/// Get the coordinate of the clicked location
// ...

/// Load 3D Model into the scene as SCNNode and adding into the scene
guard let node : SCNNode = loadNodeBasedOnPrediction(identifierString) else {return}
sceneView.scene.rootNode.addChildNode(node)
node.position = worldCoord
}
}
}

As you can see, we will create a SCNNode on every click on the view, and call the loadNodeBasedOnPrediction method to get the right 3D model. After that, we will assign the node to the scene and set its position based on the raycast result.

Testing The Result

To test the result of what we have created before, take a piece of paper and draw one of these objects: a tree, the Eiffel Tower, a bridge, a traffic light, a car, or a bus. After that, start the application and point your camera to the hand drawing.

A tree is placed on the hand drawing.

Congratulations !!! 🎉 👏 You have created an AR application that can detect your hand drawing!

Optional: Reset Button

This is an optional step and will not affect the main function of this app. This step will explain the implementation of the reset button and clear the virtual world of any placed objects.

To start, modify the IBAction we have created before to this code:

@IBAction func resetButtonAction() {
print("Reset")
guard let sceneView = sceneView else {return}
sceneView.scene.rootNode.enumerateChildNodes { (node, stop) in
node.removeFromParentNode() }
}

And re-run the application. You will be able to reset the world.

Conclusion

Congratulations on your first app on integrating AR and Machine Learning. This article is only covering a small number of possibilities that you can create with this technology. It is fascinating that Apple developed these cool frameworks. Now it’s your turn to create cooler applications using these frameworks.

--

--

Rainer Regan

Hi, I'm a software engineer, I write articles to share my experiences in software engineering. Founder of Exacode Systems. https://exacode.io