International conference on machine vision. Рубрика в журнале - Компьютерная оптика
Статья
The classical Otsu method is a common tool in document image binarization. Often, two classes, text and background, are imbalanced, which means that the assumption of the classical Otsu method is not met. In this work, we considered the imbalanced pixel classes of background and text: weights of two classes are different, but variances are the same. We experimentally demonstrated that the employment of a criterion that takes into account the imbalance of the classes' weights, allows attaining higher binarization accuracy. We described the generalization of the criteria for a two-parametric model, for which an algorithm for the optimal linear separation search via fast linear clustering was proposed. We also demonstrated that the two-parametric model with the proposed separation allows increasing the image binarization accuracy for the documents with a complex background or spots.
Бесплатно
Статья научная
We present a computer vision method to assess cognitive load and stress resistance of an agricultural unmanned aerial vehicle operator during mission planning in virtual reality. The approach combines geometrically rigorous gaze-to-user-interface mapping projecting the eye-tracker ray into widget space to obtain metrically correct hits on areas of interest, behavioral and ocular biomarkers, and image-like attention representations, such as heatmaps and recurrence plots. In a study with twelve participants across four scenarios, we recorded 1,198 interaction events and obtained 85.3 % accuracy of gaze-to-interface hits; with increasing difficulty, fixation durations shortened, transition entropy increased, and event-locked pupil responses became larger and slower to recover. Planning time and the time required for replanning increased, while route quality decreased under time pressure. The approach relies only on software–platform aggregate signals and does not use raw eye images, which supports privacy-preserving deployment and portability to ground control software.
Бесплатно
Algorithm for choosing the best frame in a video stream in the task of identity document recognition
Статья
During the process of document recognition in a video stream using a mobile device camera, the image quality of the document varies greatly from frame to frame. Sometimes recognition system is required not only to recognize all the specified attributes of the document, but also to select final document image of the best quality. This is necessary, for example, for archiving or providing various services; in some countries it can be required by law. In this case, recognition system needs to assess the quality of frames in the video stream and choose the "best" frame. In this paper we considered the solution to such a problem where the "best" frame means the presence of all specified attributes in a readable form in the document image. The method was set up on a private dataset, and then tested on documents from the open MIDV-2019 dataset. A practically applicable result was obtained for use in recognition systems.
Бесплатно
Статья
An algorithm for post-processing of the grayscale 3D computed tomography (CT) images of porous structures with the automatic selection of filtering parameters is proposed. The determination of parameters is carried out on a representative part of the image under analysis. A criterion for the search for optimal filtering parameters based on the count of "levitating stone" voxels is described. The stages of CT image filtering and its binarization are performed sequentially. Bilateral and anisotropic diffuse filtering is implemented; the Otsu method for unbalanced classes is chosen for binarization. Verification of the proposed algorithm was carried out on model data. To create model porous structures, we used our image generator, which implements the function of anisotropic porous structures generation. Results of the post-processing of real CT images containing noise and reconstruction artifacts by the proposed method are discussed.
Бесплатно
AlphaDent: A dataset for automated tooth pathology detection
Статья научная
In this article, we present a new unique dataset for dental research – AlphaDent. This dataset is based on the DSLR camera photographs of the teeth of 295 patients and contains over 1200 images. The dataset is labeled for solving the instance segmentation problem and is divided into 9 classes. The article provides a detailed description of the dataset and the labeling format. The article also provides the details of the experiment on neural network training for the Instance Segmentation problem using this dataset. The results obtained show high quality of predictions. The dataset is published under an open license; and the training/inference code and model weights are also available under open licenses.
Бесплатно
Статья научная
Reading Aztec codes is crucial in many practical applications and is well-studied for simple scenarios. However, mobile phone-based decoding is challenging under uncontrolled conditions and when the codes are printed on irregular surfaces like warped paper. The codes must remain readable, even though paper is flexible and not perfectly planar. Our novel method addresses this problem by considering local variations in adjacent symbol modules using conventional image processing techniques. It is particularly effective for Aztec Compact symbols lacking reference elements. We evaluate it on the specially modelled CoBRA-CYL-AZ dataset, including curved and cropped symbol examples, and further confirm the method's applicability on small dataset of the real photos. Both synthetic and real datasets are made publicly accessible on Zenodo. The proposed method achieves 0.59 accuracy on the CoBRA-CYL-AZ dataset, significantly outperforming the popular open-source readers: ZXing (0.02), ZXing-cpp (0.16), and Dynamsoft (0.16). While our method is applicable with any Aztec symbology, it features scanning distorted and damaged Aztec Compact codes.
Бесплатно
Bulk cargo volume measurement for moving dump trucks with a single-layer LiDAR and a camera
Статья научная
The paper addresses the problem of non–contact bulk cargo volume estimation for moving dump trucks. A common scanning method that lets to evaluate the volume of cargo of complex surface for a moving truck implies two single-layer (2D) Light Detection and Ranging (LiDAR) sensors: one is used to scan a vehicle in a plane perpendicular to its movement and the second – to estimate vehicle displacements and restore scans positions on an axis along vehicle movement direction. While LiDAR sensors provide reliable measurement signals in controlled environments their efficacy drastically decreases under challenging outdoor conditions: sand dust, fog, rain heavy precipitation cause false detections and distort LiDARs signal. Thus, vehicle displacements estimated with a highly corrupted LiDAR signal can not be used for a reliable measurement as they may lead to significant volume calculation errors. Partially this is solved in multi-echo lidar where distorted data could be separated from the relevant. In contrast to the single-echo 2D LiDAR, image data from industrial cameras is less sensitive to sand dust or fog. In the paper we propose a novel bulk cargo estimation method that implies only one 2D LiDAR and for vehicle displacements estimation utilizes a camera and computer vision methods. As we demonstrate on a diverse dataset of 730 pairs of dump truck passes from an operating sand pit, the proposed method is more accurate than the two 2D LiDARs baseline while requiring a significantly cheaper sensor. In case if a camera is already present in the volume measurement system and utilized for loaded material classification then the proposed method lets to reduce the cost of solution by the cost of one lidar.
Бесплатно
Статья научная
Identity document recognition is becoming more and more common in our daily lives. As security measures and document standards improve, the number of documents that need to be recognized is also increasing. So, one of the essential tasks of identity document recognition systems is to identify the document type from thousands of possible variants. However, in many cases, we have supplementary information and can reduce a set of possible types on-the-fly to improve processing speed and quality. In this paper, we discuss ID document recognition with on-the-fly type subset selection. The main challenges in such a system are responding within a limited time and achieving computational and memory efficiency for subset handling. We propose a solution based on a feature-matching approach using binary keypoint descriptors and adjusted multi-index hashing, which uses two new heuristics to ensure a constant number of comparisons for each request. We experimentally evaluate this method on the MIDV-500 and MIDV-2019 datasets and demonstrate that it offers an excellent combination of accuracy, configuration time, and search time compared to commonly used hierarchical clustering, hierarchical navigable small-world graphs, multi-probe locality-sensitive hashing, and straightforward brute-force solutions.
Бесплатно
Статья научная
This paper addresses neural network segmentation of a human olfactory bulb sample on X-ray phase-contrast tomographic reconstruction. The olfactory bulb plays a key role in the primary processing of olfactory information. It consists of several nested cell layers, the morphometric analysis of which has important diagnostic value. However, manual segmentation of the reconstructed volume is labor-intensive and requires high qualifications, which makes the development of automated segmentation methods crucial. X-ray phase-contrast tomography provides a high-resolution reconstruction of the olfactory bulb morphological structure. The resulting reconstructions are characterized by excessive morphological details and reconstruction artifacts. These features, combined with limited data volume, visual similarity of neighboring slices, and sparse ground truth, hinder the application of standard neural network-based segmentation approaches. This paper examines the characteristics of the data under consideration and suggests a training pipeline for a convolutional neural network, including inter-slice smoothing at the data preprocessing stage, alternative strategies for splitting the data into subsets, a set of augmentations, and training on sparse sampling. The proposed adaptations achieved a Dice score (micro) value of 0.93 on the test subset. An ablation study demonstrated that each of the above-mentioned modifications independently improves segmentation quality. The presented training pipeline can be applied to the segmentation of morphological structures on tomographic images in biomedical tasks with a limited dataset and non-standard ground truth.
Бесплатно
Enhanced dynamic programming-based method for text line recognition in documents
Статья научная
On-premise text recognition is in demand. Customers want to recognize bank cards to pay online, passports to fill in tickets' information and many more using their smartphones. As main approach to text recognition in the last two decades is artificial neural networks the resulting solutions tend to be resource-hungry and not fitting on mobile devices. In our paper, we introduce an enhanced method based on dynamic programming and a fully convolutional network for text line recognition that allows this classic model to demonstrate competitive results with much heavier architectures. The main idea is the addition of the special pin into the network alphabet that allows to apply dynamic programming to analyze the raw neural network output effectively. As our main focus is the recognition of identity documents we employ public dataset MIDV-500 and its extension MIDV-2019 as a test sample. We compare our resulting recognizer with several published models, including TrOCR, Paddle OCR, and Tesseract OCR 5, to demonstrate its superiority in accuracy and performance trade-off. Our method is about 200 times faster than TrOCR, and in the most cases is about 2 times faster than Paddle OCR. The accuracy of our recognizer is comparable with Paddle OCR on MIDV-500 and is better on MIDV-2019, including it being about 2 times more accurate for machine-readable zones images.
Бесплатно
Fast localization and rectification of documents folded into thirds
Статья научная
The ubiquitous usage of smartphones makes camera-captured document images as widely used as scanned ones as the input of a modern document recognition system. A document captured by a smartphone camera may appear mechanically distorted in the image creating the need for an image rectification step. The present paper considers a particular case of document image distortions. Specifically, if a business document is sent via postal service, it may need to be folded to fit the envelope. Once the document is taken out of the envelope and unfolded, its geometric shape is distorted in a very particular pattern. Since the most popular envelope formats in Europe and America require the document to be folded into thirds, this case is considered in this paper. We propose a novel content-independent model-based algorithm for the localization and geometrical rectification of documents folded into thirds. Our algorithm outperforms current SOTA rectification methods on the recently published dataset FDI by key rectification accuracy metrics (AD and CER) and is able to rectify documents held in hand. Moreover, it can be executed on a mobile CPU and has a reasonable execution time: it takes only about 17 ms to localize a document and about 110 ms to projectively rectify it. So it makes it possible to embed the proposed algorithm into document recognition systems designed for on-device acquisition.
Бесплатно
Handwritten text generation and strikethrough characters augmentation
Статья научная
We introduce two data augmentation techniques, which, used with a Resnet - BiLSTM - CTC network, significantly reduce Word Error Rate and Character Error Rate beyond best-reported results on handwriting text recognition tasks. We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix), which proved to be very effective in handwriting text recognition tasks. StackMix uses weakly-supervised framework to get character boundaries. Because these data augmentation techniques are independent of the network used, they could also be applied to enhance the performance of other networks and approaches to handwriting text recognition. Extensive experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of handwriting text recognition models.
Бесплатно
Статья научная
The human olfactory bulb (OB) is a complex neural structure critical for odor processing and one of the earliest sites of pathology in a number of neurodegenerative diseases. We used X-ray phase-contrast tomography (XPCT) to obtain high-quality 3D images of OB tissue from postmortem patients, allowing detailed visualization of soft tissue microarchitecture, including the olfactory glomeruli. To improve spatial analysis, we developed a computational unfolding method that transforms the curved surface of the OB into a 2D map. This transformation preserves anatomical relationships, allowing accurate quantification of glomeruli by number, size, shape, and distribution. The unfolded representations of OB image support in-depth statistical analysis and are compatible with machine learning tools for automated detection and classification of OB morphological structures. This method provides a powerful framework for studying olfactory function and identifying early structural changes in diseases such as Parkinson's disease, Alzheimer's disease, and COVID-19-associated anosmia. By integrating XPCT with virtual unfolding, we offer a new approach to mapping OB morphological features with increased clarity and diagnostic accuracy.
Бесплатно
High-resolution X-ray imaging for industrial process monitoring and quality control
Статья научная
High-resolution X-ray imaging is an essential component of advanced workflows for industrial process monitoring and quality control (e.g., for metrology and defect inspection in the semiconductor industry). Depending on the specific application area, however, it is subject to different requirements, particularly regarding imaging accuracy and reconstruction fidelity, which are analyzed and systematically structured in this study. As an example, a seamless workflow of two nondestructive techniques with different spatial resolution and different throughput (here shown for a combination of acoustic and X-ray techniques) is proposed to auto-detect and auto-classify defects. X-ray microcopy and high-resolution X-ray computed tomography (XCT) provide nondestructive characterization capabilities on opaque objects, observing features with sizes down to several 10 nanometers. Because of the ability of micro-XCT and nano-XCT to reveal structural characteristics, to determine deviations from a well-defined standard, or to observe kinetic processes, they are suitable imaging techniques for micro- and nano-structured objects, but also for industrial process monitoring and quality control. Typical applications of high-resolution XCT are categorized into 3 groups: 1) Structure analysis – Creation of 3D digital images of the complete interior structure of an opaque object, 2) Flaw detection – Monitoring industrial processes and defect inspection, and 3) Quality control – Observing kinetic processes in objects important for industrial quality control and reliability engineering. These different categories of applications have different requirements for the accuracy of the 3D reconstruction and for the time-to-data. While the highest possible resolution is requested for group 1, data acquisition and data analysis time are essential for group 2. To get high-resolution 3D information of the complete interior structure of an opaque object using lens-based laboratory nano-XCT requires a thorough data analysis, e.g., the application of deep convolutional neural networks, for denoising and mitigation of artefacts. Kinetic studies for group 3, e.g., of reliability-limiting degradation processes in microchips, provide the opportunity to establish appropriate risk mitigation strategies to avoid catastrophic failure. The rapid evolution of advanced semiconductor technologies, including technologies for heterogeneous 3D integration of ICs and chiplet architectures, provides significant challenges for metrology, defect inspection, and physical failure analysis (PFA). The application of nano-XCT as a highly reliable inspection method requires a balance between throughput and fault detection (i.e., measurement and reconstruction accuracy). Ways to achieve a drastic increase in acquisition speed include high-brilliance laboratory X-ray sources, the application of AI algorithms for new image acquisition protocols, and high-speed data processing. A thorough and systematic analysis of the accuracy needed and the consequences for protocol and data analysis will support the goal of the semiconductor industry to improve throughput in metrology and defect inspection. This work may be of interest to a broad audience, including both specialists in the field of XCT and professionals employing XCT as a tool for industrial applications.
Бесплатно
Improving Data Matrix mobile recognition via fast Hough transform and adaptive grid extractors
Статья научная
The Data Matrix is a barcode symbology originally designed for industrial needs. Today, its symbols are increasingly found on everyday products such as pharmaceutical packaging, electronic components, food labels, and clothing tags. This widespread usage presents a challenge: reading Data Matrix symbols from images captured by mobile cameras in uncontrolled environments. The reading process mainly consists of three steps, namely barcode localization, segmentation and decoding. In this work, we focus on the precise localization and segmentation of Data Matrix barcodes. We introduce a new method that involves the localization of the finder pattern using fast Hough transform and subsequent iterative segmentation to extract the encoded message. Our approach demonstrates superior localization quality, as measured by the mean Intersection over Union metric (0.889), and achieves better recognition accuracy (0.903) compared to open–source solutions for reading Data Matrix barcodes, such as libdmtx (0.665), ZXing (0.569), and ZXing–cpp (0.858). Our method requires only 35 milliseconds for computations on an ARM device, enabling real–time processing. It is significantly faster than libdmtx (10 seconds), ZXing (610 milliseconds), although it is slightly slower than ZXing–cpp (6.65 milliseconds).
Бесплатно
Lightweight neural network-based pipeline for barcode image preprocessing
Статья научная
Barcode scanning greatly benefited from deep learning research, as well as the image processing stages included in its workflow. These stages commonly handle pre-processing tasks like localizing barcode symbols in the input image, identifying their type, and normalizing the found regions. They are especially important when there is no a priori knowledge of input image capturing conditions. Thus, a case of multiple barcode recognition within a unique image drastically differs from a single barcode processing in video stream via smartphone. We assess how accuracy of these stages affects the accuracy of the whole barcode scanning as its best and propose a lightweight neural network-based pipeline implementing tasks listed above. To perform this assessment and evaluate the performance of the proposed pipeline elements, we conduct a series of experiments using the set of popular open source scanners, including OpenCV, WeChat, ZBar, ZXing and ZXing-cpp over the SE-barcode and Dubska datasets. These experiments reveal how the proposed pipeline can be configured for optimum speed and accuracy performance depending on the objective and the chosen scanner.
Бесплатно
MIDV-DM: A Document-Oriented Dataset for Image Manipulation Detection and Localization
Статья научная
As the scope of application of document recognition systems in business processes increases, so does the number of attacks on these systems. One form of such attacks could involve software for manipulating a digital image of a document. The development of methods for image manipulation detection and localization is complicated with the fact that available datasets neither contain images of documents nor lack diversity in capture conditions and document types. Furthermore, these datasets do not cover the range of possible kinds of manipulations that occur under natural conditions. In this paper, we introduce MIDV-DM – a publicly available benchmark designed for the development and testing of methods aimed at detecting and localizing manipulations in identity document images. It contains images subjected to eight types of manipulations, which we have conceptually categorized based on our analysis of over 2000 real-world fraud attempts. In total, MIDV-DM contains 1000 original document images from the public MIDV-2020 dataset and 8000 automatically created manipulated images based on them, along with the ground truth masks and annotations. The paper also describes the process of obtaining baseline quality based on the IML-ViT model. The authors believe that MIDV-DM will open new opportunities for researchers to advance technologies for document image authenticity analysis.
Бесплатно
Статья научная
ID document recognition systems are already deeply integrated into human activity, and the pace of integration is only increasing. The first and most fundamental problems of such systems are document image localization and classification. In this field, template matching-based approaches have become widely used. These methods offer industrial precision, require minimal training data, and provide real-time performance on mobile devices. However, these methods have a significant limitation in scalability: every document type represents a set of local features to store and process, which affects the required computing resources. Moreover, considering the number of different document types supported by modern industrial recognition systems, they become unusable. To mitigate the drawback, we propose a method to select a subset of the most "stable" keypoints. To estimate keypoints' stability we synthesize a dataset of images containing various distortions relevant to the process of taking photos of hand-held documents with a smartphone camera in uncontrolled lighting conditions. To perform experiments we use well-known MIDV datasets, which have been designed to benchmark modern ID document recognition. The experiments show that the proposed method allows for increased ID document detection performance given thousands of document types and with limited computing resources.
Бесплатно
Neural network regularization in the problem of few-view computed tomography
Статья научная
The computed tomography allows to reconstruct the inner morphological structure of an object without physical destructing. The accuracy of digital image reconstruction directly depends on the measurement conditions of tomographic projections, in particular, on the number of recorded projections. In medicine, to reduce the dose of the patient load there try to reduce the number of measured projections. However, in a few-view computed tomography, when we have a small number of projections, using standard reconstruction algorithms leads to the reconstructed images degradation. The main feature of our approach for few-view tomography is that algebraic reconstruction is being finalized by a neural network with keeping measured projection data because the additive result is in zero space of the forward projection operator. The final reconstruction presents the sum of the additive calculated with the neural network and the algebraic reconstruction. First is an element of zero space of the forward projection operator. The second is an element of orthogonal addition to the zero space. Last is the result of applying the algebraic reconstruction method to a few-angle sinogram. The dependency model between elements of zero space of forward projection operator and algebraic reconstruction is built with neural networks. It demonstrated that realization of the suggested approach allows achieving better reconstruction accuracy and better computation time than state-of-the-art approaches on test data from the Low Dose CT Challenge dataset without increasing reprojection error.
Бесплатно
Optimal affine image normalization approach for optical character recognition
Статья
Optical character recognition (OCR) in images captured from arbitrary angles requires preliminary normalization, i.e. a geometric transformation resulting in an image as if it was captured at an angle suitable for OCR. In most cases, a surface containing characters can be considered flat, and a pinhole model can be adopted for a camera. Thus, in theory, the normalization should be projective. Usually, the camera optical axis is approximately perpendicular to the document surface, so the projective normalization can be replaced with an affine one without a significant loss of accuracy. An affine image transformation is performed significantly faster than a projective normalization, which is important for OCR on mobile devices. In this work, we propose a fast approach for image normalization. It utilizes an affine normalization instead of a projective one if there is no significant loss of accuracy. The approach is based on a proposed criterion for the normalization accuracy: root mean square (RMS) coordinate discrepancies over the region of interest (ROI). The problem of optimal affine normalization according to this criterion is considered. We have established that this unconstrained optimization is quadratic and can be reduced to a problem of fractional quadratic functions integration over the ROI. The latter was solved analytically in the case of OCR where the ROI consists of rectangles. The proposed approach is generalized for various cases when instead of the affine transform its special cases are used: scaling, translation, shearing, and their superposition, allowing the image normalization procedure to be further accelerated.
Бесплатно