{"id":48387,"date":"2019-05-14T07:00:48","date_gmt":"2019-05-14T06:00:48","guid":{"rendered":"https:\/\/www.clickworker.com\/?p=48387"},"modified":"2022-07-25T17:36:34","modified_gmt":"2022-07-25T16:36:34","slug":"realistic-training-data-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.clickworker.com\/customer-blog\/realistic-training-data-for-machine-learning\/","title":{"rendered":"Realistic training data for machine learning"},"content":{"rendered":"
<\/p>\r\n
Data are the foundation for training algorithms. The more realistic the data, the better the results. This is because artificial intelligence is based on precise and reliable information for training its algorithms. This is obvious but it is often overlooked. The training data are realistic when they reflect the data that the AI system gathers in real operation. Unrealistic data sets prevent machine learning and lead to expensive false interpretations. <\/p>\r\n\r\n\r\n\r\n\r\n
Artificial neural networks need to be fed good input to be able to learn – just like the human brain. Ultimately, it is the data that are used to train the systems that will determine what an AI system knows and can accomplish. When using artificially created and open data as training data you run a great risk of obtaining distorted results because the data are often not realistic. Artificial intelligence consists of algorithms that are fed data from which they are meant to learn – so-called machine learning. If data are used that are not realistic with regard to their use in the system, this can lead to insufficient or incorrect results in the system as illustrated in the following example. <\/p>\r\n\r\n
While developing a software for drone cameras the developers make use of photographs found on the Internet. These photos exist in ample supply on Facebook or Instagram. However, these photos have two typical features: <\/p>\r\n\r\n
A self-learning algorithm will draw incorrect conclusions from these features. These allegedly general structures are not useful for the assessment of camera photos taken from a drone; at worst they may even be harmful. In the exemplary case the algorithm might learn that important objects are always at the center of the image – a false conclusion. Photographs taken by drones are taken from various perspectives and distances.<\/p>\r\n\r\n
Another example: To train automobile software for the German market, the developer team uses photos of traffic situations taken worldwide. In this case there is a risk that artificial neural networks in practice misinterpret an advertising poster that is similar to a foreign traffic sign, for a road sign. <\/p>\r\n\r\n
How does one identify poor training data sets? The following signs can be indications, for instance: <\/p>\r\n\r\n
The solution is to gather the data oneself or have them newly gathered by a provider. In doing so one can have them gathered to meet ones requirements and \/ or examine existing data sets with regard to whether they are suitable for the respective system. They are suitable when the data sets correspond to what input the system receives, recognizes and correctly evaluates when in operation. <\/p>\r\n\r\n\r\n\r\n
Tip:<\/strong>
\r\nAt clickworker you can have your Datasets for Machine Learning<\/a> newly generated – to meet your individual requirements and tailored to the specifications of your system.<\/blockquote>\r\n\r\n\r\n\r\nThe quality of training data can be verified based on the following questions:<\/p>\r\n\r\n
\r\n
- What methods and technologies were used to generate the data?<\/li>\r\n
- Is the source of the data reliable? Or was data collection associated with a specific intention?<\/li>\r\n
- Where do the data come from? Many training data sets have a clearly defined geographical focus. Is this suitable for the special use?<\/li>\r\n
- When were the data collected? <\/li>\r\n
- In which surroundings \/ under what conditions were the data generated?<\/li>\r\n
- How are the data related, why were they gathered? <\/li>\r\n
- What methods and technologies were used to generate the data?<\/li>\r\n<\/ul>\r\n\r\n
The crowd assumes the creation of the data and the quality control<\/h2>\r\n\r\n
The crowd is especially successful for the generation as well as the verification of training data for systems with artificial intelligence. In principle there are three individual approaches here, but they can also be combined: <\/p>\r\n\r\n
Clickworkers
\r\n\r\n
- Create new training data (for instance photos, video datasets<\/a>, audio datasets & voice datasets<\/a>),<\/li>\r\n
- Evaluate and classify existing data sets according to their quality and \/or content,<\/li>\r\n
- Control and assess results that are supplied by AI systems.<\/li>\r\n<\/ul>\r\n\r\n
Inadequate data can also be optimized for use as training data at a later date. Within a short period of time, Clickworkers can process raw data – add keywords and tags, use bounding boxes, polygons and key points to annotate elements on images<\/a>, or carry out semantic segmentations. <\/p>\r\n\r\n
The data sets and results are subsequently controlled, either by means of various procedures, including peer review, or dual control principle and majority decision. <\/p>\r\n\r\n