A controlled, real-life experiment designed to compare two variants of a system or a model, A and B.
In the context of Artificial Neural Networks, a function that takes in the weighted sum of all of the inputs from the previous layer and generates an output value to ignite the next layer.
A special case of Semi-Supervised Machine Learning in which a learning agent is able to interactively query an oracle (usually, a human annotator) to obtain labels at new data points.
An unambiguous specification of a process describing how to solve a class of problems that can perform calculations, process data and automate reasoning.
A metadatum attached to a piece of data, typically provided by a human annotator.
A methodology used in Machine Learning to determine which one of several used models have the highest performance by measuring the area under the receiver operating characteristic (ROC) curve.
A broad concept encompassing machine learning, natural language processing, and other techniques, aiming to simulate human intelligence in machines.
An architecture composed of successive layers of simple connected units called artificial neurons interweaved with non-linear activation functions, which is vaguely reminiscent of the neurons in an animal brain.
A rule-based Machine Learning method for discovering interesting relations between variables in large data sets.
A type of Artificial Neural Network used to produce efficient representations of data in an unsupervised and non-linear manner, typically to reduce dimensionality.
A subfield of Computational Linguistics interested in methods that enables the recognition and translation of spoken language into text by computers.
A method used to train Artificial Neural Networks to compute a gradient that is needed in the calculation of the network’s weights.
The set of examples used in one gradient update of model training.
A famous theorem used by statisticians to describe the probability of an event based on prior knowledge of conditions that might be related to an occurrence.
A conflict arising when data scientists try to simultaneously minimize bias and variance, that prevents supervised algorithms from generalizing beyond their training set.
A Machine Learning ensemble meta-algorithm for primarily reducing bias and variance in supervised learning, and a family of Machine Learning algorithms that convert weak learners to strong ones.
The smallest (rectangular) box fully containing a set of points or an object.
A computer program or an AI designed to interact with human users through conversation.
The task of approximating a mapping function from input variables to discrete output variables, or, by extension, a class of Machine Learning algorithms that determine the classes to which specific instances belong.
In Machine Learning, the unsupervised task of grouping a set of objects so that objects within the same group (called a cluster) are more “similar” to each other than they are to those in other groups.
A method used in the context of recommender systems to make predictions about the interests of a user by collecting preferences from a larger group of users.
The field of Machine Learning that studies how to gain high-level understanding from images or videos.
A type of interval estimate that is likely to contain the true value of an unknown population parameter. The interval is associated with a confidence level that quantifies the level of confidence of this parameter being in the interval.
A human worker providing annotations on the Appen data annotation platform.
A class of Deep, Feed-Forward Artificial Neural Networks, often used in Computer Vision.
The electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logical, control and input/output operations specified by the instructions.
A collection of processes designed to evaluate how the results of a predictive model will generalize to new data sets.
– k-fold Cross-Validation
– Leave-p-out Cross-Validation
The most essential ingredient to all Machine Learning and Artificial Intelligence projects.
Unstructured Data: raw, unprocessed data. Textual data is a perfect example of unstructured data because it is not formatted into specific features.
Structured Data: data processed in a way that it becomes ingestible by a Machine Learning algorithm and, if in the case of Supervised Machine Learning, labeled data; data after it has been processed on the Appen data annotation platform.
Data Augmentation: the process of adding new information derived from both internal and external sources to a data set, typically through annotation.
A category of Supervised Machine Learning algorithms where the data is iteratively split in respect to a given parameter or criteria.
A chess-playing computer developed by IBM, better known for being the first computer chess-playing system to win both a chess game and a chess match against a reigning world champion under regular time controls.
One instance of some mathematical structure contained within another instance, such as a group that is a subgroup.
In Statistics and Machine Learning, ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models but typically allows for a much more flexible structure to exist among those alternatives.
The average amount of information conveyed by a stochastic source of data.
A variable that is used as an input to a model.
An ensemble of techniques meant to automatically discover the representations needed for feature detection or classification from raw data.
An error due to the fact a result did reject the null hypothesis when it shouldn’t have.
A principle stating that whenever the input data is flawed, it will lead to misleading results and produces nonsensical output, a.k.a. “garbage”.
A regulation in EU law on data protection and privacy for all individuals within the European Union aiming to give control to citizens and residents over their personal data.
A search heuristic inspired by the Theory of Evolution that reflects the process of natural selection where the fittest individuals are selected to produce offspring of the following generation.
A configuration, external to the model and whose value cannot be estimated from data, that data scientists continuously tweak during the process of training a model.
– The process of manually determining the optimal configuration to train a specific model.
A large visual dataset made of 14 million URLs of hand-annotated images organized in twenty-thousand (20,000) different categories, designed for use in visual object recognition research.
The problem in Computer Vision of determining whether an image contains some specific object, feature, or activity.
The process of making predictions by applying a trained model to new, unlabeled instances.
A series of neurons in an Artificial Neural Network that process a set of input features, or, by extension, the output of those neurons. Hidden Layer: a layer of neurons whose outputs are connected to the inputs of other neurons, therefore not directly visible as a network output.
A new direction within the field of Machine Learning investigating how algorithms can change the way they generalize by analyzing their own learning process and improving on it.
The application of Machine Learning to the construction of ranking models for Information Retrieval systems.
The subfield of Artificial Intelligence that often uses statistical techniques to give computers the ability to “learn”, i.e., progressively improve performance on a specific task, with data, without being explicitly programmed.
DevOps for Machine Learning systems.
A subfield of computational linguistics that studies the use of software to translate text or speech from one language to another.
A family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features.
A subtask of Information Extraction that seeks to identify and classify named entities in text into predetermined categories such as the names, locations, parts-of-speech, etc.
The area of Artificial Intelligence that studies the interactions between computers and human languages, in particular how to process and analyze large amounts of natural language data.
The conversion of images of printed, handwritten or typed text into a machine-friendly textual format.
The selection of the best element (with regard to some criterion) from some set of available alternatives.
The fact that a model unknowingly identified patterns in the noise and assumed those represented the underlying structure; the production of a model that corresponds too closely to a particular set of data, and therefore fails to generalize well to unseen observations.
An area of Machine Learning focusing on the (supervised or unsupervised) recognition of patterns in the data.
The process of reducing a matrix generated by a convolutional layer to a smaller matrix.
Any piece of information that can be used on its own or in combination with some other information in order to identify a particular individual.
An ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting a combined version (such as the mean or the mode) of the results of each individual trees.
The fraction of all relevant samples that are correctly classified as positive.
A unit employing the rectifier function as an activation function.
A class of supervised learning techniques that also leverages available unlabeled data for training, typically using a small number of labeled instances in combination with a larger amount of unlabeled rows. See also Supervised Learning and Unsupervised Learning.
The use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affected states and subjective information.
An open-source library, popular among the Machine Learning community, for data flow programming across a range of tasks. It is a symbolic math library and is also used for machine learning applications such as neural networks.
A sequence of data points recorded at specific times and indexed accordingly to their order of occurrence.
A range of values likely to enclose the true value.
The fact that a Machine Learning algorithm fails to capture the underlying structure of the data properly, typically because the model is either not sophisticated enough, or not appropriate for the task at hand; opposite of Overfitting.
The process of using hold-out data in order to evaluate the performance of a trained model; by opposition to the testing phase which is used for the final assessment of the model’s performance, the validation phase is used to determine if any iterative modification needs to be made to the model.
A dreaded difficulty and major obstacle to recurrent net performance that data scientists face when training Artificial Neural Networks with gradient-based learning methods and backpropagation, due to the neural network’s weights receiving an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training.