I Introduction
Human activity detection is the process of classifying the current physical activity of a person. A common data source for human activity detection is instantaneous acceleration (used in [1, 2, 3, 4, 5, 6, 7, 8]). Accelerometers are lightweight [9], inexpensive [10], and widely available [11] across wrist devices and smart phones [8], and thus any builtin system could be distributed broadly.
Arguably, some physical activities are more complex than others in that they consist of other, more fundamental, subactivities. We introduce the following terms: composite activities and atomic activities. Each composite activity is a unique temporal series of consecutive atomic activities. Therefore the atomic activities are building blocks for defining complex activities. Some sports are complex activities (e.g. basketball) that are combinations of constituent physical activities (atomic activities). Notice that simple activities are also defined as composite activities in which consecutive atomic activities are almost identical.
Previous work in human activity detection typically consists of the following steps: aggregation of a subset of a sensor signal, summarization of the information in the subset (e.g. sample statistics) and an instantaneous classification of the physical activity (used in [4, 8, 12, 13]). Models that utilize the previous three steps face a fundamental problem in classifying composite activities. For instance, the act of playing basketball includes running, and thus these models may struggle to learn separate representations for them. Our proposed model enables the detection of such complex physical activities by first decomposing them into unique atomic activities (Section IIB) and then combining these for activity classification (Section IIC).
In this paper, we propose a model for classifying the physical activity of a person based on the readings of a triaxial accelerometer. Compared to the existing work, our model has the following benefits:

The model learns to classify the physical activities as composite activities, which consist of atomic activities.

The computed features, which characterize the atomic activities and the composite activities, are not selected from a fixed set of features (e.g. mean or variance, as in
[4, 8]). Instead, the features are learned automatically to represent the supported physical activities. Therefore we do not impose assumptions on how to build a representation of the subsets of a sensor signal to summarize their information. 
The model can be utilized as a population system and as a personalized system. For a test subject, a population system is created using data of other people and a personalized system is created using data of the test subject. Our proposed model is used to experiment with both systems in Section III.
The proposed model is trained endtoend as a single model without manual intervention. The computational components of our model originate from deep learning literature
[14]. However, the model uses a limited amount of memory and is not a deep model. Instead, one should consider our model to be a computational graph [15, 16].The experiments in this paper use a total of six hours of data from Palantir Context Data Library [12, 13]. This data was created by recording multiple sensors, including a triaxial wristworn accelerometer, while users performed various activities. The sampling frequency of the accelerometer was 20Hz, which is sufficient to detect physical activities [4]
. The data were recorded outside a laboratory in a realworld environment, and represent a set of activities, which are performed in personal ways (e.g. two persons run in different ways.) Therefore, by utilizing the Palantir data, we obtain a good estimate of the realworld accuracy of our model for human activity detection. See
[12, 13] for a detailed description of the data and the setup of the data collection.The scope of our work in this paper is in the classification of eight physical activities (walking, Nordic walking, running, soccer, rowing, bicycling, exercise bicycling and lying down). A physical activity is classified using a subset of 15 seconds of acceleration signal. We also construct and evaluate a set of baseline models to classify the 15s of acceleration signal. The following established classifiers are selected as the baseline models in the experiments: logistic regression
[17], random forest classifier
[18][17], adaptive boosting classifier [19]and linear support vector machine
[17]. The proposed model obtained an overall mean accuracy of % (population) and % (personalized). The corresponding accuracies of the best baseline model (random forest) were % and %.The reported parametrization of our proposed model in this paper (e.g. the number of computational units) was selected through empirical experiments. Our goal is not to provide a universal and general parametrization for the model. However, the parametrization utilized in this paper is a solid initial choice, which is experimentally validated. Rather, our goal is to introduce a suitable architecture for a model to detect composite activities. The following section defines the architecture and its parametrization used in this paper.
Ii The proposed model
To detect composite activities, we propose a model with three major components:
An atomic activity is encoded into a vector of real values, which encode its characteristics. These values are called features and the process of learning to compute them is called feature learning (see Section IIB). A set of feature vectors, corresponding to atomic activities, is sent to a recurrent model which is used as it is able to learn the temporal relationship between these features [20]. The final internal state of the recurrent model, which is a vector of real values, represents the feature values of the composite activity. Therefore the representation of a composite activity is learned using the atomic activities (see Section IIC). The final internal state is then classified into one of the supported physical activities (see Section IID). The following subsection introduces the computational machinery, which are then examined in further detail in sections IIB, IIC and IID.
Iia Computation in layers
We factor the computation of our proposed model into consecutive steps called layers. A layer takes an input, transforms the input and outputs the transformed value. A set of chained layers forms a computational graph. For example, neural networks can be defined as computational graphs. The following layers are utilized by our model (see
[14, 20, 21, 22, 23] for more details):
Fully connected layer (FC). The layer applies a nonlinear function elementwise to an affine transformation: where denotes dimensional input and
are adjustable parameters. We utilize a rectified linear unit for the nonlinearity (
), which is widely employed as the nonlinearity [21] because of its calculational simplicity [24]. We use fully connected layers to learn suitable features to classify the physical activities. 
Convolution layer (CO). The layer processes an input signal while retaining its temporal structure. Similarly to a FC layer, a convolution layer applies affine transformations followed by a nonlinearity. The parameters for affine transformations are contained in convolution kernels. The same convolution kernel parameters are used across all the temporal subsets of the input signal. Each convolution kernel in each layer outputs a new transformed signal. Convolution layers can use multiple convolution kernels to output multiple transformed signals. The goal is to learn translation invariant features (kernels) from the input signal.

Maxpooling layer (MP). This layer reduces the dimensionality of its input signal by outputting only the largest input value from temporally neighboring regions. The size of a region is four without overlapping in our experiments. Therefore, for example, eight temporally consecutive acceleration readings are transformed into two values.

Long short term memory layer (LSTM). LSTM is a specific type of a recurrent neural network, which utilizes a gating mechanism to retain an internal state over a long period of time in its memory [20]. During every time step, the LSTM determines its internal state based on the current input vector (atomic feature vector) and its own previous state. Each composite activity is modeled using the final state of the LSTM.

Softmax layer (SM). The layer computes a probability distribution (relative probabilities) for the physical activities as where is the probability of :th activity and is the :th input value for the layer. We use a softmax layer, and the resulting probability distribution, to predict the current physical activity.
The layers compute the gradient of their parameters with respect to its output (e.g. , and
for FC). The gradient between consecutive layers is computed using the chain rule where the gradient of the lower layer is multiplied by the gradient of the upper layer. This allows one to compute the gradient of all parameters with respect to the desired output of the computational graph.
The deviation between the desired output () and the output of the computational graph () is measured using an error function (). The final layer in our model is a softmax layer, which forms a probability distribution of the classification results. Therefore we obtain the probabilities of the eight physical activities. The classification result is the physical activity with the highest probability. The error function is the crossentropy function between the computed probability distribution and the desired probability distribution, see [25] for details. To minimize the error function, the parameter values are updated using gradient descent as:
(1) 
where is the set of model parameters and
is a hyperparameter called learning rate. The learning rate scales the magnitude of the gradient descent
[17]. It is important to select an appropriate value of the learning rate for the convergence of the model. Section IIEexplains the training procedure of our proposed model in detail. The computational graph is trained by minimizing the error function. This approach is known as the backpropagation algorithm in the neural network literature, see
[25, 17, 26] for more details. The following subsection describes the feature learning for atomic activities in detail.IiB Feature learning for atomic activities
An atomic activity is represented as a dimensional vector of real values (). We selected based on empirical tests using the values . The input for computing in our experiments is three seconds of sensor signal from a triaxial accelerometer (three axes, 20Hz sampling rate.) Feature learning () transforms the acceleration signal into a feature representation of a fixed size ().
Fig. 1 illustrates the architecture of the computational graph for learning and computing the features of an atomic activity (AFL). The computation of the features commences by processing the acceleration signal in two consecutive convolution layers. The convolution kernels learn to identify signal patterns that correspond to different atomic activities. We experimented using a various number of consecutive convolution layers (architecture) and kernels (parametrization). We utilize two layers with and kernels because they provided good results consistently. The maximum values of the twice convolved signal are pooled to reduce the number of parameters. The final layer is a fully connected layer, which acquires the pooled signal. The outputs of the fully connected layer are the feature values () of an atomic activity. A larger model (see Fig. 2) utilizes these feature values (Fig. 1) to learn a composite activity from atomic activities. The following subsection describes the feature learning for composite activities in detail.
IiC Feature learning for composite activities
In our experiments, a composite activity is segmented into five consecutive atomic activities, which results in a total of 15 seconds of acceleration signal. Using the feature learning for atomic activities (AFL) in Fig. 1, the atomic activities are transformed into five vectors of feature values (). The parametrization of AFL is shared across the temporally consecutive subsets (3s) of the acceleration signal. The feature vectors are provided to LSTM, one vector at a time in temporal order. The final internal state of LSTM () is the feature representation of the composite activity, which encodes the composite activity using the atomic activities. Fig. 2 illustrates the feature learning for composite activities. The following subsection describes how to classify the current physical activity of a person given the final internal state of the LSTM.
IiD Classification of the composite activities
In order to classify the current physical activity, the feature representation of a composite activity ( in Fig. 2) is passed through two fully connected layers. The second fully connected layer outputs a vector, with a dimensionality that matches the number of supported physical activities. This vector is then transformed into a probability distribution by a softmax layer, and the classification result is the physical activity that corresponds to the highest relative probability. This process is depicted in Fig. 3. The next section describes the setup of the experiments and reports the obtained results.
IiE Training the model
Our model is trained using the established backpropagation algorithm (applied recently in [27, 28, 29, 30].) The convergence of the backpropagation is affected by the selection of a suitable learning rate ( in Eq. 1) [23]. We apply Adam algorithm [31] for adaptive selection of the learning rate (used in [32, 33, 34]
), which does not require an initial selection of a fixed learning rate. We evaluated empirically that the model converged significantly faster using Adam than with a fixed learning rate (stochastic gradient descent). Adam convergences fast because 1) it calculates a separate learning rate for each of the features in the model and 2) it incorporates momentum in the gradient descent. See
[17] for details.The model parametrization used in this paper requires a high number of adjustable parameters (278 696). One challenge with the model is its tendency to overfit the training data; that is, it does not generalize well with previously unseen data [22, 17, 27, 35]. One indicator of an overfitted model is a large norm of weight parameters [17]. If the weight values are large in the inner product (e.g. in ), then small changes in the input cause large changes in the value of the inner product. A model becomes overly specific to the training data with large values of . To improve the generalization ability, we utilize regularization in the weight values of the fully connected layers, the LSTM and the convolution layers. The regularization penalizes the training of the model with a specified regularization strength () as where is the sum of the squared weight values [17] and . In our experiments, we use the regularization strength of .
To further improve the generalization ability and to reduce overfitting, we employ a technique called dropout [27]. We use dropout to randomly disable portions of inputs in the fully connected layers and the LSTM with 50% probability as in [22]. Therefore our model has to learn a robust classification as the model cannot rely on having an access to all of the available information. Dropout forces the computational units to rely on themselves, and not to coadapt with each other, which improves the classification accuracy of the neural network [22, 27].
Despite the regularization and dropout, we observed overfitting because 1) the representational capacity of the model is high and 2) Adam is an efficient method for stochastic optimization. In the experiments in Section III, our model consistently provided good results in the beginning of the training procedure. Therefore we utilized an approach called early stopping that trains a model for a limited amount of time. In our experiments, the training was stopped after two iterations of the gradient descent (Eq. 1) using Adam algorithm. See [17, 36, 37] for a detailed definition of the early stopping. The following section describes the evaluation experiments and reports their results.
Iii Experiments
The experiments use data recorded on nine persons. The data originate from the Palantir Context Data Library, see [12, 13] for more details. It consists of five minutes of acceleration signal for the following physical activities: walking, Nordic walking, running, soccer, rowing, bicycling, exercise bicycling and lying down. The data were recorded using a triaxial accelerometer worn on the wrist. Two types of experiments are conducted for each test subject: a population system and a personalized system. The resulting classification accuracies are then reported. A population system is created using 40 minutes of acceleration signal (from 8 test subjects) and evaluated using 5 minutes of acceleration signal (from one test subject). This simulates a usecase where a user utilizes a previously created model for human activity detection. A personalized system is constructed for each test subject using the first two minutes of the acceleration signal per physical activity. The classification accuracy is evaluated using the remaining three minutes of the acceleration signal. This simulates a usecase where the user of the system creates a personalized model for human activity detection from scratch.
The following established classifiers are selected as the baseline models in the experiments: logistic regression [17], random forest classifier [18], decision tree [17], adaptive boosting classifier [19] and linear support vector machine [17]. Gaussian support vector machine is not selected because of its infeasible computation time. The baseline models utilize the data in one go (15 seconds of acceleration signal). We train, evaluate and report the baseline models individually. The next subsection reports the results from the experiments. The baseline models do not model the physical activities as combinations of composite activities. Instead, like in the existing work [4, 8, 12, 13], the baseline models attempt to instantaneously determine the current physical activity.
Iiia Results using a population model
A population system was constructed for each test subject using our proposed model and the baseline models. For example, for the first test subject, a population system was constructed using the data from the remaining (eight) test subjects. The experiments measure the effectiveness of providing a preexisting, nonpersonalized model for human activity detection. Table I reports the mean accuracies for the population systems by repeating their construction and evaluation ten times. The proposed model obtained an overall mean accuracy of %, while the best baseline model (random forest) obtained an overall mean accuracy of %. The proposed model also provides the highest mean accuracy for the individual users. Compared to the the best baseline model, significant improvements in the mean accuracies are obtained using the proposed model for users one (% vs %), three (% vs %) and eight (% vs. %). The results show that our model outperforms the baseline models.
The models do not provide good results for the fifth and the sixth user. This unveils a fundamental problem with the population system; it cannot recognize an activity if a user performs it in a personal, unique, manner. For users five and six, soccer consisted of significant amounts of walking and lying down, and hence the model had trouble distinguishing soccer from the latter two, hence the low accuracies. The following subsection reports results from experiments where the users are provided a personalized system.
User 1  User 2  User 3  User 4  User 5  User 6  User 7  User 8  User 9  Mean  

Proposed  0.9526  0.9454  0.9565  0.8325  0.3985  0.3862  0.8403  0.8172  0.8824  0.7791 
Logistic regression  0.5540  0.6319  0.5549  0.6274  0.2113  0.2916  0.5379  0.3617  0.6215  0.4880 
Random forest  0.9065  0.9390  0.9191  0.8049  0.3595  0.2973  0.8096  0.7153  0.8659  0.7352 
Decision tree  0.8402  0.7854  0.7946  0.6982  0.3304  0.3089  0.6836  0.6055  0.7235  0.6411 
Adaptive boosting  0.4918  0.4602  0.5164  0.5671  0.0925  0.3376  0.5319  0.4115  0.4514  0.4289 
Linear support vector machine  0.5361  0.5626  0.5403  0.4033  0.1364  0.3516  0.4405  0.3958  0.4618  0.4254 
IiiB Results using a personalized model
A personalized system was constructed for each test subject using our proposed model and the baseline models. Table II reports the mean accuracies for the personalized systems by repeating their construction and evaluation ten times. The mean accuracy of our model exceeds % for the users while the baseline models are accurate for a subset of the users. The overall mean accuracy of the proposed model is %, while the best baseline model (random forest) obtained an overall mean accuracy of %. The proposed model also provides the highest mean accuracy for the individual users. The results show that our model outperforms the baseline models.
As explained above, human activity detection for the user 5 is particularly challenging as physical activities are not well defined. However, our model learns the composite activity of soccer for the user 5, which is one of the advantages of the personalized system. The mean accuracy is % using the proposed model. The baseline models do not model physical activities explicitly as composite activities, which results in significantly lower accuracies than those of our model. The best baseline model obtained a mean accuracy of %. Therefore it is beneficial to use our model when the physical activities are difficult to classify without considering the temporal series of atomic activities.
User 1  User 2  User 3  User 4  User 5  User 6  User 7  User 8  User 9  Mean  

Proposed  0.9974  0.9784  0.9857  0.9334  0.9203  0.9278  0.9422  0.9065  0.9836  0.9528 
Logistic regression  0.8602  0.8705  0.9304  0.8435  0.6907  0.9045  0.7999  0.6099  0.8669  0.8196 
Random forest  0.9964  0.8880  0.9855  0.9059  0.7607  0.8943  0.8523  0.8979  0.9221  0.9003 
Decision tree  0.9076  0.8130  0.8939  0.8087  0.6890  0.8412  0.7367  0.7437  0.8470  0.8090 
Adaptive boosting  0.4499  0.3207  0.5384  0.2753  0.2215  0.2714  0.4878  0.2707  0.7051  0.3934 
Linear support vector machine  0.8690  0.8037  0.8804  0.8145  0.5923  0.8949  0.7896  0.6378  0.8449  0.7919 
Iv Importance of the composite activities
The central assumption of our work is that composite activities are efficiently learned as temporal series of atomic activities. To explicitly study the validity of this assumption, we conducted two experiments in the following two subsections.
Iva Human activity detection without composite activities
We removed the recurrent components (Fig. 2) from the proposed model to test the importance of the composite activities. The trimmed model, which does not explicitly combine the atomic activities in time, is illustrated in Fig. 4. The input for the model is 15s of acceleration signal in one go.
The trimmed model was utilized as a population system and a personalized system for the nine users. Table III reports the mean accuracies (ten repetitions) for the trimmed model, as well as the full model, for comparison. The full model provides a better performance for the users with mean accuracies of 77.9% in the population system and 95.3% in the personalized system. The respective figures for the trimmed model are % and %, respectively. The results suggest that the utilization of composite activities improves the accuracy of human activity detection.
User 1  User 2  User 3  User 4  User 5  User 6  User 7  User 8  User 9  Mean  

Population (full)  0.9526  0.9454  0.9565  0.8325  0.3985  0.3862  0.8403  0.8172  0.8824  0.7791 
Population (trimmed)  0.8290  0.8997  0.8546  0.8204  0.2896  0.3829  0.8257  0.7234  0.8112  0.7152 
Personalized (full)  0.9974  0.9784  0.9857  0.9334  0.9203  0.9278  0.9422  0.9065  0.9836  0.9528 
Personalized (trimmed)  0.9662  0.9222  0.9775  0.8985  0.7898  0.9120  0.9405  0.8255  0.9776  0.9122 
IvB Visualization of the recurrent states
We projected the first and the last internal states of the LSTM ( and in Fig. 2) into a twodimensional manifold. The projection was computed using tdistributed stochastic neighbor embedding (tSNE, [38, 39]
), which is designed to visualize highdimensional data. The internal LSTM states of a personalized model for the second test subject were computed using three minutes of validation data. Fig.
5 illustrates the obtained results using tSNE where the left image is the first internal state and the right image is the last internal state. Notice that initially some of the composite activities are mixed with atomic activities. For example, soccer overlaps with walking and running. This suggests that it is a fundamentally challenging task to instantaneously recognize composite activities. However, walking and running do not overlap with soccer in the last internal state of the LSTM. Therefore our model has learned to identify composite activities as temporal series of atomic activities.V Discussion and conclusion
There are complex activities (composite activities) that consist of simpler activities (atomic activities). It is a fundamentally challenging task to classify a composite activity without decomposing it into atomic activities. In this paper, we have proposed a model to automatically learn the atomic activities and how to combine them into composite activities. The experiments in Section III show the following benefits of our model:

Our proposed model learns to form a set of atomic activities and to combine them meaningfully. Instantaneous classification is not sufficient and one needs to learn the composite activities as combinations of atomic activities over time. This is suggested from inspecting Fig. 5 and the results in Table III.

The proposed model outperforms the baseline models by obtaining an overall mean accuracy of % (population) and % (personalized). The corresponding accuracies of the best baseline model were % and %. However, the corresponding accuracies of the proposed model without utilizing the composite activities was % and %.
The limitations of our model are the utilization of 15s of acceleration signal and the memory consumption. Depending on the application domain, 15 seconds can be adequate. However, our model is not suited for timecritical systems that require a low latency for the results. Additionally, the current parametrization of our model uses 8.92Mb (32bit floating point values) of memory. This can be a large amount of memory for small embedded devices. One of our goals is to find a small parametrization, which retains a good accuracy. However, we believe that the amount of available device memory is going to increase. Future work includes the following:

Further experiments using different parametrizations of the proposed model.

Fusion of different sensor signals for human activity detection.
References
 [1] J. Mäntyjärvi, J. Himberg, and T. Seppänen, “Recognizing human motion with multiple acceleration sensors,” in Proceedings of the 2001 IEEE International Conference on Systems, Man, and Cybernetics, vol. 2, 2001, pp. 747–752.
 [2] M. Mathie, B. Celler, N. Lovell, and A. Coster, “Classification of basic daily movements using a triaxial accelerometer,” Medical and Biological Engineering and Computing, vol. 42, no. 5, pp. 679–687, 2004.
 [3] D. M. Karantonis, M. R. Narayanan, M. Mathie, N. H. Lovell, and B. G. Celler, “Implementation of a realtime human movement classifier using a triaxial accelerometer for ambulatory monitoring,” IEEE Transactions on Information Technology in Biomedicine, vol. 10, no. 1, pp. 156–167, Jan. 2006.

[4]
P. Gupta and T. Dallas, “Feature selection and activity recognition system using a single triaxial accelerometer,”
IEEE Transactions on Biomedical Engineering, vol. 61, no. 6, pp. 1780–1786, June 2014.  [5] A. Khan, Y. Lee, S. Lee, and T. Kim, “A triaxial accelerometerbased physicalactivity recognition via augmentedsignal features and a hierarchical recognizer,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 5, pp. 1166–1172, Sep 2010.
 [6] S. Chernbumroong, A. Atkins, and H. Yu, “Activity classification using a single wristworn accelerometer,” in Proceedings of the 2011 5th International Conference on Software, Knowledge Information, Industrial Management and Applications (SKIMA), Sept 2011, pp. 1–6.
 [7] A. Godfrey, A. Bourke, G. Olaighin, P. van de Ven, and J. Nelson, “Activity classification using a single chest mounted triaxial accelerometer,” Medical Engineering & Physics, vol. 33, pp. 1127–1135, 2011.
 [8] J. Kwapisz, G. Weiss, and S. Moore, “Activity recognition using cell phone accelerometers,” SIGKDD Explor. Newsl., vol. 12, no. 2, pp. 74–82, Mar 2011.
 [9] J. Petersen, D. Austin, R. Sack, and T. Hayes, “Actigraphybased scratch detection using logistic regression,” IEEE Journal of Biomedical and Health Informatics, vol. 17, no. 2, pp. 277–283, Mar 2013.
 [10] W. Cheng and D. Jhan, “Triaxial accelerometerbased fall detection method using a selfconstructing cascadeadaboostsvm classifier,” IEEE Journal of Biomedical and Health Informatics, vol. 17, no. 2, pp. 411–419, Mar 2013.
 [11] A. Matic, V. Osmani, and O. Mayora, “Speech activity detection using accelerometer,” in Proceedings of the 2012 IEEE Annual International Conference on Engineering in Medicine and Biology Society, Aug 2012, pp. 2112–2115.
 [12] J. Pärkkä, M. Ermes, P. Korpipää, J. Mäntyjärvi, J. Peltola, and I. Korhonen, “Activity classification using realistic data from wearable sensors,” IEEE Transactions on Information Technology in Biomedicine, vol. 10, no. 1, pp. 119–128, Jan. 2006.
 [13] M. Ermes, J. Pärkkä, J. Mäntyjärvi, and I. Korhonen, “Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions,” IEEE Transactions on Information Technology in Biomedicine, vol. 12, no. 1, pp. 20–26, Jan. 2008.
 [14] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.

[15]
D. Koller and N. Friedman,
Probabilistic Graphical Models: Principles and Techniques  Adaptive Computation and Machine Learning
. The MIT Press, 2009. 
[16]
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: new features and speed improvements,” in
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.  [17] C. Bishop, Pattern Recognition and Machine Learning. Secaucus, NJ, USA: SpringerVerlag New York, Inc., 2006.
 [18] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, Oct 2001.
 [19] Y. Freund and R. Schapire, “Experiments with a new boosting algorithm,” in Proceedings of the Thirteenth International Conference on Machine Learning (ICML 1996), L. Saitta, Ed. Morgan Kaufmann, 1996, pp. 148–156.
 [20] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computing, vol. 9, no. 8, pp. 1735–1780, Nov 1997.
 [21] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, pp. 436–444, 2015.

[22]
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.  [23] Y. Bengio, Neural Networks: Tricks of the Trade: Second Edition. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, ch. Practical Recommendations for GradientBased Training of Deep Architectures, pp. 437–478.

[24]
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,”
in
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, vol. 15, 2011, pp. 315–323.  [25] S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th ed. Academic Press, 2008.
 [26] T. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGrawHill, Inc., 1997.
 [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 [28] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27. Curran Associates, Inc., 2014, pp. 2672–2680.

[29]
Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” in
ICML, ser. JMLR Proceedings, F. Bach and D. Blei, Eds., vol. 37, 2015, pp. 1718–1727. 
[30]
N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in
Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 2222–2230.  [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
 [32] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing, 2015, pp. 2980–2988.

[33]
A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semisupervised learning with ladder networks,” in
Proceedings of the 28th International Conference on Neural Information Processing Systems, ser. NIPS’15, 2015, pp. 3546–3554.  [34] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” in Proceedings of the 32nd International Conference on Machine Learning (ICML15). JMLR Workshop and Conference Proceedings, 2015, pp. 1462–1471.
 [35] K. P. Murphy, Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
 [36] L. Prechelt, “Early stopping — but when?” in Neural Networks: Tricks of the Trade, ser. Lecture Notes in Computer Science, 2012, vol. 7700.
 [37] D. Barber, Bayesian Reasoning and Machine Learning. New York, NY, USA: Cambridge University Press, 2012.
 [38] L. Van Der Maaten and G. Hinton, “Visualizing highdimensional data using tSNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, Nov 2008.
 [39] L. Van Der Maaten, “Accelerating tSNE using treebased algorithms,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 3221–3245, Jan 2014.
Comments
There are no comments yet.