first try to write in english && want to make friends in our studio say hello to Natural Language Processing

Introduction

Natural language processing (NLP) is all about creating systems that process or “understand” language in order to perform certain tasks. So what are these tasks?

For example:

qa
sentiment analysis
image to text mappings
machine translation
name entity recognition

traditional approach

The traditional approach to NLP involved a lot of domain knowledge of linguistics itself. Such as phonemes and morphemes.

just try to understand the following word—“uninterested”

“un” indicates an opposing or opposite idea and “ed” can specify the time period (past tense)

But we are not skilled linguists

dl approach

representation learning can be describe as a general term.

Word Vectors

we have to represent each word as a d-dimensional vector. Let’s use d = 6.

we want this six dimensional vector to represents the word and its context, meaning, or semantics.

A coocurence matrix is a matrix that contains the number of counts of each word appearing next to all the other words in the data.

So now using the matrix we have word vectors,but the question is that the matrix which would be extremely sparse (lots of 0’s) when the training data are very large. Definitely it has poor storage efficiency.

so let us come to Word2Vec.

Word2Vec

The basic idea is that we want to store as much information as we can in this word vector while still keeping the dimensionality at a manageable scale (25 – 1000 dimensions).

We want to predict the surrounding words of every word by maximizing the log probability of any context word given the current center word

Word2Vec finds vector representations of different words by maximizing the log probability of context words given a center word and modifying the vectors through SGD(https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf、http://nlp.stanford.edu/pubs/glove.pdf)

RNN

Each of the vectors has a hidden state vector at that same time step (ht, ht-1, ht+1). Let’s call this

one module=each of the vectors(x)+hidden state vector

hidden state：about the hidden state vector at previous time step and the word vector.W is weight matrix.

When we want a output.The last time is t.

gated recurrent units

Why we need to capture long distance dependencies? During backpropagation, such as if the initial gradient is a small number (< 0.25), then by the 3rd or 4th module, the gradient will have vanished.

The GRU provides a different way of computing this hidden state vector h(t). Included:an update gate, a reset gate, and a new memory container.