NOTE: This is a jupyter notebook converted to markdown. As such, it does not look quite good. The original notebook can be seen here.
%load_ext watermark
%watermark
This is a notebook showing a modification of the original NYT Ingredient Phrase tagger. Here is the article where they talk about it.
That github repository contains New York Time’s tool for performing Named Entity Recognition via Conditional Random Fields on food recipes to extract the ingredients used on those recipes as well as the quantities.
On their implementation they use a CRF++ as the extractor.
Here I will use pycrfsuite instead of CRF++, the main reasons being:
by using a full python solution (even though pycrfsuite is just a wrapper around crfsuite) we can deploy the model more easily, and
installing CRF++ proved to be a challenge in Ubuntu 14.04
You can install pycrfsuite by doing:
pip install python-crfsuite
We load the train_file with features produced by calling (as it appears on the README):
bin/generate_data --data-path=input.csv --count=180000 --offset=0 > tmp/train_file
import re
import json
from itertools import chain
import nltk
import pycrfsuite
from lib.training import utils
with open(&# 39;tmp/train_file&# 39;) as fname:
lines = fname.readlines()
items = [line.strip(&# 39;\n&# 39;).split(&# 39;\t&# 39;) for line in lines]
items = [item for item in items if len(item)==6]
items[:10]
As we can see, each line of the train_file follows the format:
- token
- position on the phrase. (I1 would be first word, I2 the second, and so on)
- LX , being the length group of the token (defined by LengthGroup)
- NoCAP or YesCAP, whether the token is capitalized or not
- YesParen or NoParen, whether the token is inside parenthesis or not
PyCRFSuite expects the input to be a list of the structured items and their respective tags. So we process the items from the train file and bucket them into sentences
sentences = []
sent = [items[0]]
for item in items[1:]:
if &# 39;I1&# 39; in item:
sentences.append(sent)
sent = [item]
else:
sent.append(item)
len(sentences)
import random
random.shuffle(sentences)
test_size = 0.1
data_size = len(sentences)
test_data = sentences[:int(test_sizedata_size)]
train_data = sentences[int(test_sizedata_size):]
def sent2labels(sent):
return [word[-1] for word in sent]
def sent2features(sent):
return [word[:-1] for word in sent]
def sent2tokens(sent):
return [word[0] for word in sent]
y_train = [sent2labels(s) for s in train_data]
X_train = [sent2features(s) for s in train_data]
X_train[1]
We set up the CRF trainer. We will use the default values and include all the possible joint features
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)
I obtained the following hyperparameters by performing a GridSearchCV with the scikit learn implementation of pycrfsuite.
trainer.set_params(
{
&# 39;c1&# 39;: 0.43,
&# 39;c2&# 39;: 0.012,
&# 39;max_iterations&# 39;: 100,
&# 39;feature.possible_transitions&# 39;: True,
&# 39;feature.possible_states&# 39;: True,
&# 39;linesearch&# 39;: &# 39;StrongBacktracking&# 39;
}
)
We train the model (this might take a while)
trainer.train(&# 39;tmp/trained_pycrfsuite&# 39;)
Now we have a pretrained model that we can just deploy
tagger = pycrfsuite.Tagger()
tagger.open(&# 39;tmp/trained_pycrfsuite&# 39;)
Now we just add a wrapper function for the script found in lib/testing/convert_to_json.py and create a convient way to parse an ingredient sentence
import re
import json
from lib.training import utils
from string import punctuation
from nltk.tokenize import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
def get_sentence_features(sent):
"""Gets the features of the sentence"""
sent_tokens = utils.tokenize(utils.cleanUnicodeFractions(sent))
<span class="n">sent_features</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sent_tokens</span><span class="p">):</span>
<span class="n">token_features</span> <span class="o">=</span> <span class="p">[</span><span class="n">token</span><span class="p">]</span>
<span class="n">token_features</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">utils</span><span class="o">.</span><span class="n">getFeatures</span><span class="p">(</span><span class="n">token</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">sent_tokens</span><span class="p">))</span>
<span class="n">sent_features</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">token_features</span><span class="p">)</span>
<span class="k">return</span> <span class="n">sent_features</span>
def format_ingredient_output(tagger_output, display=False):
"""Formats the tagger output into a more convenient dictionary"""
data = [{}]
display = [[]]
prevTag = None
<span class="k">for</span> <span class="n">token</span><span class="p">,</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">tagger_output</span><span class="p">:</span>
<span class="c1"># turn B-NAME/123 back into "name"</span>
<span class="n">tag</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s1">r&# 39;^[BI]\-&# 39;</span><span class="p">,</span> <span class="s2">""</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
<span class="c1"># ---- DISPLAY ----</span>
<span class="c1"># build a structure which groups each token by its tag, so we can</span>
<span class="c1"># rebuild the original display name later.</span>
<span class="k">if</span> <span class="n">prevTag</span> <span class="o">!=</span> <span class="n">tag</span><span class="p">:</span>
<span class="n">display</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">tag</span><span class="p">,</span> <span class="p">[</span><span class="n">token</span><span class="p">]))</span>
<span class="n">prevTag</span> <span class="o">=</span> <span class="n">tag</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">display</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
<span class="c1"># ^- token</span>
<span class="c1"># ^---- tag</span>
<span class="c1"># ^-------- ingredient</span>
<span class="c1"># ---- DATA ----</span>
<span class="c1"># build a dict grouping tokens by their tag</span>
<span class="c1"># initialize this attribute if this is the first token of its kind</span>
<span class="k">if</span> <span class="n">tag</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
<span class="n">data</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">tag</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># HACK: If this token is a unit, singularize it so Scoop accepts it.</span>
<span class="k">if</span> <span class="n">tag</span> <span class="o">==</span> <span class="s2">"unit"</span><span class="p">:</span>
<span class="n">token</span> <span class="o">=</span> <span class="n">utils</span><span class="o">.</span><span class="n">singularize</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
<span class="n">data</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">tag</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
<span class="c1"># reassemble the output into a list of dicts.</span>
<span class="n">output</span> <span class="o">=</span> <span class="p">[</span>
<span class="nb">dict</span><span class="p">([(</span><span class="n">k</span><span class="p">,</span> <span class="n">utils</span><span class="o">.</span><span class="n">smartJoin</span><span class="p">(</span><span class="n">tokens</span><span class="p">))</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">tokens</span> <span class="ow">in</span> <span class="n">ingredient</span><span class="o">.</span><span class="n">iteritems</span><span class="p">()])</span>
<span class="k">for</span> <span class="n">ingredient</span> <span class="ow">in</span> <span class="n">data</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">ingredient</span><span class="p">)</span>
<span class="p">]</span>
<span class="c1"># Add the raw ingredient phrase</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">output</span><span class="p">):</span>
<span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s2">"input"</span><span class="p">]</span> <span class="o">=</span> <span class="n">utils</span><span class="o">.</span><span class="n">smartJoin</span><span class="p">(</span>
<span class="p">[</span><span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">tokens</span> <span class="ow">in</span> <span class="n">display</span><span class="p">[</span><span class="n">i</span><span class="p">]])</span>
<span class="k">return</span> <span class="n">output</span>
def parse_ingredient(sent):
"""ingredient parsing logic"""
sentence_features = get_sentence_features(sent)
tags = tagger.tag(sentence_features)
tagger_output = zip(sent2tokens(sentence_features), tags)
parsed_ingredient = format_ingredient_output(tagger_output)
if parsed_ingredient:
parsed_ingredient[0][&# 39;name&# 39;] = parsed_ingredient[0].get(&# 39;name&# 39;,&# 39;&# 39;).strip(&# 39;.&# 39;)
return parsed_ingredient
def parse_recipe_ingredients(ingredient_list):
"""Wrapper around parse_ingredient so we can call it on an ingredient list"""
sentences = tokenizer.tokenize(q)
sentences = [sent.strip(&# 39;\n&# 39;) for sent in sentences]
ingredients = []
for sent in sentences:
ingredients.extend(parse_ingredient(sent))
return ingredients
q = &# 39;&# 39;&# 39;
2 1⁄4 cups all-purpose flour.
1⁄2 teaspoon baking soda.
1 cup (2 sticks) unsalted butter, room temperature.
1⁄2 cup granulated sugar.
1 cup packed light-brown sugar.
1 teaspoon salt.
2 teaspoons pure vanilla extract.
2 large eggs.
2 cups (about 12 ounces) semisweet and/or milk chocolate chips.
&# 39;&# 39;&# 39;
parse_recipe_ingredients(q)