[nltk_data] Error loading reuters: < urlopen error [Errno 11004] [nltk_data] getaddrinfo

When completing cs224n Assignment1, I need to use the reuters corpus in the nltk library, but when I run nltk.download(“reuters”) in the code, I get an error that I can’t download it due to network problems:

[nltk_data] Error loading reuters: <urlopen error [Errno 11004] 
[nltk_data]     getaddrinfo

According to the online tutorials tossed a long time, stepped on a lot of pits to find a set of effective and minimalist solution. Do not have to download the entire ntlk package (more than 700M), only need to download the appropriate files according to their own needs.

Failure mode 1: Modify HOST

This method requires adding proxies and using global magic. Follow the blog postNLTK anomaly problemAdding an IP address to HOSTS as given.199.232.68.133 raw.githubusercontent.comInvalid because this IP is disabled.
Since I only have a browser-side ladder, not a global ladder, this approach PASSES!

Failure mode 2: Download NLTK package but can’t use it

The way it is usually given is to go toGItee LinksOn the download of the NLTK package, but most bloggers download the entire package (700+M) directly, which is very memory intensive. Here is the way to give the required files on demand, and the details to pay special attention to! (Potholes stepped on)

1. First check your NLTK directory and create one if you don’t have one.

In the python interpreter, type

import nltk
nltk.download(".")

will see the following return.

Searched in:
    - 'C:\\Users\\YayingLuo/nltk_data'
    - 'C:\\Users\\ghost\\anaconda3\\envs\\cs224n\\nltk_data'
    - 'C:\\Users\\ghost\\anaconda3\\envs\\cs224n\\share\\nltk_data'
    - 'C:\\Users\\ghost\\anaconda3\\envs\\cs224n\\lib\\nltk_data'
    - 'C:\\Users\\YayingLuo\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'

These paths will vary depending on the respective username. This return is caused by not finding the relevant nltk_data path. Select one of these paths and create the nltk_data folder under it, for example:
C:\Users\ghost\anaconda3\envs\cs224n\nltk_data

2. Download the corpus files you need.

For example, what needs to be downloaded in the original code is the reuters corpus, a financial news dataset.

nltk.corpus('reuters')

In the gitee link given above thepacages/corporaUnder Path, Downloadreuters.zipfile
[nltk_data] Error loading reuters: < urlopen error [Errno 11004] [nltk_data] getaddrinfo
Because of the download here at gitee, the zip includes a lot of path prefixes that appear, and the real reuters folder is still under this layer! It needs to be modified or it won’t work correctly.

3. [Important] Modify the path name of the relevant file and place the download file correctly.

In the previously created/nltk_datapath and create a newcorporaSubfolders (meaning corpus)
Rename the downloaded zip toreuters.zip
replacing thisreuters.zipFolder recompression moved to /nltk_data/corpora` path
Pro-test no need to unzip, directly use zip can be

This should work fine. If you still have problems, restart jupyter notebook and check the naming of the zip and path.

note

If anaconda creates a new env but jupyter notebook can’t connect to that kernel
The solution is to activate the environment in anaconda and then open the jupyter notebook, instead of opening it in the base environment.