This is a collection of code written by Maurice Curran that was used to process the Microscopy and Microanalysis conference proceeding corpus into word products described in the publication "NLP-Driven Electron Microscopy Ontology Development". The scripts are written in Python, to be used in the following order:
1. SettingUpTextFiles.py and CopyingText.py to get the raw text files;
2. SentenceConversion.py;
3. reference_remover.py;
4. testing.py and testingavg.py;
5. SentenceCreator.py;
6. matscholar_model.py to get matscholar tags;
7. training_model_gensim.py to get gensim model;
8. word2vecscript.py and gensim_visual.py;