{"id":3622,"date":"2016-03-03T09:19:23","date_gmt":"2016-03-03T09:19:23","guid":{"rendered":"http:\/\/blogs.oii.ox.ac.uk\/policy\/?p=3552"},"modified":"2020-12-07T14:24:52","modified_gmt":"2020-12-07T14:24:52","slug":"topic-modelling-content-from-the-everyday-sexism-project-whats-it-all-about","status":"publish","type":"post","link":"https:\/\/ensr.oii.ox.ac.uk\/topic-modelling-content-from-the-everyday-sexism-project-whats-it-all-about\/","title":{"rendered":"Topic modelling content from the &#8220;Everyday Sexism&#8221; project: what\u2019s it all about?"},"content":{"rendered":"<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">We recently\u00a0<\/span><a href=\"http:\/\/blogs.oii.ox.ac.uk\/policy\/creating-a-semantic-map-of-sexism-topic-modelling-of-everyday-sexism-content\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"Internet_0020Link__Char\">announced<\/span><\/a><span class=\"Normal__Char\">\u00a0the start of an exciting new research\u00a0<\/span><a href=\"http:\/\/www.oii.ox.ac.uk\/research\/?id=145\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"Internet_0020Link__Char\">project<\/span><\/a><span class=\"Normal__Char\">\u00a0that will involve the use of topic modelling in understanding the patterns in submitted stories to the\u00a0<\/span><a href=\"http:\/\/www.everydaysexism.com\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"Internet_0020Link__Char\">Everyday Sexism<\/span><\/a><span class=\"Normal__Char\">\u00a0website. Here, we briefly explain our text analysis approach, \u201ctopic modelling\u201d.<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">At its very core, topic modelling is a technique that seeks to automatically discover the topics contained within a group of documents. \u2018Documents\u2019 in this context could refer to text items as lengthy as individual books, or as short as sentences within a paragraph. Let\u2019s take the idea of sentences-as-documents as an example:<\/span><\/p>\n<ul>\n<li class=\"List_0020Paragraph\"><strong><span class=\"List_0020Paragraph__Char\">Document 1:<\/span><\/strong><span class=\"List_0020Paragraph__Char\">\u00a0I like to eat kippers for breakfast.<\/span><\/li>\n<li class=\"List_0020Paragraph\"><strong><span class=\"List_0020Paragraph__Char\">Document 2:<\/span><\/strong><span class=\"List_0020Paragraph__Char\">\u00a0I love all animals, but kittens are the cutest.<\/span><\/li>\n<li class=\"List_0020Paragraph\"><strong><span class=\"List_0020Paragraph__Char\">Document 3:<\/span><\/strong><span class=\"List_0020Paragraph__Char\">\u00a0My kitten eats kippers too.<\/span><\/li>\n<\/ul>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">Assuming that each sentence contains a mixture of different topics (and that a \u2018topic\u2019 can be understood as a collection of words (of any part of speech) that have different probabilities of appearance in passages discussing the topic), how does the topic modelling algorithm \u2018discover\u2019 the topics within these sentences?<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">The algorithm is initiated by setting the number of topics that it needs to extract. Of course, it is hard to guess this number without having an insight on the topics, but one can think of this as a resolution tuning parameter. The smaller the number of topics is set, the more general the bag of words in each topic would be, and\u00a0the looser the connections between them.<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">The algorithm loops through all of the words in each document, assigning every word to one of our topics in a temporary and semi-random manner. This initial assignment is arbitrary and it is easy to show that different initializations lead to the same results in long run. Once each word has been assigned a temporary topic, the algorithm then re-iterates through each word in each document to update the topic assignment using two criteria: 1) How prevalent is the word in question across topics? And 2) How prevalent are the topics in the document?<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">To quantify these two, the algorithm calculates the likelihood of the words appearing in each document assuming the assignment of words to topics and topics to documents.\u00a0<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">Of course words can appear in different topics and more than one topic can appear in a document. But the iterative algorithm seeks to maximize the self-consistency of the assignment by maximizing the likelihood of the observed word-document statistics.\u00a0<\/span><\/p>\n<p class=\"Normal\"><span class=\"Normal__Char\">We can illustrate this process and its outcome by going back to our example. A topic modelling approach might use the process above to discover the following topics across our documents:<\/span><\/p>\n<ul>\n<li class=\"List_0020Paragraph\"><strong><span class=\"List_0020Paragraph__Char\">Document 1:<\/span><\/strong><span class=\"List_0020Paragraph__Char\">\u00a0I like to\u00a0<span style=\"text-decoration: underline\"><span class=\"List_0020Paragraph__Char\">eat<\/span><\/span>\u00a0<span style=\"text-decoration: underline\"><span class=\"List_0020Paragraph__Char\">kippers<\/span><\/span>\u00a0for\u00a0<span style=\"text-decoration: underline\"><span class=\"List_0020Paragraph__Char\">breakfast<\/span><\/span>.\u00a0<strong>[<\/strong><\/span><strong>100% Topic A]<\/strong><\/li>\n<li class=\"List_0020Paragraph\"><strong>D<span class=\"List_0020Paragraph__Char\">ocument 2:<\/span><\/strong><span class=\"List_0020Paragraph__Char\">\u00a0I love all\u00a0<em><span class=\"List_0020Paragraph__Char\">animals<\/span><\/em>, but\u00a0<em><span class=\"List_0020Paragraph__Char\">kittens<\/span><\/em>\u00a0are the cutest. <strong>[<\/strong><\/span><strong>100% Topic B]<\/strong><\/li>\n<li class=\"List_0020Paragraph\"><span class=\"List_0020Paragraph__Char\"><strong>Document 3:<\/strong>\u00a0<\/span><span class=\"List_0020Paragraph__Char\">My\u00a0<span class=\"List_0020Paragraph__Char\"><em>kitten<\/em>\u00a0<\/span><span style=\"text-decoration: underline\"><span class=\"List_0020Paragraph__Char\">eats<\/span><\/span>\u00a0<span style=\"text-decoration: underline\"><span class=\"List_0020Paragraph__Char\">kippers<\/span><\/span>\u00a0too. <strong>[<\/strong><\/span><strong>67% Topic A, 33% Topic B]<\/strong><\/li>\n<\/ul>\n<p class=\"Normal\"><span class=\"Normal__Char\">Topic modelling defines each topic as a so-called \u2018bag of words\u2019, but it is the researcher\u2019s responsibility to decide upon an appropriate label for each topic based on their understanding of language and context. Going back to our example, the algorithm might classify the\u00a0<span class=\"Normal__Char\"><span style=\"text-decoration: underline\">underlined<\/span>\u00a0<\/span>words under Topic A, which we could then label as \u2018food\u2019 based on our understanding of what the words mean. Similarly\u00a0the\u00a0<em><span class=\"Normal__Char\">italicised<\/span><\/em>\u00a0words might be classified under a separate topic, Topic B, which we could label \u2018animals\u2019. In this simple example the word \u201ceat\u201d has appeared in a sentence dominated by Topic A, but also in a sentence with some association to Topic B. Therefore it can also be seen as a connector of the two topics. Of course animals eat too and they like food!<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">We are going to use a similar approach to first extract the main topics reflected on the reports to the\u00a0<\/span><a href=\"http:\/\/www.everydaysexism.com\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"Internet_0020Link__Char\"><span class=\"Internet_0020Link__Char\">Everyday Sexism Project<\/span><\/span><\/a><span class=\"Normal__Char\">\u00a0website and extract the relation between the sexism-related topics and concepts based on the overlap between the bags of words of each topic. Finally we can also look into the co-appearance of topics in the same document.\u00a0 This way we try to draw a linguistic picture of the more than 100,000\u00a0submitted reports.<\/span><\/p>\n<p class=\"Normal\" dir=\"LTR\"><span class=\"Normal__Char\">As ever, be sure to check back for further updates on our progress!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We recently\u00a0announced\u00a0the start of an exciting new research\u00a0project\u00a0that will involve the use of topic modelling in understanding the patterns in submitted stories to the\u00a0Everyday Sexism\u00a0website. Here, we briefly explain our text analysis approach, \u201ctopic modelling\u201d. At its very core, topic modelling is a technique that seeks to automatically discover the topics contained within a group [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":3439,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[83,112,179,241],"_links":{"self":[{"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/posts\/3622"}],"collection":[{"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/comments?post=3622"}],"version-history":[{"count":2,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/posts\/3622\/revisions"}],"predecessor-version":[{"id":4913,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/posts\/3622\/revisions\/4913"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/media\/3439"}],"wp:attachment":[{"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/media?parent=3622"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/categories?post=3622"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ensr.oii.ox.ac.uk\/wp-json\/wp\/v2\/tags?post=3622"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}