afader/oqa: Open Question Answering - GitHub

文章推薦指數: 80 %
投票人數:10人

The WikiAnswers corpus contains clusters of questions tagged by WikiAnswers users as paraphrases. Each cluster optionally contains an answer provided by ... Skiptocontent {{message}} afader / oqa Public Notifications Fork 54 Star 153 OpenQuestionAnswering 153 stars 54 forks Star Notifications Code Issues 7 Pullrequests 1 Actions Projects 0 Wiki Security Insights More Code Issues Pullrequests Actions Projects Wiki Security Insights afader/oqa Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository. master Branches Tags Couldnotloadbranches Nothingtoshow {{refName}} default Couldnotloadtags Nothingtoshow {{refName}} default 1 branch 0 tags Code Latestcommit   Gitstats 15 commits Files Permalink Failedtoloadlatestcommitinformation. Type Name Latestcommitmessage Committime oqa-core     oqa-data     oqa-lm     oqa-solr     README.md     Viewcode Code Dependencies CodeStructure Data KnowledgeBase(KB)Data WikiAnswersCorpus ParaphraseTemplateData QueryRewriteData LabeledQuestion-AnswerPairs SystemOutput README.md ThisisarepositoryforthecodeanddatafromthepaperOpenQuestion AnsweringOverCuratedandExtractedKnowledgeBasesfromKDD2014.Ifyouuse anyoftheseresourcesinapublishedpaper,pleaseusethefollowingcitation: @inproceedings{Fader14, author={AnthonyFaderandLukeZettlemoyerandOrenEtzioni}, title={{OpenQuestionAnsweringOverCuratedandExtracted KnowledgeBases}}, booktitle={KDD}, year={2014} } Code Warning:Thisprojecthaslotsofmovingparts.Itwillprobablytakequitea bitofefforttogetitrunning.Iwouldrecommendplayingwiththedata beforetryingtorunthecode. Dependencies BelowarethedependenciesusedforOQA.VersionnumbersarewhatIhaveused, butotherversionsmaybecompatible. sbt(0.13) java(1.8.0) scala(2.10) BoostC++libraries(1.5.7) Python(2.7.8) wget(1.15) CodeStructure OQAconsistsofthefollowingcomponents: Solrindexes(usedforstoringtriples,paraphrases,andqueryrewrites). Languagemodel(usedforscoringanswerderivationsteps) Questionansweringcode(usedforinferenceandlearning) Gettingthecoderunninginvolvescompletingthesestepsinorder: Downloadingthedatainoqa-data/ Creatingtheindexesinoqa-solr/ Buildingthelanguagemodelinoqa-lm/ Runningthecodeinoqa-core/ PleasefollowtheabovelinkstotheindividualREADMEfiles.EachREADME willwalkyouthroughthesteps. Data BelowisadescriptionofthedataincludedwithOQA. KnowledgeBase(KB)Data YoucandownloadtheKBdataatthisurl: http://knowitall.cs.washington.edu/oqa/data/kb.TheKBisdividedinto20 gzip-compressedfiles.Thetotalcompressedfilesizeisapproximately20GB;the totaldecompressedfilesizeisapproximately50GB. Eachfilecontainsanewline-separatedlistofKB records.Eachrecordisatab-separatedlistof(fieldname,fieldvalue)pairs. Forexample,hereisarecordcorrespondingtoaFreebaseassertion(withtabs replacedbynewlines): arg1 1,2-Benzoquinone rel Notabletypes arg2 ChemicalCompound arg1_fbid_s 08s9rd id fb-179681780 namespace freebase Thefollowingfieldsnamesappearinthedata: FieldName Description Required? arg1 Argument1ofthetriple Yes rel Relationphraseofthetriple Yes arg2 Argument1ofthetriple Yes id UniqueIDforthetriple Yes namespace Thesourceofthistriple Yes arg1_fbid_s Arg1FreebaseID No arg2_fbid_s Arg2FreebaseID No num_extrs_i Extractionredundancy No conf_f Extractorconfidence No corpora_ss Extractorcorpus No zipfSlope_f Probasestatistic No entitySize_i Probasestatistic No entityFrequency_i Probasestatistic No popularity_i Probasestatistic No freq_i Probasestatistic No zipfPearsonCoefficient_f Probasestatistic No conceptVagueness_f Probasestatistic No prob_f Probasestatistic No conceptSize_i Probasestatistic No Thereisatotalof930millionrecordsinthedata.Thedistributionthe differentnamespacevaluesis: Namespace Count Total 930,143,872 ReVerb 391,345,565 Freebase 299,370,817 Probase 170,278,429 OpenIE4.0 67,221,551 NELL 1,927,510 WikiAnswersCorpus TheWikiAnswerscorpuscontainsclustersofquestionstaggedbyWikiAnswers usersasparaphrases.Eachclusteroptionallycontainsananswerprovidedby WikiAnswersusers.Thereare30,370,994clusterscontaininganaverageof25 questionspercluster.3,386,256(11%)oftheclustershaveananswer. Thedatacanbedownloadedfrom: http://knowitall.cs.washington.edu/oqa/data/wikianswers/.Thecorpusissplit into40gzip-compressedfiles.Thetotalcompressedfilesizeis8GB;thetotal decompressedfilesizeis40GB.Eachfilecontainsoneclusterperline.Each clusterisatab-separatedlistofquestionsandanswers.Questionsareprefixed byq:andanswersareprefixedbya:.Hereisanexamplecluster(tabs replacedwithnewlines): q:Howmanymuslimsmakeupindias1billionpopulation? q:Howmanyofindia'spopulationaremuslim? q:Howmanypopulationsofmuslimsinindia? q:Whatispopulationofmuslimsinindia? a:Over160millionMuslimsperPewForumStudyasofOctober2009. ThiscorpusisdifferentthanthedatausedintheParalexsystem(see http://knowitall.cs.washington.edu/paralex).First,itcontainsmorequestions resultingfromalongercrawlofWikiAnswers.Second,itgroupsquestionsinto clusters,insteadofenumeratingallpairsofparaphrases.Third,itcontains theanswers,whiletheParalexdatadoesnot. Wealsoprovideahierarchicalclusteringofthelowercasedtokensinthe WikiAnswerscorpus.WeusedPercyLiang'simplementationoftheBrown ClusteringAlgorithmwith 1000clusters(i.e.--c1000).Therawoutputisavailablehere. Youcanbrowsetheclustershere. WedidnotusetheseintheOQAsystem,butweprobablyshouldhave. ParaphraseTemplateData TheparaphrasetemplatesusedinOQAareavailablefordownloadat http://knowitall.cs.washington.edu/oqa/data/paraphrase-templates.txt.gz.The fileis90Mcompressedand900Mdecompressed.Eachlineinthefilecontainsa paraphrasetemplatepairasatab-separatedlistof(fieldname,fieldvalue) pairs.Hereisanexamplerecord(withtabsreplacedwithnewlines): id pair1718534 template1 howdopeopleuse$y? template2 whatbecommonusefor$y? typ anything count1 0.518446 count2 0.335112 typCount12 0.195711 count12 0.195711 typPmi 0.707756 pmi 0.687842 Eachtemplateinarecordisaspace-delimitedlistoflowercased,lemmatized tokens.Thetoken$yisavariablerepresentingtheargumentslotposition. Thenumericvaluesintherecordsarescaledtobein[0,1]. Field Description id Theuniqueidentifierforthepairoftemplates template1 Thefirsttemplate template2 Thesecondtemplate typ Unusuedfield,ignore count1 Logcountofthefirsttemplate count2 Logcountofthesecondtemplate typCount12 Unusedfield,ignore count12 Logjoint-countofthetemplatepair typPmi Unusedfield,ignore pmi Logpointwisemutualinformationofthetemplatepair Thereareatotalof5,137,558recordsinthefile. QueryRewriteData Thequeryrewriteoperatorsareavailablefordownloadat http://knowitall.cs.washington.edu/oqa/data/query-rewrites.txt.gz.Thefileis 1Gcompressedand8Gdecompressed.Eachlineinthefileisatab-separated listof(fieldname,fieldvalue)pairs.Hereisanexamplerecord(withtabs replacedwithnewlines): inverted 0 joint_count 18 marg_count1 263 marg_count2 102 pmi -7.30675508757 rel1 bethelanguageofthecountry rel2 bewidelyspeakin Eachrecordhasstatisticscomputedoverapairofrelationphrasesrel1and rel2.Therelationphrasesarelowercasedandlemmatized. Field Description inverted 1iftheruleinvertsarg.order,0otherwise joint_count ThenumberofsharedargumentpairsintheKB marg_count1 Thenumberofargumentpairsrel1takesintheKB marg_count2 Thenumberofargumentpairsrel2takesintheKB pmi Logpointwisemutualinformationofrel1andrel2 rel1 Lemmatized,lowercasedrelationphrase1 rel2 Lemmatized,lowercasedrelationphrase2 Thereareatotalof74,461,831recordsinthefile. LabeledQuestion-AnswerPairs Thequestionsandanswersusedfortheevaluationareavailableat http://knowitall.cs.washington.edu/oqa/data/questions/. Thequestionsareavailableintheirownfiles: WebQuestionstraindevtesttest TRECtraindevtesttest WikiAnswerstraindevtesttest Ilabeledthetoppredictionsforeachsystemascorrectorincorrectifthey thepredictedanswerwasnotfoundinthelabelsetsprovidedwithWebQuestions, TREC,andWikiAnswers.Theselabelscanbefoundat http://knowitall.cs.washington.edu/oqa/data/questions/labels.txt.Theformatof thisfileisanewline-separatedlistoftab-separated(LABEL,truthvalue, question,answer)records.Thequestionsandanswersmaybelowercasedand lemmatized. SystemOutput Seethedocumentationinoqa-data/predictions. About OpenQuestionAnswering Resources Readme Stars 153 stars Watchers 26 watching Forks 54 forks Releases Noreleasespublished Packages0 Nopackagespublished Languages JavaScript 75.5% CSS 7.8% Scala 5.9% HTML 5.8% XSLT 4.1% Shell 0.8% Other 0.1% Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.



請為這篇文章評分?