jueves, 16 de mayo de 2013

Semantic web. No es tan difícil.



La web semántica (Semantic Web) es la manera de representar los datos de la web de forma estructurada. El objetivo es conseguir una web que "se entienda" a sí misma, se busquen conceptos y no palabras.
La representación de este contenido viene dada, generalmente, por el formato RDF (Resource Description Framework) y es una forma de dar todo el contenido asociado a un término. Para realizar búsquedas dentro de este contenido, se usa el lenguaje de querys SPARQL.

En este post me voy a centrar en una forma de obtener el significado de las palabras que contiene un texto. Es decir, a partir de un fragmento, podremos conocer los principales términos y a qué se refieren.
Para ello vamos a usar DBPedia. Esto es una comunidad que se ha dedicado a introducir toda la información de Wikipedia en formato RDF, de modo que están todos los términos que necesitamos.

Para evitarnos tener que usar el lenguaje SPARQL, DBepedia nos ofrece su servicio Spotlight, de código abierto. Us servicio RESTful por medio del cual podemos realizar consultas, obteniendo las entidades asociadas a un texto dado.
Buscando por ahí, he encontrado además una gema que encapsula esta llamada: https://github.com/fumi/dbpedia-spotlight-rb

Así, si queremos acceder desde una aplicación rails, con instalar esta gema y realizar la llamada al servicio, obtendremos la lista de entidades asociadas. Para hace resto:


SPOTLIGHT_ACCESS = DBpedia::Spotlight("http://spotlight.dbpedia.org/rest/")
texto = "President Obama on Monday will call for a new minimum tax rate for individuals making more than $1 million a year to ensure that they pay at least the same percentage of their earnings as other taxpayers, according to administration officials."
entities = SPOTLIGHT_ACCESS.annotate texto
entities

=> [{"@URI"=>"http://dbpedia.org/resource/Presidency_of_Barack_Obama", "@support"=>"134", "@types"=>"DBpedia:OfficeHolder,DBpedia:Person,Schema:Person,Freebase:/book/book_subject,Freebase:/book,Freebase:/book/periodical_subject,Freebase:/media_common/quotation_subject,Freebase:/media_common,DBpedia:TopicalConcept", "@surfaceForm"=>"President Obama", "@offset"=>"0", "@similarityScore"=>"0.18565504252910614", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Rick_Monday", "@support"=>"96", "@types"=>"DBpedia:BaseballPlayer,DBpedia:Athlete,DBpedia:Person,Schema:Person,Freebase:/people/measured_person,Freebase:/people,Freebase:/sports/drafted_athlete,Freebase:/sports,Freebase:/sports/pro_athlete,Freebase:/baseball/baseball_player,Freebase:/baseball,Freebase:/people/person", "@surfaceForm"=>"Monday", "@offset"=>"19", "@similarityScore"=>"0.0737665593624115", "@percentageOfSecondRank"=>"0.6866002476572552"}, {"@URI"=>"http://dbpedia.org/resource/Call_option", "@support"=>"123", "@types"=>"", "@surfaceForm"=>"call", "@offset"=>"31", "@similarityScore"=>"0.12362345308065414", "@percentageOfSecondRank"=>"0.710047498083316"}, {"@URI"=>"http://dbpedia.org/resource/Maxima_and_minima", "@support"=>"131", "@types"=>"", "@surfaceForm"=>"minimum", "@offset"=>"46", "@similarityScore"=>"0.05654768645763397", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Individual", "@support"=>"312", "@types"=>"", "@surfaceForm"=>"individuals", "@offset"=>"67", "@similarityScore"=>"0.12983855605125427", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Million", "@support"=>"492", "@types"=>"", "@surfaceForm"=>"1 million", "@offset"=>"97", "@similarityScore"=>"0.12119115144014359", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/University", "@support"=>"5001", "@types"=>"Freebase:/organization/organization_type,Freebase:/organization,Freebase:/business/company_type,Freebase:/business,Freebase:/tv/tv_subject,Freebase:/tv,Freebase:/education/school_type,Freebase:/education,Freebase:/book/book_subject,Freebase:/book,Freebase:/fictional_universe/type_of_fictional_setting,Freebase:/fictional_universe,Freebase:/architecture/building_function,Freebase:/architecture,DBpedia:TopicalConcept", "@surfaceForm"=>"year", "@offset"=>"109", "@similarityScore"=>"0.08789163082838058", "@percentageOfSecondRank"=>"0.9540837774225911"}, {"@URI"=>"http://dbpedia.org/resource/Payment", "@support"=>"129", "@types"=>"Freebase:/media_common/quotation_subject,Freebase:/media_common", "@surfaceForm"=>"pay", "@offset"=>"134", "@similarityScore"=>"0.11993571370840073", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Percentage", "@support"=>"165", "@types"=>"", "@surfaceForm"=>"percentage", "@offset"=>"156", "@similarityScore"=>"0.18815511465072632", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Income", "@support"=>"648", "@types"=>"Freebase:/media_common/quotation_subject,Freebase:/media_common,Freebase:/book/book_subject,Freebase:/book,DBpedia:TopicalConcept", "@surfaceForm"=>"earnings", "@offset"=>"176", "@similarityScore"=>"0.16177840530872345", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Tax", "@support"=>"1540", "@types"=>"Freebase:/tv/tv_subject,Freebase:/tv,Freebase:/organization/organization_sector,Freebase:/organization,Freebase:/book/book_subject,Freebase:/book", "@surfaceForm"=>"taxpayers", "@offset"=>"194", "@similarityScore"=>"0.15324345231056213", "@percentageOfSecondRank"=>"-1.0"}, {"@URI"=>"http://dbpedia.org/resource/Administration_%28government%29", "@support"=>"126", "@types"=>"", "@surfaceForm"=>"administration", "@offset"=>"218", "@similarityScore"=>"0.195627361536026", "@percentageOfSecondRank"=>"0.6496970292489601"}, {"@URI"=>"http://dbpedia.org/resource/Official", "@support"=>"196", "@types"=>"Freebase:/people/profession,Freebase:/people,Freebase:/fictional_universe/character_occupation,Freebase:/fictional_universe,Freebase:/book/book_subject,Freebase:/book", "@surfaceForm"=>"officials", "@offset"=>"233", "@similarityScore"=>"0.11004404723644257", "@percentageOfSecondRank"=>"-1.0"}] 

Con esto obtenemos una lista de entidades, con su categoría asociada, que podemos usar, por ejemplo, para identificar la categoría de un texto.



Nota: Si necesitamos realizar la petición tras un proxy...
SPOTLIGHT_ACCESS.class.http_proxy proxy_host, proxy_port