{"id":7696,"date":"2019-01-08T23:26:56","date_gmt":"2019-01-08T23:26:56","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=7696"},"modified":"2019-01-12T04:42:43","modified_gmt":"2019-01-12T04:42:43","slug":"drugs-on-the-command-line","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/01\/08\/drugs-on-the-command-line\/","title":{"rendered":"Drugs on the command line"},"content":{"rendered":"<h1>Drugs on the command line<\/h1>\n<p>There&#8217;s a lot of raw material on the Web for data auditors to tinker with, but I&#8217;ve found only one website that advertises\u00a0<a href=\"https:\/\/makingnoiseandhearingthings.com\/2018\/04\/19\/datasets-for-data-cleaning-practice\/\"><em>Datasets for data cleaning practice<\/em><\/a>. It&#8217;s a 2018 blog post by computational linguist Rachael Tatman. Among the offerings is a\u00a0<a href=\"https:\/\/www.fda.gov\/Drugs\/InformationOnDrugs\/ucm142438.htm\">link<\/a>\u00a0to the National Drug Code Directory website of the US Food and Drug Administration, and one of the FDA downloadables there contains a table with 123,841 product records (2018-12-28 version).<\/p>\n<p class=\"comment\">The product table is plain text and tab-separated, but it&#8217;s in windows-1252 encoding with a Windows carriage return at the end of each line. (Sigh.) I deleted the carriage returns and converted the table to UTF-8 as the file &#8220;prods0&#8221;.<\/p>\n<p>Tatman writes\u00a0<em>&#8220;<strong>Issue<\/strong>: Non-trivial duplication (which drugs are different names for the same things?)&#8221;<\/em><\/p>\n<p>Answering Tatman&#8217;s question isn&#8217;t straightforward, because the product table contains partially duplicated records. Although each record has a unique product ID, if that ID is ignored there&#8217;s a set of more than 1100 duplicates:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.polydesmida.info\/BASHing\/img3\/2019-01-06_1.png\" alt=\"dupes1\" width=\"618\" height=\"308\" \/><\/p>\n<p class=\"comment\">Duplicate pair example from &#8220;prods0&#8221;, with PRODUCTID in\u00a0red:<\/p>\n<p>0009-0039_5e394712-e775-435b-a4e0-32e1d9647ff5 0009-0039 HUMAN PRESCRIPTION DRUG SOLU-MEDROL methylprednisolone sodium succinate INJECTION, POWDER, FOR SOLUTION INTRAMUSCULAR; INTRAVENOUS 19590402 NDA NDA011856 Pharmacia and Upjohn Company LLC METHYLPREDNISOLONE SODIUM SUCCINATE 40 mg\/mL Corticosteroid [EPC],Corticosteroid Hormone Receptor Agonists [MoA] N 20191231<\/p>\n<p>0009-0039_95289567-4341-4b6c-bc3c-aa13036bc9b4 0009-0039 HUMAN PRESCRIPTION DRUG SOLU-MEDROL methylprednisolone sodium succinate INJECTION, POWDER, FOR SOLUTION INTRAMUSCULAR; INTRAVENOUS 19590402 NDA NDA011856 Pharmacia and Upjohn Company LLC METHYLPREDNISOLONE SODIUM SUCCINATE 40 mg\/mL Corticosteroid [EPC],Corticosteroid Hormone Receptor Agonists [MoA] N 20191231<\/p>\n<p>I\u00a0<strong>cut<\/strong>\u00a0away the unique ID and\u00a0<strong>sort<\/strong>ed and\u00a0<strong>uniq<\/strong>uified the records to build the file &#8220;prods1&#8221;, retaining the header line:<\/p>\n<p class=\"code\">cat &lt;(cut -f1 &#8211;complement prods0 | head -n 1) &lt;(tail -n +2 prods0 | cut -f1 &#8211;complement | sort | uniq) &gt; prods1<\/p>\n<p>Next, I focused in &#8220;prods1&#8221; on the fields SUBSTANCENAME (field 13), ACTIVE_NUMERATOR_STRENGTH (14) and ACTIVE_INGRED_UNIT (15). The FDA&#8217;s\u00a0<a href=\"https:\/\/www.fda.gov\/Drugs\/InformationOnDrugs\/ucm254527.htm\">explainer page<\/a>\u00a0describes these as follows:<\/p>\n<p class=\"code\"><strong>SubstanceName<\/strong><br \/>\nThis is the active ingredient list. Each ingredient name is the preferred term of the UNII code submitted.<\/p>\n<p><strong>StrengthNumber<\/strong>\u00a0[older field name?]<br \/>\nThese are the strength values (to be used with units below) of each active ingredient, listed in the same order as the SubstanceName field above.<\/p>\n<p><strong>StrengthUnit<\/strong>\u00a0[older field name?]<br \/>\nThese are the units to be used with the strength values above, listed in the same order as the SubstanceName and SubstanceNumber.<\/p>\n<p>If these 3 fields are the same, then the product is the same so far as the active ingredients are concerned. To find these partial duplicates I used the two-pass method described in a previous\u00a0<a href=\"https:\/\/www.polydesmida.info\/BASHing\/2018-07-14.html\"><em>BASHing data<\/em>\u00a0post<\/a>:<\/p>\n<p class=\"code\">awk -F&#8221;\\t&#8221; &#8216;FNR==NR {a[$13,$14,$15]++; next} $13 != &#8220;&#8221; &amp;&amp; $14 != &#8220;&#8221; &amp;&amp; $15 != &#8220;&#8221; &amp;&amp; a[$13,$14,$15]&gt;1&#8217; prods1 prods1 | wc -l<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.polydesmida.info\/BASHing\/img3\/2019-01-06_2.png\" alt=\"dupes2\" width=\"557\" height=\"97\" \/><\/p>\n<p>Wow! That&#8217;s a lot of &#8220;same product&#8221; out of 123,205 unique product records. To investigate further I added the fields STARTMARKETINGDATE (field 8 in prods1), PROPRIETARYNAME (3), PROPRIETARYNAMESUFFIX (4) and LABELERNAME (12) to a print as the new file &#8220;prods2&#8221; (no header this time).<\/p>\n<p class=\"code\">awk -F&#8221;\\t&#8221; &#8216;FNR==NR {a[$13,$14,$15]++; next} $13 != &#8220;&#8221; &amp;&amp; $14 != &#8220;&#8221; &amp;&amp; $15 != &#8220;&#8221; &amp;&amp; a[$13,$14,$15]&gt;1 {print $8 FS $3 FS $4 FS $12 FS $13 FS $14 FS $15}&#8217; prods1 prods1 &gt; prods2<\/p>\n<p>&#8212;&#8211;<\/p>\n<p><strong>StartMarketingDate<\/strong><br \/>\nThis is the date that the labeler indicates was the start of its marketing of the drug product.<\/p>\n<p><strong>ProprietaryName<\/strong><br \/>\nAlso known as the trade name. It is the name of the product chosen by the labeler.<\/p>\n<p><strong>ProprietaryNameSuffix<\/strong><br \/>\nA suffix to the proprietary name, a value here should be appended to the ProprietaryName field to obtain the complete name of the product. This suffix is often used to distinguish characteristics of a product such as extended release (\u201cXR\u201d) or sleep aid (\u201cPM\u201d). Although many companies follow certain naming conventions for suffices, there is no recognized standard.<\/p>\n<p><strong>LabelerName<\/strong><br \/>\nName of Company corresponding to the labeler code segment of the ProductNDC.<\/p>\n<p>That still apparently doesn&#8217;t capture all the variation in FDA&#8217;s database, because &#8220;prods2&#8221; contains a lot of exact duplicates<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.polydesmida.info\/BASHing\/img3\/2019-01-06_3.png\" alt=\"dupes3\" width=\"292\" height=\"57\" \/><\/p>\n<p>and there don&#8217;t seem to be any differences between the duplicated records in the original downloaded table (&#8220;prods0&#8221;), apart from the FDA product code and the unique ID based on that code:<\/p>\n<p class=\"comment\">Example from &#8220;prods0&#8221;:<\/p>\n<p>17518-080_7aa3171b-36c0-48d6-e053-2991aa0a6aec 17518-080 HUMAN OTC DRUG 3M SoluPrep chlorhexidine gluconate and isopropyl alcohol SOLUTION TOPICAL 20181008 NDA NDA208288 3M Company CHLORHEXIDINE GLUCONATE; ISOPROPYL ALCOHOL 20; .7 mg\/mL; mL\/mL N 20191231<\/p>\n<p>17518-081_7aa3171b-36c0-48d6-e053-2991aa0a6aec 17518-081 HUMAN OTC DRUG 3M SoluPrep chlorhexidine gluconate and isopropyl alcohol SOLUTION TOPICAL 20181008 NDA NDA208288 3M Company CHLORHEXIDINE GLUCONATE; ISOPROPYL ALCOHOL 20; .7 mg\/mL; mL\/mL N 20191231<\/p>\n<p>Once again I\u00a0<strong>sort<\/strong>ed and\u00a0<strong>uniq<\/strong>uified, converting &#8220;prods2&#8221; to &#8220;prods3&#8221;, which has 92,452 records. One source of duplication in &#8220;prods3&#8221; is in the proprietary name suffix field, because the same basic product can be sold with slightly different formulations not affecting the active ingredients. Here&#8217;s an example (from &#8220;prods0&#8221;) \u2014 a dental fluoride paste that comes in 3 different flavours:<\/p>\n<p class=\"comment\">65222-401_6155acd9-8ec2-a87d-e053-2991aa0a7b43 65222-401 HUMAN PRESCRIPTION DRUG Nupro Fluorides NaF Oral Solution Mint Sodium Fluoride GEL DENTAL 19000101 UNAPPROVED DRUG OTHER Dentsply LLC. Professional Division Trading as &#8220;DENTSPLY Professional&#8221; SODIUM FLUORIDE 20 mg\/g N 20191231<\/p>\n<p>65222-411_6155acd9-8ec2-a87d-e053-2991aa0a7b43 65222-411 HUMAN PRESCRIPTION DRUG Nupro Fluorides NaF Oral Solution Mandarin Orange Sodium Fluoride GEL DENTAL 19000101 UNAPPROVED DRUG OTHER Dentsply LLC. Professional Division Trading as &#8220;DENTSPLY Professional&#8221; SODIUM FLUORIDE 20 mg\/g N 20191231<\/p>\n<p>65222-421_6155acd9-8ec2-a87d-e053-2991aa0a7b43 65222-421 HUMAN PRESCRIPTION DRUG Nupro Fluorides NaF Oral Solution Apple Cinnamon Sodium Fluoride GEL DENTAL 19000101 UNAPPROVED DRUG OTHER Dentsply LLC. Professional Division Trading as &#8220;DENTSPLY Professional&#8221; SODIUM FLUORIDE 20 mg\/g N 20191231<\/p>\n<p>A larger source of duplication in &#8220;prods3&#8221; is the marketing date. Walgreens, for instance, has 2 different registrations for an allergy medicine, differing only in start and end marketing dates (example from &#8220;prods0&#8221;):<\/p>\n<p class=\"comment\">0363-0211_997032ac-0110-4004-9697-d82146ba7128 0363-0211 HUMAN OTC DRUG 24 Hour Allergy Cetirizine HCl CAPSULE ORAL 20130301 NDA NDA022429 Walgreens CETIRIZINE HYDROCHLORIDE 10 mg\/1 N 20181231<\/p>\n<p>0363-1219_f3168f2c-27a7-4dd7-9770-e91443a580f1 0363-1219 HUMAN OTC DRUG 24 Hour Allergy Cetirizine HCl CAPSULE ORAL 20180914 NDA NDA022429 Walgreens CETIRIZINE HYDROCHLORIDE 10 mg\/1 N 20191231<\/p>\n<p>I generated &#8220;prods4&#8221; from &#8220;prods3&#8221; by\u00a0<strong>cut<\/strong>ting out marketing date and\u00a0<strong>sort<\/strong>ing and\u00a0<strong>uniq<\/strong>uifying again. That reduced the set of &#8220;basically the same product&#8221; records to 84,163. Here are the top 10 formulations:<\/p>\n<p class=\"code\">cut -f4-6 prods4 | sort | uniq -c | sort -nr | head<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.polydesmida.info\/BASHing\/img3\/2019-01-06_4.png\" alt=\"dupes4\" width=\"493\" height=\"220\" \/><\/p>\n<p>Those 25 mg lots of diphenhydramine hydrochloride (an antihistamine) were sold by a nominal 188 labelling entities, but again there&#8217;s duplication. The FDA lists multiple strings for what&#8217;s presumably the same company<\/p>\n<p class=\"comment\">Allergy relief\u00a0\u00a0\u00a0\u00a0\u00a0[no suffix]\u00a0\u00a0\u00a0\u00a0\u00a0Topco Associates LLC<br \/>\nAllergy relief\u00a0\u00a0\u00a0\u00a0\u00a0[no suffix]\u00a0\u00a0\u00a0\u00a0\u00a0Topco Associates, LLC<br \/>\nAllergy relief\u00a0\u00a0\u00a0\u00a0\u00a0[no suffix]\u00a0\u00a0\u00a0\u00a0\u00a0TopCo Associates LLC<\/p>\n<p>the same product<\/p>\n<p class=\"comment\">Sleep Aid\u00a0\u00a0\u00a0\u00a0\u00a0Nighttime\u00a0\u00a0\u00a0\u00a0\u00a0CVS Pharmacy<br \/>\nSleep- Aid\u00a0\u00a0\u00a0\u00a0\u00a0Nighttime\u00a0\u00a0\u00a0\u00a0\u00a0CVS Pharmacy<br \/>\nSleep-Aid\u00a0\u00a0\u00a0\u00a0\u00a0Nighttime\u00a0\u00a0\u00a0\u00a0\u00a0CVS Pharmacy<\/p>\n<p>or both<\/p>\n<p class=\"comment\">sleep aid\u00a0\u00a0\u00a0\u00a0\u00a0nighttime\u00a0\u00a0\u00a0\u00a0\u00a0Target Corporation<br \/>\nSleep Aid\u00a0\u00a0\u00a0\u00a0\u00a0NightTime\u00a0\u00a0\u00a0\u00a0\u00a0TARGET Corporation<\/p>\n<p>Summing up, the answer to Tatman&#8217;s question about this dataset, namely &#8220;Which drugs are different names for the same things?&#8221; has several answers depending on how you define &#8220;things&#8221;. But even after you&#8217;ve decided what you&#8217;re looking for, the surprising messiness of the FDA&#8217;s data means you have a lot of data cleaning to do before you can start looking. The FDA&#8217;s product table is indeed a good dataset for data cleaning practice!<\/p>\n<hr \/>\n<p class=\"comment\">Some of the ingredient fields in the product table contain semicolon-and-space-separated strings, like<\/p>\n<p class=\"code\">ACETALDEHYDE; ARSENIC TRIOXIDE; BALSAM PERU; OYSTER SHELL CALCIUM CARBONATE, CRUDE; PHENOL; CONIUM MACULATUM FLOWERING TOP; COUMARIN; SAFFRON; HISTAMINE DIHYDROCHLORIDE; LACHESIS MUTA VENOM; LYCOPODIUM CLAVATUM SPORE; PHOSPHORUS; SEPIA OFFICINALIS JUICE<\/p>\n<p class=\"comment\">Could there be additional duplication in these entries, with the same items listed in different orders in different records? Checking for different orders of items within a single field is an interesting exercise in data auditing: see the next\u00a0<em>BASHing data<\/em>\u00a0post.<\/p>\n<p><a href=\"http:\/\/lxer.com\/module\/newswire\/ext_link.php?rid=264603\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Drugs on the command line There&#8217;s a lot of raw material on the Web for data auditors to tinker with, but I&#8217;ve found only one website that advertises\u00a0Datasets for data cleaning practice. It&#8217;s a 2018 blog post by computational linguist Rachael Tatman. Among the offerings is a\u00a0link\u00a0to the National Drug Code Directory website of the &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/01\/08\/drugs-on-the-command-line\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Drugs on the command line&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-7696","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/7696","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=7696"}],"version-history":[{"count":2,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/7696\/revisions"}],"predecessor-version":[{"id":7869,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/7696\/revisions\/7869"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=7696"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=7696"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=7696"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}