تصویر بذریعہ ایڈیٹر
I’ve been reading, writing, and speaking since late last year on the intersection of open source software and machine learning, trying to understand what the future might bring.
When I started, I expected that I would be talking mostly about how open source software is used by the machine learning community. But the more I’ve explored, the more I’ve realized that there are a lot of similarities between the two areas of practice. In this article I’ll discuss some of those parallels — and what machine learning can and can’t learn from open source software.
The easy and obvious parallel is that both modern machine learning and modern software are built almost entirely with open source software. For software, that is compilers and code editors; for machine learning, it is training and inference frameworks like PyTorch and TensorFlow. These spaces are dominated by open source software, and nothing appears ready to change that.
There is one notable, apparent exception to this: all of these frameworks depend on the very proprietary Nvidia hardware and software stack. This actually is more parallel than it might look at first. For a long time, open source software ran mostly on proprietary Unix operating systems, sold by proprietary hardware vendors. It was only after Linux came along that we began to take for granted that an open “bottom” of the stack was even possible, and much open development is done these days on MacOS and Windows. It is unclear how this will play out in machine learning. Amazon (for AWS), Google (for both cloud and Android), and Apple are all investing in competing chips and stacks, and it’s possible that one or more of those could follow the path laid by Linus (and Intel) of freeing the پورے ڈھیر لگانا.
A more critical parallel between how open source software is built, and how machine learning is built, is the complexity and public availability of the data that each are built on.
جیسا کہ اس میں تفصیل ہے۔ پری پرنٹ کاغذ “The Data Provenance Project,” which I co-authored, modern machine learning is built on literally thousands of data sources, just as modern open source software is built on hundreds of thousands of libraries. And just like each open library brings with it legal, security, and maintenance challenges, each public data set brings with it the exact same set of difficulties.
At my organization, we’ve talked about open source software’s version of this challenge as being an “accidental supply chain.” The software industry started building things because the incredible building blocks of open source libraries meant that we could. This meant the industry started treating open source software as a supply chain—which came as a surprise to many of those “suppliers.”
To mitigate these challenges, open source software has developed lots of sophisticated (though imperfect) techniques, like scanners for identifying what is being used, and metadata for tracking things after deployment. We’re also starting to invest in humans, to try to address the mismatch between industrial needs and volunteer motivations.
Unfortunately, the machine learning community seems ready to plunge into the exact same “accidental” supply chain mistake—doing lots of things because it can, without stopping to think much about the long-term implications once the entire economy is based on these data sets.
A last important parallel is that I strongly suspect that machine learning will expand to fill many, many niches, just as open source software has. At the moment, the (deserved) hype is about large, generative models, but there are also many small models out there, as well as tweaks on larger models. Indeed, hosting site HuggingFace, machine learning’s primary hosting platform, reports the number of models on their site is growing exponentially.
These models will likely be plentiful and available for improvement, much like small pieces of open source software. That will make them incredibly flexible and powerful. I’m using a small machine learning-based tool to do cheap, privacy-sensitive traffic measurement on my street, for example, a use case that wouldn’t have been possible except on expensive devices a few years ago.
But this proliferation means that they’ll need to be tracked—models may become less like mainframes and more like open source software or SaaS, which pop up all over the place because of low cost and ease of deployment.
So if there are these important parallels (particularly of complex supply chains and proliferating distribution) what can machine learning learn from open source software?
The first parallel lesson we can draw is simply that to understand its many challenges, machine learning will need metadata and tooling. Open source software stumbled into metadata work through copyright and licensing compliance, but as the accidental supply chain for software has matured, metadata has proven immensely useful on a variety of fronts.
In machine learning, metadata tracking is a work in progress. A few examples:
- A key 2019 paper, widely cited in the industry, urged developers of models to document their work with “model cards.” Unfortunately, recent research suggests their implementation in the wild is still weak.
- Both the SPDX and CycloneDX software bills of materials (SBOM) specifications are working on AI bills of materials (AI BOMs) to help track machine learning data and models, in a more structured manner than model cards (befitting the complexity one would expect if this truly does parallel open source software).
- HuggingFace has created a variety of specs and tools to allow model and dataset authors to document their sources.
- The MIT Data Provenance paper cited above tries to understand the “ground truth” of data licensing, to help flesh out the specifications with real-world data.
- Anecdotally, many companies doing machine learning training work appear to have somewhat casual relationships with data tracking, using “more is better” as an excuse to shovel data into the hopper without necessarily tracking it well.
If we’ve learned anything from open, it’s that getting the metadata right (first, the specs, then the actual data) is going to be a project of years and may require حکومتی مداخلت. machine learning should take that metadata plunge sooner rather than later.
Security has been another major driver of open source software’s metadata demand—if you don’t know what you’re running, you can’t know if you’re susceptible to the seemingly endless stream of attacks.
Machine learning isn’t subject to most types of traditional software attacks, but that doesn’t mean they’re invulnerable. (My favorite example is that it was possible to poison image training sets because they often drew from dead domains.) Research in this area is hot enough that we’ve already gone past “proof of concept” and into “there are enough attacks to فہرست اور taxonomize".
Unfortunately, open source software can’t offer machine learning any magic bullets for security—if we had them, we’d be using them. But the history of how open source software spread to so many niches suggests that machine learning must take this challenge seriously, starting with tracking usage and deployment metadata, exactly because it is likely to be applied in so many ways beyond those in which it is currently deployed.
The motivations that drove open source metadata (licensing, then security) point to the next important parallel: as the importance of a sector grows, the scope of things that must be measured and tracked will expand, because regulation and liability will expand.
In open source software, the primary government “regulation” for many years was copyright law, and so metadata developed to support that. But open source software now faces a variety of security and product liability rules—and we must mature our supply chains to meet those new requirements.
AI will similarly be regulated in an ever-growing multitude of ways as it becomes ever-more important. The sources of regulation will be extremely diverse, including on content (both inputs and outputs), discrimination, and product liability. This will require what is sometimes called “traceability”—understanding how the models are built, and how those choices (including data sources) impact the outcomes of the models.
This core requirement—what do we have? how did it get here?—is now intimately familiar for enterprise open source software developers. However, it may be a radical change for machine learning developers and needs to be embraced.
Another parallel lesson machine learning can draw from open source software (and indeed from many waves of software before it, dating back at least to the mainframe) is that its useful life will be very, very long. Once a technology is “good enough,” it will be deployed and therefore must be maintained for a very, very long time. This implies that we must think about maintenance of this software as early as possible, and think about what it will mean that this software might survive for decades. “Decades” is not an exaggeration; many customers I encounter are using software that is old enough to vote. Many open source software companies, and some projects, now have so-called “Long Term Support” versions that are intended for these sorts of use cases.
In contrast, OpenAI kept their Codex tool available for less than two years—leading to a lot of anger, especially in the academic community. Given the rapid pace of change in machine learning, and that most adopters are probably interested in using the very cutting edge, this probably wasn’t unreasonable—but the day will come, sooner than the industry thinks, where it needs to plan for this sort of “long term”—including how it interacts with liability and security.
Finally, it’s clear that—like open source software—there is going to be a lot of money flowing into machine learning, but most of that money will pool around what one author has called the “processor rich” companies. If the parallels to open source software play out, those companies will have very different concerns and spending priorities than the median creator (or user) of models.
Our company, Tidelift, has been thinking about this problem of incentives in open source software for some time, and entities like the world’s largest purchaser of software—the US government—are looking into the problem as well.
Machine learning companies, especially those seeking to create communities of creators, should think hard about this challenge. If they’re dependent on thousands of data sets, how will they ensure those are funded for maintenance, legal compliance, and security, for decades? If large companies end up with dozens or hundreds of models deployed around the company, how will they ensure those with the best specialist knowledge—those who created the models—are still around to work on new problems as they are discovered?
Like security, there are no easy answers for this challenge. But the sooner machine learning takes the problem seriously—not as an act of charity, but as a key component of long-term growth—the better off the entire industry, and the entire world, will be.
Machine learning’s deep roots in academia’s culture of experimentalism, and Silicon Valley’s culture of fast iteration, has served it well, leading to an amazing explosion of innovation that would have seemed magical less than a decade ago. Open source software’s course in the past decade has perhaps been less glamorous, but during that time it has become the underpinning of all enterprise software—and learned a lot of lessons along the way. Hopefully machine learning will not reinvent those wheels.
لوئس ولا is co-founder and general counsel at Tidelift. Previously he was a top open source lawyer advising clients, from Fortune 50 companies to leading startups, on product development and open source licensing.
- SEO سے چلنے والا مواد اور PR کی تقسیم۔ آج ہی بڑھا دیں۔
- پلیٹو ڈیٹا ڈاٹ نیٹ ورک ورٹیکل جنریٹو اے آئی۔ اپنے آپ کو بااختیار بنائیں۔ یہاں تک رسائی حاصل کریں۔
- پلیٹوآئ اسٹریم۔ ویب 3 انٹیلی جنس۔ علم میں اضافہ۔ یہاں تک رسائی حاصل کریں۔
- پلیٹو ای ایس جی۔ کاربن، کلین ٹیک، توانائی ، ماحولیات، شمسی، ویسٹ مینجمنٹ یہاں تک رسائی حاصل کریں۔
- پلیٹو ہیلتھ۔ بائیوٹیک اینڈ کلینیکل ٹرائلز انٹیلی جنس۔ یہاں تک رسائی حاصل کریں۔
- ماخذ: https://www.kdnuggets.com/ai-and-open-source-software-separated-at-birth?utm_source=rss&utm_medium=rss&utm_campaign=ai-and-open-source-software-separated-at-birth
- : ہے
- : ہے
- : نہیں
- :کہاں
- $UP
- 2019
- 50
- a
- ہمارے بارے میں
- اوپر
- تعلیمی
- حادثاتی
- ایکٹ
- اصل
- اصل میں
- پتہ
- گود لینے والے
- مشورہ دینے
- کے بعد
- پہلے
- AI
- تمام
- کی اجازت
- تقریبا
- ساتھ
- پہلے ہی
- بھی
- حیرت انگیز
- ایمیزون
- an
- اور
- لوڈ، اتارنا Android
- غصہ
- ایک اور
- جواب
- کوئی بھی
- کچھ
- واضح
- ظاہر
- ظاہر ہوتا ہے
- ایپل
- اطلاقی
- کیا
- رقبہ
- علاقوں
- ارد گرد
- مضمون
- AS
- At
- حملے
- مصنف
- مصنفین
- دستیابی
- دستیاب
- AWS
- واپس
- کی بنیاد پر
- BE
- کیونکہ
- بن
- ہو جاتا ہے
- رہا
- اس سے پہلے
- شروع ہوا
- کیا جا رہا ہے
- BEST
- بہتر
- کے درمیان
- سے پرے
- بل
- پیدائش
- بلاکس
- دونوں
- لانے
- لاتا ہے
- عمارت
- تعمیر
- لیکن
- by
- کہا جاتا ہے
- آیا
- کر سکتے ہیں
- کارڈ
- کیس
- مقدمات
- انیت
- چین
- زنجیروں
- چیلنج
- چیلنجوں
- تبدیل
- چیریٹی
- سستے
- چپس
- انتخاب
- حوالہ دیا
- واضح
- کلائنٹس
- بادل
- شریک بانی
- کوڈ
- کس طرح
- کمیونٹی
- کمیونٹی
- کمپنیاں
- کمپنی کے
- مقابلہ کرنا
- پیچیدہ
- پیچیدگی
- تعمیل
- جزو
- اندراج
- مواد
- اس کے برعکس
- کاپی رائٹ
- کور
- قیمت
- سکتا ہے
- وکیل
- کورس
- تخلیق
- بنائی
- خالق
- تخلیق کاروں
- اہم
- ثقافت
- اس وقت
- گاہکوں
- کاٹنے
- اعداد و شمار
- ڈیٹا سیٹ
- ڈیٹا سیٹ
- ڈیٹنگ
- دن
- دن
- مردہ
- دہائی
- دہائیوں
- گہری
- انحصار
- تعینات
- تعیناتی
- تفصیلی
- ترقی یافتہ
- ڈویلپرز
- ترقی
- کے الات
- DID
- مختلف
- مشکلات
- دریافت
- تبعیض
- بات چیت
- تقسیم
- متنوع
- do
- دستاویز
- کرتا
- نہیں کرتا
- کر
- ڈومینز
- کیا
- نہیں
- درجنوں
- اپنی طرف متوجہ
- ڈرائیور
- کے دوران
- ہر ایک
- ابتدائی
- کو کم
- آسان
- معیشت کو
- ایج
- ایڈیٹرز
- گلے لگا لیا
- تصادم
- آخر
- لامتناہی
- کافی
- کو یقینی بنانے کے
- انٹرپرائز
- پوری
- مکمل
- اداروں
- خاص طور پر
- Ether (ETH)
- بھی
- کبھی بڑھتی ہوئی
- بالکل
- مثال کے طور پر
- مثال کے طور پر
- اس کے علاوہ
- رعایت
- توسیع
- توقع ہے
- توقع
- مہنگی
- وضاحت کی
- دھماکے
- تیزی سے
- انتہائی
- چہرے
- واقف
- فاسٹ
- پسندیدہ
- چند
- بھرنے
- پہلا
- لچکدار
- بہہ رہا ہے
- پر عمل کریں
- کے لئے
- فارچیون
- فریم ورک
- سے
- پیسے سے چلنے
- مستقبل
- جنرل
- پیداواری
- حاصل
- حاصل کرنے
- GitHub کے
- دی
- جا
- گئے
- گوگل
- حکومت
- عطا کی
- بڑھتے ہوئے
- بڑھتا ہے
- تھا
- ہارڈ
- ہارڈ ویئر
- ہے
- he
- مدد
- یہاں
- تاریخ
- امید ہے کہ
- ہوسٹنگ
- HOT
- کس طرح
- تاہم
- HTTPS
- گلے لگانے والا چہرہ
- انسان
- سینکڑوں
- ہائپ
- i
- میں ہوں گے
- کی نشاندہی
- IEEE
- if
- تصویر
- بے حد
- اثر
- اثرات
- اہمیت
- اہم
- بہتری
- in
- مراعات
- سمیت
- ناقابل اعتماد
- ناقابل یقین حد تک
- صنعتی
- صنعت
- جدت طرازی
- آدانوں
- انٹیل
- ارادہ
- انٹرایکٹو
- دلچسپی
- چوراہا
- میں
- سرمایہ کاری
- سرمایہ کاری
- IT
- تکرار
- میں
- صرف
- KDnuggets
- رکھی
- کلیدی
- جان
- بڑے
- بڑے
- سب سے بڑا
- آخری
- آخری سال
- مرحوم
- بعد
- قانون
- وکیل
- معروف
- جانیں
- سیکھا ہے
- سیکھنے
- کم سے کم
- قانونی
- کم
- سبق
- اسباق
- ذمہ داری
- لائبریریوں
- لائبریری
- لائسنسنگ
- زندگی
- کی طرح
- امکان
- لنکڈ
- لینکس
- لانگ
- طویل وقت
- طویل مدتی
- دیکھو
- بہت
- لاٹوں
- لو
- مشین
- مشین لرننگ
- MacOS کے
- ماجک
- دیکھ بھال
- اہم
- بنا
- انداز
- بہت سے
- مواد
- عقلمند و سمجھدار ہو
- مئی..
- مطلب
- کا مطلب ہے کہ
- مراد
- ماپا
- پیمائش
- سے ملو
- میٹا ڈیٹا
- شاید
- ایم ائی ٹی
- تخفیف کریں
- ماڈل
- ماڈل
- جدید
- لمحہ
- قیمت
- زیادہ
- سب سے زیادہ
- زیادہ تر
- منشا
- بہت
- بھیڑ
- ضروری
- my
- ضروری ہے
- ضرورت ہے
- ضروریات
- نئی
- اگلے
- نہیں
- قابل ذکر
- کچھ بھی نہیں
- اب
- تعداد
- NVIDIA
- واضح
- of
- بند
- پیش کرتے ہیں
- اکثر
- پرانا
- on
- ایک بار
- ایک
- صرف
- کھول
- اوپن سورس
- اوپنائی
- کام
- آپریٹنگ سسٹم
- or
- تنظیم
- ہمارے
- باہر
- نتائج
- نتائج
- پر
- امن
- کاغذ.
- متوازی
- Parallels کے
- خاص طور پر
- گزشتہ
- راستہ
- شاید
- ٹکڑے ٹکڑے
- مقام
- منصوبہ
- پلیٹ فارم
- پلاٹا
- افلاطون ڈیٹا انٹیلی جنس
- پلیٹو ڈیٹا
- کھیلیں
- چھلانگ لگانا
- پوائنٹ
- پول
- پاپ آؤٹ
- ممکن
- طاقتور
- پریکٹس
- پہلے
- پرائمری
- شاید
- مسئلہ
- مسائل
- مصنوعات
- مصنوعات کی ترقی
- پیش رفت
- منصوبے
- منصوبوں
- ملکیت
- ثابت
- provenance کے
- عوامی
- خریدار
- pytorch
- بنیاد پرست
- تیزی سے
- بلکہ
- پڑھنا
- تیار
- حقیقی دنیا
- احساس ہوا
- حال ہی میں
- باضابطہ
- ریگولیشن
- دوبارہ ایجاد
- تعلقات
- رپورٹیں
- کی ضرورت
- ضروریات
- تحقیق
- ریسرچ سے پتہ چلتا ہے
- ٹھیک ہے
- جڑوں
- چل رہا ہے
- s
- ساس
- اسی
- ایس بی او ایم
- گنجائش
- شعبے
- سیکورٹی
- کی تلاش
- لگتا ہے
- سنجیدگی سے
- مقرر
- سیٹ
- ہونا چاہئے
- سلیکن
- مماثلت
- اسی طرح
- صرف
- بعد
- سائٹ
- چھوٹے
- So
- سافٹ ویئر کی
- سافٹ ویئر ڈویلپرز
- فروخت
- کچھ
- کبھی کبھی
- کچھ بھی نہیں
- بہتر
- ماخذ
- ذرائع
- خالی جگہیں
- بات
- ماہر
- وضاحتیں
- شیشے
- خرچ کرنا۔
- پھیلانے
- ڈھیر لگانا
- Stacks
- شروع
- شروع
- سترٹو
- ابھی تک
- روکنا
- سٹریم
- سڑک
- سختی
- منظم
- موضوع
- پتہ چلتا ہے
- فراہمی
- فراہمی کا سلسلہ
- سپلائی چین
- حمایت
- حیرت
- زندہ
- مناسب
- سسٹمز
- لے لو
- لیتا ہے
- بات کر
- تکنیک
- ٹیکنالوجی
- ٹیسسرور
- اصطلاح
- سے
- کہ
- ۔
- مستقبل
- ان
- ان
- تو
- وہاں.
- لہذا
- یہ
- وہ
- چیزیں
- لگتا ہے کہ
- سوچنا
- سوچتا ہے
- اس
- ان
- اگرچہ؟
- ہزاروں
- کے ذریعے
- وقت
- کرنے کے لئے
- کے آلے
- سب سے اوپر
- ٹریک
- ٹریکنگ
- روایتی
- ٹریفک
- ٹریننگ
- علاج
- واقعی
- کوشش
- کی کوشش کر رہے
- مواقع
- دو
- اقسام
- واضح نہیں
- انڈرپننگ
- سمجھ
- بدقسمتی سے
- یونیکس
- us
- استعمال
- استعمال کی شرائط
- استعمال کیس
- استعمال کیا جاتا ہے
- رکن کا
- کا استعمال کرتے ہوئے
- مختلف اقسام کے
- دکانداروں
- ورژن
- بہت
- رضاکارانہ
- ووٹ
- تھا
- لہروں
- راستہ..
- طریقوں
- we
- اچھا ہے
- کیا
- کیا ہے
- جس
- ڈبلیو
- بڑے پیمانے پر
- وائلڈ
- گے
- کھڑکیاں
- ساتھ
- بغیر
- کام
- کام کر
- دنیا
- دنیا کی
- گا
- تحریری طور پر
- سال
- سال
- آپ
- زیفیرنیٹ