CodePlex

From Archiveteam
Jump to navigation Jump to search
CodePlex
CodePlex logo
Codeplex 201703312132.png
URL CodePlex[IAWcite.todayMemWeb]
Archive[IAWcite.todayMemWeb]
Status Offline
Archiving status Saved!
Archiving type DPoS, ArchiveBot, other
Project source codeplex-grab
Project tracker codeplex
IRC channel #archiveteam-bs (on hackint)
(formerly #plexicode (on hackint))
Project lead User:Arkiver (DPoS)
User:JustAnotherArchivist (ArchiveBot)
User:Sylirana (ZIP)
Data[how to use] DPoS:
archiveteam_codeplex

ArchiveBot:
job:37xhjatblhmteac21hfztij39
job:e6b63pwhut60gwe4z41syjin7
job:9mu2jt79bwn5p5jtk3svrgbw5
job:eofk4xb29zp47t2vmkokzmz22
job:953ckm35odcko93cg6jd25gfp
job:7ugnk5fiz3i5efr87w3p82w9k
job:d84m3m6fe2793mt3333pgdoka
job:9cwff2hdnszzus34wfk0996yk

ZIP:
sylirana_ms_codeplex_zips

CodePlex was a software repository owned by Microsoft. It hosted only open source software paired with an open source license.[1]

CodePlex allowed people to commit their code into a Git, Mercurial, or Team Foundation Server version control repository. It had a downloads section for people to upload their software packages, an issue tracker, documentation repository, and discussion forums.

The platform was shut down in 2017, but a read-only archive remained online. This self-archive was announced to be shut down in July 2021 via a banner on the site, and the archive subdomain stopped resolving on 2021-10-21 (between 18:20 and 18:35 UTC).

Vital signs

The shutdown announcement was made on 31st March 2017 [2]. New project creation was disabled at the same time the shutdown announcement was published. On an unspecified date in October 2017 the site will be made read-only. Shutdown is scheduled for 15th December 2017.

Archiving

After the 15th December 2017 shutdown date the announcement indicates that "lightweight archives" containing project source code, documentation, downloads, documentation, license, and issues as-of the date the site changed to read-only will be available. There is no planned date to stop hosting these archives.

The shutdown announcement indicates that project owners will be provided a tool to migrate their sites to Github. As of the announcement date the migration tool is "in the works".

Alex Mullans, a Microsoft Program Manager of the Visual Studio Team Services product, stated in a discussion on Hacker News [3] that project archives will be available for anyone to download (as opposed to being restricted to project owners). He further stated that for projects using the Git and Mercurial version control systems the ".git" and ".hg" folders would be included in the archive, so that full source code history would be preserved. For projects using the TFS version control system, however, the full history would not be included in the archive and only the code as-of the site being changed to read-only would be available.

In late January 2021[4], a banner was added to the archive website, announcing that the archive would be shut down in July 2021.

Site structure

There is a sitemap (Warning: Large xml file!)[IAWcite.todayMemWeb] which contains links to 108516 individual projects in the format of https://archive.codeplex.com/?p=<ID>. It has not been confirmed yet whether this contains all of the projects on the site or not.
There are projects on the sitemap that have been completely removed from the site.

It is important to note that there are two different types of IDs used. One is used at the sitemap and for some other resources, such as the page JSON this one is all lowercase (called <ID> here). The second type is the same ID, but with uppercase letters (if the project had any) (called <ID2> here). Because of that, just going by the IDs returned by the sitemap will return a 404 for those that have capital letters in the project name.

The individual sites load the actual contents of the page using JavaScript by requesting multiple JSON files (for the page (https://archive.codeplex.com/metadata/<ID>.json), issues, etc.).

There is a .zip file for each project. This uses the aforementioned second ID and is located at https://codeplexarchive.blob.core.windows.net/archive/projects/<ID2>/<ID2>.zip.

WARC

The way the site loads the content, makes it more difficult to all the pages (see above for details).

The first JSON that is requested for each project, contains an HTML snippet, which is then inserted into the actual page for the user to view.
The wiki on the site itself is *broken*! All Wiki links simply redirect to the project's single page from which issues and discussions can be read (also loaded with JS). Because of that, the only way to get the Wikis seems to be the .zip files. The imags only have hashes as filenames though and the links from the HTML pages can't be rewritten automatically (see below).

Project ZIP Files

User:Sylirana has started archiving all of the .zip files that Microsoft provides for each project. Those contain the code (depending on which versioning system was used, multiple versions, see above) and other data such as issues and wikis.

One thing to note is that while the zip files do contain image attachments, they just have a hash as a filename (and no extension). The HTML files (of the wikis, for example) do NOT link to that file, but instead to the soon-to-be-offline server. This is something to consider for anyone wanting to see images on the HTML-pages in the archive. There doesn't seem to be a mapping anywhere for the url (which contains a proper filename) and the file in the archive which just contains a hash as the file name.

Upon checking all the JSON files inside the archive, I found that there are indeed some containing mappings for files of different subdirectories of the archive. This means it should be possible to fully rebuild the wiki (and other) pages *with* attachments from the zip files.

The archiving will happen in two steps due to the different ID types.
In a first step, all of the projects with lowercase IDs will be saved.
In a second step, the JSON for all of the projects with a 404 during the first step (=The project has uppercase letters in the ID or it has been deleted from the site.) will be requested and another list to download will be generated, along with a list of projects that have been deleted from the site (see above).

The reasoning behind doing this in two steps and not just getting the JSON for every single project to check for the letter case is that the majority of projects can be downloaded without those extra steps (and requests to the server!).

Downloading is rather slow as the server significantly limits the bandwidth.
Despite those limits, the project ZIP files are on track to be completed within March, which is reasonably far from the shutdown in July.

Current progress of ZIP files (step 1) [DONE]:
Total Total done (1) Saved (1) 404 (1)
108516 108516 (100%) 94097 (~548.3 GB) 14419

Total: Total according to sitemap.
Total done (1): Total links done during step 1 (or in other words, progress towards step 2).
Saved (1): Saved during step 1.
404 (1): Links which returned a 404 during step 1. This does NOT mean thata project is lost, it might simply have uppercase letters which will be checked in step 2 (see above).

Stats on step 2 will follow soon.

My apologies for the lack of updates for so long. I was really busy and ended up only updating people on request, but everything was kept up and running.

The archive for the .zip files is now complete and can be found at: https://archive.org/details/sylirana_ms_codeplex_zips .

Contact Sylirana on hackint or check the channel (see infobox) if you have any questions.

References

  1. Documentation - CodePlex FAQ - Project Hosting Requirements
  2. Shutdown down CodePlex
  3. Hacker News - Shutting down Codeplex discussion
  4. No date is mentioned in the notice, but the Wayback Machine snapshots indicate that it was added in the last week of January 2021.

External links